Purpose

There is a relationship between the number of words (and the number of characters) in the source language and the target language. If this relationship can be established and captured in yet another model, such a model can be helpful in at least two ways:

For training: Validate the alignment of two sentences (in a training example) by comparing their word size and/or character size
For inference: Validate the word size and/or character size of a translated/proofread sentence

In this notebook we will continue to discover models for each language and to evaluate its use in the above roles.

Dataset and Variables

The dataset used in this notebook contains the following features:

m_descriptor: Unique identifier of a document
t_lan_E: Language of the translation (English is also considered a translation)
t_version: Version of a translation
s_rsen: Number of a sentence within a document
c_id: Database primary key of a contribution
e_content_E: Text content of an English contribution
chars_E: Number of characters in an English contribution
words_E: Number of words in an English contribution
t_lan_V: Language of the translation
e_top: N/A
be_top: N/A
c_created_at: Creation time of a contribution
c_kind: Kind of a contribution
c_base: N/A
a_role: N/A
u_name: N/A
e_content_V: Text content of a translated contribution
chars_V: Number of characters in a translated contribution
words_V: Number of words in a translated contribution

Setup the Environment

from pathlib import Path
import pandas as pd
%matplotlib inline

!python --version

Python 3.6.9

PATH = Path(base_dir + './'); #PATH

import plotly.express as px

Language POR

df = pd.read_csv(f'{PATH}/PredictTranslationWordAndCharCount/PredictTranslationWordAndCharCount_2-POR-output.csv', sep='~')
df.loc[:5, df.columns.isin(['m_descriptor','t_lan_E','t_version','s_rsen','c_id','chars_E','words_E','t_lan_V','chars_V','words_V'])]

Characters

fig = px.scatter(data_frame=df, x='chars_E', y='chars_V', color='t_lan_V', 
                 title='Translation Characters vs English Characters', 
                 opacity=.5, 
                 hover_data=['m_descriptor','t_lan_V','s_rsen'], 
                 labels={'m_descriptor':'Descriptor','t_lan_V':'','s_rsen':'Sentence No'})
fig.show()

#outlier
pd.set_option('display.max_colwidth',20)
outdf = df[(df['m_descriptor']=='1964-0621') & (df['s_rsen']==1125)]
outdf

pd.set_option('display.max_colwidth',1000)
print(outdf.loc[:,['e_content_E']].values[0][0])
print(outdf.loc[:,['e_content_V']].values[0][0])

Sometime ago, about fifteen years ago, I remember one night being called to a hospital, to a boy dying with black diphtheria.
Há algum tempo atrás, há cerca de quinze anos atrás, lembro-me de uma noite ser chamado a um hospital, a respeito de um rapaz a morrer com difteria negra [Doença provocada por uma bactéria que se transmite por contacto físico com um doente ou pela respiração e que atinge a faringe. Manifesta-se sobretudo em crianças. - Trad.] .

This data-point is an outlier due to the over-translation.

df = df.drop(2989)

#outlier
pd.set_option('display.max_colwidth',20)
outdf = df[(df['m_descriptor']=='1949-0718') & (df['s_rsen']==458)]
outdf

pd.set_option('display.max_colwidth',1000)
print(outdf.loc[:,['e_content_E']].values[0][0])
print(outdf.loc[:,['e_content_V']].values[0][0])

Spastic.
Espasmódico. [Pessoa que sofre de paralisia cerebral com espasmos musculares acentuados. - Trad.]

df = df.drop(1622)

#outlier
pd.set_option('display.max_colwidth',20)
outdf = df[(df['m_descriptor']=='1965-1031y') & (df['s_rsen']==632)]
outdf

pd.set_option('display.max_colwidth',1000)
print(outdf.loc[:,['e_content_E']].values[0][0])
print(outdf.loc[:,['e_content_V']].values[0][0])

It’s turpentine.
É terebintina. [Terebintina é um líquido obtido por destilação de resina de árvores - Trad.]

df = df.drop(4969)

#outlier
pd.set_option('display.max_colwidth',20)
outdf = df[(df['m_descriptor']=='1965-1031y') & (df['s_rsen']==426)]
outdf

pd.set_option('display.max_colwidth',1000)
print(outdf.loc[:,['e_content_E']].values[0][0])
print(outdf.loc[:,['e_content_V']].values[0][0])

And mother say, “Did you get A’s on your report card?”
E a mãe diz: "Tiveste A na tua ficha de avaliação?" [No sistema educativo Americano, a nota A equivale à avaliação mais elevada - Trad.]

df = df.drop(4763)

#outlier
pd.set_option('display.max_colwidth',20)
outdf = df[(df['m_descriptor']=='1965-1031y') & (df['s_rsen']==534)]
outdf

pd.set_option('display.max_colwidth',1000)
print(outdf.loc[:,['e_content_E']].values[0][0])
print(outdf.loc[:,['e_content_V']].values[0][0])

Flies on the bread, flies on the beef, flies on the butter, f-o-b, flies on everything.
Moscas no pão [no Inglês a sigla f.o.b. representa "flies on bread" - moscas no pão - Trad.], moscas na carne ["flies on beef" - Trad.], moscas na manteiga ["flies on butter" - Trad.], f-o-b, moscas em tudo.

df = df.drop(4871)

#outlier
pd.set_option('display.max_colwidth',20)
outdf = df[(df['m_descriptor']=='1965-1031y') & (df['s_rsen']==54)]
outdf

pd.set_option('display.max_colwidth',1000)
print(outdf.loc[:,['e_content_E']].values[0][0])
print(outdf.loc[:,['e_content_V']].values[0][0])

Now, children, that’s the horn of plenty and I’ll take this and hang it in our new home.
Agora, crianças, isso é a cornucópia [vaso em forma de chifre que representava abundância e generosidade - Trad.] e vou levar isto e pendurá-lo na nossa nova casa.

df = df.drop(4391)

#outlier
pd.set_option('display.max_colwidth',20)
outdf = df[(df['m_descriptor']=='1965-0117') & (df['s_rsen']==164)]
outdf

pd.set_option('display.max_colwidth',1000)
print(outdf.loc[:,['e_content_E']].values[0][0])
print(outdf.loc[:,['e_content_V']].values[0][0])

And then the strange thing of it was, my favorite bird, robin, picture was on the marker, the little bird with the red breast.
E depois o estranho disso foi que, o meu pássaro favorito, o pisco [Ave existente no Sul do Canadá, Estados Unidos da América e México - Trad.], a imagem dele estava no marcador, o passarinho com o peito vermelho.

df = df.drop(3243)

#outlier
pd.set_option('display.max_colwidth',20)
outdf = df[(df['m_descriptor']=='1965-1031y') & (df['s_rsen']==191)]
outdf

pd.set_option('display.max_colwidth',1000)
print(outdf.loc[:,['e_content_E']].values[0][0])
print(outdf.loc[:,['e_content_V']].values[0][0])

*Then Jesus beholding him loved him*, this young fellow; *and* he *said unto him, One thing thou lackest: go thy way, sell whatsoever thou hast, and give to the poor, and thou shalt have treasures in heaven: and come, take up* thy *cross, and follow me*.
E Jesus, olhando para ele, o amou e lhe disse: Falta-te uma coisa: vai, e vende tudo quanto tens, e dá-o aos pobres, e terás um tesouro no céu; e vem e segue-me.

df = df.drop(4528)

Words

fig = px.scatter(data_frame=df, x='words_E', y='words_V', color='t_lan_V', 
                 title='Translation Words vs English Words', 
                 opacity=.5, 
                 hover_data=['m_descriptor','t_lan_V','s_rsen'], 
                 labels={'m_descriptor':'Descriptor','t_lan_V':'','s_rsen':'Sentence No'})
fig.show()

df.to_csv (f'{PATH}/PredictTranslationWordAndCharCount/PredictTranslationWordAndCharCount_3-POR-output.csv', sep='~', index = False, header=True)

Language SHO

df = pd.read_csv(f'{PATH}/PredictTranslationWordAndCharCount/PredictTranslationWordAndCharCount_2-SHO-output.csv', sep='~')
df.loc[:5, df.columns.isin(['m_descriptor','t_lan_E','t_version','s_rsen','c_id','chars_E','words_E','t_lan_V','chars_V','words_V'])]

Characters

fig = px.scatter(data_frame=df, x='chars_E', y='chars_V', color='t_lan_V', 
                 title='Translation Characters vs English Characters', 
                 opacity=.5, 
                 hover_data=['m_descriptor','t_lan_V','s_rsen'], 
                 labels={'m_descriptor':'Descriptor','t_lan_V':'','s_rsen':'Sentence No'})
fig.show()

# #outlier
# pd.set_option('display.max_colwidth',20)
# outdf = df[(df['m_descriptor']=='1965-0822x') & (df['s_rsen']==1)]
# outdf

# print(outdf.loc[:,['e_content_E']].values[0][0])
# print(outdf.loc[:,['e_content_V']].values[0][0])

Words

fig = px.scatter(data_frame=df, x='words_E', y='words_V', color='t_lan_V', 
                 title='Translation Words vs English Words', 
                 opacity=.5, 
                 hover_data=['m_descriptor','t_lan_V','s_rsen'], 
                 labels={'m_descriptor':'Descriptor','t_lan_V':'','s_rsen':'Sentence No'})
fig.show()

df.to_csv (f'{PATH}/PredictTranslationWordAndCharCount/PredictTranslationWordAndCharCount_3-SHO-output.csv', sep='~', index = False, header=True)

Language SWA

df = pd.read_csv(f'{PATH}/PredictTranslationWordAndCharCount/PredictTranslationWordAndCharCount_2-SWA-output.csv', sep='~')
df.loc[:5, df.columns.isin(['m_descriptor','t_lan_E','t_version','s_rsen','c_id','chars_E','words_E','t_lan_V','chars_V','words_V'])]

Characters

fig = px.scatter(data_frame=df, x='chars_E', y='chars_V', color='t_lan_V', 
                 title='Translation Characters vs English Characters', 
                 opacity=.5, 
                 hover_data=['m_descriptor','t_lan_V','s_rsen'], 
                 labels={'m_descriptor':'Descriptor','t_lan_V':'','s_rsen':'Sentence No'})
fig.show()

# #outlier
# pd.set_option('display.max_colwidth',20)
# outdf = df[(df['m_descriptor']=='1965-0822x') & (df['s_rsen']==1)]
# outdf

# print(outdf.loc[:,['e_content_E']].values[0][0])
# print(outdf.loc[:,['e_content_V']].values[0][0])

Words

fig = px.scatter(data_frame=df, x='words_E', y='words_V', color='t_lan_V', 
                 title='Translation Words vs English Words', 
                 opacity=.5, 
                 hover_data=['m_descriptor','t_lan_V','s_rsen'], 
                 labels={'m_descriptor':'Descriptor','t_lan_V':'','s_rsen':'Sentence No'})
fig.show()

df.to_csv (f'{PATH}/PredictTranslationWordAndCharCount/PredictTranslationWordAndCharCount_3-SWA-output.csv', sep='~', index = False, header=True)

Language TWI

df = pd.read_csv(f'{PATH}/PredictTranslationWordAndCharCount/PredictTranslationWordAndCharCount_2-TWI-output.csv', sep='~')
df.loc[:5, df.columns.isin(['m_descriptor','t_lan_E','t_version','s_rsen','c_id','chars_E','words_E','t_lan_V','chars_V','words_V'])]

Characters

fig = px.scatter(data_frame=df, x='chars_E', y='chars_V', color='t_lan_V', 
                 title='Translation Characters vs English Characters', 
                 opacity=.5, 
                 hover_data=['m_descriptor','t_lan_V','s_rsen'], 
                 labels={'m_descriptor':'Descriptor','t_lan_V':'','s_rsen':'Sentence No'})
fig.show()

#outlier
pd.set_option('display.max_colwidth',20)
outdf = df[(df['m_descriptor']=='1965-1204') & (df['s_rsen']==1297)]
outdf

pd.set_option('display.max_colwidth',1000)
print(outdf.loc[:,['e_content_E']].values[0][0])
print(outdf.loc[:,['e_content_V']].values[0][0])

Now, if you want to read in Matthew 18:16, It said, “There is three that bear record,” see, in Saint … in First John 5:7, so forth.
Seisei, sԑ wo pԑ sԑ wokan Mateo ti du-nnwↄtwe nkyekyԑmu du-nsia, ԑkaa sԑ, "mmiԑnsa na ԑdi ho adansie, "hwԑ, wↄ kronkron no ... wↄ Yohane nwoma ԑdikan ti num nkyekyԑmu nson, ne deԑ ԑkeka ho.

df = df.drop(5170)

Words

fig = px.scatter(data_frame=df, x='words_E', y='words_V', color='t_lan_V', 
                 title='Translation Words vs English Words', 
                 opacity=.5, 
                 hover_data=['m_descriptor','t_lan_V','s_rsen'], 
                 labels={'m_descriptor':'Descriptor','t_lan_V':'','s_rsen':'Sentence No'})
fig.show()

df.to_csv (f'{PATH}/PredictTranslationWordAndCharCount/PredictTranslationWordAndCharCount_3-TWI-output.csv', sep='~', index = False, header=True)

Language YOR

df = pd.read_csv(f'{PATH}/PredictTranslationWordAndCharCount/PredictTranslationWordAndCharCount_2-YOR-output.csv', sep='~')
df.loc[:5, df.columns.isin(['m_descriptor','t_lan_E','t_version','s_rsen','c_id','chars_E','words_E','t_lan_V','chars_V','words_V'])]

Characters

fig = px.scatter(data_frame=df, x='chars_E', y='chars_V', color='t_lan_V', 
                 title='Translation Characters vs English Characters', 
                 opacity=.5, 
                 hover_data=['m_descriptor','t_lan_V','s_rsen'])
fig.show()

# #outlier
# pd.set_option('display.max_colwidth',20)
# outdf = df[(df['m_descriptor']=='1965-0822x') & (df['s_rsen']==1)]
# outdf

# print(outdf.loc[:,['e_content_E']].values[0][0])
# print(outdf.loc[:,['e_content_V']].values[0][0])

Words

fig = px.scatter(data_frame=df, x='words_E', y='words_V', color='t_lan_V', 
                 title='Translation Words vs English Words', 
                 opacity=.5, 
                 hover_data=['m_descriptor','t_lan_V','s_rsen'], 
                 labels={'m_descriptor':'Descriptor','t_lan_V':'','s_rsen':'Sentence No'})
fig.show()

df.to_csv (f'{PATH}/PredictTranslationWordAndCharCount/PredictTranslationWordAndCharCount_3-YOR-output.csv', sep='~', index = False, header=True)

	m_descriptor	t_lan_E	t_version	s_rsen	c_id	chars_E	words_E	t_lan_V	chars_V	words_V
0	1948-0304	ENG	15-0902	1	660286	98	21	POR	104	18
1	1948-0304	ENG	15-0902	2	660287	27	6	POR	25	5
2	1948-0304	ENG	15-0902	3	660288	36	7	POR	32	7
3	1948-0304	ENG	15-0902	4	660289	13	4	POR	11	3
4	1948-0304	ENG	15-0902	5	660290	85	18	POR	93	19
5	1948-0304	ENG	15-0902	6	660291	36	9	POR	35	6

	m_descriptor	t_lan_E	t_version	s_rsen	c_id	chars_E	words_E	t_lan_V	chars_V	words_V
0	1965-1212	ENG	12-1201	1	808450	76	14	SWA	83	14
1	1965-1212	ENG	12-1201	2	808451	84	20	SWA	113	18
2	1965-1212	ENG	12-1201	3	808452	57	10	SWA	62	10
3	1965-1212	ENG	12-1201	4	808453	67	15	SWA	64	10
4	1965-1212	ENG	12-1201	5	808454	23	6	SWA	21	4
5	1965-1212	ENG	12-1201	6	808455	72	16	SWA	72	9

	m_descriptor	t_lan_E	t_version	s_rsen	c_id	chars_E	words_E	t_lan_V	chars_V	words_V
0	1965-0725x	ENG	15-1102	1	174446	18	2	TWI	26	6
1	1965-0725x	ENG	15-1102	2	174447	29	7	TWI	41	8
2	1965-0725x	ENG	15-1102	3	174448	113	20	TWI	95	20
3	1965-0725x	ENG	15-1102	4	174449	177	34	TWI	150	36
4	1965-0725x	ENG	15-1102	5	174450	72	12	TWI	74	15
5	1965-0725x	ENG	15-1102	6	174451	85	18	TWI	75	17

	m_descriptor	t_lan_E	t_version	s_rsen	c_id	chars_E	words_E	t_lan_V	chars_V	words_V
0	1965-1206	ENG	14-0901	1	3195	20	5	YOR	19	5
1	1965-1206	ENG	14-0901	2	3196	82	14	YOR	85	21
2	1965-1206	ENG	14-0901	3	3197	60	10	YOR	89	24
3	1965-1206	ENG	14-0901	4	3198	95	18	YOR	88	23
4	1965-1206	ENG	14-0901	5	3199	80	17	YOR	42	11
5	1965-1206	ENG	14-0901	6	3200	115	23	YOR	94	23

	m_descriptor	t_lan_E	t_version	s_rsen	c_id	chars_E	words_E	t_lan_V	chars_V	words_V
0	1965-1204	ENG	16-1102	1	124021	95	19	SHO	90	12
1	1965-1204	ENG	16-1102	2	124022	44	9	SHO	54	6
2	1965-1204	ENG	16-1102	3	124023	175	36	SHO	165	19
3	1965-1204	ENG	16-1102	4	124024	142	26	SHO	137	20
4	1965-1204	ENG	16-1102	5	124025	123	25	SHO	141	19
5	1965-1204	ENG	16-1102	6	124026	40	9	SHO	47	7