Purpose

There is a relationship between the number of words (and the number of characters) in the source language and the target language. If this relationship can be established and captured in yet another model, such a model can be helpful in at least two ways:

  • For training: Validate the alignment of two sentences (in a training example) by comparing their word size and/or character size
  • For inference: Validate the word size and/or character size of a translated/proofread sentence

In this notebook we will continue to discover models for each language and to evaluate its use in the above roles.

Dataset and Variables

The dataset used in this notebook contains the following features:

  • m_descriptor: Unique identifier of a document
  • t_lan_E: Language of the translation (English is also considered a translation)
  • t_version: Version of a translation
  • s_rsen: Number of a sentence within a document
  • c_id: Database primary key of a contribution
  • e_content_E: Text content of an English contribution
  • chars_E: Number of characters in an English contribution
  • words_E: Number of words in an English contribution
  • t_lan_V: Language of the translation
  • e_top: N/A
  • be_top: N/A
  • c_created_at: Creation time of a contribution
  • c_kind: Kind of a contribution
  • c_base: N/A
  • a_role: N/A
  • u_name: N/A
  • e_content_V: Text content of a translated contribution
  • chars_V: Number of characters in a translated contribution
  • words_V: Number of words in a translated contribution

Setup the Environment

from pathlib import Path
import pandas as pd
%matplotlib inline
!python --version
Python 3.6.9
PATH = Path(base_dir + './'); #PATH
import plotly.express as px

Language POR

df = pd.read_csv(f'{PATH}/PredictTranslationWordAndCharCount/PredictTranslationWordAndCharCount_2-POR-output.csv', sep='~')
df.loc[:5, df.columns.isin(['m_descriptor','t_lan_E','t_version','s_rsen','c_id','chars_E','words_E','t_lan_V','chars_V','words_V'])]
m_descriptor t_lan_E t_version s_rsen c_id chars_E words_E t_lan_V chars_V words_V
0 1948-0304 ENG 15-0902 1 660286 98 21 POR 104 18
1 1948-0304 ENG 15-0902 2 660287 27 6 POR 25 5
2 1948-0304 ENG 15-0902 3 660288 36 7 POR 32 7
3 1948-0304 ENG 15-0902 4 660289 13 4 POR 11 3
4 1948-0304 ENG 15-0902 5 660290 85 18 POR 93 19
5 1948-0304 ENG 15-0902 6 660291 36 9 POR 35 6

Characters

fig = px.scatter(data_frame=df, x='chars_E', y='chars_V', color='t_lan_V', 
                 title='Translation Characters vs English Characters', 
                 opacity=.5, 
                 hover_data=['m_descriptor','t_lan_V','s_rsen'], 
                 labels={'m_descriptor':'Descriptor','t_lan_V':'','s_rsen':'Sentence No'})
fig.show()
#outlier
pd.set_option('display.max_colwidth',20)
outdf = df[(df['m_descriptor']=='1964-0621') & (df['s_rsen']==1125)]
outdf
m_descriptor t_lan_E t_version s_rsen c_id e_content_E chars_E words_E t_lan_V e_top be_top c_created_at count c_kind c_base a_role u_name e_content_V chars_V words_V
2989 1964-0621 ENG 18-0101 1125 729001 Sometime ago, ab... 125 22 POR T N 2020-08-21 09:22... 1 V c CE pausil Há algum tempo a... 329 58
pd.set_option('display.max_colwidth',1000)
print(outdf.loc[:,['e_content_E']].values[0][0])
print(outdf.loc[:,['e_content_V']].values[0][0])
Sometime ago, about fifteen years ago, I remember one night being called to a hospital, to a boy dying with black diphtheria.
Há algum tempo atrás, há cerca de quinze anos atrás, lembro-me de uma noite ser chamado a um hospital, a respeito de um rapaz a morrer com difteria negra [Doença provocada por uma bactéria que se transmite por contacto físico com um doente ou pela respiração e que atinge a faringe. Manifesta-se sobretudo em crianças. - Trad.] .

This data-point is an outlier due to the over-translation.

df = df.drop(2989)
#outlier
pd.set_option('display.max_colwidth',20)
outdf = df[(df['m_descriptor']=='1949-0718') & (df['s_rsen']==458)]
outdf
m_descriptor t_lan_E t_version s_rsen c_id e_content_E chars_E words_E t_lan_V e_top be_top c_created_at count c_kind c_base a_role u_name e_content_V chars_V words_V
1622 1949-0718 ENG 15-0901 458 2252 Spastic. 8 1 POR T N 2020-02-08 17:24... 1 V c CE pausil Espasmódico. [Pe... 97 12
pd.set_option('display.max_colwidth',1000)
print(outdf.loc[:,['e_content_E']].values[0][0])
print(outdf.loc[:,['e_content_V']].values[0][0])
Spastic.
Espasmódico. [Pessoa que sofre de paralisia cerebral com espasmos musculares acentuados. - Trad.]
df = df.drop(1622)
#outlier
pd.set_option('display.max_colwidth',20)
outdf = df[(df['m_descriptor']=='1965-1031y') & (df['s_rsen']==632)]
outdf
m_descriptor t_lan_E t_version s_rsen c_id e_content_E chars_E words_E t_lan_V e_top be_top c_created_at count c_kind c_base a_role u_name e_content_V chars_V words_V
4969 1965-1031y ENG 17-0901 632 73575 It’s turpentine. 16 3 POR T N 2020-07-20 08:05... 1 V c CE pausil É terebintina. [... 92 14
pd.set_option('display.max_colwidth',1000)
print(outdf.loc[:,['e_content_E']].values[0][0])
print(outdf.loc[:,['e_content_V']].values[0][0])
It’s turpentine.
É terebintina. [Terebintina é um líquido obtido por destilação de resina de árvores - Trad.]
df = df.drop(4969)
#outlier
pd.set_option('display.max_colwidth',20)
outdf = df[(df['m_descriptor']=='1965-1031y') & (df['s_rsen']==426)]
outdf
m_descriptor t_lan_E t_version s_rsen c_id e_content_E chars_E words_E t_lan_V e_top be_top c_created_at count c_kind c_base a_role u_name e_content_V chars_V words_V
4763 1965-1031y ENG 17-0901 426 73369 And mother say, ... 54 12 POR T N 2020-07-20 05:03... 1 V c CE pausil E a mãe diz: "Ti... 136 24
pd.set_option('display.max_colwidth',1000)
print(outdf.loc[:,['e_content_E']].values[0][0])
print(outdf.loc[:,['e_content_V']].values[0][0])
And mother say, “Did you get A’s on your report card?”
E a mãe diz: "Tiveste A na tua ficha de avaliação?" [No sistema educativo Americano, a nota A equivale à avaliação mais elevada - Trad.]
df = df.drop(4763)
#outlier
pd.set_option('display.max_colwidth',20)
outdf = df[(df['m_descriptor']=='1965-1031y') & (df['s_rsen']==534)]
outdf
m_descriptor t_lan_E t_version s_rsen c_id e_content_E chars_E words_E t_lan_V e_top be_top c_created_at count c_kind c_base a_role u_name e_content_V chars_V words_V
4871 1965-1031y ENG 17-0901 534 73477 Flies on the bre... 87 18 POR T N 2020-07-20 07:30... 1 V c CE pausil Moscas no pão [n... 207 38
pd.set_option('display.max_colwidth',1000)
print(outdf.loc[:,['e_content_E']].values[0][0])
print(outdf.loc[:,['e_content_V']].values[0][0])
Flies on the bread, flies on the beef, flies on the butter, f-o-b, flies on everything.
Moscas no pão [no Inglês a sigla f.o.b. representa "flies on bread" - moscas no pão - Trad.], moscas na carne ["flies on beef" - Trad.], moscas na manteiga ["flies on butter" - Trad.], f-o-b, moscas em tudo.
df = df.drop(4871)
#outlier
pd.set_option('display.max_colwidth',20)
outdf = df[(df['m_descriptor']=='1965-1031y') & (df['s_rsen']==54)]
outdf
m_descriptor t_lan_E t_version s_rsen c_id e_content_E chars_E words_E t_lan_V e_top be_top c_created_at count c_kind c_base a_role u_name e_content_V chars_V words_V
4391 1965-1031y ENG 17-0901 54 72997 Now, children, t... 88 20 POR T N 2020-07-19 06:14... 1 V c CE pausil Agora, crianças,... 163 28
pd.set_option('display.max_colwidth',1000)
print(outdf.loc[:,['e_content_E']].values[0][0])
print(outdf.loc[:,['e_content_V']].values[0][0])
Now, children, that’s the horn of plenty and I’ll take this and hang it in our new home.
Agora, crianças, isso é a cornucópia [vaso em forma de chifre que representava abundância e generosidade - Trad.] e vou levar isto e pendurá-lo na nossa nova casa.
df = df.drop(4391)
#outlier
pd.set_option('display.max_colwidth',20)
outdf = df[(df['m_descriptor']=='1965-0117') & (df['s_rsen']==164)]
outdf
m_descriptor t_lan_E t_version s_rsen c_id e_content_E chars_E words_E t_lan_V e_top be_top c_created_at count c_kind c_base a_role u_name e_content_V chars_V words_V
3243 1965-0117 ENG 18-0101 164 510934 And then the str... 126 24 POR T N 2020-07-23 07:08... 1 V c CE pausil E depois o estra... 213 38
pd.set_option('display.max_colwidth',1000)
print(outdf.loc[:,['e_content_E']].values[0][0])
print(outdf.loc[:,['e_content_V']].values[0][0])
And then the strange thing of it was, my favorite bird, robin, picture was on the marker, the little bird with the red breast.
E depois o estranho disso foi que, o meu pássaro favorito, o pisco [Ave existente no Sul do Canadá, Estados Unidos da América e México - Trad.], a imagem dele estava no marcador, o passarinho com o peito vermelho.
df = df.drop(3243)
#outlier
pd.set_option('display.max_colwidth',20)
outdf = df[(df['m_descriptor']=='1965-1031y') & (df['s_rsen']==191)]
outdf
m_descriptor t_lan_E t_version s_rsen c_id e_content_E chars_E words_E t_lan_V e_top be_top c_created_at count c_kind c_base a_role u_name e_content_V chars_V words_V
4528 1965-1031y ENG 17-0901 191 73134 *Then Jesus beho... 254 46 POR M M 2020-07-19 07:13... 2 V t CE pausil E Jesus, olhando... 161 36
pd.set_option('display.max_colwidth',1000)
print(outdf.loc[:,['e_content_E']].values[0][0])
print(outdf.loc[:,['e_content_V']].values[0][0])
*Then Jesus beholding him loved him*, this young fellow; *and* he *said unto him, One thing thou lackest: go thy way, sell whatsoever thou hast, and give to the poor, and thou shalt have treasures in heaven: and come, take up* thy *cross, and follow me*.
E Jesus, olhando para ele, o amou e lhe disse: Falta-te uma coisa: vai, e vende tudo quanto tens, e dá-o aos pobres, e terás um tesouro no céu; e vem e segue-me.
df = df.drop(4528)

Words

fig = px.scatter(data_frame=df, x='words_E', y='words_V', color='t_lan_V', 
                 title='Translation Words vs English Words', 
                 opacity=.5, 
                 hover_data=['m_descriptor','t_lan_V','s_rsen'], 
                 labels={'m_descriptor':'Descriptor','t_lan_V':'','s_rsen':'Sentence No'})
fig.show()
df.to_csv (f'{PATH}/PredictTranslationWordAndCharCount/PredictTranslationWordAndCharCount_3-POR-output.csv', sep='~', index = False, header=True)

Language SHO

df = pd.read_csv(f'{PATH}/PredictTranslationWordAndCharCount/PredictTranslationWordAndCharCount_2-SHO-output.csv', sep='~')
df.loc[:5, df.columns.isin(['m_descriptor','t_lan_E','t_version','s_rsen','c_id','chars_E','words_E','t_lan_V','chars_V','words_V'])]
m_descriptor t_lan_E t_version s_rsen c_id chars_E words_E t_lan_V chars_V words_V
0 1965-1204 ENG 16-1102 1 124021 95 19 SHO 90 12
1 1965-1204 ENG 16-1102 2 124022 44 9 SHO 54 6
2 1965-1204 ENG 16-1102 3 124023 175 36 SHO 165 19
3 1965-1204 ENG 16-1102 4 124024 142 26 SHO 137 20
4 1965-1204 ENG 16-1102 5 124025 123 25 SHO 141 19
5 1965-1204 ENG 16-1102 6 124026 40 9 SHO 47 7

Characters

fig = px.scatter(data_frame=df, x='chars_E', y='chars_V', color='t_lan_V', 
                 title='Translation Characters vs English Characters', 
                 opacity=.5, 
                 hover_data=['m_descriptor','t_lan_V','s_rsen'], 
                 labels={'m_descriptor':'Descriptor','t_lan_V':'','s_rsen':'Sentence No'})
fig.show()
# #outlier
# pd.set_option('display.max_colwidth',20)
# outdf = df[(df['m_descriptor']=='1965-0822x') & (df['s_rsen']==1)]
# outdf
# print(outdf.loc[:,['e_content_E']].values[0][0])
# print(outdf.loc[:,['e_content_V']].values[0][0])

Words

fig = px.scatter(data_frame=df, x='words_E', y='words_V', color='t_lan_V', 
                 title='Translation Words vs English Words', 
                 opacity=.5, 
                 hover_data=['m_descriptor','t_lan_V','s_rsen'], 
                 labels={'m_descriptor':'Descriptor','t_lan_V':'','s_rsen':'Sentence No'})
fig.show()
df.to_csv (f'{PATH}/PredictTranslationWordAndCharCount/PredictTranslationWordAndCharCount_3-SHO-output.csv', sep='~', index = False, header=True)

Language SWA

df = pd.read_csv(f'{PATH}/PredictTranslationWordAndCharCount/PredictTranslationWordAndCharCount_2-SWA-output.csv', sep='~')
df.loc[:5, df.columns.isin(['m_descriptor','t_lan_E','t_version','s_rsen','c_id','chars_E','words_E','t_lan_V','chars_V','words_V'])]
m_descriptor t_lan_E t_version s_rsen c_id chars_E words_E t_lan_V chars_V words_V
0 1965-1212 ENG 12-1201 1 808450 76 14 SWA 83 14
1 1965-1212 ENG 12-1201 2 808451 84 20 SWA 113 18
2 1965-1212 ENG 12-1201 3 808452 57 10 SWA 62 10
3 1965-1212 ENG 12-1201 4 808453 67 15 SWA 64 10
4 1965-1212 ENG 12-1201 5 808454 23 6 SWA 21 4
5 1965-1212 ENG 12-1201 6 808455 72 16 SWA 72 9

Characters

fig = px.scatter(data_frame=df, x='chars_E', y='chars_V', color='t_lan_V', 
                 title='Translation Characters vs English Characters', 
                 opacity=.5, 
                 hover_data=['m_descriptor','t_lan_V','s_rsen'], 
                 labels={'m_descriptor':'Descriptor','t_lan_V':'','s_rsen':'Sentence No'})
fig.show()
# #outlier
# pd.set_option('display.max_colwidth',20)
# outdf = df[(df['m_descriptor']=='1965-0822x') & (df['s_rsen']==1)]
# outdf
# print(outdf.loc[:,['e_content_E']].values[0][0])
# print(outdf.loc[:,['e_content_V']].values[0][0])

Words

fig = px.scatter(data_frame=df, x='words_E', y='words_V', color='t_lan_V', 
                 title='Translation Words vs English Words', 
                 opacity=.5, 
                 hover_data=['m_descriptor','t_lan_V','s_rsen'], 
                 labels={'m_descriptor':'Descriptor','t_lan_V':'','s_rsen':'Sentence No'})
fig.show()
df.to_csv (f'{PATH}/PredictTranslationWordAndCharCount/PredictTranslationWordAndCharCount_3-SWA-output.csv', sep='~', index = False, header=True)

Language TWI

df = pd.read_csv(f'{PATH}/PredictTranslationWordAndCharCount/PredictTranslationWordAndCharCount_2-TWI-output.csv', sep='~')
df.loc[:5, df.columns.isin(['m_descriptor','t_lan_E','t_version','s_rsen','c_id','chars_E','words_E','t_lan_V','chars_V','words_V'])]
m_descriptor t_lan_E t_version s_rsen c_id chars_E words_E t_lan_V chars_V words_V
0 1965-0725x ENG 15-1102 1 174446 18 2 TWI 26 6
1 1965-0725x ENG 15-1102 2 174447 29 7 TWI 41 8
2 1965-0725x ENG 15-1102 3 174448 113 20 TWI 95 20
3 1965-0725x ENG 15-1102 4 174449 177 34 TWI 150 36
4 1965-0725x ENG 15-1102 5 174450 72 12 TWI 74 15
5 1965-0725x ENG 15-1102 6 174451 85 18 TWI 75 17

Characters

fig = px.scatter(data_frame=df, x='chars_E', y='chars_V', color='t_lan_V', 
                 title='Translation Characters vs English Characters', 
                 opacity=.5, 
                 hover_data=['m_descriptor','t_lan_V','s_rsen'], 
                 labels={'m_descriptor':'Descriptor','t_lan_V':'','s_rsen':'Sentence No'})
fig.show()
#outlier
pd.set_option('display.max_colwidth',20)
outdf = df[(df['m_descriptor']=='1965-1204') & (df['s_rsen']==1297)]
outdf
m_descriptor t_lan_E t_version s_rsen c_id e_content_E chars_E words_E t_lan_V e_top be_top c_created_at count c_kind c_base a_role u_name e_content_V chars_V words_V
5170 1965-1204 ENG 16-1102 1297 125317 Now, if you want... 131 28 TWI M N 2018-03-10 07:44... 1 V c TE mosart Seisei, sԑ wo pԑ... 189 36
pd.set_option('display.max_colwidth',1000)
print(outdf.loc[:,['e_content_E']].values[0][0])
print(outdf.loc[:,['e_content_V']].values[0][0])
Now, if you want to read in Matthew 18:16, It said, “There is three that bear record,” see, in Saint … in First John 5:7, so forth.
Seisei, sԑ wo pԑ sԑ wokan Mateo ti du-nnwↄtwe nkyekyԑmu du-nsia, ԑkaa sԑ, "mmiԑnsa na ԑdi ho adansie, "hwԑ, wↄ kronkron no ... wↄ Yohane nwoma ԑdikan ti num nkyekyԑmu nson, ne deԑ ԑkeka ho.
df = df.drop(5170)

Words

fig = px.scatter(data_frame=df, x='words_E', y='words_V', color='t_lan_V', 
                 title='Translation Words vs English Words', 
                 opacity=.5, 
                 hover_data=['m_descriptor','t_lan_V','s_rsen'], 
                 labels={'m_descriptor':'Descriptor','t_lan_V':'','s_rsen':'Sentence No'})
fig.show()
df.to_csv (f'{PATH}/PredictTranslationWordAndCharCount/PredictTranslationWordAndCharCount_3-TWI-output.csv', sep='~', index = False, header=True)

Language YOR

df = pd.read_csv(f'{PATH}/PredictTranslationWordAndCharCount/PredictTranslationWordAndCharCount_2-YOR-output.csv', sep='~')
df.loc[:5, df.columns.isin(['m_descriptor','t_lan_E','t_version','s_rsen','c_id','chars_E','words_E','t_lan_V','chars_V','words_V'])]
m_descriptor t_lan_E t_version s_rsen c_id chars_E words_E t_lan_V chars_V words_V
0 1965-1206 ENG 14-0901 1 3195 20 5 YOR 19 5
1 1965-1206 ENG 14-0901 2 3196 82 14 YOR 85 21
2 1965-1206 ENG 14-0901 3 3197 60 10 YOR 89 24
3 1965-1206 ENG 14-0901 4 3198 95 18 YOR 88 23
4 1965-1206 ENG 14-0901 5 3199 80 17 YOR 42 11
5 1965-1206 ENG 14-0901 6 3200 115 23 YOR 94 23

Characters

fig = px.scatter(data_frame=df, x='chars_E', y='chars_V', color='t_lan_V', 
                 title='Translation Characters vs English Characters', 
                 opacity=.5, 
                 hover_data=['m_descriptor','t_lan_V','s_rsen'])
fig.show()
# #outlier
# pd.set_option('display.max_colwidth',20)
# outdf = df[(df['m_descriptor']=='1965-0822x') & (df['s_rsen']==1)]
# outdf
# print(outdf.loc[:,['e_content_E']].values[0][0])
# print(outdf.loc[:,['e_content_V']].values[0][0])

Words

fig = px.scatter(data_frame=df, x='words_E', y='words_V', color='t_lan_V', 
                 title='Translation Words vs English Words', 
                 opacity=.5, 
                 hover_data=['m_descriptor','t_lan_V','s_rsen'], 
                 labels={'m_descriptor':'Descriptor','t_lan_V':'','s_rsen':'Sentence No'})
fig.show()
df.to_csv (f'{PATH}/PredictTranslationWordAndCharCount/PredictTranslationWordAndCharCount_3-YOR-output.csv', sep='~', index = False, header=True)