Purpose

There is a relationship between the number of words (and the number of characters) in the source language and the target language. If this relationship can be established and captured in yet another model, such a model can be helpful in at least two ways:

  • For training: Validate the alignment of two sentences (in a training example) by comparing their word size and/or character size
  • For inference: Validate the word size and/or character size of a translated/proofread sentence

In this notebook we will continue to discover models for each language and to evaluate its use in the above roles.

Dataset and Variables

The dataset used in this notebook contains the following features:

  • m_descriptor: Unique identifier of a document
  • t_lan_E: Language of the translation (English is also considered a translation)
  • t_version: Version of a translation
  • s_rsen: Number of a sentence within a document
  • c_id: Database primary key of a contribution
  • e_content_E: Text content of an English contribution
  • chars_E: Number of characters in an English contribution
  • words_E: Number of words in an English contribution
  • t_lan_V: Language of the translation
  • e_top: N/A
  • be_top: N/A
  • c_created_at: Creation time of a contribution
  • c_kind: Kind of a contribution
  • c_base: N/A
  • a_role: N/A
  • u_name: N/A
  • e_content_V: Text content of a translated contribution
  • chars_V: Number of characters in a translated contribution
  • words_V: Number of words in a translated contribution

Setup the Environment

from pathlib import Path
import pandas as pd
%matplotlib inline
!python --version
Python 3.6.9
PATH = Path(base_dir + './'); #PATH
import plotly.express as px

Language IBO

df = pd.read_csv(f'{PATH}/PredictTranslationWordAndCharCount/PredictTranslationWordAndCharCount_2-IBO-output.csv', sep='~')
df.loc[:5, df.columns.isin(['m_descriptor','t_lan_E','t_version','s_rsen','c_id','chars_E','words_E','t_lan_V','chars_V','words_V'])]
m_descriptor t_lan_E t_version s_rsen c_id chars_E words_E t_lan_V chars_V words_V
0 1965-1206 ENG 14-0901 1 3195 20 5 IBO 22 5
1 1965-1206 ENG 14-0901 2 3196 82 14 IBO 80 19
2 1965-1206 ENG 14-0901 3 3197 60 10 IBO 64 14
3 1965-1206 ENG 14-0901 4 3198 95 18 IBO 84 19
4 1965-1206 ENG 14-0901 5 3199 80 17 IBO 80 16
5 1965-1206 ENG 14-0901 6 3200 115 23 IBO 101 25

Characters

fig = px.scatter(data_frame=df, x='chars_E', y='chars_V', color='t_lan_V', 
                 title='Translation Characters vs English Characters', 
                 opacity=.5, 
                 hover_data=['m_descriptor','t_lan_V','s_rsen'], 
                 labels={'m_descriptor':'Descriptor','t_lan_V':'','s_rsen':'Sentence No'})
fig.show()
#outlier
pd.set_option('display.max_colwidth',20)
outdf = df[(df['m_descriptor']=='1965-1206') & (df['s_rsen']==1120)]
outdf
m_descriptor t_lan_E t_version s_rsen c_id e_content_E chars_E words_E t_lan_V e_top be_top c_created_at count c_kind c_base a_role u_name e_content_V chars_V words_V
1119 1965-1206 ENG 14-0901 1120 4314 Look, they laugh... 139 22 IBO M NaN 2019-02-05 08:31... 1 V c TE vicadi Lee, ha chìri ya... 206 47
pd.set_option('display.max_colwidth',1000)
print(outdf.loc[:,['e_content_E']].values[0][0])
print(outdf.loc[:,['e_content_V']].values[0][0])
Look, they laughed at him, called him “a wild, screaming, unlearned fanatic,” as usual, that prophet forerunning the first coming of Jesus.
Lee, ha chìri ya ọchì, kpọọ ya onye n'emebiga ihe oke bụ "mmadụ-ọhia, nke n'eti sọ mkpu, n'enweghi mmụta, onye ihe n'anụ-ọkụ n'obi", dika ọ na adi, na onye amụma ahụ bụ onye mbu-uzọ n'obibia Kraist nke mbu.

This data-point could be an outlier.

df = df.drop(1119)

Words

fig = px.scatter(data_frame=df, x='words_E', y='words_V', color='t_lan_V', 
                 title='Translation Words vs English Words', 
                 opacity=.5, 
                 hover_data=['m_descriptor','t_lan_V','s_rsen'], 
                 labels={'m_descriptor':'Descriptor','t_lan_V':'','s_rsen':'Sentence No'})
fig.show()
df.to_csv (f'{PATH}/PredictTranslationWordAndCharCount/PredictTranslationWordAndCharCount_3-IBO-output.csv', sep='~', index = False, header=True)

Language IND

df = pd.read_csv(f'{PATH}/PredictTranslationWordAndCharCount/PredictTranslationWordAndCharCount_2-IND-output.csv', sep='~')
df.loc[:5, df.columns.isin(['m_descriptor','t_lan_E','t_version','s_rsen','c_id','chars_E','words_E','t_lan_V','chars_V','words_V'])]
m_descriptor t_lan_E t_version s_rsen c_id chars_E words_E t_lan_V chars_V words_V
0 1961-0218 ENG 19-0201 1 492254 14 3 IND 16 3
1 1961-0218 ENG 19-0201 2 492255 11 2 IND 23 2
2 1961-0218 ENG 19-0201 3 492256 47 9 IND 45 7
3 1961-0218 ENG 19-0201 4 492257 60 13 IND 78 11
4 1961-0218 ENG 19-0201 5 492258 35 9 IND 38 7
5 1961-0218 ENG 19-0201 6 492259 29 5 IND 38 5

Characters

fig = px.scatter(data_frame=df, x='chars_E', y='chars_V', color='t_lan_V', 
                 title='Translation Characters vs English Characters', 
                 opacity=.5, 
                 hover_data=['m_descriptor','t_lan_V','s_rsen'], 
                 labels={'m_descriptor':'Descriptor','t_lan_V':'','s_rsen':'Sentence No'})
fig.show()
# #outlier
# pd.set_option('display.max_colwidth',20)
# outdf = df[(df['m_descriptor']=='1965-0822x') & (df['s_rsen']==1)]
# outdf
# print(outdf.loc[:,['e_content_E']].values[0][0])
# print(outdf.loc[:,['e_content_V']].values[0][0])

Words

fig = px.scatter(data_frame=df, x='words_E', y='words_V', color='t_lan_V', 
                 title='Translation Words vs English Words', 
                 opacity=.5, 
                 hover_data=['m_descriptor','t_lan_V','s_rsen'], 
                 labels={'m_descriptor':'Descriptor','t_lan_V':'','s_rsen':'Sentence No'})
fig.show()
df.to_csv (f'{PATH}/PredictTranslationWordAndCharCount/PredictTranslationWordAndCharCount_3-IND-output.csv', sep='~', index = False, header=True)

Language LIN

df = pd.read_csv(f'{PATH}/PredictTranslationWordAndCharCount/PredictTranslationWordAndCharCount_2-LIN-output.csv', sep='~')
df.loc[:5, df.columns.isin(['m_descriptor','t_lan_E','t_version','s_rsen','c_id','chars_E','words_E','t_lan_V','chars_V','words_V'])]
m_descriptor t_lan_E t_version s_rsen c_id chars_E words_E t_lan_V chars_V words_V
0 1960-0626 ENG 19-0201 1 520727 41 7 LIN 53 9
1 1960-0626 ENG 19-0201 2 520728 87 17 LIN 106 17
2 1960-0626 ENG 19-0201 3 520729 56 11 LIN 60 11
3 1960-0626 ENG 19-0201 4 520730 53 10 LIN 75 15
4 1960-0626 ENG 19-0201 5 520731 33 6 LIN 40 7
5 1960-0626 ENG 19-0201 6 520732 99 19 LIN 149 28

Characters

fig = px.scatter(data_frame=df, x='chars_E', y='chars_V', color='t_lan_V', 
                 title='Translation Characters vs English Characters', 
                 opacity=.5, 
                 hover_data=['m_descriptor','t_lan_V','s_rsen'], 
                 labels={'m_descriptor':'Descriptor','t_lan_V':'','s_rsen':'Sentence No'})
fig.show()
#outlier
pd.set_option('display.max_colwidth',20)
outdf = df[(df['m_descriptor']=='1960-0626') & (df['s_rsen']==88)]
outdf
m_descriptor t_lan_E t_version s_rsen c_id e_content_E chars_E words_E t_lan_V e_top be_top c_created_at count c_kind c_base a_role u_name e_content_V chars_V words_V
87 1960-0626 ENG 19-0201 88 520814 You go all the w... 349 63 LIN M N 2020-06-14 00:25... 2 V c TE chaona Okokita boye kin... 463 79
pd.set_option('display.max_colwidth',1000)
print(outdf.loc[:,['e_content_E']].values[0][0])
print(outdf.loc[:,['e_content_V']].values[0][0])
You go all the way down the turnpike, and then about the same distance, or a little farther again, on over through, down by Brother Beeler’s place, and on through that city, and another city, and another city, and another city, to a little church that my grandfather built, a little Methodist church that I preached in twenty-five, thirty years ago.
Okokita boye kino na balabala esika bafutisaka mbongo mpo na koleka na ngambo mosusu, mpe na nsima okotambwisa pene na ntaka ndenge moko, to mwa mosika, okoleka pembeni na esika ya Ndeko Beeler, okokatisa engumba wana, mpe engumba mosusu, mpe engumba mosusu, mpe engumba mosusu lisusu, kino na mwa ndako na losambo moko oyo nkoko na ngai ya mobali atongaká, mwa ndako na losambo moko ya ba-Metodiste, nateya kuna eleki mibu ntuku mibale na mitano to ntuku misato.

This data-point could be an outlier.

df = df.drop(87)

Words

fig = px.scatter(data_frame=df, x='words_E', y='words_V', color='t_lan_V', 
                 title='Translation Words vs English Words', 
                 opacity=.5, 
                 hover_data=['m_descriptor','t_lan_V','s_rsen'], 
                 labels={'m_descriptor':'Descriptor','t_lan_V':'','s_rsen':'Sentence No'})
fig.show()
df.to_csv (f'{PATH}/PredictTranslationWordAndCharCount/PredictTranslationWordAndCharCount_3-LIN-output.csv', sep='~', index = False, header=True)

Language LUA

df = pd.read_csv(f'{PATH}/PredictTranslationWordAndCharCount/PredictTranslationWordAndCharCount_2-LUA-output.csv', sep='~')
df.loc[:5, df.columns.isin(['m_descriptor','t_lan_E','t_version','s_rsen','c_id','chars_E','words_E','t_lan_V','chars_V','words_V'])]
m_descriptor t_lan_E t_version s_rsen c_id chars_E words_E t_lan_V chars_V words_V
0 1965-0801z ENG 15-1101 1 134793 36 8 LUA 41 6
1 1965-0801z ENG 15-1101 2 134794 205 38 LUA 246 42
2 1965-0801z ENG 15-1101 3 134795 182 35 LUA 218 36
3 1965-0801z ENG 15-1101 4 134796 26 5 LUA 32 4
4 1965-0801z ENG 15-1101 5 134797 79 17 LUA 108 16
5 1965-0801z ENG 15-1101 6 134798 104 21 LUA 132 20

Characters

fig = px.scatter(data_frame=df, x='chars_E', y='chars_V', color='t_lan_V', 
                 title='Translation Characters vs English Characters', 
                 opacity=.5, 
                 hover_data=['m_descriptor','t_lan_V','s_rsen'], 
                 labels={'m_descriptor':'Descriptor','t_lan_V':'','s_rsen':'Sentence No'})
fig.show()
# #outlier
# pd.set_option('display.max_colwidth',20)
# outdf = df[(df['m_descriptor']=='1965-0822x') & (df['s_rsen']==1)]
# outdf
# print(outdf.loc[:,['e_content_E']].values[0][0])
# print(outdf.loc[:,['e_content_V']].values[0][0])

Words

fig = px.scatter(data_frame=df, x='words_E', y='words_V', color='t_lan_V', 
                 title='Translation Words vs English Words', 
                 opacity=.5, 
                 hover_data=['m_descriptor','t_lan_V','s_rsen'], 
                 labels={'m_descriptor':'Descriptor','t_lan_V':'','s_rsen':'Sentence No'})
fig.show()
df.to_csv (f'{PATH}/PredictTranslationWordAndCharCount/PredictTranslationWordAndCharCount_3-LUA-output.csv', sep='~', index = False, header=True)

Language LUG

df = pd.read_csv(f'{PATH}/PredictTranslationWordAndCharCount/PredictTranslationWordAndCharCount_2-LUG-output.csv', sep='~')
df.loc[:5, df.columns.isin(['m_descriptor','t_lan_E','t_version','s_rsen','c_id','chars_E','words_E','t_lan_V','chars_V','words_V'])]
m_descriptor t_lan_E t_version s_rsen c_id chars_E words_E t_lan_V chars_V words_V
0 1965-0219 ENG 15-1101 1 306022 79 16 LUG 99 16
1 1965-0219 ENG 15-1101 2 306023 144 22 LUG 159 25
2 1965-0219 ENG 15-1101 3 306024 83 14 LUG 63 9
3 1965-0219 ENG 15-1101 4 306025 215 37 LUG 199 29
4 1965-0219 ENG 15-1101 5 306026 47 9 LUG 50 8
5 1965-0219 ENG 15-1101 6 306027 161 28 LUG 153 21

Characters

fig = px.scatter(data_frame=df, x='chars_E', y='chars_V', color='t_lan_V', 
                 title='Translation Characters vs English Characters', 
                 opacity=.5, 
                 hover_data=['m_descriptor','t_lan_V','s_rsen'], 
                 labels={'m_descriptor':'Descriptor','t_lan_V':'','s_rsen':'Sentence No'})
fig.show()
#outlier
pd.set_option('display.max_colwidth',20)
outdf = df[(df['m_descriptor']=='1965-0219') & (df['s_rsen']==87)]
outdf
m_descriptor t_lan_E t_version s_rsen c_id e_content_E chars_E words_E t_lan_V e_top be_top c_created_at count c_kind c_base a_role u_name e_content_V chars_V words_V
86 1965-0219 ENG 15-1101 87 306108 It was the last ... 120 23 LUG M N 2018-07-31 09:40... 1 V c TE julmuk Olukunngaana olw... 58 9
pd.set_option('display.max_colwidth',1000)
print(outdf.loc[:,['e_content_E']].values[0][0])
print(outdf.loc[:,['e_content_V']].values[0][0])
It was the last day, of the service that I was to speak at the International Convention of the Full Gospel Business Men.
Olukunngaana olw’ensi yonna olwa Full Gospel Business Men.

This data-point could be an outlier.

df = df.drop(86)

Words

fig = px.scatter(data_frame=df, x='words_E', y='words_V', color='t_lan_V', 
                 title='Translation Words vs English Words', 
                 opacity=.5, 
                 hover_data=['m_descriptor','t_lan_V','s_rsen'], 
                 labels={'m_descriptor':'Descriptor','t_lan_V':'','s_rsen':'Sentence No'})
fig.show()
#outlier
pd.set_option('display.max_colwidth',20)
outdf = df[(df['m_descriptor']=='1965-0219') & (df['s_rsen']==830)]
outdf
m_descriptor t_lan_E t_version s_rsen c_id e_content_E chars_E words_E t_lan_V e_top be_top c_created_at count c_kind c_base a_role u_name e_content_V chars_V words_V
829 1965-0219 ENG 15-1101 830 306851 And remember, Ab... 160 43 LUG M N 2018-08-17 13:54... 1 V c TE julmuk 208 Kati jjukira... 205 52
pd.set_option('display.max_colwidth',1000)
print(outdf.loc[:,['e_content_E']].values[0][0])
print(outdf.loc[:,['e_content_V']].values[0][0])
And remember, Abraham, his name was *Abram* a few days before that, and Sarah was *Sarra* before that; S-a-r-r-a then S-a-r-a-h, and A-b-r-a-m to A-b-r-a-h-a-m.
208 Kati jjukira, Ibulayimu, erinnya lye yali Ibulaamu emabegako ng’ekyo tekinnabaawo, era ne Saala nga ye Salayi ekyo nga tekinnabaawo; S-a-l-a-y-i kati S-a-a-l-a, ne I-b-u-l-a-a-m-u mu I-b-u-l-a-y-i-m-u.

This data-point could be an outlier.

df = df.drop(829)
df.to_csv (f'{PATH}/PredictTranslationWordAndCharCount/PredictTranslationWordAndCharCount_3-LUG-output.csv', sep='~', index = False, header=True)

Language POB

df = pd.read_csv(f'{PATH}/PredictTranslationWordAndCharCount/PredictTranslationWordAndCharCount_2-POB-output.csv', sep='~')
df.loc[:5, df.columns.isin(['m_descriptor','t_lan_E','t_version','s_rsen','c_id','chars_E','words_E','t_lan_V','chars_V','words_V'])]
m_descriptor t_lan_E t_version s_rsen c_id chars_E words_E t_lan_V chars_V words_V
0 1950-0110 ENG 15-0901 1 51236 66 14 POB 73 14
1 1950-0110 ENG 15-0901 2 51237 105 22 POB 113 19
2 1950-0110 ENG 15-0901 3 51238 87 19 POB 89 17
3 1950-0110 ENG 15-0901 4 51239 90 17 POB 106 19
4 1950-0110 ENG 15-0901 5 51240 61 11 POB 64 10
5 1950-0110 ENG 15-0901 6 51241 94 18 POB 96 17

Characters

fig = px.scatter(data_frame=df, x='chars_E', y='chars_V', color='t_lan_V', 
                 title='Translation Characters vs English Characters', 
                 opacity=.5, 
                 hover_data=['m_descriptor','t_lan_V','s_rsen'], 
                 labels={'m_descriptor':'Descriptor','t_lan_V':'','s_rsen':'Sentence No'})
fig.show()
#outlier
pd.set_option('display.max_colwidth',20)
outdf = df[(df['m_descriptor']=='1963-0412x') & (df['s_rsen']==1283)]
outdf
m_descriptor t_lan_E t_version s_rsen c_id e_content_E chars_E words_E t_lan_V e_top be_top c_created_at count c_kind c_base a_role u_name e_content_V chars_V words_V
2170 1963-0412x ENG 15-0402 1283 578647 You ought to go ... 107 20 POB M N 2019-07-06 18:10... 1 V c CE calamo Vocês deveriam v... 162 28
pd.set_option('display.max_colwidth',1000)
print(outdf.loc[:,['e_content_E']].values[0][0])
print(outdf.loc[:,['e_content_V']].values[0][0])
You ought to go back into Africa, the Hottentots, and let them kill an animal and blood theirself all over.
Vocês deveriam voltar para a África, os Hottentots [os pastores nômades indígenas não-bantus da África do Sul], e deixá-los matar um animal e sangue deles mesmos.

This data-point is an outlier due to the over-translation.

df = df.drop(2170)

Words

fig = px.scatter(data_frame=df, x='words_E', y='words_V', color='t_lan_V', 
                 title='Translation Words vs English Words', 
                 opacity=.5, 
                 hover_data=['m_descriptor','t_lan_V','s_rsen'], 
                 labels={'m_descriptor':'Descriptor','t_lan_V':'','s_rsen':'Sentence No'})
fig.show()
df.to_csv (f'{PATH}/PredictTranslationWordAndCharCount/PredictTranslationWordAndCharCount_3-POB-output.csv', sep='~', index = False, header=True)