Purpose

There is a relationship between the number of words (and the number of characters) in the source language and the target language. If this relationship can be established and captured in yet another model, such a model can be helpful in at least two ways:

  • For training: Validate the alignment of two sentences (in a training example) by comparing their word size and/or character size
  • For inference: Validate the word size and/or character size of a translated/proofread sentence

In this notebook we will continue to discover models for each language and to evaluate its use in the above roles.

Dataset and Variables

The dataset used in this notebook contains the following features:

  • m_descriptor: Unique identifier of a document
  • t_lan_E: Language of the translation (English is also considered a translation)
  • t_version: Version of a translation
  • s_rsen: Number of a sentence within a document
  • c_id: Database primary key of a contribution
  • e_content_E: Text content of an English contribution
  • chars_E: Number of characters in an English contribution
  • words_E: Number of words in an English contribution
  • t_lan_V: Language of the translation
  • e_top: N/A
  • be_top: N/A
  • c_created_at: Creation time of a contribution
  • c_kind: Kind of a contribution
  • c_base: N/A
  • a_role: N/A
  • u_name: N/A
  • e_content_V: Text content of a translated contribution
  • chars_V: Number of characters in a translated contribution
  • words_V: Number of words in a translated contribution

Setup the Environment

from pathlib import Path
import pandas as pd
%matplotlib inline
!python --version
Python 3.6.9
PATH = Path(base_dir + './'); #PATH
import plotly.express as px

Language CHN

df = pd.read_csv(f'{PATH}/PredictTranslationWordAndCharCount/PredictTranslationWordAndCharCount_2-CHN-output.csv', sep='~')
df.loc[:5, df.columns.isin(['m_descriptor','t_lan_E','t_version','s_rsen','c_id','chars_E','words_E','t_lan_V','chars_V','words_V'])]
m_descriptor t_lan_E t_version s_rsen c_id chars_E words_E t_lan_V chars_V words_V
0 1951-0508 ENG 19-0201 1 634231 22 4 CHN 5 1
1 1951-0508 ENG 19-0201 2 634232 82 13 CHN 28 6
2 1951-0508 ENG 19-0201 3 634233 151 30 CHN 46 5
3 1951-0508 ENG 19-0201 4 634234 85 19 CHN 28 3
4 1951-0508 ENG 19-0201 5 634235 89 18 CHN 27 2
5 1951-0508 ENG 19-0201 6 634236 59 12 CHN 17 1

Characters

fig = px.scatter(data_frame=df, x='chars_E', y='chars_V', color='t_lan_V', 
                 title='Translation Characters vs English Characters', 
                 opacity=.5, 
                 hover_data=['m_descriptor','t_lan_V','s_rsen'], 
                 labels={'m_descriptor':'Descriptor','t_lan_V':'','s_rsen':'Sentence No'},
                 )
fig.show()
#outlier
pd.set_option('display.max_colwidth',20)
outdf = df[(df['m_descriptor']=='1960-0607') & (df['s_rsen']==909)]
outdf
m_descriptor t_lan_E t_version s_rsen c_id e_content_E chars_E words_E t_lan_V e_top be_top c_created_at count c_kind c_base a_role u_name e_content_V chars_V words_V
39218 1960-0607 ENG 19-0401 909 683539 Paul, out in the... 56 10 CHN M N 2019-09-10 04:28... 1 V c TE dawnxu 他们上去等着,他们认出那是神,他... 119 15
pd.set_option('display.max_colwidth',1000)
print(outdf.loc[:,['e_content_E']].values[0][0])
print(outdf.loc[:,['e_content_V']].values[0][0])
Paul, out in the ocean, obeying the commandments of God.
他们上去等着,他们认出那是神,他们立刻行动起来。你知道结果是什么。他们像醉汉一样摇摇晃晃,他们说方言,他们……哦,这是你一生中听到的最可怕的打扮,直到人们说:“这些人都喝了新酒。”不…是保罗在海里。(现在注意)……保罗在海里顺从神的诫命。

This data-point is definitely an outlier.

df = df.drop(39218)
#outlier
pd.set_option('display.max_colwidth',20)
outdf = df[(df['m_descriptor']=='1955-1118') & (df['s_rsen']==359)]
outdf
m_descriptor t_lan_E t_version s_rsen c_id e_content_E chars_E words_E t_lan_V e_top be_top c_created_at count c_kind c_base a_role u_name e_content_V chars_V words_V
19955 1955-1118 ENG 15-0401 359 117132 I'll cut my hand... 446 96 CHN M N 2018-01-27 22:37... 1 V c TE estzhe 我要是今晚用刀割我的手,而且我倒... 166 11
pd.set_option('display.max_colwidth',1000)
print(outdf.loc[:,['e_content_E']].values[0][0])
print(outdf.loc[:,['e_content_V']].values[0][0])
I'll cut my hand tonight with a knife, and I'd fall down here dead, you'd take my body out, and embalm it, and let a doctors, the best in the world, come every day and give me a shot of penicillin, and put sulfa drug in it, and whatever they want to, sew it up, and embalm my body with a fluid and make me look natural for fifty years, in fifty years from now that knife cut would look just exactly like it was when it was cut in the first place.
我要是今晚用刀割我的手,而且我倒在这里死了,你就会把我的身体带出去并进行防腐处理。然后让一个世界上最好的医生,每天都来给我打一针青霉素并放一些磺胺类药物,不管他们想放什么。把被刀切割的伤口缝好,然后用一种溶液来给我的身体防腐,使我在未来五十年里看起来像活人一样。但从现在开始的五十年里,被刀切割的切口看起来会和刚刚被切伤时是一样的。

This data-point is definitely an outlier.

df = df.drop(19955)

Words

fig = px.scatter(data_frame=df, x='words_E', y='words_V', color='t_lan_V', 
                 title='Translation Words vs English Words', 
                 opacity=.5, 
                 hover_data=['m_descriptor','t_lan_V','s_rsen'], 
                 labels={'m_descriptor':'Descriptor','t_lan_V':'','s_rsen':'Sentence No'},
                )
fig.show()
df.to_csv (f'{PATH}/PredictTranslationWordAndCharCount/PredictTranslationWordAndCharCount_3-CHN-output.csv', sep='~', index = False, header=True)

Language FAS

df = pd.read_csv(f'{PATH}/PredictTranslationWordAndCharCount/PredictTranslationWordAndCharCount_2-FAS-output.csv', sep='~')
df.loc[:5, df.columns.isin(['m_descriptor','t_lan_E','t_version','s_rsen','c_id','chars_E','words_E','t_lan_V','chars_V','words_V'])]
m_descriptor t_lan_E t_version s_rsen c_id chars_E words_E t_lan_V chars_V words_V
0 1965-1031y ENG 17-0901 1 72944 56 10 FAS 54 11
1 1965-1031y ENG 17-0901 2 72945 14 3 FAS 20 4
2 1965-1031y ENG 17-0901 3 72946 62 15 FAS 68 16
3 1965-1031y ENG 17-0901 4 72947 62 11 FAS 84 15
4 1965-1031y ENG 17-0901 5 72948 16 4 FAS 21 4
5 1965-1031y ENG 17-0901 6 72949 43 7 FAS 42 8

Characters

fig = px.scatter(data_frame=df, x='chars_E', y='chars_V', color='t_lan_V', 
                 title='Translation Characters vs English Characters', 
                 opacity=.5, 
                 hover_data=['m_descriptor','t_lan_V','s_rsen'], 
                 labels={'m_descriptor':'Descriptor','t_lan_V':'','s_rsen':'Sentence No'},
                 )
fig.show()
#outlier
pd.set_option('display.max_colwidth',20)
outdf = df[(df['m_descriptor']=='1965-1031y') & (df['s_rsen']==575)]
outdf
m_descriptor t_lan_E t_version s_rsen c_id e_content_E chars_E words_E t_lan_V e_top be_top c_created_at count c_kind c_base a_role u_name e_content_V chars_V words_V
574 1965-1031y ENG 17-0901 575 73518 When you get out... 99 20 FAS M N 2019-09-15 15:28... 1 V c TE matdeh هنگامی که شما از... 193 38
pd.set_option('display.max_colwidth',1000)
print(outdf.loc[:,['e_content_E']].values[0][0])
print(outdf.loc[:,['e_content_V']].values[0][0])
When you get out of school, when you get away from, you need another leader, but let that be Jesus.
هنگامی که شما از مدرسه فارقلتحصیل شدید، وقتی که شما از جو راهنمایی کننده مدرسه دورمی شوید، آن موقع می باشد که شما به یک رهبری دیگری احتیاج دارید ، بگذارید آن راهنما در زندگی شما عیسی مسیح باشد.

This data-point could be an outlier.

df = df.drop(574)

Words

fig = px.scatter(data_frame=df, x='words_E', y='words_V', color='t_lan_V', 
                 title='Translation Words vs English Words', 
                 opacity=.5, 
                 hover_data=['m_descriptor','t_lan_V','s_rsen'], 
                 labels={'m_descriptor':'Descriptor','t_lan_V':'','s_rsen':'Sentence No'},
                )
fig.show()
df.to_csv (f'{PATH}/PredictTranslationWordAndCharCount/PredictTranslationWordAndCharCount_3-FAS-output.csv', sep='~', index = False, header=True)

Language FIJ

df = pd.read_csv(f'{PATH}/PredictTranslationWordAndCharCount/PredictTranslationWordAndCharCount_2-FIJ-output.csv', sep='~')
df.loc[:5, df.columns.isin(['m_descriptor','t_lan_E','t_version','s_rsen','c_id','chars_E','words_E','t_lan_V','chars_V','words_V'])]
m_descriptor t_lan_E t_version s_rsen c_id chars_E words_E t_lan_V chars_V words_V
0 1963-0728 ENG 15-0401 1 30399 27 4 FIJ 32 4
1 1963-0728 ENG 15-0401 2 30400 19 4 FIJ 36 5
2 1963-0728 ENG 15-0401 3 30401 64 12 FIJ 66 11
3 1963-0728 ENG 15-0401 4 30402 312 61 FIJ 385 70
4 1963-0728 ENG 15-0401 5 30403 360 70 FIJ 456 87
5 1963-0728 ENG 15-0401 6 30404 128 27 FIJ 131 25

Characters

fig = px.scatter(data_frame=df, x='chars_E', y='chars_V', color='t_lan_V', 
                 title='Translation Characters vs English Characters', 
                 opacity=.5, 
                 hover_data=['m_descriptor','t_lan_V','s_rsen'], 
                 labels={'m_descriptor':'Descriptor','t_lan_V':'','s_rsen':'Sentence No'},
                )
fig.show()
#outlier
pd.set_option('display.max_colwidth',20)
outdf = df[(df['m_descriptor']=='1963-0728') & (df['s_rsen']==1149)]
outdf
m_descriptor t_lan_E t_version s_rsen c_id e_content_E chars_E words_E t_lan_V e_top be_top c_created_at count c_kind c_base a_role u_name e_content_V chars_V words_V
1148 1963-0728 ENG 15-0401 1149 31547 Where, these lit... 302 45 FIJ M M 2018-12-05 09:01... 2 V t TE pitrai Na vanua se evei... 529 110
pd.set_option('display.max_colwidth',1000)
print(outdf.loc[:,['e_content_E']].values[0][0])
print(outdf.loc[:,['e_content_V']].values[0][0])
Where, these little doctrines, like Luther brought out catechism and everything else; and Wesley brought *this, that,* and the *other,* and these other things; and then Pentecost brought organization just the same, and “Father, Son, Holy Ghost” baptism, and things; not knowing any different, ’cause … 
Na vanua se evei, na vei vakavuvuli lalai kece; me vaka ea kauta mai ko Luca mai na yavu ni veika me baleta na Lotu, taro kei na kena i sau ni taro kei na kena veika kece tale eso, ko Wesele ea kauta mai oqo, koya, kei na dua tale oya, veika tale vaka koya; sa qai lako mai ko Penitiko sa qai kauta mai na veika vakayavu buli yavutaki ga ena veika vaka tamata me vaka ga na veika ea sa yaco tiko mai, kei na veipapitaisotaki "Tamana, Luvena, Yalo Tabu" kei na veika tale eso, era a sega tu ni a kila na kena duidui, 'e baleta ...
df = df.drop(1148)
#outlier
pd.set_option('display.max_colwidth',20)
outdf = df[(df['m_descriptor']=='1963-0728') & (df['s_rsen']==612)]
outdf
m_descriptor t_lan_E t_version s_rsen c_id e_content_E chars_E words_E t_lan_V e_top be_top c_created_at count c_kind c_base a_role u_name e_content_V chars_V words_V
611 1963-0728 ENG 15-0401 612 31010 That’s why these... 186 36 FIJ M N 2018-01-30 01:12... 2 V c QE engest O ya na vuna na ... 426 93
pd.set_option('display.max_colwidth',1000)
print(outdf.loc[:,['e_content_E']].values[0][0])
print(outdf.loc[:,['e_content_V']].values[0][0])
That’s why these added books that they call the *Second Book of Daniel*, and the—the *Book of the Maccabees*, adds purgatory and stuff like that, see, it’s not spoke of in the Scripture.
O ya na vuna na vei i vola oqo ka na vakuri tale kina ka vakatokai na i Karua ni i Vola i Taniela, kei na—na nodra i Vola na Maccabees, ena vakuria kina na me vaka na nodra vakabauta na Katolika, e dua na vanua era na lako yani kina na yalodra na Tamata i Valavala Ca ka veisorovaki kina ni bera ni ra na lako ki lomalagi kei na veika kece sara tale vaka ko ya, raica, veika kece oqo era sega tu ni tukuni ena i Vola ni Kalou.

This data-point could be an outlier.

df = df.drop(611)

Words

fig = px.scatter(data_frame=df, x='words_E', y='words_V', color='t_lan_V', 
                 title='Translation Words vs English Words', 
                 opacity=.5, 
                 hover_data=['m_descriptor','t_lan_V','s_rsen'], 
                 labels={'m_descriptor':'Descriptor','t_lan_V':'','s_rsen':'Sentence No'},
                )
fig.show()
df.to_csv (f'{PATH}/PredictTranslationWordAndCharCount/PredictTranslationWordAndCharCount_3-FIJ-output.csv', sep='~', index = False, header=True)

Language GER

df = pd.read_csv(f'{PATH}/PredictTranslationWordAndCharCount/PredictTranslationWordAndCharCount_2-GER-output.csv', sep='~')
df.loc[:5, df.columns.isin(['m_descriptor','t_lan_E','t_version','s_rsen','c_id','chars_E','words_E','t_lan_V','chars_V','words_V'])]
m_descriptor t_lan_E t_version s_rsen c_id chars_E words_E t_lan_V chars_V words_V
0 1948-0304 ENG 15-0902 1 660286 98 21 GER 112 20
1 1948-0304 ENG 15-0902 2 660287 27 6 GER 28 5
2 1948-0304 ENG 15-0902 3 660288 36 7 GER 41 7
3 1948-0304 ENG 15-0902 4 660289 13 4 GER 18 4
4 1948-0304 ENG 15-0902 5 660290 85 18 GER 106 18
5 1948-0304 ENG 15-0902 6 660291 36 9 GER 36 8

Characters

fig = px.scatter(data_frame=df, x='chars_E', y='chars_V', color='t_lan_V', 
                 title='Translation Characters vs English Characters', 
                 opacity=.5, 
                 hover_data=['m_descriptor','t_lan_V','s_rsen'], 
                 labels={'m_descriptor':'Descriptor','t_lan_V':'','s_rsen':'Sentence No'})
fig.show()
#outlier
pd.set_option('display.max_colwidth',20)
outdf = df[(df['m_descriptor']=='1965-0725z') & (df['s_rsen']==509)]
outdf
m_descriptor t_lan_E t_version s_rsen c_id e_content_E chars_E words_E t_lan_V e_top be_top c_created_at count c_kind c_base a_role u_name e_content_V chars_V words_V
19825 1965-0725z ENG 17-0401 509 27680 That’s Malachi 3... 68 14 GER M N 2018-02-10 21:18... 2 V c TE eliwal Das ist Maleachi... 204 35
pd.set_option('display.max_colwidth',1000)
print(outdf.loc[:,['e_content_E']].values[0][0])
print(outdf.loc[:,['e_content_V']].values[0][0])
That’s Malachi 3, was the first coming, now here is the next coming.
Das ist Maleachi 3, [Vers. 1. ] das war Sein erstes Kommen, jetzt hier ist Sein nächstes kommen. [Dies gehört nicht zu Maleachi 3, Vers 1, das Sein erstes Kommen betraf, sondern zu Seinem zweiten Kommen.]

This data-point is an outlier due to the explanations added by the translator.

df = df.drop(19825)
#outlier
pd.set_option('display.max_colwidth',20)
outdf = df[(df['m_descriptor']=='1964-0208') & (df['s_rsen']==223)]
outdf
m_descriptor t_lan_E t_version s_rsen c_id e_content_E chars_E words_E t_lan_V e_top be_top c_created_at count c_kind c_base a_role u_name e_content_V chars_V words_V
8272 1964-0208 ENG 15-0402 223 243251 We find out now ... 90 18 GER M N 2018-11-09 23:06... 2 V c TE eliwal Wir erfahren nun... 201 33
pd.set_option('display.max_colwidth',1000)
print(outdf.loc[:,['e_content_E']].values[0][0])
print(outdf.loc[:,['e_content_V']].values[0][0])
We find out now that the Holy Spirit was the One now that identifies us with Jesus Christ.
Wir erfahren nun, daß der Heilige Geist der war jetzt, uns mit Jesus Christus identifiziert. Wir finden jetzt heraus, daß der Heilige Geist der Eine war, der uns jetzt mit Jesus Christus identifiziert.

This data-point is an outlier; translation seems to be repeated.

df = df.drop(8272)
#outlier
pd.set_option('display.max_colwidth',20)
outdf = df[(df['m_descriptor']=='1964-0314') & (df['s_rsen']==112)]
outdf
m_descriptor t_lan_E t_version s_rsen c_id e_content_E chars_E words_E t_lan_V e_top be_top c_created_at count c_kind c_base a_role u_name e_content_V chars_V words_V
13481 1964-0314 ENG 15-0402 112 692094 *Then Jesus beho... 236 43 GER M N 2019-08-27 11:38... 1 V c TE hugmes Da liebte ihn Je... 157 31
pd.set_option('display.max_colwidth',1000)
print(outdf.loc[:,['e_content_E']].values[0][0])
print(outdf.loc[:,['e_content_V']].values[0][0])
*Then Jesus beholding him loved him, and said unto him, One thing thou lackest: go thy way, sell whatsoever thou hast, and give to the poor, and thou shalt have* treasures *in heaven: and come*, and *take up* thy *cross, and follow me*.
Da liebte ihn Jesus und sprach zu ihm: Was dir fehlt, gehe hin und verkaufe, was du hast, und gib den Armen, und du wirst Schätze im Himmel haben; Folge mir.

Seems OK.

 

Words

fig = px.scatter(data_frame=df, x='words_E', y='words_V', color='t_lan_V', 
                 title='Translation Words vs English Words', 
                 opacity=.5, 
                 hover_data=['m_descriptor','t_lan_V','s_rsen'], 
                 labels={'m_descriptor':'Descriptor','t_lan_V':'','s_rsen':'Sentence No'})
fig.show()
df.to_csv (f'{PATH}/PredictTranslationWordAndCharCount/PredictTranslationWordAndCharCount_3-GER-output.csv', sep='~', index = False, header=True)