Purpose

There is a relationship between the number of words (and the number of characters) in the source language and the target language. If this relationship can be established and captured in yet another model, such a model can be helpful in at least two ways:

For training: Validate the alignment of two sentences (in a training example) by comparing their word size and/or character size
For inference: Validate the word size and/or character size of a translated/proofread sentence

In this notebook we will continue to discover models for each language and to evaluate its use in the above roles.

Dataset and Variables

The dataset used in this notebook contains the following features:

m_descriptor: Unique identifier of a document
t_lan_E: Language of the translation (English is also considered a translation)
t_version: Version of a translation
s_rsen: Number of a sentence within a document
c_id: Database primary key of a contribution
e_content_E: Text content of an English contribution
chars_E: Number of characters in an English contribution
words_E: Number of words in an English contribution
t_lan_V: Language of the translation
e_top: N/A
be_top: N/A
c_created_at: Creation time of a contribution
c_kind: Kind of a contribution
c_base: N/A
a_role: N/A
u_name: N/A
e_content_V: Text content of a translated contribution
chars_V: Number of characters in a translated contribution
words_V: Number of words in a translated contribution

Setup the Environment

from pathlib import Path
import pandas as pd
%matplotlib inline

!python --version

Python 3.6.9

PATH = Path(base_dir + './'); #PATH

import plotly.express as px

Language CHN

df = pd.read_csv(f'{PATH}/PredictTranslationWordAndCharCount/PredictTranslationWordAndCharCount_2-CHN-output.csv', sep='~')
df.loc[:5, df.columns.isin(['m_descriptor','t_lan_E','t_version','s_rsen','c_id','chars_E','words_E','t_lan_V','chars_V','words_V'])]

Characters

fig = px.scatter(data_frame=df, x='chars_E', y='chars_V', color='t_lan_V', 
                 title='Translation Characters vs English Characters', 
                 opacity=.5, 
                 hover_data=['m_descriptor','t_lan_V','s_rsen'], 
                 labels={'m_descriptor':'Descriptor','t_lan_V':'','s_rsen':'Sentence No'},
                 )
fig.show()

#outlier
pd.set_option('display.max_colwidth',20)
outdf = df[(df['m_descriptor']=='1960-0607') & (df['s_rsen']==909)]
outdf

pd.set_option('display.max_colwidth',1000)
print(outdf.loc[:,['e_content_E']].values[0][0])
print(outdf.loc[:,['e_content_V']].values[0][0])

Paul, out in the ocean, obeying the commandments of God.
他们上去等着，他们认出那是神，他们立刻行动起来。你知道结果是什么。他们像醉汉一样摇摇晃晃，他们说方言，他们……哦，这是你一生中听到的最可怕的打扮，直到人们说:“这些人都喝了新酒。”不…是保罗在海里。(现在注意)……保罗在海里顺从神的诫命。

This data-point is definitely an outlier.

df = df.drop(39218)

#outlier
pd.set_option('display.max_colwidth',20)
outdf = df[(df['m_descriptor']=='1955-1118') & (df['s_rsen']==359)]
outdf

pd.set_option('display.max_colwidth',1000)
print(outdf.loc[:,['e_content_E']].values[0][0])
print(outdf.loc[:,['e_content_V']].values[0][0])

I'll cut my hand tonight with a knife, and I'd fall down here dead, you'd take my body out, and embalm it, and let a doctors, the best in the world, come every day and give me a shot of penicillin, and put sulfa drug in it, and whatever they want to, sew it up, and embalm my body with a fluid and make me look natural for fifty years, in fifty years from now that knife cut would look just exactly like it was when it was cut in the first place.
我要是今晚用刀割我的手，而且我倒在这里死了，你就会把我的身体带出去并进行防腐处理。然后让一个世界上最好的医生，每天都来给我打一针青霉素并放一些磺胺类药物，不管他们想放什么。把被刀切割的伤口缝好，然后用一种溶液来给我的身体防腐，使我在未来五十年里看起来像活人一样。但从现在开始的五十年里，被刀切割的切口看起来会和刚刚被切伤时是一样的。

This data-point is definitely an outlier.

df = df.drop(19955)

Words

fig = px.scatter(data_frame=df, x='words_E', y='words_V', color='t_lan_V', 
                 title='Translation Words vs English Words', 
                 opacity=.5, 
                 hover_data=['m_descriptor','t_lan_V','s_rsen'], 
                 labels={'m_descriptor':'Descriptor','t_lan_V':'','s_rsen':'Sentence No'},
                )
fig.show()

df.to_csv (f'{PATH}/PredictTranslationWordAndCharCount/PredictTranslationWordAndCharCount_3-CHN-output.csv', sep='~', index = False, header=True)

Language FAS

df = pd.read_csv(f'{PATH}/PredictTranslationWordAndCharCount/PredictTranslationWordAndCharCount_2-FAS-output.csv', sep='~')
df.loc[:5, df.columns.isin(['m_descriptor','t_lan_E','t_version','s_rsen','c_id','chars_E','words_E','t_lan_V','chars_V','words_V'])]

Characters

fig = px.scatter(data_frame=df, x='chars_E', y='chars_V', color='t_lan_V', 
                 title='Translation Characters vs English Characters', 
                 opacity=.5, 
                 hover_data=['m_descriptor','t_lan_V','s_rsen'], 
                 labels={'m_descriptor':'Descriptor','t_lan_V':'','s_rsen':'Sentence No'},
                 )
fig.show()

#outlier
pd.set_option('display.max_colwidth',20)
outdf = df[(df['m_descriptor']=='1965-1031y') & (df['s_rsen']==575)]
outdf

pd.set_option('display.max_colwidth',1000)
print(outdf.loc[:,['e_content_E']].values[0][0])
print(outdf.loc[:,['e_content_V']].values[0][0])

When you get out of school, when you get away from, you need another leader, but let that be Jesus.
هنگامی که شما از مدرسه فارقلتحصیل شدید، وقتی که شما از جو راهنمایی کننده مدرسه دورمی شوید، آن موقع می باشد که شما به یک رهبری دیگری احتیاج دارید ، بگذارید آن راهنما در زندگی شما عیسی مسیح باشد.

This data-point could be an outlier.

df = df.drop(574)

Words

fig = px.scatter(data_frame=df, x='words_E', y='words_V', color='t_lan_V', 
                 title='Translation Words vs English Words', 
                 opacity=.5, 
                 hover_data=['m_descriptor','t_lan_V','s_rsen'], 
                 labels={'m_descriptor':'Descriptor','t_lan_V':'','s_rsen':'Sentence No'},
                )
fig.show()

df.to_csv (f'{PATH}/PredictTranslationWordAndCharCount/PredictTranslationWordAndCharCount_3-FAS-output.csv', sep='~', index = False, header=True)

Language FIJ

df = pd.read_csv(f'{PATH}/PredictTranslationWordAndCharCount/PredictTranslationWordAndCharCount_2-FIJ-output.csv', sep='~')
df.loc[:5, df.columns.isin(['m_descriptor','t_lan_E','t_version','s_rsen','c_id','chars_E','words_E','t_lan_V','chars_V','words_V'])]

Characters

fig = px.scatter(data_frame=df, x='chars_E', y='chars_V', color='t_lan_V', 
                 title='Translation Characters vs English Characters', 
                 opacity=.5, 
                 hover_data=['m_descriptor','t_lan_V','s_rsen'], 
                 labels={'m_descriptor':'Descriptor','t_lan_V':'','s_rsen':'Sentence No'},
                )
fig.show()

#outlier
pd.set_option('display.max_colwidth',20)
outdf = df[(df['m_descriptor']=='1963-0728') & (df['s_rsen']==1149)]
outdf

pd.set_option('display.max_colwidth',1000)
print(outdf.loc[:,['e_content_E']].values[0][0])
print(outdf.loc[:,['e_content_V']].values[0][0])

Where, these little doctrines, like Luther brought out catechism and everything else; and Wesley brought *this, that,* and the *other,* and these other things; and then Pentecost brought organization just the same, and “Father, Son, Holy Ghost” baptism, and things; not knowing any different, ’cause … 
Na vanua se evei, na vei vakavuvuli lalai kece; me vaka ea kauta mai ko Luca mai na yavu ni veika me baleta na Lotu, taro kei na kena i sau ni taro kei na kena veika kece tale eso, ko Wesele ea kauta mai oqo, koya, kei na dua tale oya, veika tale vaka koya; sa qai lako mai ko Penitiko sa qai kauta mai na veika vakayavu buli yavutaki ga ena veika vaka tamata me vaka ga na veika ea sa yaco tiko mai, kei na veipapitaisotaki "Tamana, Luvena, Yalo Tabu" kei na veika tale eso, era a sega tu ni a kila na kena duidui, 'e baleta ...

df = df.drop(1148)

#outlier
pd.set_option('display.max_colwidth',20)
outdf = df[(df['m_descriptor']=='1963-0728') & (df['s_rsen']==612)]
outdf

pd.set_option('display.max_colwidth',1000)
print(outdf.loc[:,['e_content_E']].values[0][0])
print(outdf.loc[:,['e_content_V']].values[0][0])

That’s why these added books that they call the *Second Book of Daniel*, and the—the *Book of the Maccabees*, adds purgatory and stuff like that, see, it’s not spoke of in the Scripture.
O ya na vuna na vei i vola oqo ka na vakuri tale kina ka vakatokai na i Karua ni i Vola i Taniela, kei na—na nodra i Vola na Maccabees, ena vakuria kina na me vaka na nodra vakabauta na Katolika, e dua na vanua era na lako yani kina na yalodra na Tamata i Valavala Ca ka veisorovaki kina ni bera ni ra na lako ki lomalagi kei na veika kece sara tale vaka ko ya, raica, veika kece oqo era sega tu ni tukuni ena i Vola ni Kalou.

This data-point could be an outlier.

df = df.drop(611)

Words

fig = px.scatter(data_frame=df, x='words_E', y='words_V', color='t_lan_V', 
                 title='Translation Words vs English Words', 
                 opacity=.5, 
                 hover_data=['m_descriptor','t_lan_V','s_rsen'], 
                 labels={'m_descriptor':'Descriptor','t_lan_V':'','s_rsen':'Sentence No'},
                )
fig.show()

df.to_csv (f'{PATH}/PredictTranslationWordAndCharCount/PredictTranslationWordAndCharCount_3-FIJ-output.csv', sep='~', index = False, header=True)

Language GER

df = pd.read_csv(f'{PATH}/PredictTranslationWordAndCharCount/PredictTranslationWordAndCharCount_2-GER-output.csv', sep='~')
df.loc[:5, df.columns.isin(['m_descriptor','t_lan_E','t_version','s_rsen','c_id','chars_E','words_E','t_lan_V','chars_V','words_V'])]

Characters

fig = px.scatter(data_frame=df, x='chars_E', y='chars_V', color='t_lan_V', 
                 title='Translation Characters vs English Characters', 
                 opacity=.5, 
                 hover_data=['m_descriptor','t_lan_V','s_rsen'], 
                 labels={'m_descriptor':'Descriptor','t_lan_V':'','s_rsen':'Sentence No'})
fig.show()

#outlier
pd.set_option('display.max_colwidth',20)
outdf = df[(df['m_descriptor']=='1965-0725z') & (df['s_rsen']==509)]
outdf

pd.set_option('display.max_colwidth',1000)
print(outdf.loc[:,['e_content_E']].values[0][0])
print(outdf.loc[:,['e_content_V']].values[0][0])

That’s Malachi 3, was the first coming, now here is the next coming.
Das ist Maleachi 3, [Vers. 1. ] das war Sein erstes Kommen, jetzt hier ist Sein nächstes kommen. [Dies gehört nicht zu Maleachi 3, Vers 1, das Sein erstes Kommen betraf, sondern zu Seinem zweiten Kommen.]

This data-point is an outlier due to the explanations added by the translator.

df = df.drop(19825)

#outlier
pd.set_option('display.max_colwidth',20)
outdf = df[(df['m_descriptor']=='1964-0208') & (df['s_rsen']==223)]
outdf

pd.set_option('display.max_colwidth',1000)
print(outdf.loc[:,['e_content_E']].values[0][0])
print(outdf.loc[:,['e_content_V']].values[0][0])

We find out now that the Holy Spirit was the One now that identifies us with Jesus Christ.
Wir erfahren nun, daß der Heilige Geist der war jetzt, uns mit Jesus Christus identifiziert. Wir finden jetzt heraus, daß der Heilige Geist der Eine war, der uns jetzt mit Jesus Christus identifiziert.

This data-point is an outlier; translation seems to be repeated.

df = df.drop(8272)

#outlier
pd.set_option('display.max_colwidth',20)
outdf = df[(df['m_descriptor']=='1964-0314') & (df['s_rsen']==112)]
outdf

pd.set_option('display.max_colwidth',1000)
print(outdf.loc[:,['e_content_E']].values[0][0])
print(outdf.loc[:,['e_content_V']].values[0][0])

*Then Jesus beholding him loved him, and said unto him, One thing thou lackest: go thy way, sell whatsoever thou hast, and give to the poor, and thou shalt have* treasures *in heaven: and come*, and *take up* thy *cross, and follow me*.
Da liebte ihn Jesus und sprach zu ihm: Was dir fehlt, gehe hin und verkaufe, was du hast, und gib den Armen, und du wirst Schätze im Himmel haben; Folge mir.

Seems OK.

Words

fig = px.scatter(data_frame=df, x='words_E', y='words_V', color='t_lan_V', 
                 title='Translation Words vs English Words', 
                 opacity=.5, 
                 hover_data=['m_descriptor','t_lan_V','s_rsen'], 
                 labels={'m_descriptor':'Descriptor','t_lan_V':'','s_rsen':'Sentence No'})
fig.show()

df.to_csv (f'{PATH}/PredictTranslationWordAndCharCount/PredictTranslationWordAndCharCount_3-GER-output.csv', sep='~', index = False, header=True)

	m_descriptor	t_lan_E	t_version	s_rsen	c_id	chars_E	words_E	t_lan_V	chars_V	words_V
0	1951-0508	ENG	19-0201	1	634231	22	4	CHN	5	1
1	1951-0508	ENG	19-0201	2	634232	82	13	CHN	28	6
2	1951-0508	ENG	19-0201	3	634233	151	30	CHN	46	5
3	1951-0508	ENG	19-0201	4	634234	85	19	CHN	28	3
4	1951-0508	ENG	19-0201	5	634235	89	18	CHN	27	2
5	1951-0508	ENG	19-0201	6	634236	59	12	CHN	17	1

	m_descriptor	t_lan_E	t_version	s_rsen	c_id	chars_E	words_E	t_lan_V	chars_V	words_V
0	1965-1031y	ENG	17-0901	1	72944	56	10	FAS	54	11
1	1965-1031y	ENG	17-0901	2	72945	14	3	FAS	20	4
2	1965-1031y	ENG	17-0901	3	72946	62	15	FAS	68	16
3	1965-1031y	ENG	17-0901	4	72947	62	11	FAS	84	15
4	1965-1031y	ENG	17-0901	5	72948	16	4	FAS	21	4
5	1965-1031y	ENG	17-0901	6	72949	43	7	FAS	42	8

	m_descriptor	t_lan_E	t_version	s_rsen	c_id	chars_E	words_E	t_lan_V	chars_V	words_V
0	1948-0304	ENG	15-0902	1	660286	98	21	GER	112	20
1	1948-0304	ENG	15-0902	2	660287	27	6	GER	28	5
2	1948-0304	ENG	15-0902	3	660288	36	7	GER	41	7
3	1948-0304	ENG	15-0902	4	660289	13	4	GER	18	4
4	1948-0304	ENG	15-0902	5	660290	85	18	GER	106	18
5	1948-0304	ENG	15-0902	6	660291	36	9	GER	36	8

	m_descriptor	t_lan_E	t_version	s_rsen	c_id	chars_E	words_E	t_lan_V	chars_V	words_V
0	1963-0728	ENG	15-0401	1	30399	27	4	FIJ	32	4
1	1963-0728	ENG	15-0401	2	30400	19	4	FIJ	36	5
2	1963-0728	ENG	15-0401	3	30401	64	12	FIJ	66	11
3	1963-0728	ENG	15-0401	4	30402	312	61	FIJ	385	70
4	1963-0728	ENG	15-0401	5	30403	360	70	FIJ	456	87
5	1963-0728	ENG	15-0401	6	30404	128	27	FIJ	131	25