Purpose

There is a relationship between the number of words (and the number of characters) in the source language and the target language. If this relationship can be established and captured in yet another model, such a model can be helpful in at least two ways:

For training: Validate the alignment of two sentences (in a training example) by comparing their word size and/or character size
For inference: Validate the word size and/or character size of a translated/proofread sentence

In this notebook we will continue to discover models for each language and to evaluate its use in the above roles.

Dataset and Variables

The dataset used in this notebook contains the following features:

m_descriptor: Unique identifier of a document
t_lan_E: Language of the translation (English is also considered a translation)
t_version: Version of a translation
s_rsen: Number of a sentence within a document
c_id: Database primary key of a contribution
e_content_E: Text content of an English contribution
chars_E: Number of characters in an English contribution
words_E: Number of words in an English contribution
t_lan_V: Language of the translation
e_top: N/A
be_top: N/A
c_created_at: Creation time of a contribution
c_kind: Kind of a contribution
c_base: N/A
a_role: N/A
u_name: N/A
e_content_V: Text content of a translated contribution
chars_V: Number of characters in a translated contribution
words_V: Number of words in a translated contribution

Setup the Environment

from pathlib import Path
import pandas as pd
%matplotlib inline

!python --version

Python 3.6.9

PATH = Path(base_dir + './'); #PATH

import plotly.express as px

Language IBO

df = pd.read_csv(f'{PATH}/PredictTranslationWordAndCharCount/PredictTranslationWordAndCharCount_2-IBO-output.csv', sep='~')
df.loc[:5, df.columns.isin(['m_descriptor','t_lan_E','t_version','s_rsen','c_id','chars_E','words_E','t_lan_V','chars_V','words_V'])]

Characters

fig = px.scatter(data_frame=df, x='chars_E', y='chars_V', color='t_lan_V', 
                 title='Translation Characters vs English Characters', 
                 opacity=.5, 
                 hover_data=['m_descriptor','t_lan_V','s_rsen'], 
                 labels={'m_descriptor':'Descriptor','t_lan_V':'','s_rsen':'Sentence No'})
fig.show()

#outlier
pd.set_option('display.max_colwidth',20)
outdf = df[(df['m_descriptor']=='1965-1206') & (df['s_rsen']==1120)]
outdf

pd.set_option('display.max_colwidth',1000)
print(outdf.loc[:,['e_content_E']].values[0][0])
print(outdf.loc[:,['e_content_V']].values[0][0])

Look, they laughed at him, called him “a wild, screaming, unlearned fanatic,” as usual, that prophet forerunning the first coming of Jesus.
Lee, ha chìri ya ọchì, kpọọ ya onye n'emebiga ihe oke bụ "mmadụ-ọhia, nke n'eti sọ mkpu, n'enweghi mmụta, onye ihe n'anụ-ọkụ n'obi", dika ọ na adi, na onye amụma ahụ bụ onye mbu-uzọ n'obibia Kraist nke mbu.

This data-point could be an outlier.

df = df.drop(1119)

Words

fig = px.scatter(data_frame=df, x='words_E', y='words_V', color='t_lan_V', 
                 title='Translation Words vs English Words', 
                 opacity=.5, 
                 hover_data=['m_descriptor','t_lan_V','s_rsen'], 
                 labels={'m_descriptor':'Descriptor','t_lan_V':'','s_rsen':'Sentence No'})
fig.show()

df.to_csv (f'{PATH}/PredictTranslationWordAndCharCount/PredictTranslationWordAndCharCount_3-IBO-output.csv', sep='~', index = False, header=True)

Language IND

df = pd.read_csv(f'{PATH}/PredictTranslationWordAndCharCount/PredictTranslationWordAndCharCount_2-IND-output.csv', sep='~')
df.loc[:5, df.columns.isin(['m_descriptor','t_lan_E','t_version','s_rsen','c_id','chars_E','words_E','t_lan_V','chars_V','words_V'])]

Characters

fig = px.scatter(data_frame=df, x='chars_E', y='chars_V', color='t_lan_V', 
                 title='Translation Characters vs English Characters', 
                 opacity=.5, 
                 hover_data=['m_descriptor','t_lan_V','s_rsen'], 
                 labels={'m_descriptor':'Descriptor','t_lan_V':'','s_rsen':'Sentence No'})
fig.show()

# #outlier
# pd.set_option('display.max_colwidth',20)
# outdf = df[(df['m_descriptor']=='1965-0822x') & (df['s_rsen']==1)]
# outdf

# print(outdf.loc[:,['e_content_E']].values[0][0])
# print(outdf.loc[:,['e_content_V']].values[0][0])

Words

fig = px.scatter(data_frame=df, x='words_E', y='words_V', color='t_lan_V', 
                 title='Translation Words vs English Words', 
                 opacity=.5, 
                 hover_data=['m_descriptor','t_lan_V','s_rsen'], 
                 labels={'m_descriptor':'Descriptor','t_lan_V':'','s_rsen':'Sentence No'})
fig.show()

df.to_csv (f'{PATH}/PredictTranslationWordAndCharCount/PredictTranslationWordAndCharCount_3-IND-output.csv', sep='~', index = False, header=True)

Language LIN

df = pd.read_csv(f'{PATH}/PredictTranslationWordAndCharCount/PredictTranslationWordAndCharCount_2-LIN-output.csv', sep='~')
df.loc[:5, df.columns.isin(['m_descriptor','t_lan_E','t_version','s_rsen','c_id','chars_E','words_E','t_lan_V','chars_V','words_V'])]

Characters

fig = px.scatter(data_frame=df, x='chars_E', y='chars_V', color='t_lan_V', 
                 title='Translation Characters vs English Characters', 
                 opacity=.5, 
                 hover_data=['m_descriptor','t_lan_V','s_rsen'], 
                 labels={'m_descriptor':'Descriptor','t_lan_V':'','s_rsen':'Sentence No'})
fig.show()

#outlier
pd.set_option('display.max_colwidth',20)
outdf = df[(df['m_descriptor']=='1960-0626') & (df['s_rsen']==88)]
outdf

pd.set_option('display.max_colwidth',1000)
print(outdf.loc[:,['e_content_E']].values[0][0])
print(outdf.loc[:,['e_content_V']].values[0][0])

You go all the way down the turnpike, and then about the same distance, or a little farther again, on over through, down by Brother Beeler’s place, and on through that city, and another city, and another city, and another city, to a little church that my grandfather built, a little Methodist church that I preached in twenty-five, thirty years ago.
Okokita boye kino na balabala esika bafutisaka mbongo mpo na koleka na ngambo mosusu, mpe na nsima okotambwisa pene na ntaka ndenge moko, to mwa mosika, okoleka pembeni na esika ya Ndeko Beeler, okokatisa engumba wana, mpe engumba mosusu, mpe engumba mosusu, mpe engumba mosusu lisusu, kino na mwa ndako na losambo moko oyo nkoko na ngai ya mobali atongaká, mwa ndako na losambo moko ya ba-Metodiste, nateya kuna eleki mibu ntuku mibale na mitano to ntuku misato.

This data-point could be an outlier.

df = df.drop(87)

Words

fig = px.scatter(data_frame=df, x='words_E', y='words_V', color='t_lan_V', 
                 title='Translation Words vs English Words', 
                 opacity=.5, 
                 hover_data=['m_descriptor','t_lan_V','s_rsen'], 
                 labels={'m_descriptor':'Descriptor','t_lan_V':'','s_rsen':'Sentence No'})
fig.show()

df.to_csv (f'{PATH}/PredictTranslationWordAndCharCount/PredictTranslationWordAndCharCount_3-LIN-output.csv', sep='~', index = False, header=True)

Language LUA

df = pd.read_csv(f'{PATH}/PredictTranslationWordAndCharCount/PredictTranslationWordAndCharCount_2-LUA-output.csv', sep='~')
df.loc[:5, df.columns.isin(['m_descriptor','t_lan_E','t_version','s_rsen','c_id','chars_E','words_E','t_lan_V','chars_V','words_V'])]

Characters

fig = px.scatter(data_frame=df, x='chars_E', y='chars_V', color='t_lan_V', 
                 title='Translation Characters vs English Characters', 
                 opacity=.5, 
                 hover_data=['m_descriptor','t_lan_V','s_rsen'], 
                 labels={'m_descriptor':'Descriptor','t_lan_V':'','s_rsen':'Sentence No'})
fig.show()

# #outlier
# pd.set_option('display.max_colwidth',20)
# outdf = df[(df['m_descriptor']=='1965-0822x') & (df['s_rsen']==1)]
# outdf

# print(outdf.loc[:,['e_content_E']].values[0][0])
# print(outdf.loc[:,['e_content_V']].values[0][0])

Words

fig = px.scatter(data_frame=df, x='words_E', y='words_V', color='t_lan_V', 
                 title='Translation Words vs English Words', 
                 opacity=.5, 
                 hover_data=['m_descriptor','t_lan_V','s_rsen'], 
                 labels={'m_descriptor':'Descriptor','t_lan_V':'','s_rsen':'Sentence No'})
fig.show()

df.to_csv (f'{PATH}/PredictTranslationWordAndCharCount/PredictTranslationWordAndCharCount_3-LUA-output.csv', sep='~', index = False, header=True)

Language LUG

df = pd.read_csv(f'{PATH}/PredictTranslationWordAndCharCount/PredictTranslationWordAndCharCount_2-LUG-output.csv', sep='~')
df.loc[:5, df.columns.isin(['m_descriptor','t_lan_E','t_version','s_rsen','c_id','chars_E','words_E','t_lan_V','chars_V','words_V'])]

Characters

fig = px.scatter(data_frame=df, x='chars_E', y='chars_V', color='t_lan_V', 
                 title='Translation Characters vs English Characters', 
                 opacity=.5, 
                 hover_data=['m_descriptor','t_lan_V','s_rsen'], 
                 labels={'m_descriptor':'Descriptor','t_lan_V':'','s_rsen':'Sentence No'})
fig.show()

#outlier
pd.set_option('display.max_colwidth',20)
outdf = df[(df['m_descriptor']=='1965-0219') & (df['s_rsen']==87)]
outdf

pd.set_option('display.max_colwidth',1000)
print(outdf.loc[:,['e_content_E']].values[0][0])
print(outdf.loc[:,['e_content_V']].values[0][0])

It was the last day, of the service that I was to speak at the International Convention of the Full Gospel Business Men.
Olukunngaana olw’ensi yonna olwa Full Gospel Business Men.

This data-point could be an outlier.

df = df.drop(86)

Words

fig = px.scatter(data_frame=df, x='words_E', y='words_V', color='t_lan_V', 
                 title='Translation Words vs English Words', 
                 opacity=.5, 
                 hover_data=['m_descriptor','t_lan_V','s_rsen'], 
                 labels={'m_descriptor':'Descriptor','t_lan_V':'','s_rsen':'Sentence No'})
fig.show()

#outlier
pd.set_option('display.max_colwidth',20)
outdf = df[(df['m_descriptor']=='1965-0219') & (df['s_rsen']==830)]
outdf

pd.set_option('display.max_colwidth',1000)
print(outdf.loc[:,['e_content_E']].values[0][0])
print(outdf.loc[:,['e_content_V']].values[0][0])

And remember, Abraham, his name was *Abram* a few days before that, and Sarah was *Sarra* before that; S-a-r-r-a then S-a-r-a-h, and A-b-r-a-m to A-b-r-a-h-a-m.
208 Kati jjukira, Ibulayimu, erinnya lye yali Ibulaamu emabegako ng’ekyo tekinnabaawo, era ne Saala nga ye Salayi ekyo nga tekinnabaawo; S-a-l-a-y-i kati S-a-a-l-a, ne I-b-u-l-a-a-m-u mu I-b-u-l-a-y-i-m-u.

This data-point could be an outlier.

df = df.drop(829)

df.to_csv (f'{PATH}/PredictTranslationWordAndCharCount/PredictTranslationWordAndCharCount_3-LUG-output.csv', sep='~', index = False, header=True)

Language POB

df = pd.read_csv(f'{PATH}/PredictTranslationWordAndCharCount/PredictTranslationWordAndCharCount_2-POB-output.csv', sep='~')
df.loc[:5, df.columns.isin(['m_descriptor','t_lan_E','t_version','s_rsen','c_id','chars_E','words_E','t_lan_V','chars_V','words_V'])]

Characters

fig = px.scatter(data_frame=df, x='chars_E', y='chars_V', color='t_lan_V', 
                 title='Translation Characters vs English Characters', 
                 opacity=.5, 
                 hover_data=['m_descriptor','t_lan_V','s_rsen'], 
                 labels={'m_descriptor':'Descriptor','t_lan_V':'','s_rsen':'Sentence No'})
fig.show()

#outlier
pd.set_option('display.max_colwidth',20)
outdf = df[(df['m_descriptor']=='1963-0412x') & (df['s_rsen']==1283)]
outdf

pd.set_option('display.max_colwidth',1000)
print(outdf.loc[:,['e_content_E']].values[0][0])
print(outdf.loc[:,['e_content_V']].values[0][0])

You ought to go back into Africa, the Hottentots, and let them kill an animal and blood theirself all over.
Vocês deveriam voltar para a África, os Hottentots [os pastores nômades indígenas não-bantus da África do Sul], e deixá-los matar um animal e sangue deles mesmos.

This data-point is an outlier due to the over-translation.

df = df.drop(2170)

Words

fig = px.scatter(data_frame=df, x='words_E', y='words_V', color='t_lan_V', 
                 title='Translation Words vs English Words', 
                 opacity=.5, 
                 hover_data=['m_descriptor','t_lan_V','s_rsen'], 
                 labels={'m_descriptor':'Descriptor','t_lan_V':'','s_rsen':'Sentence No'})
fig.show()

df.to_csv (f'{PATH}/PredictTranslationWordAndCharCount/PredictTranslationWordAndCharCount_3-POB-output.csv', sep='~', index = False, header=True)

	m_descriptor	t_lan_E	t_version	s_rsen	c_id	chars_E	words_E	t_lan_V	chars_V	words_V
0	1965-1206	ENG	14-0901	1	3195	20	5	IBO	22	5
1	1965-1206	ENG	14-0901	2	3196	82	14	IBO	80	19
2	1965-1206	ENG	14-0901	3	3197	60	10	IBO	64	14
3	1965-1206	ENG	14-0901	4	3198	95	18	IBO	84	19
4	1965-1206	ENG	14-0901	5	3199	80	17	IBO	80	16
5	1965-1206	ENG	14-0901	6	3200	115	23	IBO	101	25

	m_descriptor	t_lan_E	t_version	s_rsen	c_id	chars_E	words_E	t_lan_V	chars_V	words_V
0	1961-0218	ENG	19-0201	1	492254	14	3	IND	16	3
1	1961-0218	ENG	19-0201	2	492255	11	2	IND	23	2
2	1961-0218	ENG	19-0201	3	492256	47	9	IND	45	7
3	1961-0218	ENG	19-0201	4	492257	60	13	IND	78	11
4	1961-0218	ENG	19-0201	5	492258	35	9	IND	38	7
5	1961-0218	ENG	19-0201	6	492259	29	5	IND	38	5

	m_descriptor	t_lan_E	t_version	s_rsen	c_id	chars_E	words_E	t_lan_V	chars_V	words_V
0	1965-0801z	ENG	15-1101	1	134793	36	8	LUA	41	6
1	1965-0801z	ENG	15-1101	2	134794	205	38	LUA	246	42
2	1965-0801z	ENG	15-1101	3	134795	182	35	LUA	218	36
3	1965-0801z	ENG	15-1101	4	134796	26	5	LUA	32	4
4	1965-0801z	ENG	15-1101	5	134797	79	17	LUA	108	16
5	1965-0801z	ENG	15-1101	6	134798	104	21	LUA	132	20

	m_descriptor	t_lan_E	t_version	s_rsen	c_id	chars_E	words_E	t_lan_V	chars_V	words_V
0	1965-0219	ENG	15-1101	1	306022	79	16	LUG	99	16
1	1965-0219	ENG	15-1101	2	306023	144	22	LUG	159	25
2	1965-0219	ENG	15-1101	3	306024	83	14	LUG	63	9
3	1965-0219	ENG	15-1101	4	306025	215	37	LUG	199	29
4	1965-0219	ENG	15-1101	5	306026	47	9	LUG	50	8
5	1965-0219	ENG	15-1101	6	306027	161	28	LUG	153	21

	m_descriptor	t_lan_E	t_version	s_rsen	c_id	chars_E	words_E	t_lan_V	chars_V	words_V
0	1950-0110	ENG	15-0901	1	51236	66	14	POB	73	14
1	1950-0110	ENG	15-0901	2	51237	105	22	POB	113	19
2	1950-0110	ENG	15-0901	3	51238	87	19	POB	89	17
3	1950-0110	ENG	15-0901	4	51239	90	17	POB	106	19
4	1950-0110	ENG	15-0901	5	51240	61	11	POB	64	10
5	1950-0110	ENG	15-0901	6	51241	94	18	POB	96	17

	m_descriptor	t_lan_E	t_version	s_rsen	c_id	chars_E	words_E	t_lan_V	chars_V	words_V
0	1960-0626	ENG	19-0201	1	520727	41	7	LIN	53	9
1	1960-0626	ENG	19-0201	2	520728	87	17	LIN	106	17
2	1960-0626	ENG	19-0201	3	520729	56	11	LIN	60	11
3	1960-0626	ENG	19-0201	4	520730	53	10	LIN	75	15
4	1960-0626	ENG	19-0201	5	520731	33	6	LIN	40	7
5	1960-0626	ENG	19-0201	6	520732	99	19	LIN	149	28