Purpose

There is a relationship between the number of words (and the number of characters) in the source language and the target language. If this relationship can be established and captured in yet another model, such a model can be helpful in at least two ways:

For training: Validate the alignment of two sentences (in a training example) by comparing their word size and/or character size
For inference: Validate the word size and/or character size of a translated/proofread sentence

In this notebook we will make good on our proposed value proposition to discover a model for each language and to evaluate its use in the above roles.

Dataset and Variables

The dataset used in this notebook contains the following features:

m_descriptor: Unique identifier of a document
t_lan_E: Language of the translation (English is also considered a translation)
t_version: Version of a translation
s_rsen: Number of a sentence within a document
c_id: Database primary key of a contribution
e_content_E: Text content of an English contribution
chars_E: Number of characters in an English contribution
words_E: Number of words in an English contribution
t_lan_V: Language of the translation
e_top: N/A
be_top: N/A
c_created_at: Creation time of a contribution
c_kind: Kind of a contribution
c_base: N/A
a_role: N/A
u_name: N/A
e_content_V: Text content of a translated contribution
chars_V: Number of characters in a translated contribution
words_V: Number of words in a translated contribution

Setup the Environment

from pathlib import Path
import pandas as pd
import plotly.express as px
%matplotlib inline

!python --version

Python 3.6.9

PATH = Path(base_dir + './'); #PATH

Get train/valid data

We now ingest all the data we prepared in part 1. Note that the content (e_content) for each contribution is not displayed as it often makes the presentation unwieldy.

Language AFR

df = pd.read_csv(f'{PATH}/PredictTranslationWordAndCharCount/PredictTranslationWordAndCharCount_2-AFR-output.csv', sep='~')
# df.loc[:5, ~df.columns.isin(['e_content_E','u_name','e_content_V','e_top','be_top','c_created_at','count','c_kind','c_base','a_role'])]
df.loc[:5, df.columns.isin(['m_descriptor','t_lan_E','t_version','s_rsen','c_id','chars_E','words_E','t_lan_V','chars_V','words_V'])]

Characters

fig = px.scatter(data_frame=df, x='chars_E', y='chars_V', color='t_lan_V', 
                 title='Translation Characters vs English Characters', 
                 opacity=.5, 
                 hover_data=['m_descriptor','t_lan_V','s_rsen'], 
                 labels={'m_descriptor':'Descriptor','t_lan_V':'','s_rsen':'Sentence No'},
                #  trendline="ols", 
                #  trendline="lowess", 
                #  trendline_color_override='black',
                 )
fig.show()

#outlier
pd.set_option('display.max_colwidth',20)
outdf = df[(df['m_descriptor']=='CAB-05') & (df['s_rsen']==400)]
outdf

pd.set_option('display.max_colwidth',1000)
print(outdf.loc[:,['e_content_E']].values[0][0])
print(outdf.loc[:,['e_content_V']].values[0][0])

*The law of reproduction is that each specie brings forth after its own kind, even according to Genesis 1:11, ”And God said, Let the earth bring forth grass, and the herb yielding seed, and the fruit tree yielding fruit after his kind, whose seed is in itself, upon the earth: and it was so.“ Whatever life was in the seed came forth into a plant and thence into fruit.*
*Die wet van voortplanting is dat elke spesie voortbring volgens sy eie soort, selfs volgens Génesis 1: 11, "En God het gesê: Laat die aarde voortbring grasspruitjies, plante wat saad gee en bome wat, volgens hulle soorte, vrugte dra, waarin hulle saad is, op die aarde.*

Sure enough, this sentence is 'under-translated' and misses the last part in the English source sentence. We will remove this data-point and consider it an outlier.

df = df.drop(70153)

#outlier
pd.set_option('display.max_colwidth',20)
outdf = df[(df['m_descriptor']=='1965-0822x') & (df['s_rsen']==899)]
outdf

pd.set_option('display.max_colwidth',1000)
print(outdf.loc[:,['e_content_E']].values[0][0])
print(outdf.loc[:,['e_content_V']].values[0][0])

*All scripture* (yeah) is *given by inspiration of* (Prophets? No.) … *inspiration of* (What?) *God, and is profitable for doctrine*, and *reproof*, and *correction*, and *instruction in righteousness:*
*Die hele Skrif is deur God ingegee en is nuttig tot lering, tot weerlegging, tot teregwysing, tot onderwysing in die geregtigheid*,

The translation lacks the injections in parentheses on the English side. Will be considered an outlier.

df = df.drop(54210)

#outlier
pd.set_option('display.max_colwidth',20)
outdf = df[(df['m_descriptor']=='1964-0418z') & (df['s_rsen']==178)]
outdf

pd.set_option('display.max_colwidth',1000)
print(outdf.loc[:,['e_content_E']].values[0][0])
print(outdf.loc[:,['e_content_V']].values[0][0])

*Now when the Pharisee which had bidden him saw it, he spake within himself*, (remember, not out loud), *within himself, saying*, If this man … was *a prophet*, he would *know who and what manner of woman this is that* touched *him … for she is a sinner*.
*Toe die Fariseër wat Hom genooi het, dit sien, sê hy by homself: Hy as Hy 'n profeet was, sou geweet het wie en watter soort vrou dit is wat Hom aanraak; want sy is 'n sondares.*

Again, the translation lacks the injection in parentheses on the English side. Will be considered an outlier.

df = df.drop(26290)

Words

fig = px.scatter(data_frame=df, x='words_E', y='words_V', color='t_lan_V', 
                 title='Translation Words vs English Words', 
                 opacity=.5, 
                 hover_data=['m_descriptor','t_lan_V','s_rsen'], 
                 labels={'m_descriptor':'Descriptor','t_lan_V':'','s_rsen':'Sentence No'},
                 )
fig.show()

df.to_csv (f'{PATH}/PredictTranslationWordAndCharCount/PredictTranslationWordAndCharCount_3-AFR-output.csv', sep='~', index = False, header=True)

Language BEM

df = pd.read_csv(f'{PATH}/PredictTranslationWordAndCharCount/PredictTranslationWordAndCharCount_2-BEM-output.csv', sep='~')
df.loc[:5, df.columns.isin(['m_descriptor','t_lan_E','t_version','s_rsen','c_id','chars_E','words_E','t_lan_V','chars_V','words_V'])]

Characters

fig = px.scatter(data_frame=df, x='chars_E', y='chars_V', color='t_lan_V', 
                 title='Translation Characters vs English Characters', 
                 opacity=.5, 
                 hover_data=['m_descriptor','t_lan_V','s_rsen'], 
                 labels={'m_descriptor':'Descriptor','t_lan_V':'','s_rsen':'Sentence No'},
                 )
fig.show()

#outlier
pd.set_option('display.max_colwidth',20)
outdf = df[(df['m_descriptor']=='1965-0822x') & (df['s_rsen']==1)]
outdf

pd.set_option('display.max_colwidth',1000)
print(outdf.loc[:,['e_content_E']].values[0][0])
print(outdf.loc[:,['e_content_V']].values[0][0])

Let us bow our heads.
Natukontamike imitwe yesu. Shikulu Yesu, ka Cema wa mukuni mukalamba,Ifye natukwata imisha ishingi kuli Imwe,Shikulu, isho ifwe ta twa katale lipila pa citemwiko ico mwabika mu mitima Yesu. Ifwe tuleufwa abashalinga ilyo tunkontamike imitwe yesu no kwiminina mumulola Wenu.

This data-point is definitely an outlier.

df = df.drop(24010)

Words

fig = px.scatter(data_frame=df, x='words_E', y='words_V', color='t_lan_V', 
                 title='Translation Words vs English Words', 
                 opacity=.5, 
                 hover_data=['m_descriptor','t_lan_V','s_rsen'], 
                 labels={'m_descriptor':'Descriptor','t_lan_V':'','s_rsen':'Sentence No'},
                )
fig.show()

df.to_csv (f'{PATH}/PredictTranslationWordAndCharCount/PredictTranslationWordAndCharCount_3-BEM-output.csv', sep='~', index = False, header=True)

	m_descriptor	t_lan_E	t_version	s_rsen	c_id	chars_E	words_E	t_lan_V	chars_V	words_V
0	1958-0928z	ENG	14-0101	1	461719	18	4	AFR	17	4
1	1958-0928z	ENG	14-0101	2	461720	276	51	AFR	299	53
2	1958-0928z	ENG	14-0101	3	461721	105	20	AFR	120	21
3	1958-0928z	ENG	14-0101	4	461722	121	22	AFR	133	23
4	1958-0928z	ENG	14-0101	5	461723	174	34	AFR	169	34
5	1958-0928z	ENG	14-0101	6	461724	113	21	AFR	111	19

	m_descriptor	t_lan_E	t_version	s_rsen	c_id	chars_E	words_E	t_lan_V	chars_V	words_V
0	1955-0121	ENG	15-0401	1	846565	23	4	BEM	24	3
1	1955-0121	ENG	15-0401	2	846566	27	4	BEM	27	4
2	1955-0121	ENG	15-0401	3	846567	38	6	BEM	38	5
3	1955-0121	ENG	15-0401	4	846568	44	10	BEM	46	8
4	1955-0121	ENG	15-0401	5	846569	198	37	BEM	213	34
5	1955-0121	ENG	15-0401	6	846570	158	29	BEM	167	27