Purpose

There is a relationship between the number of words (and the number of characters) in the source language and the target language. If this relationship can be established and captured in yet another model, such a model can be helpful in at least two ways:

  • For training: Validate the alignment of two sentences (in a training example) by comparing their word size and/or character size
  • For inference: Validate the word size and/or character size of a translated/proofread sentence

In this notebook we will make good on our proposed value proposition to discover a model for each language and to evaluate its use in the above roles.

Dataset and Variables

The dataset used in this notebook contains the following features:

  • m_descriptor: Unique identifier of a document
  • t_lan_E: Language of the translation (English is also considered a translation)
  • t_version: Version of a translation
  • s_rsen: Number of a sentence within a document
  • c_id: Database primary key of a contribution
  • e_content_E: Text content of an English contribution
  • chars_E: Number of characters in an English contribution
  • words_E: Number of words in an English contribution
  • t_lan_V: Language of the translation
  • e_top: N/A
  • be_top: N/A
  • c_created_at: Creation time of a contribution
  • c_kind: Kind of a contribution
  • c_base: N/A
  • a_role: N/A
  • u_name: N/A
  • e_content_V: Text content of a translated contribution
  • chars_V: Number of characters in a translated contribution
  • words_V: Number of words in a translated contribution

Setup the Environment

from pathlib import Path
import pandas as pd
import plotly.express as px
%matplotlib inline
!python --version
Python 3.6.9
PATH = Path(base_dir + './'); #PATH

Get train/valid data

We now ingest all the data we prepared in part 1. Note that the content (e_content) for each contribution is not displayed as it often makes the presentation unwieldy.

Language AFR

df = pd.read_csv(f'{PATH}/PredictTranslationWordAndCharCount/PredictTranslationWordAndCharCount_2-AFR-output.csv', sep='~')
# df.loc[:5, ~df.columns.isin(['e_content_E','u_name','e_content_V','e_top','be_top','c_created_at','count','c_kind','c_base','a_role'])]
df.loc[:5, df.columns.isin(['m_descriptor','t_lan_E','t_version','s_rsen','c_id','chars_E','words_E','t_lan_V','chars_V','words_V'])]
m_descriptor t_lan_E t_version s_rsen c_id chars_E words_E t_lan_V chars_V words_V
0 1958-0928z ENG 14-0101 1 461719 18 4 AFR 17 4
1 1958-0928z ENG 14-0101 2 461720 276 51 AFR 299 53
2 1958-0928z ENG 14-0101 3 461721 105 20 AFR 120 21
3 1958-0928z ENG 14-0101 4 461722 121 22 AFR 133 23
4 1958-0928z ENG 14-0101 5 461723 174 34 AFR 169 34
5 1958-0928z ENG 14-0101 6 461724 113 21 AFR 111 19

Characters

fig = px.scatter(data_frame=df, x='chars_E', y='chars_V', color='t_lan_V', 
                 title='Translation Characters vs English Characters', 
                 opacity=.5, 
                 hover_data=['m_descriptor','t_lan_V','s_rsen'], 
                 labels={'m_descriptor':'Descriptor','t_lan_V':'','s_rsen':'Sentence No'},
                #  trendline="ols", 
                #  trendline="lowess", 
                #  trendline_color_override='black',
                 )
fig.show()
#outlier
pd.set_option('display.max_colwidth',20)
outdf = df[(df['m_descriptor']=='CAB-05') & (df['s_rsen']==400)]
outdf
m_descriptor t_lan_E t_version s_rsen c_id e_content_E chars_E words_E t_lan_V e_top be_top c_created_at count c_kind c_base a_role u_name e_content_V chars_V words_V
70153 CAB-05 ENG 18-1101 400 704340 *The law of repr... 370 70 AFR T N 2020-03-26 00:11... 1 V c CE engest *Die wet van voo... 271 46
pd.set_option('display.max_colwidth',1000)
print(outdf.loc[:,['e_content_E']].values[0][0])
print(outdf.loc[:,['e_content_V']].values[0][0])
*The law of reproduction is that each specie brings forth after its own kind, even according to Genesis 1:11, ”And God said, Let the earth bring forth grass, and the herb yielding seed, and the fruit tree yielding fruit after his kind, whose seed is in itself, upon the earth: and it was so.“ Whatever life was in the seed came forth into a plant and thence into fruit.*
*Die wet van voortplanting is dat elke spesie voortbring volgens sy eie soort, selfs volgens Génesis 1: 11, "En God het gesê: Laat die aarde voortbring grasspruitjies, plante wat saad gee en bome wat, volgens hulle soorte, vrugte dra, waarin hulle saad is, op die aarde.*

Sure enough, this sentence is 'under-translated' and misses the last part in the English source sentence. We will remove this data-point and consider it an outlier.

df = df.drop(70153)
#outlier
pd.set_option('display.max_colwidth',20)
outdf = df[(df['m_descriptor']=='1965-0822x') & (df['s_rsen']==899)]
outdf
m_descriptor t_lan_E t_version s_rsen c_id e_content_E chars_E words_E t_lan_V e_top be_top c_created_at count c_kind c_base a_role u_name e_content_V chars_V words_V
54210 1965-0822x ENG 18-0101 899 131669 *All scripture* ... 202 27 AFR T N 2018-03-14 20:04... 1 V c QE engest *Die hele Skrif ... 132 21
pd.set_option('display.max_colwidth',1000)
print(outdf.loc[:,['e_content_E']].values[0][0])
print(outdf.loc[:,['e_content_V']].values[0][0])
*All scripture* (yeah) is *given by inspiration of* (Prophets? No.) … *inspiration of* (What?) *God, and is profitable for doctrine*, and *reproof*, and *correction*, and *instruction in righteousness:*
*Die hele Skrif is deur God ingegee en is nuttig tot lering, tot weerlegging, tot teregwysing, tot onderwysing in die geregtigheid*,

The translation lacks the injections in parentheses on the English side. Will be considered an outlier.

df = df.drop(54210)
#outlier
pd.set_option('display.max_colwidth',20)
outdf = df[(df['m_descriptor']=='1964-0418z') & (df['s_rsen']==178)]
outdf
m_descriptor t_lan_E t_version s_rsen c_id e_content_E chars_E words_E t_lan_V e_top be_top c_created_at count c_kind c_base a_role u_name e_content_V chars_V words_V
26290 1964-0418z ENG 15-0402 178 740031 *Now when the Ph... 255 46 AFR M M 2020-05-18 02:01... 2 V t QE kobes2 *Toe die Fariseë... 179 37
pd.set_option('display.max_colwidth',1000)
print(outdf.loc[:,['e_content_E']].values[0][0])
print(outdf.loc[:,['e_content_V']].values[0][0])
*Now when the Pharisee which had bidden him saw it, he spake within himself*, (remember, not out loud), *within himself, saying*, If this man … was *a prophet*, he would *know who and what manner of woman this is that* touched *him … for she is a sinner*.
*Toe die Fariseër wat Hom genooi het, dit sien, sê hy by homself: Hy as Hy 'n profeet was, sou geweet het wie en watter soort vrou dit is wat Hom aanraak; want sy is 'n sondares.*

Again, the translation lacks the injection in parentheses on the English side. Will be considered an outlier.

df = df.drop(26290)

Words

fig = px.scatter(data_frame=df, x='words_E', y='words_V', color='t_lan_V', 
                 title='Translation Words vs English Words', 
                 opacity=.5, 
                 hover_data=['m_descriptor','t_lan_V','s_rsen'], 
                 labels={'m_descriptor':'Descriptor','t_lan_V':'','s_rsen':'Sentence No'},
                 )
fig.show()
df.to_csv (f'{PATH}/PredictTranslationWordAndCharCount/PredictTranslationWordAndCharCount_3-AFR-output.csv', sep='~', index = False, header=True)

Language BEM

df = pd.read_csv(f'{PATH}/PredictTranslationWordAndCharCount/PredictTranslationWordAndCharCount_2-BEM-output.csv', sep='~')
df.loc[:5, df.columns.isin(['m_descriptor','t_lan_E','t_version','s_rsen','c_id','chars_E','words_E','t_lan_V','chars_V','words_V'])]
m_descriptor t_lan_E t_version s_rsen c_id chars_E words_E t_lan_V chars_V words_V
0 1955-0121 ENG 15-0401 1 846565 23 4 BEM 24 3
1 1955-0121 ENG 15-0401 2 846566 27 4 BEM 27 4
2 1955-0121 ENG 15-0401 3 846567 38 6 BEM 38 5
3 1955-0121 ENG 15-0401 4 846568 44 10 BEM 46 8
4 1955-0121 ENG 15-0401 5 846569 198 37 BEM 213 34
5 1955-0121 ENG 15-0401 6 846570 158 29 BEM 167 27

Characters

fig = px.scatter(data_frame=df, x='chars_E', y='chars_V', color='t_lan_V', 
                 title='Translation Characters vs English Characters', 
                 opacity=.5, 
                 hover_data=['m_descriptor','t_lan_V','s_rsen'], 
                 labels={'m_descriptor':'Descriptor','t_lan_V':'','s_rsen':'Sentence No'},
                 )
fig.show()
#outlier
pd.set_option('display.max_colwidth',20)
outdf = df[(df['m_descriptor']=='1965-0822x') & (df['s_rsen']==1)]
outdf
m_descriptor t_lan_E t_version s_rsen c_id e_content_E chars_E words_E t_lan_V e_top be_top c_created_at count c_kind c_base a_role u_name e_content_V chars_V words_V
24010 1965-0822x ENG 18-0101 1 130771 Let us bow our h... 21 5 BEM T NaN 2018-07-16 07:47... 1 V c TE davmwa Natukontamike im... 273 41
pd.set_option('display.max_colwidth',1000)
print(outdf.loc[:,['e_content_E']].values[0][0])
print(outdf.loc[:,['e_content_V']].values[0][0])
Let us bow our heads.
Natukontamike imitwe yesu. Shikulu Yesu, ka Cema wa mukuni mukalamba,Ifye natukwata imisha ishingi kuli Imwe,Shikulu, isho ifwe ta twa katale lipila pa citemwiko ico mwabika mu mitima Yesu. Ifwe tuleufwa abashalinga ilyo tunkontamike imitwe yesu no kwiminina mumulola Wenu.

This data-point is definitely an outlier.

df = df.drop(24010)

Words

fig = px.scatter(data_frame=df, x='words_E', y='words_V', color='t_lan_V', 
                 title='Translation Words vs English Words', 
                 opacity=.5, 
                 hover_data=['m_descriptor','t_lan_V','s_rsen'], 
                 labels={'m_descriptor':'Descriptor','t_lan_V':'','s_rsen':'Sentence No'},
                )
fig.show()
df.to_csv (f'{PATH}/PredictTranslationWordAndCharCount/PredictTranslationWordAndCharCount_3-BEM-output.csv', sep='~', index = False, header=True)