Translation Word/Char Count Prediction (Part 3a)
Languages AFR, BEM, CHN, FAS
Purpose
There is a relationship between the number of words (and the number of characters) in the source language and the target language. If this relationship can be established and captured in yet another model, such a model can be helpful in at least two ways:
- For training: Validate the alignment of two sentences (in a training example) by comparing their word size and/or character size
- For inference: Validate the word size and/or character size of a translated/proofread sentence
In this notebook we will make good on our proposed value proposition to discover a model for each language and to evaluate its use in the above roles.
Dataset and Variables
The dataset used in this notebook contains the following features:
- m_descriptor: Unique identifier of a document
- t_lan_E: Language of the translation (English is also considered a translation)
- t_version: Version of a translation
- s_rsen: Number of a sentence within a document
- c_id: Database primary key of a contribution
- e_content_E: Text content of an English contribution
- chars_E: Number of characters in an English contribution
- words_E: Number of words in an English contribution
- t_lan_V: Language of the translation
- e_top: N/A
- be_top: N/A
- c_created_at: Creation time of a contribution
- c_kind: Kind of a contribution
- c_base: N/A
- a_role: N/A
- u_name: N/A
- e_content_V: Text content of a translated contribution
- chars_V: Number of characters in a translated contribution
- words_V: Number of words in a translated contribution
from pathlib import Path
import pandas as pd
import plotly.express as px
%matplotlib inline
!python --version
PATH = Path(base_dir + './'); #PATH
df = pd.read_csv(f'{PATH}/PredictTranslationWordAndCharCount/PredictTranslationWordAndCharCount_2-AFR-output.csv', sep='~')
# df.loc[:5, ~df.columns.isin(['e_content_E','u_name','e_content_V','e_top','be_top','c_created_at','count','c_kind','c_base','a_role'])]
df.loc[:5, df.columns.isin(['m_descriptor','t_lan_E','t_version','s_rsen','c_id','chars_E','words_E','t_lan_V','chars_V','words_V'])]
fig = px.scatter(data_frame=df, x='chars_E', y='chars_V', color='t_lan_V',
title='Translation Characters vs English Characters',
opacity=.5,
hover_data=['m_descriptor','t_lan_V','s_rsen'],
labels={'m_descriptor':'Descriptor','t_lan_V':'','s_rsen':'Sentence No'},
# trendline="ols",
# trendline="lowess",
# trendline_color_override='black',
)
fig.show()
#outlier
pd.set_option('display.max_colwidth',20)
outdf = df[(df['m_descriptor']=='CAB-05') & (df['s_rsen']==400)]
outdf
pd.set_option('display.max_colwidth',1000)
print(outdf.loc[:,['e_content_E']].values[0][0])
print(outdf.loc[:,['e_content_V']].values[0][0])
Sure enough, this sentence is 'under-translated' and misses the last part in the English source sentence. We will remove this data-point and consider it an outlier.
df = df.drop(70153)
#outlier
pd.set_option('display.max_colwidth',20)
outdf = df[(df['m_descriptor']=='1965-0822x') & (df['s_rsen']==899)]
outdf
pd.set_option('display.max_colwidth',1000)
print(outdf.loc[:,['e_content_E']].values[0][0])
print(outdf.loc[:,['e_content_V']].values[0][0])
The translation lacks the injections in parentheses on the English side. Will be considered an outlier.
df = df.drop(54210)
#outlier
pd.set_option('display.max_colwidth',20)
outdf = df[(df['m_descriptor']=='1964-0418z') & (df['s_rsen']==178)]
outdf
pd.set_option('display.max_colwidth',1000)
print(outdf.loc[:,['e_content_E']].values[0][0])
print(outdf.loc[:,['e_content_V']].values[0][0])
Again, the translation lacks the injection in parentheses on the English side. Will be considered an outlier.
df = df.drop(26290)
fig = px.scatter(data_frame=df, x='words_E', y='words_V', color='t_lan_V',
title='Translation Words vs English Words',
opacity=.5,
hover_data=['m_descriptor','t_lan_V','s_rsen'],
labels={'m_descriptor':'Descriptor','t_lan_V':'','s_rsen':'Sentence No'},
)
fig.show()
df.to_csv (f'{PATH}/PredictTranslationWordAndCharCount/PredictTranslationWordAndCharCount_3-AFR-output.csv', sep='~', index = False, header=True)
df = pd.read_csv(f'{PATH}/PredictTranslationWordAndCharCount/PredictTranslationWordAndCharCount_2-BEM-output.csv', sep='~')
df.loc[:5, df.columns.isin(['m_descriptor','t_lan_E','t_version','s_rsen','c_id','chars_E','words_E','t_lan_V','chars_V','words_V'])]
fig = px.scatter(data_frame=df, x='chars_E', y='chars_V', color='t_lan_V',
title='Translation Characters vs English Characters',
opacity=.5,
hover_data=['m_descriptor','t_lan_V','s_rsen'],
labels={'m_descriptor':'Descriptor','t_lan_V':'','s_rsen':'Sentence No'},
)
fig.show()
#outlier
pd.set_option('display.max_colwidth',20)
outdf = df[(df['m_descriptor']=='1965-0822x') & (df['s_rsen']==1)]
outdf
pd.set_option('display.max_colwidth',1000)
print(outdf.loc[:,['e_content_E']].values[0][0])
print(outdf.loc[:,['e_content_V']].values[0][0])
This data-point is definitely an outlier.
df = df.drop(24010)
fig = px.scatter(data_frame=df, x='words_E', y='words_V', color='t_lan_V',
title='Translation Words vs English Words',
opacity=.5,
hover_data=['m_descriptor','t_lan_V','s_rsen'],
labels={'m_descriptor':'Descriptor','t_lan_V':'','s_rsen':'Sentence No'},
)
fig.show()
df.to_csv (f'{PATH}/PredictTranslationWordAndCharCount/PredictTranslationWordAndCharCount_3-BEM-output.csv', sep='~', index = False, header=True)