Translation Word/Char Count Prediction (Part 2)
Prediction of translated Word or Char Count is used as a Quality or Validation Check
- Dataset and Variables
- Setup the Environment
- Get train/valid data
- Inspect the ranges of chars and words variables
- Translation Word Count
- Translation Character Count
- Output a data file for each language
In this notebook we will start to explore the contribution data-points.
As pointed out in part 1, there is a relationship between the number of words (and the number of characters) in the source language and the target language. If this relationship can be established and captured in yet another model, such a model can be helpful in at least two ways:
- For training: Validate the alignment of two sentences (in a training example) by comparing their word size and/or character size
- For inference: Validate the word size and/or character size of a translated/proofread sentence
In this notebook we will color the data-points by target language and output a data file for further analysis for each langugage.
Dataset and Variables
The dataset used in this notebook contains the following features:
- m_descriptor: Unique identifier of a document
- t_lan_E: Language of the translation (English is also considered a translation)
- t_version: Version of a translation
- s_rsen: Number of a sentence within a document
- c_id: Database primary key of a contribution
- e_content_E: Text content of an English contribution
- chars_E: Number of characters in an English contribution
- words_E: Number of words in an English contribution
- t_lan_V: Language of the translation
- e_top: N/A
- be_top: N/A
- c_created_at: Creation time of a contribution
- c_kind: Kind of a contribution
- c_base: N/A
- a_role: N/A
- u_name: N/A
- e_content_V: Text content of a translated contribution
- chars_V: Number of characters in a translated contribution
- words_V: Number of words in a translated contribution
from pathlib import Path
import pandas as pd
import plotly.express as px
%matplotlib inline
!python --version
PATH = Path(base_dir + './'); #PATH
df_raw = pd.read_csv (f'{PATH}/PredictTranslationWordAndCharCount/PredictTranslationWordAndCharCount_1-output.csv', sep='~')
df_raw.loc[:5, ~df_raw.columns.isin(['e_content_E','u_name','e_content_V'])]
display_all(df_raw.describe(include='all').T)
#conclusion: min and max are fine for both
fig = px.scatter(data_frame=df_raw, x='words_E', y='words_V', color='t_lan_V',
title='Translation Words vs English Words', opacity=.3,
hover_data=['m_descriptor','t_lan_V','s_rsen'],
labels={'m_descriptor':'Descriptor','t_lan_V':'','s_rsen':'Sentence No'})
fig.show()
fig = px.scatter(data_frame=df_raw, x='chars_E', y='chars_V', color='t_lan_V',
title='Translation Characters vs English Characters', opacity=.3,
hover_data=['m_descriptor','t_lan_V','s_rsen'],
labels={'m_descriptor':'Descriptor','t_lan_V':'','s_rsen':'Sentence No'})
fig.show()
It may be easier still if we select a single target language at a time.
df_raw['t_lan_V'].unique()
outlangs = list(df_raw['t_lan_V'].unique())
for outlang in outlangs:
df_out = df_raw[df_raw['t_lan_V']==outlang]
df_out.to_csv(f'{PATH}/PredictTranslationWordAndCharCount/PredictTranslationWordAndCharCount_2-{outlang}-output.csv', sep='~', index = False, header=True)