Translation Word/Char Count Prediction (Part 1)
Prediction of translated Word or Char Count is used as a Quality or Validation Check
- Purpose
- Dataset and Variables
- Setup the Environment
- Get train/valid data
- Merge E and V contributions
- Save prepared data to file
Purpose
Machine translation is used increasingly to lighten the load of human translators. The critical component here is the translation engine which is a model that takes a sequence of source words and outputs another sequence of translated words. To train such a model many thousands of sentence pairs need to be aligned for training examples.
There is a relationship between the number of words (and the number of characters) in the source language and the target language. If this relationship can be established and captured in yet another model, such a model can be helpful in at least two ways:
- For training: Validate the alignment of two sentences (in a training example) by comparing their word size and/or character size
- For inference: Validate the word size and/or character size of a translated/proofread sentence
The purpose of this project is to discover such a model for a variety of languages and to evaluate its use in the above roles.
Dataset and Variables
The dataset comes in the form of contributions, each captured in one of 167289 rows or data-points. Each contribution is a sentence that could be in the source language (always English) or a translation of the source sentence. There could be many variations/versions of a translated sentence, including the version provided by the translation engine initially. Human proofreaders then provide their own corrections in the form of other versions.
There are 4 kinds of contributions:
- E: English contributions
- T: Translate contributions - provided by the translation engine
- C: Create contributions - corrections provided by human proofreaders/translators
- V: Vote contributions - whenever a human proofreader/translator indicates agreement with a contribution provided by the translation engine, it is recorded in the form of a vote contribution
The features of the dataset are:
- m_descriptor: Unique identifier of a document
- t_lan: Language of the translation (English is also considered a translation)
- t_senc: Number of sentences in a document
- t_version: Version of a translation
- s_typ: Type of the sentence
- s_rsen: Number of a sentence within a document
- e_id: Database primary key of a contribution's content
- e_top: Content of the contribution that got the most votes
- be_id: N/A
- be_top: N/A
- c_id: Database primary key of a contribution
- c_created_at: Creation time of a contribution
- c_kind: Kind of a contribution
- c_eis: N/A
- c_base: N/A
- a_role: N/A
- u_name: N/A
- e_content: Text content of a contribution
- chars: Number of characters in a contribution
- words: Number of words in a contribution
In this notebook we will only prepare the dataset. Exploratory data analysis as well as modeling will occur in followup notebooks.
from pathlib import Path
import pandas as pd
import seaborn as sns
import glob
import re
%matplotlib inline
!python --version
PATH = Path(base_dir); #PATH
all_files = glob.glob(f"{PATH}/contributions/*E-contributions.csv")
li = []
for filename in all_files:
dft = pd.read_csv(filename, index_col=None, header=0, sep='~')
li.append(dft)
df_E = pd.concat(li, axis=0, ignore_index=True)
df_E.iloc[:5,:-2]
df_E = df_E.drop(['e_id','t_senc','s_typ','e_top','be_id','be_top','c_created_at','c_kind','c_eis','c_base','a_role','u_name'], axis=1) #each record now unique
df_E.iloc[:5,:-1]
#handle NaNs in e_content
e_content_nans = df_E['e_content'].isna()
df_E[e_content_nans]
#replace e_content NaNs with empty strings
df_E.loc[e_content_nans, 'e_content'] = ''
# df_E.loc[e_content_nans, ['e_content']]
# OR
df_E[df_E['e_content']=='']
#add chars column
df_E['chars'] = [len(e) for e in df_E['e_content']]
# df_E['chars'] = [len(e) if type(e)==str else 1 for e in df_E['e_content']]
df_E.loc[:5,['m_descriptor','t_lan','t_version','s_rsen','c_id','chars']]
# df_E.loc[e_content_nans, ['e_content','chars']]
# OR
df_E[df_E['chars']==0]
#add words column
#https://www.geeksforgeeks.org/python-program-to-count-words-in-a-sentence/
df_E['words'] = [len(re.findall(r'\w+', e)) for e in df_E['e_content']]
df_E.loc[:5,['m_descriptor','t_lan','t_version','s_rsen','c_id','chars','words']]
#remove BER part of version from t_version so that we can use this column to join the English contributions with their matching translated contributions
df_E['t_version'] = ['-'.join(e.split('-')[:2]) for e in df_E['t_version']]
df_E.loc[:5,['m_descriptor','t_lan','t_version','s_rsen','c_id','chars','words']]
all_files = glob.glob(f"{PATH}/contributions/*V-contributions.csv")
li = []
for filename in all_files:
dft = pd.read_csv(filename, index_col=None, header=0, sep='~')
li.append(dft)
df_V = pd.concat(li, axis=0, ignore_index=True)
df_V.iloc[:5,:-2]
df_V = df_V.drop(['t_senc','s_typ','be_id','c_eis'], axis=1)
df_V.iloc[:5,:-2]
#keep only top edits
# df_V[df_V['e_top'].isin(['M','T'])] #Majority & Tie
df_V = df_V[df_V['e_top'].isin(['M','T'])] #Majority & Tie
df_V.iloc[:5,:-2]
tmp = df_V.sort_values(by=['m_descriptor', 't_lan','t_version','s_rsen','c_created_at'])
df_V = df_V.groupby(['m_descriptor', 't_lan','t_version','s_rsen']).agg({'e_top':'last', 'be_top':'last', 'c_created_at':['last','count'], 'c_kind':'last', 'c_base':'last', 'a_role':'last', 'u_name':'last', 'e_content':'last'})
df_V.iloc[:,:-2]
# use T-contributions.csv to verify that all sentences have votes (i.e. no red ones left)
all_files = glob.glob(f"{PATH}/contributions/*T-contributions.csv")
li = []
for filename in all_files:
dft = pd.read_csv(filename, index_col=None, header=0, sep='~')
li.append(dft)
df_T = pd.concat(li, axis=0, ignore_index=True)
df_T.iloc[:5,:-2]
assert len(df_V)==len(df_T), f"df_V has different length from df_T: Maybe there are sentences without any votes (red ones)!. This means there are contributions such that df_T['e_top']=='Z'"
#IF PREVIOUS ASSERTION FAILS: See if there are: if so, go vote for them and run this notebook again. This is unusual because each translation's CE should have voted (i.e. signed off on) for ALL sentences!!!
df_T[df_T['e_top']=='Z']
df_T[~df_T['e_top'].isin(['M','T','Z','N'])]
df_V = df_V.reset_index()
df_V.iloc[:5,:-2]
df_V.columns
#rename columns
df_V.columns = ['m_descriptor','t_lan','t_version','s_rsen','e_top','be_top','c_created_at','count','c_kind','c_base','a_role','u_name','e_content']
#handle NaNs in e_content
e_content_nans = df_V['e_content'].isna()
#replace e_content NaNs with empty strings
df_V.loc[e_content_nans, 'e_content'] = ''
#add chars column
df_V['chars'] = [len(e) for e in df_V['e_content']] #TypeError: object of type 'float' has no len()
# df_V['chars'] = [len(e) if type(e)==str else 1 for e in df_V['e_content']]
df_V.loc[:5,['m_descriptor','t_lan','t_version','s_rsen','e_top','be_top','c_created_at','count','c_kind','c_base','a_role','chars']]
#add words column
#https://www.geeksforgeeks.org/python-program-to-count-words-in-a-sentence/
df_V['words'] = [len(re.findall(r'\w+', e)) for e in df_V['e_content']]
df_V.loc[:5,['m_descriptor','t_lan','t_version','s_rsen','e_top','be_top','c_created_at','count','c_kind','c_base','a_role','chars','words']]
#remove BER part from t_version; allows for joining to English contributions
df_V['t_version'] = ['-'.join(e.split('-')[:2]) for e in df_V['t_version']]
df_V.loc[:5,['m_descriptor','t_lan','t_version','s_rsen','e_top','be_top','c_created_at','count','c_kind','c_base','a_role','chars','words']]
df_joind_EV = pd.merge(df_E, df_V, how='inner', on=['m_descriptor', 't_version', 's_rsen'], suffixes=('_E', '_V'), sort=True)
df_joind_EV.loc[:5,['m_descriptor','t_lan_E','t_version','s_rsen','c_id','chars_E','words_E','t_lan_V','e_top','be_top','c_created_at','count','c_kind','c_base','a_role','chars_V','words_V']]
df_joind_EV.to_csv (f'{PATH}/PredictTranslationWordAndCharCount/PredictTranslationWordAndCharCount_1-output.csv', sep='~', index = False, header=True)