Purpose

Machine translation is used increasingly to lighten the load of human translators. The critical component here is the translation engine which is a model that takes a sequence of source words and outputs another sequence of translated words. To train such a model many thousands of sentence pairs need to be aligned for training examples.

There is a relationship between the number of words (and the number of characters) in the source language and the target language. If this relationship can be established and captured in yet another model, such a model can be helpful in at least two ways:

  • For training: Validate the alignment of two sentences (in a training example) by comparing their word size and/or character size
  • For inference: Validate the word size and/or character size of a translated/proofread sentence

The purpose of this project is to discover such a model for a variety of languages and to evaluate its use in the above roles.

Dataset and Variables

The dataset comes in the form of contributions, each captured in one of 167289 rows or data-points. Each contribution is a sentence that could be in the source language (always English) or a translation of the source sentence. There could be many variations/versions of a translated sentence, including the version provided by the translation engine initially. Human proofreaders then provide their own corrections in the form of other versions.

There are 4 kinds of contributions:

  • E: English contributions
  • T: Translate contributions - provided by the translation engine
  • C: Create contributions - corrections provided by human proofreaders/translators
  • V: Vote contributions - whenever a human proofreader/translator indicates agreement with a contribution provided by the translation engine, it is recorded in the form of a vote contribution

The features of the dataset are:

  • m_descriptor: Unique identifier of a document
  • t_lan: Language of the translation (English is also considered a translation)
  • t_senc: Number of sentences in a document
  • t_version: Version of a translation
  • s_typ: Type of the sentence
  • s_rsen: Number of a sentence within a document
  • e_id: Database primary key of a contribution's content
  • e_top: Content of the contribution that got the most votes
  • be_id: N/A
  • be_top: N/A
  • c_id: Database primary key of a contribution
  • c_created_at: Creation time of a contribution
  • c_kind: Kind of a contribution
  • c_eis: N/A
  • c_base: N/A
  • a_role: N/A
  • u_name: N/A
  • e_content: Text content of a contribution
  • chars: Number of characters in a contribution
  • words: Number of words in a contribution

In this notebook we will only prepare the dataset. Exploratory data analysis as well as modeling will occur in followup notebooks.

Setup the Environment

from pathlib import Path
import pandas as pd
import seaborn as sns
import glob
import re
%matplotlib inline
!python --version
Python 3.6.9
PATH = Path(base_dir); #PATH

Get train/valid data

Next we will ingest all the data we need. Note that the content (e_content) for each contribution is not displayed as it often makes the presentation unwieldy.

Ingest all E-contributions

all_files = glob.glob(f"{PATH}/contributions/*E-contributions.csv")
li = []
for filename in all_files:
    dft = pd.read_csv(filename, index_col=None, header=0, sep='~')
    li.append(dft)
df_E = pd.concat(li, axis=0, ignore_index=True)
df_E.iloc[:5,:-2]
m_descriptor t_lan t_senc t_version s_typ s_rsen e_id e_top be_id be_top c_id c_created_at c_kind c_eis c_base a_role
0 1965-0418x ENG 1870 18-0101-E1R n 1 174684 Z NaN NaN 224461 2018-03-29 23:10:24.573038 E 0 NaN EP
1 1965-0418x ENG 1870 18-0101-E1R n 2 174685 Z NaN NaN 224462 2018-03-29 23:10:24.595501 E 0 NaN EP
2 1965-0418x ENG 1870 18-0101-E1R n 3 174686 Z NaN NaN 224463 2018-03-29 23:10:24.628362 E 0 NaN EP
3 1965-0418x ENG 1870 18-0101-E1R n 4 174687 Z NaN NaN 224464 2018-03-29 23:10:24.650119 E 0 NaN EP
4 1965-0418x ENG 1870 18-0101-E1R n 5 174688 Z NaN NaN 224465 2018-03-29 23:10:24.670806 E 0 NaN EP
df_E = df_E.drop(['e_id','t_senc','s_typ','e_top','be_id','be_top','c_created_at','c_kind','c_eis','c_base','a_role','u_name'], axis=1) #each record now unique
df_E.iloc[:5,:-1]
m_descriptor t_lan t_version s_rsen c_id
0 1965-0418x ENG 18-0101-E1R 1 224461
1 1965-0418x ENG 18-0101-E1R 2 224462
2 1965-0418x ENG 18-0101-E1R 3 224463
3 1965-0418x ENG 18-0101-E1R 4 224464
4 1965-0418x ENG 18-0101-E1R 5 224465
#handle NaNs in e_content
e_content_nans = df_E['e_content'].isna()
df_E[e_content_nans]
m_descriptor t_lan t_version s_rsen c_id e_content
33415 1956-0805 ENG 15-0402-b 1176 454335 NaN
43018 1957-0419 ENG 15-0401-b 505 13306 NaN
#replace e_content NaNs with empty strings
df_E.loc[e_content_nans, 'e_content'] = ''
# df_E.loc[e_content_nans, ['e_content']]
# OR
df_E[df_E['e_content']=='']
m_descriptor t_lan t_version s_rsen c_id e_content
33415 1956-0805 ENG 15-0402-b 1176 454335
43018 1957-0419 ENG 15-0401-b 505 13306
#add chars column
df_E['chars'] = [len(e) for e in df_E['e_content']]
# df_E['chars'] = [len(e) if type(e)==str else 1 for e in df_E['e_content']]
df_E.loc[:5,['m_descriptor','t_lan','t_version','s_rsen','c_id','chars']]
m_descriptor t_lan t_version s_rsen c_id chars
0 1965-0418x ENG 18-0101-E1R 1 224461 21
1 1965-0418x ENG 18-0101-E1R 2 224462 225
2 1965-0418x ENG 18-0101-E1R 3 224463 109
3 1965-0418x ENG 18-0101-E1R 4 224464 49
4 1965-0418x ENG 18-0101-E1R 5 224465 163
5 1965-0418x ENG 18-0101-E1R 6 224466 96
# df_E.loc[e_content_nans, ['e_content','chars']]
# OR
df_E[df_E['chars']==0]
m_descriptor t_lan t_version s_rsen c_id e_content chars
33415 1956-0805 ENG 15-0402-b 1176 454335 0
43018 1957-0419 ENG 15-0401-b 505 13306 0
#add words column
#https://www.geeksforgeeks.org/python-program-to-count-words-in-a-sentence/
df_E['words'] = [len(re.findall(r'\w+', e)) for e in df_E['e_content']]
df_E.loc[:5,['m_descriptor','t_lan','t_version','s_rsen','c_id','chars','words']]
m_descriptor t_lan t_version s_rsen c_id chars words
0 1965-0418x ENG 18-0101-E1R 1 224461 21 5
1 1965-0418x ENG 18-0101-E1R 2 224462 225 40
2 1965-0418x ENG 18-0101-E1R 3 224463 109 20
3 1965-0418x ENG 18-0101-E1R 4 224464 49 10
4 1965-0418x ENG 18-0101-E1R 5 224465 163 30
5 1965-0418x ENG 18-0101-E1R 6 224466 96 17
#remove BER part of version from t_version so that we can use this column to join the English contributions with their matching translated contributions
df_E['t_version'] = ['-'.join(e.split('-')[:2]) for e in df_E['t_version']]
df_E.loc[:5,['m_descriptor','t_lan','t_version','s_rsen','c_id','chars','words']]
m_descriptor t_lan t_version s_rsen c_id chars words
0 1965-0418x ENG 18-0101 1 224461 21 5
1 1965-0418x ENG 18-0101 2 224462 225 40
2 1965-0418x ENG 18-0101 3 224463 109 20
3 1965-0418x ENG 18-0101 4 224464 49 10
4 1965-0418x ENG 18-0101 5 224465 163 30
5 1965-0418x ENG 18-0101 6 224466 96 17

Ingest all V-contributions

all_files = glob.glob(f"{PATH}/contributions/*V-contributions.csv")
li = []
for filename in all_files:
    dft = pd.read_csv(filename, index_col=None, header=0, sep='~')
    li.append(dft)
df_V = pd.concat(li, axis=0, ignore_index=True)
df_V.iloc[:5,:-2]
m_descriptor t_lan t_senc t_version s_typ s_rsen e_id e_top be_id be_top c_id c_created_at c_kind c_eis c_base a_role
0 1965-0418x AFR 1870 18-0101-B123E1R n 1 181444 M 181444.0 M 844713 2020-01-15 02:13:34.847562 V 11 a CE
1 1965-0418x AFR 1870 18-0101-B123E1R n 1 181444 M 181444.0 M 256723 2018-04-23 11:04:31.787641 V 28 a TE
2 1965-0418x AFR 1870 18-0101-B123E1R n 2 339948 T 200635.0 N 468379 2019-01-30 22:21:29.62162 V 0 c CE
3 1965-0418x AFR 1870 18-0101-B123E1R n 2 200635 N 181445.0 N 256725 2018-04-23 11:23:43.781013 V 0 c TE
4 1965-0418x AFR 1870 18-0101-B123E1R n 3 200637 M 200636.0 N 256727 2018-04-23 11:26:37.965897 V 0 c TE
df_V = df_V.drop(['t_senc','s_typ','be_id','c_eis'], axis=1)
df_V.iloc[:5,:-2]
m_descriptor t_lan t_version s_rsen e_id e_top be_top c_id c_created_at c_kind c_base a_role
0 1965-0418x AFR 18-0101-B123E1R 1 181444 M M 844713 2020-01-15 02:13:34.847562 V a CE
1 1965-0418x AFR 18-0101-B123E1R 1 181444 M M 256723 2018-04-23 11:04:31.787641 V a TE
2 1965-0418x AFR 18-0101-B123E1R 2 339948 T N 468379 2019-01-30 22:21:29.62162 V c CE
3 1965-0418x AFR 18-0101-B123E1R 2 200635 N N 256725 2018-04-23 11:23:43.781013 V c TE
4 1965-0418x AFR 18-0101-B123E1R 3 200637 M N 256727 2018-04-23 11:26:37.965897 V c TE
#keep only top edits
# df_V[df_V['e_top'].isin(['M','T'])] #Majority & Tie
df_V = df_V[df_V['e_top'].isin(['M','T'])] #Majority & Tie
df_V.iloc[:5,:-2]
m_descriptor t_lan t_version s_rsen e_id e_top be_top c_id c_created_at c_kind c_base a_role
0 1965-0418x AFR 18-0101-B123E1R 1 181444 M M 844713 2020-01-15 02:13:34.847562 V a CE
1 1965-0418x AFR 18-0101-B123E1R 1 181444 M M 256723 2018-04-23 11:04:31.787641 V a TE
2 1965-0418x AFR 18-0101-B123E1R 2 339948 T N 468379 2019-01-30 22:21:29.62162 V c CE
4 1965-0418x AFR 18-0101-B123E1R 3 200637 M N 256727 2018-04-23 11:26:37.965897 V c TE
5 1965-0418x AFR 18-0101-B123E1R 3 200637 M M 468380 2019-01-30 22:21:51.780404 V t CE
tmp = df_V.sort_values(by=['m_descriptor', 't_lan','t_version','s_rsen','c_created_at'])
df_V = df_V.groupby(['m_descriptor', 't_lan','t_version','s_rsen']).agg({'e_top':'last', 'be_top':'last', 'c_created_at':['last','count'], 'c_kind':'last', 'c_base':'last', 'a_role':'last', 'u_name':'last', 'e_content':'last'})
df_V.iloc[:,:-2]
e_top be_top c_created_at c_kind c_base a_role
last last last count last last last
m_descriptor t_lan t_version s_rsen
1948-0304 GER 15-0902-B123 1 M NaN 2019-07-30 13:44:14.904495 1 V c TE
2 M M 2019-07-30 13:44:34.151158 1 V a TE
3 M M 2019-07-30 13:44:45.966924 1 V a TE
4 M M 2019-07-30 13:44:53.096079 1 V a TE
5 M N 2019-07-30 13:45:56.611909 1 V c TE
... ... ... ... ... ... ... ... ... ... ...
CAB-06 AFR 18-1101-B123 831 T N 2020-05-25 02:43:59.956802 1 V c CE
832 M M 2020-05-25 02:44:26.239919 2 V t CE
833 M M 2020-05-25 02:44:38.590448 2 V t CE
834 M N 2020-02-02 00:24:14.79957 2 V c TE
835 M N 2020-02-02 00:24:42.429672 2 V c TE

248389 rows × 7 columns

# use T-contributions.csv to verify that all sentences have votes (i.e. no red ones left)
all_files = glob.glob(f"{PATH}/contributions/*T-contributions.csv")
li = []
for filename in all_files:
    dft = pd.read_csv(filename, index_col=None, header=0, sep='~')
    li.append(dft)
df_T = pd.concat(li, axis=0, ignore_index=True)
df_T.iloc[:5,:-2]
m_descriptor t_lan t_senc t_version s_typ s_rsen e_id e_top be_id be_top c_id c_created_at c_kind c_eis c_base a_role
0 1965-0418x AFR 1870 18-0101-B123E1R n 1 181444 M NaN NaN 231225 2018-03-30 13:05:50.489319 T 0 NaN MT
1 1965-0418x AFR 1870 18-0101-B123E1R n 2 181445 N NaN NaN 231226 2018-03-30 13:05:50.524888 T 0 NaN MT
2 1965-0418x AFR 1870 18-0101-B123E1R n 3 181446 N NaN NaN 231227 2018-03-30 13:05:50.56683 T 0 NaN MT
3 1965-0418x AFR 1870 18-0101-B123E1R n 4 181447 N NaN NaN 231228 2018-03-30 13:05:50.601543 T 0 NaN MT
4 1965-0418x AFR 1870 18-0101-B123E1R n 5 181448 N NaN NaN 231229 2018-03-30 13:05:50.635029 T 0 NaN MT
assert len(df_V)==len(df_T), f"df_V has different length from df_T: Maybe there are sentences without any votes (red ones)!. This means there are contributions such that df_T['e_top']=='Z'"
#IF PREVIOUS ASSERTION FAILS: See if there are: if so, go vote for them and run this notebook again. This is unusual because each translation's CE should have voted (i.e. signed off on) for ALL sentences!!!
df_T[df_T['e_top']=='Z']
m_descriptor t_lan t_senc t_version s_typ s_rsen e_id e_top be_id be_top c_id c_created_at c_kind c_eis c_base a_role u_name e_content
df_T[~df_T['e_top'].isin(['M','T','Z','N'])]
m_descriptor t_lan t_senc t_version s_typ s_rsen e_id e_top be_id be_top c_id c_created_at c_kind c_eis c_base a_role u_name e_content
df_V = df_V.reset_index()
df_V.iloc[:5,:-2]
m_descriptor t_lan t_version s_rsen e_top be_top c_created_at c_kind c_base a_role
last last last count last last last
0 1948-0304 GER 15-0902-B123 1 M NaN 2019-07-30 13:44:14.904495 1 V c TE
1 1948-0304 GER 15-0902-B123 2 M M 2019-07-30 13:44:34.151158 1 V a TE
2 1948-0304 GER 15-0902-B123 3 M M 2019-07-30 13:44:45.966924 1 V a TE
3 1948-0304 GER 15-0902-B123 4 M M 2019-07-30 13:44:53.096079 1 V a TE
4 1948-0304 GER 15-0902-B123 5 M N 2019-07-30 13:45:56.611909 1 V c TE
df_V.columns
MultiIndex([('m_descriptor',      ''),
            (       't_lan',      ''),
            (   't_version',      ''),
            (      's_rsen',      ''),
            (       'e_top',  'last'),
            (      'be_top',  'last'),
            ('c_created_at',  'last'),
            ('c_created_at', 'count'),
            (      'c_kind',  'last'),
            (      'c_base',  'last'),
            (      'a_role',  'last'),
            (      'u_name',  'last'),
            (   'e_content',  'last')],
           )
#rename columns
df_V.columns = ['m_descriptor','t_lan','t_version','s_rsen','e_top','be_top','c_created_at','count','c_kind','c_base','a_role','u_name','e_content']
#handle NaNs in e_content
e_content_nans = df_V['e_content'].isna()
#replace e_content NaNs with empty strings
df_V.loc[e_content_nans, 'e_content'] = ''
#add chars column
df_V['chars'] = [len(e) for e in df_V['e_content']] #TypeError: object of type 'float' has no len()
# df_V['chars'] = [len(e) if type(e)==str else 1 for e in df_V['e_content']]
df_V.loc[:5,['m_descriptor','t_lan','t_version','s_rsen','e_top','be_top','c_created_at','count','c_kind','c_base','a_role','chars']]
m_descriptor t_lan t_version s_rsen e_top be_top c_created_at count c_kind c_base a_role chars
0 1948-0304 GER 15-0902-B123 1 M NaN 2019-07-30 13:44:14.904495 1 V c TE 112
1 1948-0304 GER 15-0902-B123 2 M M 2019-07-30 13:44:34.151158 1 V a TE 28
2 1948-0304 GER 15-0902-B123 3 M M 2019-07-30 13:44:45.966924 1 V a TE 41
3 1948-0304 GER 15-0902-B123 4 M M 2019-07-30 13:44:53.096079 1 V a TE 18
4 1948-0304 GER 15-0902-B123 5 M N 2019-07-30 13:45:56.611909 1 V c TE 106
5 1948-0304 GER 15-0902-B123 6 M N 2019-07-30 13:46:34.354399 1 V c TE 36
#add words column
#https://www.geeksforgeeks.org/python-program-to-count-words-in-a-sentence/
df_V['words'] = [len(re.findall(r'\w+', e)) for e in df_V['e_content']]
df_V.loc[:5,['m_descriptor','t_lan','t_version','s_rsen','e_top','be_top','c_created_at','count','c_kind','c_base','a_role','chars','words']]
m_descriptor t_lan t_version s_rsen e_top be_top c_created_at count c_kind c_base a_role chars words
0 1948-0304 GER 15-0902-B123 1 M NaN 2019-07-30 13:44:14.904495 1 V c TE 112 20
1 1948-0304 GER 15-0902-B123 2 M M 2019-07-30 13:44:34.151158 1 V a TE 28 5
2 1948-0304 GER 15-0902-B123 3 M M 2019-07-30 13:44:45.966924 1 V a TE 41 7
3 1948-0304 GER 15-0902-B123 4 M M 2019-07-30 13:44:53.096079 1 V a TE 18 4
4 1948-0304 GER 15-0902-B123 5 M N 2019-07-30 13:45:56.611909 1 V c TE 106 18
5 1948-0304 GER 15-0902-B123 6 M N 2019-07-30 13:46:34.354399 1 V c TE 36 8
#remove BER part from t_version; allows for joining to English contributions
df_V['t_version'] = ['-'.join(e.split('-')[:2]) for e in df_V['t_version']]
df_V.loc[:5,['m_descriptor','t_lan','t_version','s_rsen','e_top','be_top','c_created_at','count','c_kind','c_base','a_role','chars','words']]
m_descriptor t_lan t_version s_rsen e_top be_top c_created_at count c_kind c_base a_role chars words
0 1948-0304 GER 15-0902 1 M NaN 2019-07-30 13:44:14.904495 1 V c TE 112 20
1 1948-0304 GER 15-0902 2 M M 2019-07-30 13:44:34.151158 1 V a TE 28 5
2 1948-0304 GER 15-0902 3 M M 2019-07-30 13:44:45.966924 1 V a TE 41 7
3 1948-0304 GER 15-0902 4 M M 2019-07-30 13:44:53.096079 1 V a TE 18 4
4 1948-0304 GER 15-0902 5 M N 2019-07-30 13:45:56.611909 1 V c TE 106 18
5 1948-0304 GER 15-0902 6 M N 2019-07-30 13:46:34.354399 1 V c TE 36 8

Merge E and V contributions

df_joind_EV = pd.merge(df_E, df_V, how='inner', on=['m_descriptor', 't_version', 's_rsen'], suffixes=('_E', '_V'), sort=True)
df_joind_EV.loc[:5,['m_descriptor','t_lan_E','t_version','s_rsen','c_id','chars_E','words_E','t_lan_V','e_top','be_top','c_created_at','count','c_kind','c_base','a_role','chars_V','words_V']]
m_descriptor t_lan_E t_version s_rsen c_id chars_E words_E t_lan_V e_top be_top c_created_at count c_kind c_base a_role chars_V words_V
0 1948-0304 ENG 15-0902 1 660286 98 21 GER M NaN 2019-07-30 13:44:14.904495 1 V c TE 112 20
1 1948-0304 ENG 15-0902 1 660286 98 21 POR M M 2020-05-25 18:11:14.358631 2 V t QE 104 18
2 1948-0304 ENG 15-0902 2 660287 27 6 GER M M 2019-07-30 13:44:34.151158 1 V a TE 28 5
3 1948-0304 ENG 15-0902 2 660287 27 6 POR M M 2020-05-25 18:11:30.820954 3 V a QE 25 5
4 1948-0304 ENG 15-0902 3 660288 36 7 GER M M 2019-07-30 13:44:45.966924 1 V a TE 41 7
5 1948-0304 ENG 15-0902 3 660288 36 7 POR M M 2020-05-25 18:11:42.224382 2 V t QE 32 7

Save prepared data to file

df_joind_EV.to_csv (f'{PATH}/PredictTranslationWordAndCharCount/PredictTranslationWordAndCharCount_1-output.csv', sep='~', index = False, header=True)