Purpose

Machine translation is used increasingly to lighten the load of human translators. The critical component here is the translation engine which is a model that takes a sequence of source words and outputs another sequence of translated words. To train such a model many thousands of sentence pairs need to be aligned for training examples.

There is a relationship between the number of words (and the number of characters) in the source language and the target language. If this relationship can be established and captured in yet another model, such a model can be helpful in at least two ways:

For training: Validate the alignment of two sentences (in a training example) by comparing their word size and/or character size
For inference: Validate the word size and/or character size of a translated/proofread sentence

The purpose of this project is to discover such a model for a variety of languages and to evaluate its use in the above roles.

Dataset and Variables

The dataset comes in the form of contributions, each captured in one of 167289 rows or data-points. Each contribution is a sentence that could be in the source language (always English) or a translation of the source sentence. There could be many variations/versions of a translated sentence, including the version provided by the translation engine initially. Human proofreaders then provide their own corrections in the form of other versions.

There are 4 kinds of contributions:

E: English contributions
T: Translate contributions - provided by the translation engine
C: Create contributions - corrections provided by human proofreaders/translators
V: Vote contributions - whenever a human proofreader/translator indicates agreement with a contribution provided by the translation engine, it is recorded in the form of a vote contribution

The features of the dataset are:

m_descriptor: Unique identifier of a document
t_lan: Language of the translation (English is also considered a translation)
t_senc: Number of sentences in a document
t_version: Version of a translation
s_typ: Type of the sentence
s_rsen: Number of a sentence within a document
e_id: Database primary key of a contribution's content
e_top: Content of the contribution that got the most votes
be_id: N/A
be_top: N/A
c_id: Database primary key of a contribution
c_created_at: Creation time of a contribution
c_kind: Kind of a contribution
c_eis: N/A
c_base: N/A
a_role: N/A
u_name: N/A
e_content: Text content of a contribution
chars: Number of characters in a contribution
words: Number of words in a contribution

In this notebook we will only prepare the dataset. Exploratory data analysis as well as modeling will occur in followup notebooks.

Setup the Environment

from pathlib import Path
import pandas as pd
import seaborn as sns
import glob
import re
%matplotlib inline

!python --version

Python 3.6.9

PATH = Path(base_dir); #PATH

Get train/valid data

Next we will ingest all the data we need. Note that the content (e_content) for each contribution is not displayed as it often makes the presentation unwieldy.

Ingest all E-contributions

all_files = glob.glob(f"{PATH}/contributions/*E-contributions.csv")
li = []
for filename in all_files:
    dft = pd.read_csv(filename, index_col=None, header=0, sep='~')
    li.append(dft)
df_E = pd.concat(li, axis=0, ignore_index=True)
df_E.iloc[:5,:-2]

df_E = df_E.drop(['e_id','t_senc','s_typ','e_top','be_id','be_top','c_created_at','c_kind','c_eis','c_base','a_role','u_name'], axis=1) #each record now unique
df_E.iloc[:5,:-1]

#handle NaNs in e_content
e_content_nans = df_E['e_content'].isna()
df_E[e_content_nans]

#replace e_content NaNs with empty strings
df_E.loc[e_content_nans, 'e_content'] = ''
# df_E.loc[e_content_nans, ['e_content']]
# OR
df_E[df_E['e_content']=='']

#add chars column
df_E['chars'] = [len(e) for e in df_E['e_content']]
# df_E['chars'] = [len(e) if type(e)==str else 1 for e in df_E['e_content']]
df_E.loc[:5,['m_descriptor','t_lan','t_version','s_rsen','c_id','chars']]

# df_E.loc[e_content_nans, ['e_content','chars']]
# OR
df_E[df_E['chars']==0]

#add words column
#https://www.geeksforgeeks.org/python-program-to-count-words-in-a-sentence/
df_E['words'] = [len(re.findall(r'\w+', e)) for e in df_E['e_content']]
df_E.loc[:5,['m_descriptor','t_lan','t_version','s_rsen','c_id','chars','words']]

#remove BER part of version from t_version so that we can use this column to join the English contributions with their matching translated contributions
df_E['t_version'] = ['-'.join(e.split('-')[:2]) for e in df_E['t_version']]
df_E.loc[:5,['m_descriptor','t_lan','t_version','s_rsen','c_id','chars','words']]

Ingest all V-contributions

all_files = glob.glob(f"{PATH}/contributions/*V-contributions.csv")
li = []
for filename in all_files:
    dft = pd.read_csv(filename, index_col=None, header=0, sep='~')
    li.append(dft)
df_V = pd.concat(li, axis=0, ignore_index=True)
df_V.iloc[:5,:-2]

df_V = df_V.drop(['t_senc','s_typ','be_id','c_eis'], axis=1)
df_V.iloc[:5,:-2]

#keep only top edits
# df_V[df_V['e_top'].isin(['M','T'])] #Majority & Tie
df_V = df_V[df_V['e_top'].isin(['M','T'])] #Majority & Tie
df_V.iloc[:5,:-2]

tmp = df_V.sort_values(by=['m_descriptor', 't_lan','t_version','s_rsen','c_created_at'])

df_V = df_V.groupby(['m_descriptor', 't_lan','t_version','s_rsen']).agg({'e_top':'last', 'be_top':'last', 'c_created_at':['last','count'], 'c_kind':'last', 'c_base':'last', 'a_role':'last', 'u_name':'last', 'e_content':'last'})
df_V.iloc[:,:-2]

# use T-contributions.csv to verify that all sentences have votes (i.e. no red ones left)

all_files = glob.glob(f"{PATH}/contributions/*T-contributions.csv")
li = []
for filename in all_files:
    dft = pd.read_csv(filename, index_col=None, header=0, sep='~')
    li.append(dft)
df_T = pd.concat(li, axis=0, ignore_index=True)
df_T.iloc[:5,:-2]

assert len(df_V)==len(df_T), f"df_V has different length from df_T: Maybe there are sentences without any votes (red ones)!. This means there are contributions such that df_T['e_top']=='Z'"

#IF PREVIOUS ASSERTION FAILS: See if there are: if so, go vote for them and run this notebook again. This is unusual because each translation's CE should have voted (i.e. signed off on) for ALL sentences!!!
df_T[df_T['e_top']=='Z']

df_T[~df_T['e_top'].isin(['M','T','Z','N'])]

df_V = df_V.reset_index()
df_V.iloc[:5,:-2]

df_V.columns

MultiIndex([('m_descriptor',      ''),
            (       't_lan',      ''),
            (   't_version',      ''),
            (      's_rsen',      ''),
            (       'e_top',  'last'),
            (      'be_top',  'last'),
            ('c_created_at',  'last'),
            ('c_created_at', 'count'),
            (      'c_kind',  'last'),
            (      'c_base',  'last'),
            (      'a_role',  'last'),
            (      'u_name',  'last'),
            (   'e_content',  'last')],
           )

#rename columns
df_V.columns = ['m_descriptor','t_lan','t_version','s_rsen','e_top','be_top','c_created_at','count','c_kind','c_base','a_role','u_name','e_content']

#handle NaNs in e_content
e_content_nans = df_V['e_content'].isna()

#replace e_content NaNs with empty strings
df_V.loc[e_content_nans, 'e_content'] = ''

#add chars column
df_V['chars'] = [len(e) for e in df_V['e_content']] #TypeError: object of type 'float' has no len()
# df_V['chars'] = [len(e) if type(e)==str else 1 for e in df_V['e_content']]
df_V.loc[:5,['m_descriptor','t_lan','t_version','s_rsen','e_top','be_top','c_created_at','count','c_kind','c_base','a_role','chars']]

#add words column
#https://www.geeksforgeeks.org/python-program-to-count-words-in-a-sentence/
df_V['words'] = [len(re.findall(r'\w+', e)) for e in df_V['e_content']]
df_V.loc[:5,['m_descriptor','t_lan','t_version','s_rsen','e_top','be_top','c_created_at','count','c_kind','c_base','a_role','chars','words']]

#remove BER part from t_version; allows for joining to English contributions
df_V['t_version'] = ['-'.join(e.split('-')[:2]) for e in df_V['t_version']]
df_V.loc[:5,['m_descriptor','t_lan','t_version','s_rsen','e_top','be_top','c_created_at','count','c_kind','c_base','a_role','chars','words']]

Merge E and V contributions

df_joind_EV = pd.merge(df_E, df_V, how='inner', on=['m_descriptor', 't_version', 's_rsen'], suffixes=('_E', '_V'), sort=True)
df_joind_EV.loc[:5,['m_descriptor','t_lan_E','t_version','s_rsen','c_id','chars_E','words_E','t_lan_V','e_top','be_top','c_created_at','count','c_kind','c_base','a_role','chars_V','words_V']]

Save prepared data to file

df_joind_EV.to_csv (f'{PATH}/PredictTranslationWordAndCharCount/PredictTranslationWordAndCharCount_1-output.csv', sep='~', index = False, header=True)

	m_descriptor	t_lan	t_senc	t_version	s_typ	s_rsen	e_id	e_top	be_id	be_top	c_id	c_created_at	c_kind	c_base	a_role
0	1965-0418x	ENG	1870	18-0101-E1R	n	1	174684	Z	NaN	NaN	224461	2018-03-29 23:10:24.573038	E	NaN	EP
1	1965-0418x	ENG	1870	18-0101-E1R	n	2	174685	Z	NaN	NaN	224462	2018-03-29 23:10:24.595501	E	NaN	EP
2	1965-0418x	ENG	1870	18-0101-E1R	n	3	174686	Z	NaN	NaN	224463	2018-03-29 23:10:24.628362	E	NaN	EP
3	1965-0418x	ENG	1870	18-0101-E1R	n	4	174687	Z	NaN	NaN	224464	2018-03-29 23:10:24.650119	E	NaN	EP
4	1965-0418x	ENG	1870	18-0101-E1R	n	5	174688	Z	NaN	NaN	224465	2018-03-29 23:10:24.670806	E	NaN	EP

	m_descriptor	t_lan	t_version	s_rsen	c_id
0	1965-0418x	ENG	18-0101-E1R	1	224461
1	1965-0418x	ENG	18-0101-E1R	2	224462
2	1965-0418x	ENG	18-0101-E1R	3	224463
3	1965-0418x	ENG	18-0101-E1R	4	224464
4	1965-0418x	ENG	18-0101-E1R	5	224465

	m_descriptor	t_lan	t_version	s_rsen	c_id	e_content
33415	1956-0805	ENG	15-0402-b	1176	454335	NaN
43018	1957-0419	ENG	15-0401-b	505	13306	NaN

	m_descriptor	t_lan	t_version	s_rsen	c_id	e_content
33415	1956-0805	ENG	15-0402-b	1176	454335
43018	1957-0419	ENG	15-0401-b	505	13306

	m_descriptor	t_lan	t_version	s_rsen	c_id	chars
0	1965-0418x	ENG	18-0101-E1R	1	224461	21
1	1965-0418x	ENG	18-0101-E1R	2	224462	225
2	1965-0418x	ENG	18-0101-E1R	3	224463	109
3	1965-0418x	ENG	18-0101-E1R	4	224464	49
4	1965-0418x	ENG	18-0101-E1R	5	224465	163
5	1965-0418x	ENG	18-0101-E1R	6	224466	96

	m_descriptor	t_lan	t_version	s_rsen	c_id	chars	words
0	1965-0418x	ENG	18-0101	1	224461	21	5
1	1965-0418x	ENG	18-0101	2	224462	225	40
2	1965-0418x	ENG	18-0101	3	224463	109	20
3	1965-0418x	ENG	18-0101	4	224464	49	10
4	1965-0418x	ENG	18-0101	5	224465	163	30
5	1965-0418x	ENG	18-0101	6	224466	96	17

	m_descriptor	t_lan	t_senc	t_version	s_typ	s_rsen	e_id	e_top	be_id	be_top	c_id	c_created_at	c_kind	c_eis	c_base	a_role
0	1965-0418x	AFR	1870	18-0101-B123E1R	n	1	181444	M	181444.0	M	844713	2020-01-15 02:13:34.847562	V	11	a	CE
1	1965-0418x	AFR	1870	18-0101-B123E1R	n	1	181444	M	181444.0	M	256723	2018-04-23 11:04:31.787641	V	28	a	TE
2	1965-0418x	AFR	1870	18-0101-B123E1R	n	2	339948	T	200635.0	N	468379	2019-01-30 22:21:29.62162	V	0	c	CE
3	1965-0418x	AFR	1870	18-0101-B123E1R	n	2	200635	N	181445.0	N	256725	2018-04-23 11:23:43.781013	V	0	c	TE
4	1965-0418x	AFR	1870	18-0101-B123E1R	n	3	200637	M	200636.0	N	256727	2018-04-23 11:26:37.965897	V	0	c	TE

				e_top	be_top	c_created_at		c_kind	c_base	a_role
				last	last	last	count	last	last	last
m_descriptor	t_lan	t_version	s_rsen
1948-0304	GER	15-0902-B123	1	M	NaN	2019-07-30 13:44:14.904495	1	V	c	TE
			2	M	M	2019-07-30 13:44:34.151158	1	V	a	TE
			3	M	M	2019-07-30 13:44:45.966924	1	V	a	TE
			4	M	M	2019-07-30 13:44:53.096079	1	V	a	TE
			5	M	N	2019-07-30 13:45:56.611909	1	V	c	TE
...	...	...	...	...	...	...	...	...	...	...
CAB-06	AFR	18-1101-B123	831	T	N	2020-05-25 02:43:59.956802	1	V	c	CE
			832	M	M	2020-05-25 02:44:26.239919	2	V	t	CE
			833	M	M	2020-05-25 02:44:38.590448	2	V	t	CE
			834	M	N	2020-02-02 00:24:14.79957	2	V	c	TE
			835	M	N	2020-02-02 00:24:42.429672	2	V	c	TE

	m_descriptor	t_lan	t_version	s_rsen	e_top	be_top	c_created_at	count	c_kind	c_base	a_role	chars	words
0	1948-0304	GER	15-0902	1	M	NaN	2019-07-30 13:44:14.904495	1	V	c	TE	112	20
1	1948-0304	GER	15-0902	2	M	M	2019-07-30 13:44:34.151158	1	V	a	TE	28	5
2	1948-0304	GER	15-0902	3	M	M	2019-07-30 13:44:45.966924	1	V	a	TE	41	7
3	1948-0304	GER	15-0902	4	M	M	2019-07-30 13:44:53.096079	1	V	a	TE	18	4
4	1948-0304	GER	15-0902	5	M	N	2019-07-30 13:45:56.611909	1	V	c	TE	106	18
5	1948-0304	GER	15-0902	6	M	N	2019-07-30 13:46:34.354399	1	V	c	TE	36	8

	m_descriptor	t_lan_E	t_version	s_rsen	c_id	chars_E	words_E	t_lan_V	e_top	be_top	c_created_at	count	c_kind	c_base	a_role	chars_V	words_V
0	1948-0304	ENG	15-0902	1	660286	98	21	GER	M	NaN	2019-07-30 13:44:14.904495	1	V	c	TE	112	20
1	1948-0304	ENG	15-0902	1	660286	98	21	POR	M	M	2020-05-25 18:11:14.358631	2	V	t	QE	104	18
2	1948-0304	ENG	15-0902	2	660287	27	6	GER	M	M	2019-07-30 13:44:34.151158	1	V	a	TE	28	5
3	1948-0304	ENG	15-0902	2	660287	27	6	POR	M	M	2020-05-25 18:11:30.820954	3	V	a	QE	25	5
4	1948-0304	ENG	15-0902	3	660288	36	7	GER	M	M	2019-07-30 13:44:45.966924	1	V	a	TE	41	7
5	1948-0304	ENG	15-0902	3	660288	36	7	POR	M	M	2020-05-25 18:11:42.224382	2	V	t	QE	32	7