In this notebook we will start to explore the contribution data-points.

As pointed out in part 1, there is a relationship between the number of words (and the number of characters) in the source language and the target language. If this relationship can be established and captured in yet another model, such a model can be helpful in at least two ways:

For training: Validate the alignment of two sentences (in a training example) by comparing their word size and/or character size
For inference: Validate the word size and/or character size of a translated/proofread sentence

In this notebook we will color the data-points by target language and output a data file for further analysis for each langugage.

Dataset and Variables

The dataset used in this notebook contains the following features:

m_descriptor: Unique identifier of a document
t_lan_E: Language of the translation (English is also considered a translation)
t_version: Version of a translation
s_rsen: Number of a sentence within a document
c_id: Database primary key of a contribution
e_content_E: Text content of an English contribution
chars_E: Number of characters in an English contribution
words_E: Number of words in an English contribution
t_lan_V: Language of the translation
e_top: N/A
be_top: N/A
c_created_at: Creation time of a contribution
c_kind: Kind of a contribution
c_base: N/A
a_role: N/A
u_name: N/A
e_content_V: Text content of a translated contribution
chars_V: Number of characters in a translated contribution
words_V: Number of words in a translated contribution

Setup the Environment

from pathlib import Path
import pandas as pd
import plotly.express as px
%matplotlib inline

!python --version

Python 3.6.9

PATH = Path(base_dir + './'); #PATH

Get train/valid data

We now ingest all the data we prepared in part 1. Note that the content (e_content) for each contribution is not displayed as it often makes the presentation unwieldy.

df_raw = pd.read_csv (f'{PATH}/PredictTranslationWordAndCharCount/PredictTranslationWordAndCharCount_1-output.csv', sep='~')
df_raw.loc[:5, ~df_raw.columns.isin(['e_content_E','u_name','e_content_V'])]

Inspect the ranges of chars and words variables

display_all(df_raw.describe(include='all').T)
#conclusion: min and max are fine for both

Translation Word Count

First, we will plot all the data and color by target language.

fig = px.scatter(data_frame=df_raw, x='words_E', y='words_V', color='t_lan_V', 
                 title='Translation Words vs English Words', opacity=.3, 
                 hover_data=['m_descriptor','t_lan_V','s_rsen'], 
                 labels={'m_descriptor':'Descriptor','t_lan_V':'','s_rsen':'Sentence No'})
fig.show()

Translation Character Count

fig = px.scatter(data_frame=df_raw, x='chars_E', y='chars_V', color='t_lan_V', 
                 title='Translation Characters vs English Characters', opacity=.3, 
                 hover_data=['m_descriptor','t_lan_V','s_rsen'], 
                 labels={'m_descriptor':'Descriptor','t_lan_V':'','s_rsen':'Sentence No'})
fig.show()

Output a data file for each language

It may be easier still if we select a single target language at a time.

df_raw['t_lan_V'].unique()

array(['GER', 'POR', 'POB', 'CHN', 'BEM', 'AFR', 'LIN', 'IND', 'FIJ',
       'LUG', 'TWI', 'LUA', 'FAS', 'SHO', 'IBO', 'YOR', 'SWA'],
      dtype=object)

outlangs = list(df_raw['t_lan_V'].unique())
for outlang in outlangs:
  df_out = df_raw[df_raw['t_lan_V']==outlang]
  df_out.to_csv(f'{PATH}/PredictTranslationWordAndCharCount/PredictTranslationWordAndCharCount_2-{outlang}-output.csv', sep='~', index = False, header=True)

	m_descriptor	t_lan_E	t_version	s_rsen	c_id	chars_E	words_E	t_lan_V	e_top	be_top	c_created_at	count	c_kind	c_base	a_role	chars_V	words_V
0	1948-0304	ENG	15-0902	1	660286	98	21	GER	M	NaN	2019-07-30 13:44:14.904495	1	V	c	TE	112	20
1	1948-0304	ENG	15-0902	1	660286	98	21	POR	M	M	2020-05-25 18:11:14.358631	2	V	t	QE	104	18
2	1948-0304	ENG	15-0902	2	660287	27	6	GER	M	M	2019-07-30 13:44:34.151158	1	V	a	TE	28	5
3	1948-0304	ENG	15-0902	2	660287	27	6	POR	M	M	2020-05-25 18:11:30.820954	3	V	a	QE	25	5
4	1948-0304	ENG	15-0902	3	660288	36	7	GER	M	M	2019-07-30 13:44:45.966924	1	V	a	TE	41	7
5	1948-0304	ENG	15-0902	3	660288	36	7	POR	M	M	2020-05-25 18:11:42.224382	2	V	t	QE	32	7

	count	unique	top	freq	mean	std	min	25%	50%	75%	max
m_descriptor	248389	114	1965-0725x	8031	NaN	NaN	NaN	NaN	NaN	NaN	NaN
t_lan_E	248389	1	ENG	248389	NaN	NaN	NaN	NaN	NaN	NaN	NaN
t_version	248389	32	15-0402	43958	NaN	NaN	NaN	NaN	NaN	NaN	NaN
s_rsen	248389	NaN	NaN	NaN	802.646	573.306	1	353	720	1141	3714
c_id	248389	NaN	NaN	NaN	441003	298336	1795	171614	454291	701986	1.10324e+06
e_content_E	248387	139965	See?	4671	NaN	NaN	NaN	NaN	NaN	NaN	NaN
chars_E	248389	NaN	NaN	NaN	59.2004	48.1156	0	25	45	80	621
words_E	248389	NaN	NaN	NaN	11.82	9.31669	0	5	9	16	120
t_lan_V	248389	17	AFR	72063	NaN	NaN	NaN	NaN	NaN	NaN	NaN
e_top	248389	2	M	233435	NaN	NaN	NaN	NaN	NaN	NaN	NaN
be_top	241253	3	M	137328	NaN	NaN	NaN	NaN	NaN	NaN	NaN
c_created_at	248389	248389	2018-11-10 03:44:51.85236	1	NaN	NaN	NaN	NaN	NaN	NaN	NaN
count	248389	NaN	NaN	NaN	1.45781	0.625253	1	1	1	2	6
c_kind	248389	1	V	248389	NaN	NaN	NaN	NaN	NaN	NaN	NaN
c_base	244521	4	c	104785	NaN	NaN	NaN	NaN	NaN	NaN	NaN
a_role	248389	4	TE	174112	NaN	NaN	NaN	NaN	NaN	NaN	NaN
u_name	248389	52	dawnxu	29849	NaN	NaN	NaN	NaN	NaN	NaN	NaN
e_content_V	248364	224698	Sien?	1476	NaN	NaN	NaN	NaN	NaN	NaN	NaN
chars_V	248389	NaN	NaN	NaN	55.725	51.7913	0	19	39	77	629
words_V	248389	NaN	NaN	NaN	10.0926	9.77821	0	3	7	14	126