In this notebook we will start to explore the contribution data-points.

As pointed out in part 1, there is a relationship between the number of words (and the number of characters) in the source language and the target language. If this relationship can be established and captured in yet another model, such a model can be helpful in at least two ways:

  • For training: Validate the alignment of two sentences (in a training example) by comparing their word size and/or character size
  • For inference: Validate the word size and/or character size of a translated/proofread sentence

In this notebook we will color the data-points by target language and output a data file for further analysis for each langugage.

Dataset and Variables

The dataset used in this notebook contains the following features:

  • m_descriptor: Unique identifier of a document
  • t_lan_E: Language of the translation (English is also considered a translation)
  • t_version: Version of a translation
  • s_rsen: Number of a sentence within a document
  • c_id: Database primary key of a contribution
  • e_content_E: Text content of an English contribution
  • chars_E: Number of characters in an English contribution
  • words_E: Number of words in an English contribution
  • t_lan_V: Language of the translation
  • e_top: N/A
  • be_top: N/A
  • c_created_at: Creation time of a contribution
  • c_kind: Kind of a contribution
  • c_base: N/A
  • a_role: N/A
  • u_name: N/A
  • e_content_V: Text content of a translated contribution
  • chars_V: Number of characters in a translated contribution
  • words_V: Number of words in a translated contribution

Setup the Environment

from pathlib import Path
import pandas as pd
import plotly.express as px
%matplotlib inline
!python --version
Python 3.6.9
PATH = Path(base_dir + './'); #PATH

Get train/valid data

We now ingest all the data we prepared in part 1. Note that the content (e_content) for each contribution is not displayed as it often makes the presentation unwieldy.

df_raw = pd.read_csv (f'{PATH}/PredictTranslationWordAndCharCount/PredictTranslationWordAndCharCount_1-output.csv', sep='~')
df_raw.loc[:5, ~df_raw.columns.isin(['e_content_E','u_name','e_content_V'])]
m_descriptor t_lan_E t_version s_rsen c_id chars_E words_E t_lan_V e_top be_top c_created_at count c_kind c_base a_role chars_V words_V
0 1948-0304 ENG 15-0902 1 660286 98 21 GER M NaN 2019-07-30 13:44:14.904495 1 V c TE 112 20
1 1948-0304 ENG 15-0902 1 660286 98 21 POR M M 2020-05-25 18:11:14.358631 2 V t QE 104 18
2 1948-0304 ENG 15-0902 2 660287 27 6 GER M M 2019-07-30 13:44:34.151158 1 V a TE 28 5
3 1948-0304 ENG 15-0902 2 660287 27 6 POR M M 2020-05-25 18:11:30.820954 3 V a QE 25 5
4 1948-0304 ENG 15-0902 3 660288 36 7 GER M M 2019-07-30 13:44:45.966924 1 V a TE 41 7
5 1948-0304 ENG 15-0902 3 660288 36 7 POR M M 2020-05-25 18:11:42.224382 2 V t QE 32 7

Inspect the ranges of chars and words variables

display_all(df_raw.describe(include='all').T)
#conclusion: min and max are fine for both
count unique top freq mean std min 25% 50% 75% max
m_descriptor 248389 114 1965-0725x 8031 NaN NaN NaN NaN NaN NaN NaN
t_lan_E 248389 1 ENG 248389 NaN NaN NaN NaN NaN NaN NaN
t_version 248389 32 15-0402 43958 NaN NaN NaN NaN NaN NaN NaN
s_rsen 248389 NaN NaN NaN 802.646 573.306 1 353 720 1141 3714
c_id 248389 NaN NaN NaN 441003 298336 1795 171614 454291 701986 1.10324e+06
e_content_E 248387 139965 See? 4671 NaN NaN NaN NaN NaN NaN NaN
chars_E 248389 NaN NaN NaN 59.2004 48.1156 0 25 45 80 621
words_E 248389 NaN NaN NaN 11.82 9.31669 0 5 9 16 120
t_lan_V 248389 17 AFR 72063 NaN NaN NaN NaN NaN NaN NaN
e_top 248389 2 M 233435 NaN NaN NaN NaN NaN NaN NaN
be_top 241253 3 M 137328 NaN NaN NaN NaN NaN NaN NaN
c_created_at 248389 248389 2018-11-10 03:44:51.85236 1 NaN NaN NaN NaN NaN NaN NaN
count 248389 NaN NaN NaN 1.45781 0.625253 1 1 1 2 6
c_kind 248389 1 V 248389 NaN NaN NaN NaN NaN NaN NaN
c_base 244521 4 c 104785 NaN NaN NaN NaN NaN NaN NaN
a_role 248389 4 TE 174112 NaN NaN NaN NaN NaN NaN NaN
u_name 248389 52 dawnxu 29849 NaN NaN NaN NaN NaN NaN NaN
e_content_V 248364 224698 Sien? 1476 NaN NaN NaN NaN NaN NaN NaN
chars_V 248389 NaN NaN NaN 55.725 51.7913 0 19 39 77 629
words_V 248389 NaN NaN NaN 10.0926 9.77821 0 3 7 14 126

Translation Word Count

First, we will plot all the data and color by target language.

fig = px.scatter(data_frame=df_raw, x='words_E', y='words_V', color='t_lan_V', 
                 title='Translation Words vs English Words', opacity=.3, 
                 hover_data=['m_descriptor','t_lan_V','s_rsen'], 
                 labels={'m_descriptor':'Descriptor','t_lan_V':'','s_rsen':'Sentence No'})
fig.show()

Translation Character Count

fig = px.scatter(data_frame=df_raw, x='chars_E', y='chars_V', color='t_lan_V', 
                 title='Translation Characters vs English Characters', opacity=.3, 
                 hover_data=['m_descriptor','t_lan_V','s_rsen'], 
                 labels={'m_descriptor':'Descriptor','t_lan_V':'','s_rsen':'Sentence No'})
fig.show()

Output a data file for each language

It may be easier still if we select a single target language at a time.

df_raw['t_lan_V'].unique()
array(['GER', 'POR', 'POB', 'CHN', 'BEM', 'AFR', 'LIN', 'IND', 'FIJ',
       'LUG', 'TWI', 'LUA', 'FAS', 'SHO', 'IBO', 'YOR', 'SWA'],
      dtype=object)
outlangs = list(df_raw['t_lan_V'].unique())
for outlang in outlangs:
  df_out = df_raw[df_raw['t_lan_V']==outlang]
  df_out.to_csv(f'{PATH}/PredictTranslationWordAndCharCount/PredictTranslationWordAndCharCount_2-{outlang}-output.csv', sep='~', index = False, header=True)