Purpose

There is a relationship between the number of words (and the number of characters) in the source language and the target language. If this relationship can be established and captured in yet another model, such a model can be helpful in at least two ways:

  • For training: Validate the alignment of two sentences (in a training example) by comparing their word size and/or character size
  • For inference: Validate the word size and/or character size of a translated/proofread sentence

In this notebook we will make good on our proposed value proposition to discover a model for each language and to evaluate its use in the above roles.

Dataset and Variables

The dataset used in this notebook contains the following features:

  • m_descriptor: Unique identifier of a document
  • t_lan_E: Language of the translation (English is also considered a translation)
  • t_version: Version of a translation
  • s_rsen: Number of a sentence within a document
  • c_id: Database primary key of a contribution
  • e_content_E: Text content of an English contribution
  • chars_E: Number of characters in an English contribution
  • words_E: Number of words in an English contribution
  • t_lan_V: Language of the translation
  • e_top: N/A
  • be_top: N/A
  • c_created_at: Creation time of a contribution
  • c_kind: Kind of a contribution
  • c_base: N/A
  • a_role: N/A
  • u_name: N/A
  • e_content_V: Text content of a translated contribution
  • chars_V: Number of characters in a translated contribution
  • words_V: Number of words in a translated contribution

Setup the Environment

# # ! pip install fastai
# ! pip install fastai2
# ! pip install nbdev

!pip install -Uqq fastbook
import fastbook
fastbook.setup_book()
     |████████████████████████████████| 727kB 2.7MB/s 
     |████████████████████████████████| 51kB 5.3MB/s 
     |████████████████████████████████| 51kB 5.2MB/s 
     |████████████████████████████████| 194kB 12.2MB/s 
     |████████████████████████████████| 1.0MB 12.5MB/s 
     |████████████████████████████████| 51kB 6.2MB/s 
     |████████████████████████████████| 92kB 8.8MB/s 
     |████████████████████████████████| 40kB 5.1MB/s 
     |████████████████████████████████| 51kB 6.3MB/s 
     |████████████████████████████████| 61kB 7.1MB/s 
     |████████████████████████████████| 2.6MB 18.5MB/s 
# ! pip list | grep fastai
! pip list | grep fastai2
from fastai.tabular.all import *
from fastbook import *

# from fastai.tabular.all import *
# # from fastai2.tabular.all import *
!python --version
Python 3.6.9
PATH = Path(base_dir + './'); #PATH

Get train/valid data

We now ingest all the data we prepared in part 1. Note that the content (e_content) for each contribution is not displayed as it often makes the presentation unwieldy.

# for lang in langs:
#   df_out = df_raw[df_raw['t_lan_V']==lang]
#   df_out.to_csv(f'{PATH}/PredictTranslationWordAndCharCount/PredictTranslationWordAndCharCount_2-{lang}-output.csv', sep='~', index = False, header=True)
pd.set_option('display.max_colwidth',10)
all_files = glob.glob(f"{PATH}/PredictTranslationWordAndCharCount/PredictTranslationWordAndCharCount_3-*-output.csv")
li = []
for filename in all_files:
    dft = pd.read_csv(filename, index_col=None, header=0, sep='~')
    li.append(dft)
df = pd.concat(li, axis=0, ignore_index=True)
df.iloc[:5,:-2]
# df.iloc[:5, ~df.columns.isin([])]
m_descriptor t_lan_E t_version s_rsen c_id e_content_E chars_E words_E t_lan_V e_top be_top c_created_at count c_kind c_base a_role u_name e_content_V
0 1958-0... ENG 14-0101 1 461719 The Se... 18 4 AFR M M 2019-0... 2 V a TE tilvan Die Sl...
1 1958-0... ENG 14-0101 2 461720 Dear G... 276 51 AFR M M 2019-0... 2 V t CE engest God, d...
2 1958-0... ENG 14-0101 3 461721 As we ... 105 20 AFR M N 2019-0... 2 V c TE tilvan Soos o...
3 1958-0... ENG 14-0101 4 461722 “When ... 121 22 AFR M M 2019-0... 2 V t CE engest "Toe d...
4 1958-0... ENG 14-0101 5 461723 How do... 174 34 AFR M M 2019-0... 2 V t CE engest Hoe we...

Inspect distribution of output data-points

import seaborn as sns
plt.figure(figsize=(20,10))
plt.xlabel('words_V', fontsize=14, color='black')
plt.ylabel('Count', fontsize=14, rotation=90, color='black')
sns.histplot(df['words_V'], bins=100);

Inspect the signal in the predictor-points

import plotly.express as px
fig = px.scatter(data_frame=df, x='words_E', y='words_V', color='t_lan_V', 
                 title='Translation Words vs English Words', opacity=.3, 
                 hover_data=['m_descriptor','t_lan_V','s_rsen'], 
                 labels={'m_descriptor':'Descriptor','t_lan_V':'','s_rsen':'Sentence No'})
fig.show()
#try instead to relate chars, rather than words
fig = px.scatter(data_frame=df, x='chars_E', y='chars_V', color='t_lan_V', 
                 title='Translation Characters vs English Characters', opacity=.3, 
                 hover_data=['m_descriptor','t_lan_V','s_rsen'], 
                 labels={'m_descriptor':'Descriptor','t_lan_V':'','s_rsen':'Sentence No'})
fig.show()

Set y

y = 'chars_V'

Inspect the distribution of y=chars_V

plt.figure(figsize=(20,10))
sns.histplot(df[y], bins=50);

Log-transform y

df['log_chars_V'] = np.log(df['chars_V'] + 1)
plt.figure(figsize=(20,10))
sns.histplot(df['log_chars_V'], bins=50);

Setup training

procs = [Categorify, FillMissing, Normalize] #howard
splits = RandomSplitter(valid_pct=0.2)(range_of(df)); splits
((#198692) [204747,231121,174994,99849,45226,4105,135057,26667,190650,158384...],
 (#49672) [232278,174106,215598,45325,149150,122943,216610,159334,30470,20072...])
df.columns
# print(*list(df.columns), sep='\n')
Index(['m_descriptor', 't_lan_E', 't_version', 's_rsen', 'c_id', 'e_content_E',
       'chars_E', 'words_E', 't_lan_V', 'e_top', 'be_top', 'c_created_at',
       'count', 'c_kind', 'c_base', 'a_role', 'u_name', 'e_content_V',
       'chars_V', 'words_V', 'log_chars_V'],
      dtype='object')
cats = ['t_lan_V']; conts = ['chars_E', 'words_E']
logy = 'log_chars_V'
pd.options.mode.chained_assignment = None
to = TabularPandas(df, procs=procs, cat_names=cats, cont_names=conts, y_names=logy, 
                  #  y_block=RegressionBlock(), 
                   splits=splits, inplace=True, reduce_memory=True)
to.show()
t_lan_V chars_E words_E log_chars_V
204747 GER 15.0 3.0 2.890372
231121 POR 39.0 7.0 3.367296
174994 FIJ 76.0 14.0 4.653960
99849 BEM 59.0 12.0 4.304065
45226 AFR 219.0 44.0 5.438079
4105 AFR 41.0 10.0 3.828641
135057 CHN 36.0 8.0 2.484907
26667 AFR 184.0 36.0 5.236442
190650 GER 60.0 12.0 4.094345
158384 FIJ 63.0 10.0 4.406719
BS = 1024 #1024, 512, 256, 128, 64
dls = to.dataloaders(bs=BS)
dls.show_batch()
t_lan_V chars_E words_E log_chars_V
0 CHN 58.000000 12.0 3.044523
1 FIJ 28.000000 5.0 3.610918
2 FIJ 43.000000 10.0 4.025352
3 CHN 108.000001 25.0 3.713572
4 GER 36.000000 7.0 3.828641
5 CHN 51.000000 11.0 2.944439
6 BEM 41.000000 7.0 4.007333
7 BEM 20.000001 4.0 3.218876
8 POR 80.999999 16.0 4.356709
9 FIJ 41.000000 11.0 3.931826
# np.min(df[logy]), np.max(df[logy])
# min_log_y = np.min(df[logy]); min_log_y
# max_log_y = np.max(df[logy])*1.2; max_log_y
# y_range = torch.tensor([min_log_y, max_log_y]); y_range

y = to.train.y
y.min(),y.max()
(0.0, 6.4457197189331055)

Train model

Without Hyper-Parameter-Optimization (HPO)

# from fastai2.callback.all import EarlyStoppingCallback, SaveModelCallback
from fastai.callback.all import EarlyStoppingCallback, SaveModelCallback
learn = tabular_learner(dls=dls, 
                        layers=[500,250], 
                        config=tabular_config(
                            # ps=[.001, .01] 
                            # embed_p=0.04
                            y_range=(0, 6.5)),
                        metrics=[exp_rmspe]
                        # metrics=[r2_score, exp_rmspe],
                        # metrics=[mean_squared_error],
                        # ,wd=.01
                        # ,loss_func=MSELossFlat(),
                        ,n_out=1
                        ,loss_func=F.mse_loss,
                        #+,cbs=EarlyStoppingCallback(monitor='valid_loss', min_delta=0.01)
                        #+cbs=EarlyStoppingCallback(monitor='valid_loss', min_delta=0.001, patience=3),
                        )
learn.lr_find()
SuggestedLRs(lr_min=0.0015848932787775993, lr_steep=0.00013182566908653826)
! ls -ltrh {learn.path}/{learn.model_dir}
total 0
callbacks = [
  SaveModelCallback(monitor='valid_loss', comp=np.less, min_delta=0.001, fname='best'),
  EarlyStoppingCallback(monitor='valid_loss', comp=np.less, min_delta=0.001, patience=3) #comp=np.greater
]
# learn.fit_one_cycle(n_epoch=20, lr_max=.02e-3)
learn.fit_one_cycle(n_epoch=20, lr_max=1e-4, cbs=callbacks)
epoch train_loss valid_loss _exp_rmspe time
0 0.364237 0.141579 0.466485 00:11
1 0.059181 0.043871 0.244650 00:11
2 0.040825 0.036200 0.217114 00:11
3 0.037213 0.033582 0.218490 00:11
4 0.037571 0.033678 0.216463 00:11
5 0.037653 0.032857 0.210304 00:11
6 0.035941 0.033145 0.205564 00:11
Better model found at epoch 0 with valid_loss value: 0.14157940447330475.
Better model found at epoch 1 with valid_loss value: 0.04387086257338524.
Better model found at epoch 2 with valid_loss value: 0.036199815571308136.
Better model found at epoch 3 with valid_loss value: 0.033582184463739395.
No improvement since epoch 3: early stopping
! ls -ltrh {learn.path}/{learn.model_dir}/
! pwd
total 524K
-rw-r--r-- 1 root root 524K Oct 22 15:14 best.pth
/content
learn.recorder.plot_loss(skip_start=100, with_valid=True)
res = learn.get_preds(with_input=True)
inputs,preds,targs = learn.get_preds(with_input=True)
# F.r_mse(preds, targs)
# pd.Series(inputs)
inputs[0].squeeze()
#- inputs.squeeze()
tensor([13,  5,  9,  ...,  1,  5,  3])
dfval = pd.DataFrame({'inputs':inputs[0].squeeze(), 'targs':inputs[0].squeeze(), 'preds':inputs[0].squeeze()})
dfval
inputs targs preds
0 13 13 13
1 5 5 5
2 9 9 9
3 1 1 1
4 3 3 3
... ... ... ...
49667 2 2 2
49668 5 5 5
49669 1 1 1
49670 5 5 5
49671 3 3 3

49672 rows × 3 columns

learn.show_results()
t_lan_V chars_E words_E log_chars_V log_chars_V_pred
0 1.0 -0.980276 -1.053134 2.833213 2.417679
1 3.0 -0.253450 -0.088161 2.944439 2.815471
2 1.0 4.232105 3.986167 5.493062 5.540545
3 1.0 -0.731078 -0.517038 3.258096 3.354484
4 5.0 -0.232683 -0.409819 3.828641 4.095329
5 1.0 -0.523414 -0.517038 3.784190 3.588533
6 6.0 -0.502647 -0.302600 3.891820 3.750283
7 3.0 0.078814 0.019058 3.091043 3.027725
8 1.0 0.909472 0.662373 4.700480 4.621726
learn.recorder.show_results()
t_lan_V chars_E words_E log_chars_V log_chars_V_pred
0 3.0 -0.211917 -0.195380 2.995732 2.832705
1 1.0 -1.063341 -1.053134 2.484907 2.235619
2 12.0 0.390310 0.126277 4.382027 4.418166
3 3.0 -0.544180 -0.624257 2.397895 2.505552
4 3.0 -1.146407 -1.160353 1.098612 1.426897
5 2.0 0.784873 0.984030 4.820282 4.698935
6 1.0 -0.294983 -0.302600 3.850147 3.853727
7 15.0 1.366334 1.091250 4.787492 4.794899
8 6.0 -0.959509 -0.945915 2.484907 2.728739
learn.recorder.values
[(#2) [0.030756311491131783,0.1863650381565094]]
trn_losses = [row[0] for row in learn.recorder.values]; trn_losses
[0.030756311491131783]
val_losses = [row[1] for row in learn.recorder.values]; val_losses
[0.1863650381565094]
learn.recorder.values
[(#2) [0.030756311491131783,0.1863650381565094]]
plt.plot(trn_losses)
plt.plot(val_losses)
[<matplotlib.lines.Line2D at 0x7fbc74f4fcf8>]
best_epoch = np.argmin(val_losses)
ret = val_losses[best_epoch]
ret
0.1863650381565094
trn_losses = [row[0] for row in learn.recorder.values]
val_losses = [row[1] for row in learn.recorder.values]
best_epoch = np.argmin(val_losses); best_epoch
0
type(dls)
fastai.tabular.data.TabularDataLoaders
df.columns
Index(['m_descriptor', 't_lan_E', 't_version', 's_rsen', 'c_id', 'e_content_E',
       'chars_E', 'words_E', 't_lan_V', 'e_top', 'be_top', 'c_created_at',
       'count', 'c_kind', 'c_base', 'a_role', 'u_name', 'e_content_V',
       'chars_V', 'words_V', 'log_chars_V'],
      dtype='object')
x0 = pd.DataFrame({
    'chars_E':[20], #chars
    'words_E':[5], #words
    't_lan_V':['CHN'],
    })
full_dec,dec,out = learn.predict(x0.iloc[0]); out
tensor([2.1375])
out.numpy()[0]
# float(out.data[0])
2.137506
print(np.exp( out.numpy()[0] ), 'translated chars ...')
8.478267 translated chars ...

Inference with validation data

tmp = learn.get_preds(
    # ds_type=DatasetType.Train,
    ds_type=DatasetType.Valid,
    #ds_type=DatasetType.Test[0],
    with_loss=True); type(tmp),len(tmp)
---------------------------------------------------------------------------
NameError                                 Traceback (most recent call last)
<ipython-input-74-cd42dc52e746> in <module>()
      2 tmp = learn.get_preds(
      3     # ds_type=DatasetType.Train,
----> 4     ds_type=DatasetType.Valid,
      5     #ds_type=DatasetType.Test[0],
      6     with_loss=True); type(tmp),len(tmp)

NameError: name 'DatasetType' is not defined
tmp = learn.get_preds(); tmp
(tensor([[3.5538],
         [4.6666],
         [2.9208],
         ...,
         [5.1425],
         [3.4644],
         [3.4176]]), tensor([[3.7377],
         [4.8363],
         [3.1781],
         ...,
         [4.9053],
         [3.6376],
         [3.7612]]))
 
 
# sns.barplot(x=topcounts.index, y=topcounts.values, alpha=0.8)
# plt.title('Distribution of e_top values')
# plt.ylabel('Number of Occurrences', fontsize=12)
# plt.xlabel('e_top values', fontsize=12)
# plt.show();