Purpose

There is a relationship between the number of words (and the number of characters) in the source language and the target language. If this relationship can be established and captured in yet another model, such a model can be helpful in at least two ways:

For training: Validate the alignment of two sentences (in a training example) by comparing their word size and/or character size
For inference: Validate the word size and/or character size of a translated/proofread sentence

In this notebook we will make good on our proposed value proposition to discover a model for each language and to evaluate its use in the above roles.

Dataset and Variables

The dataset used in this notebook contains the following features:

m_descriptor: Unique identifier of a document
t_lan_E: Language of the translation (English is also considered a translation)
t_version: Version of a translation
s_rsen: Number of a sentence within a document
c_id: Database primary key of a contribution
e_content_E: Text content of an English contribution
chars_E: Number of characters in an English contribution
words_E: Number of words in an English contribution
t_lan_V: Language of the translation
e_top: N/A
be_top: N/A
c_created_at: Creation time of a contribution
c_kind: Kind of a contribution
c_base: N/A
a_role: N/A
u_name: N/A
e_content_V: Text content of a translated contribution
chars_V: Number of characters in a translated contribution
words_V: Number of words in a translated contribution

Setup the Environment

# # ! pip install fastai
# ! pip install fastai2
# ! pip install nbdev

!pip install -Uqq fastbook
import fastbook
fastbook.setup_book()

     |████████████████████████████████| 727kB 2.7MB/s 
     |████████████████████████████████| 51kB 5.3MB/s 
     |████████████████████████████████| 51kB 5.2MB/s 
     |████████████████████████████████| 194kB 12.2MB/s 
     |████████████████████████████████| 1.0MB 12.5MB/s 
     |████████████████████████████████| 51kB 6.2MB/s 
     |████████████████████████████████| 92kB 8.8MB/s 
     |████████████████████████████████| 40kB 5.1MB/s 
     |████████████████████████████████| 51kB 6.3MB/s 
     |████████████████████████████████| 61kB 7.1MB/s 
     |████████████████████████████████| 2.6MB 18.5MB/s

# ! pip list | grep fastai
! pip list | grep fastai2

from fastai.tabular.all import *
from fastbook import *

# from fastai.tabular.all import *
# # from fastai2.tabular.all import *

!python --version

Python 3.6.9

PATH = Path(base_dir + './'); #PATH

Get train/valid data

We now ingest all the data we prepared in part 1. Note that the content (e_content) for each contribution is not displayed as it often makes the presentation unwieldy.

# for lang in langs:
#   df_out = df_raw[df_raw['t_lan_V']==lang]
#   df_out.to_csv(f'{PATH}/PredictTranslationWordAndCharCount/PredictTranslationWordAndCharCount_2-{lang}-output.csv', sep='~', index = False, header=True)

pd.set_option('display.max_colwidth',10)
all_files = glob.glob(f"{PATH}/PredictTranslationWordAndCharCount/PredictTranslationWordAndCharCount_3-*-output.csv")
li = []
for filename in all_files:
    dft = pd.read_csv(filename, index_col=None, header=0, sep='~')
    li.append(dft)
df = pd.concat(li, axis=0, ignore_index=True)
df.iloc[:5,:-2]
# df.iloc[:5, ~df.columns.isin([])]

Inspect distribution of output data-points

import seaborn as sns
plt.figure(figsize=(20,10))
plt.xlabel('words_V', fontsize=14, color='black')
plt.ylabel('Count', fontsize=14, rotation=90, color='black')
sns.histplot(df['words_V'], bins=100);

Inspect the signal in the predictor-points

import plotly.express as px
fig = px.scatter(data_frame=df, x='words_E', y='words_V', color='t_lan_V', 
                 title='Translation Words vs English Words', opacity=.3, 
                 hover_data=['m_descriptor','t_lan_V','s_rsen'], 
                 labels={'m_descriptor':'Descriptor','t_lan_V':'','s_rsen':'Sentence No'})
fig.show()

#try instead to relate chars, rather than words
fig = px.scatter(data_frame=df, x='chars_E', y='chars_V', color='t_lan_V', 
                 title='Translation Characters vs English Characters', opacity=.3, 
                 hover_data=['m_descriptor','t_lan_V','s_rsen'], 
                 labels={'m_descriptor':'Descriptor','t_lan_V':'','s_rsen':'Sentence No'})
fig.show()

Set y

y = 'chars_V'

Inspect the distribution of y=chars_V

plt.figure(figsize=(20,10))
sns.histplot(df[y], bins=50);

Log-transform y

df['log_chars_V'] = np.log(df['chars_V'] + 1)

plt.figure(figsize=(20,10))
sns.histplot(df['log_chars_V'], bins=50);

Setup training

procs = [Categorify, FillMissing, Normalize] #howard

splits = RandomSplitter(valid_pct=0.2)(range_of(df)); splits

((#198692) [204747,231121,174994,99849,45226,4105,135057,26667,190650,158384...],
 (#49672) [232278,174106,215598,45325,149150,122943,216610,159334,30470,20072...])

df.columns
# print(*list(df.columns), sep='\n')

Index(['m_descriptor', 't_lan_E', 't_version', 's_rsen', 'c_id', 'e_content_E',
       'chars_E', 'words_E', 't_lan_V', 'e_top', 'be_top', 'c_created_at',
       'count', 'c_kind', 'c_base', 'a_role', 'u_name', 'e_content_V',
       'chars_V', 'words_V', 'log_chars_V'],
      dtype='object')

cats = ['t_lan_V']; conts = ['chars_E', 'words_E']

logy = 'log_chars_V'

pd.options.mode.chained_assignment = None
to = TabularPandas(df, procs=procs, cat_names=cats, cont_names=conts, y_names=logy, 
                  #  y_block=RegressionBlock(), 
                   splits=splits, inplace=True, reduce_memory=True)
to.show()

BS = 1024 #1024, 512, 256, 128, 64
dls = to.dataloaders(bs=BS)

dls.show_batch()

# np.min(df[logy]), np.max(df[logy])
# min_log_y = np.min(df[logy]); min_log_y
# max_log_y = np.max(df[logy])*1.2; max_log_y
# y_range = torch.tensor([min_log_y, max_log_y]); y_range

y = to.train.y
y.min(),y.max()

(0.0, 6.4457197189331055)

Train model

Without Hyper-Parameter-Optimization (HPO)

# from fastai2.callback.all import EarlyStoppingCallback, SaveModelCallback
from fastai.callback.all import EarlyStoppingCallback, SaveModelCallback
learn = tabular_learner(dls=dls, 
                        layers=[500,250], 
                        config=tabular_config(
                            # ps=[.001, .01] 
                            # embed_p=0.04
                            y_range=(0, 6.5)),
                        metrics=[exp_rmspe]
                        # metrics=[r2_score, exp_rmspe],
                        # metrics=[mean_squared_error],
                        # ,wd=.01
                        # ,loss_func=MSELossFlat(),
                        ,n_out=1
                        ,loss_func=F.mse_loss,
                        #+,cbs=EarlyStoppingCallback(monitor='valid_loss', min_delta=0.01)
                        #+cbs=EarlyStoppingCallback(monitor='valid_loss', min_delta=0.001, patience=3),
                        )

learn.lr_find()

SuggestedLRs(lr_min=0.0015848932787775993, lr_steep=0.00013182566908653826)

! ls -ltrh {learn.path}/{learn.model_dir}

total 0

callbacks = [
  SaveModelCallback(monitor='valid_loss', comp=np.less, min_delta=0.001, fname='best'),
  EarlyStoppingCallback(monitor='valid_loss', comp=np.less, min_delta=0.001, patience=3) #comp=np.greater
]
# learn.fit_one_cycle(n_epoch=20, lr_max=.02e-3)
learn.fit_one_cycle(n_epoch=20, lr_max=1e-4, cbs=callbacks)

Better model found at epoch 0 with valid_loss value: 0.14157940447330475.
Better model found at epoch 1 with valid_loss value: 0.04387086257338524.
Better model found at epoch 2 with valid_loss value: 0.036199815571308136.
Better model found at epoch 3 with valid_loss value: 0.033582184463739395.
No improvement since epoch 3: early stopping

! ls -ltrh {learn.path}/{learn.model_dir}/
! pwd

total 524K
-rw-r--r-- 1 root root 524K Oct 22 15:14 best.pth
/content

learn.recorder.plot_loss(skip_start=100, with_valid=True)

res = learn.get_preds(with_input=True)
inputs,preds,targs = learn.get_preds(with_input=True)
# F.r_mse(preds, targs)

# pd.Series(inputs)
inputs[0].squeeze()
#- inputs.squeeze()

tensor([13,  5,  9,  ...,  1,  5,  3])

dfval = pd.DataFrame({'inputs':inputs[0].squeeze(), 'targs':inputs[0].squeeze(), 'preds':inputs[0].squeeze()})
dfval

learn.show_results()

learn.recorder.show_results()

learn.recorder.values

[(#2) [0.030756311491131783,0.1863650381565094]]

trn_losses = [row[0] for row in learn.recorder.values]; trn_losses

[0.030756311491131783]

val_losses = [row[1] for row in learn.recorder.values]; val_losses

[0.1863650381565094]

learn.recorder.values

[(#2) [0.030756311491131783,0.1863650381565094]]

plt.plot(trn_losses)
plt.plot(val_losses)

[<matplotlib.lines.Line2D at 0x7fbc74f4fcf8>]

best_epoch = np.argmin(val_losses)
ret = val_losses[best_epoch]
ret

0.1863650381565094

trn_losses = [row[0] for row in learn.recorder.values]
val_losses = [row[1] for row in learn.recorder.values]
best_epoch = np.argmin(val_losses); best_epoch

0

type(dls)

fastai.tabular.data.TabularDataLoaders

df.columns

Index(['m_descriptor', 't_lan_E', 't_version', 's_rsen', 'c_id', 'e_content_E',
       'chars_E', 'words_E', 't_lan_V', 'e_top', 'be_top', 'c_created_at',
       'count', 'c_kind', 'c_base', 'a_role', 'u_name', 'e_content_V',
       'chars_V', 'words_V', 'log_chars_V'],
      dtype='object')

x0 = pd.DataFrame({
    'chars_E':[20], #chars
    'words_E':[5], #words
    't_lan_V':['CHN'],
    })
full_dec,dec,out = learn.predict(x0.iloc[0]); out

tensor([2.1375])

out.numpy()[0]
# float(out.data[0])

2.137506

print(np.exp( out.numpy()[0] ), 'translated chars ...')

8.478267 translated chars ...

Inference with validation data

tmp = learn.get_preds(
    # ds_type=DatasetType.Train,
    ds_type=DatasetType.Valid,
    #ds_type=DatasetType.Test[0],
    with_loss=True); type(tmp),len(tmp)

---------------------------------------------------------------------------
NameError                                 Traceback (most recent call last)
<ipython-input-74-cd42dc52e746> in <module>()
      2 tmp = learn.get_preds(
      3     # ds_type=DatasetType.Train,
----> 4     ds_type=DatasetType.Valid,
      5     #ds_type=DatasetType.Test[0],
      6     with_loss=True); type(tmp),len(tmp)

NameError: name 'DatasetType' is not defined

tmp = learn.get_preds(); tmp

(tensor([[3.5538],
         [4.6666],
         [2.9208],
         ...,
         [5.1425],
         [3.4644],
         [3.4176]]), tensor([[3.7377],
         [4.8363],
         [3.1781],
         ...,
         [4.9053],
         [3.6376],
         [3.7612]]))

# sns.barplot(x=topcounts.index, y=topcounts.values, alpha=0.8)
# plt.title('Distribution of e_top values')
# plt.ylabel('Number of Occurrences', fontsize=12)
# plt.xlabel('e_top values', fontsize=12)
# plt.show();

	m_descriptor	t_lan_E	t_version	s_rsen	c_id	e_content_E	chars_E	words_E	t_lan_V	e_top	be_top	c_created_at	count	c_kind	c_base	a_role	u_name	e_content_V
0	1958-0...	ENG	14-0101	1	461719	The Se...	18	4	AFR	M	M	2019-0...	2	V	a	TE	tilvan	Die Sl...
1	1958-0...	ENG	14-0101	2	461720	Dear G...	276	51	AFR	M	M	2019-0...	2	V	t	CE	engest	God, d...
2	1958-0...	ENG	14-0101	3	461721	As we ...	105	20	AFR	M	N	2019-0...	2	V	c	TE	tilvan	Soos o...
3	1958-0...	ENG	14-0101	4	461722	“When ...	121	22	AFR	M	M	2019-0...	2	V	t	CE	engest	"Toe d...
4	1958-0...	ENG	14-0101	5	461723	How do...	174	34	AFR	M	M	2019-0...	2	V	t	CE	engest	Hoe we...

	t_lan_V	chars_E	words_E	log_chars_V
204747	GER	15.0	3.0	2.890372
231121	POR	39.0	7.0	3.367296
174994	FIJ	76.0	14.0	4.653960
99849	BEM	59.0	12.0	4.304065
45226	AFR	219.0	44.0	5.438079
4105	AFR	41.0	10.0	3.828641
135057	CHN	36.0	8.0	2.484907
26667	AFR	184.0	36.0	5.236442
190650	GER	60.0	12.0	4.094345
158384	FIJ	63.0	10.0	4.406719

	t_lan_V	chars_E	words_E	log_chars_V
0	CHN	58.000000	12.0	3.044523
1	FIJ	28.000000	5.0	3.610918
2	FIJ	43.000000	10.0	4.025352
3	CHN	108.000001	25.0	3.713572
4	GER	36.000000	7.0	3.828641
5	CHN	51.000000	11.0	2.944439
6	BEM	41.000000	7.0	4.007333
7	BEM	20.000001	4.0	3.218876
8	POR	80.999999	16.0	4.356709
9	FIJ	41.000000	11.0	3.931826

epoch	train_loss	valid_loss	_exp_rmspe	time
0	0.364237	0.141579	0.466485	00:11
1	0.059181	0.043871	0.244650	00:11
2	0.040825	0.036200	0.217114	00:11
3	0.037213	0.033582	0.218490	00:11
4	0.037571	0.033678	0.216463	00:11
5	0.037653	0.032857	0.210304	00:11
6	0.035941	0.033145	0.205564	00:11

	t_lan_V	chars_E	words_E	log_chars_V	log_chars_V_pred
0	1.0	-0.980276	-1.053134	2.833213	2.417679
1	3.0	-0.253450	-0.088161	2.944439	2.815471
2	1.0	4.232105	3.986167	5.493062	5.540545
3	1.0	-0.731078	-0.517038	3.258096	3.354484
4	5.0	-0.232683	-0.409819	3.828641	4.095329
5	1.0	-0.523414	-0.517038	3.784190	3.588533
6	6.0	-0.502647	-0.302600	3.891820	3.750283
7	3.0	0.078814	0.019058	3.091043	3.027725
8	1.0	0.909472	0.662373	4.700480	4.621726

	inputs	targs	preds
0	13	13	13
1	5	5	5
2	9	9	9
3	1	1	1
4	3	3	3
...	...	...	...
49667	2	2	2
49668	5	5	5
49669	1	1	1
49670	5	5	5
49671	3	3	3

	t_lan_V	chars_E	words_E	log_chars_V	log_chars_V_pred
0	3.0	-0.211917	-0.195380	2.995732	2.832705
1	1.0	-1.063341	-1.053134	2.484907	2.235619
2	12.0	0.390310	0.126277	4.382027	4.418166
3	3.0	-0.544180	-0.624257	2.397895	2.505552
4	3.0	-1.146407	-1.160353	1.098612	1.426897
5	2.0	0.784873	0.984030	4.820282	4.698935
6	1.0	-0.294983	-0.302600	3.850147	3.853727
7	15.0	1.366334	1.091250	4.787492	4.794899
8	6.0	-0.959509	-0.945915	2.484907	2.728739