Translation Word/Char Count Prediction (Part 4)
Prediction of translated Word or Char Count is used as a Quality or Validation Check
- Purpose
- Dataset and Variables
- Setup the Environment
- Get train/valid data
- Inspect distribution of output data-points
- Inspect the signal in the predictor-points
- Set y
- Inspect the distribution of y=chars_V
- Log-transform y
- Setup training
- Train model
- Inference with validation data
Purpose
There is a relationship between the number of words (and the number of characters) in the source language and the target language. If this relationship can be established and captured in yet another model, such a model can be helpful in at least two ways:
- For training: Validate the alignment of two sentences (in a training example) by comparing their word size and/or character size
- For inference: Validate the word size and/or character size of a translated/proofread sentence
In this notebook we will make good on our proposed value proposition to discover a model for each language and to evaluate its use in the above roles.
Dataset and Variables
The dataset used in this notebook contains the following features:
- m_descriptor: Unique identifier of a document
- t_lan_E: Language of the translation (English is also considered a translation)
- t_version: Version of a translation
- s_rsen: Number of a sentence within a document
- c_id: Database primary key of a contribution
- e_content_E: Text content of an English contribution
- chars_E: Number of characters in an English contribution
- words_E: Number of words in an English contribution
- t_lan_V: Language of the translation
- e_top: N/A
- be_top: N/A
- c_created_at: Creation time of a contribution
- c_kind: Kind of a contribution
- c_base: N/A
- a_role: N/A
- u_name: N/A
- e_content_V: Text content of a translated contribution
- chars_V: Number of characters in a translated contribution
- words_V: Number of words in a translated contribution
# # ! pip install fastai
# ! pip install fastai2
# ! pip install nbdev
!pip install -Uqq fastbook
import fastbook
fastbook.setup_book()
# ! pip list | grep fastai
! pip list | grep fastai2
from fastai.tabular.all import *
from fastbook import *
# from fastai.tabular.all import *
# # from fastai2.tabular.all import *
!python --version
PATH = Path(base_dir + './'); #PATH
# for lang in langs:
# df_out = df_raw[df_raw['t_lan_V']==lang]
# df_out.to_csv(f'{PATH}/PredictTranslationWordAndCharCount/PredictTranslationWordAndCharCount_2-{lang}-output.csv', sep='~', index = False, header=True)
pd.set_option('display.max_colwidth',10)
all_files = glob.glob(f"{PATH}/PredictTranslationWordAndCharCount/PredictTranslationWordAndCharCount_3-*-output.csv")
li = []
for filename in all_files:
dft = pd.read_csv(filename, index_col=None, header=0, sep='~')
li.append(dft)
df = pd.concat(li, axis=0, ignore_index=True)
df.iloc[:5,:-2]
# df.iloc[:5, ~df.columns.isin([])]
import seaborn as sns
plt.figure(figsize=(20,10))
plt.xlabel('words_V', fontsize=14, color='black')
plt.ylabel('Count', fontsize=14, rotation=90, color='black')
sns.histplot(df['words_V'], bins=100);
import plotly.express as px
fig = px.scatter(data_frame=df, x='words_E', y='words_V', color='t_lan_V',
title='Translation Words vs English Words', opacity=.3,
hover_data=['m_descriptor','t_lan_V','s_rsen'],
labels={'m_descriptor':'Descriptor','t_lan_V':'','s_rsen':'Sentence No'})
fig.show()
#try instead to relate chars, rather than words
fig = px.scatter(data_frame=df, x='chars_E', y='chars_V', color='t_lan_V',
title='Translation Characters vs English Characters', opacity=.3,
hover_data=['m_descriptor','t_lan_V','s_rsen'],
labels={'m_descriptor':'Descriptor','t_lan_V':'','s_rsen':'Sentence No'})
fig.show()
y = 'chars_V'
plt.figure(figsize=(20,10))
sns.histplot(df[y], bins=50);
df['log_chars_V'] = np.log(df['chars_V'] + 1)
plt.figure(figsize=(20,10))
sns.histplot(df['log_chars_V'], bins=50);
procs = [Categorify, FillMissing, Normalize] #howard
splits = RandomSplitter(valid_pct=0.2)(range_of(df)); splits
df.columns
# print(*list(df.columns), sep='\n')
cats = ['t_lan_V']; conts = ['chars_E', 'words_E']
logy = 'log_chars_V'
pd.options.mode.chained_assignment = None
to = TabularPandas(df, procs=procs, cat_names=cats, cont_names=conts, y_names=logy,
# y_block=RegressionBlock(),
splits=splits, inplace=True, reduce_memory=True)
to.show()
BS = 1024 #1024, 512, 256, 128, 64
dls = to.dataloaders(bs=BS)
dls.show_batch()
# np.min(df[logy]), np.max(df[logy])
# min_log_y = np.min(df[logy]); min_log_y
# max_log_y = np.max(df[logy])*1.2; max_log_y
# y_range = torch.tensor([min_log_y, max_log_y]); y_range
y = to.train.y
y.min(),y.max()
# from fastai2.callback.all import EarlyStoppingCallback, SaveModelCallback
from fastai.callback.all import EarlyStoppingCallback, SaveModelCallback
learn = tabular_learner(dls=dls,
layers=[500,250],
config=tabular_config(
# ps=[.001, .01]
# embed_p=0.04
y_range=(0, 6.5)),
metrics=[exp_rmspe]
# metrics=[r2_score, exp_rmspe],
# metrics=[mean_squared_error],
# ,wd=.01
# ,loss_func=MSELossFlat(),
,n_out=1
,loss_func=F.mse_loss,
#+,cbs=EarlyStoppingCallback(monitor='valid_loss', min_delta=0.01)
#+cbs=EarlyStoppingCallback(monitor='valid_loss', min_delta=0.001, patience=3),
)
learn.lr_find()
! ls -ltrh {learn.path}/{learn.model_dir}
callbacks = [
SaveModelCallback(monitor='valid_loss', comp=np.less, min_delta=0.001, fname='best'),
EarlyStoppingCallback(monitor='valid_loss', comp=np.less, min_delta=0.001, patience=3) #comp=np.greater
]
# learn.fit_one_cycle(n_epoch=20, lr_max=.02e-3)
learn.fit_one_cycle(n_epoch=20, lr_max=1e-4, cbs=callbacks)
! ls -ltrh {learn.path}/{learn.model_dir}/
! pwd
learn.recorder.plot_loss(skip_start=100, with_valid=True)
res = learn.get_preds(with_input=True)
inputs,preds,targs = learn.get_preds(with_input=True)
# F.r_mse(preds, targs)
# pd.Series(inputs)
inputs[0].squeeze()
#- inputs.squeeze()
dfval = pd.DataFrame({'inputs':inputs[0].squeeze(), 'targs':inputs[0].squeeze(), 'preds':inputs[0].squeeze()})
dfval
learn.show_results()
learn.recorder.show_results()
learn.recorder.values
trn_losses = [row[0] for row in learn.recorder.values]; trn_losses
val_losses = [row[1] for row in learn.recorder.values]; val_losses
learn.recorder.values
plt.plot(trn_losses)
plt.plot(val_losses)
best_epoch = np.argmin(val_losses)
ret = val_losses[best_epoch]
ret
trn_losses = [row[0] for row in learn.recorder.values]
val_losses = [row[1] for row in learn.recorder.values]
best_epoch = np.argmin(val_losses); best_epoch
type(dls)
df.columns
x0 = pd.DataFrame({
'chars_E':[20], #chars
'words_E':[5], #words
't_lan_V':['CHN'],
})
full_dec,dec,out = learn.predict(x0.iloc[0]); out
out.numpy()[0]
# float(out.data[0])
print(np.exp( out.numpy()[0] ), 'translated chars ...')
tmp = learn.get_preds(
# ds_type=DatasetType.Train,
ds_type=DatasetType.Valid,
#ds_type=DatasetType.Test[0],
with_loss=True); type(tmp),len(tmp)
tmp = learn.get_preds(); tmp
# sns.barplot(x=topcounts.index, y=topcounts.values, alpha=0.8)
# plt.title('Distribution of e_top values')
# plt.ylabel('Number of Occurrences', fontsize=12)
# plt.xlabel('e_top values', fontsize=12)
# plt.show();