Luganda Translation Engine
Build a translation engine (English to another language) from scratch and train with opennmt
- Purpose
- Dataset and Variables
- Data preparation process
- Setup the Environment
- Part 1(a): Prepare Data for a Single Document
- Part 1(b): Prepare Data for All Documents
- Part 2: Position Train Data on Azure
- Part 3: Train engine using opennmt
Purpose
Machine translation is used increasingly to lighten the load of human translators. The critical component here is the translation engine which is a model that takes a sequence of source words and outputs another sequence of translated words. To train such a model many thousands of sentence pairs need to be aligned for training examples.
This project is about constructing a translation engine, from English to Luganda, for a client in Uganda.
Dataset and Variables
A data-point in the dataset comes in the form of a sequence of "scalars" of type (nominal) categorical. The value assumed by each categorical is one of the words in the vocabulary for each language. Put in simple terms, each data-point is a English/Luganda sentence pair.
Currently, there are only 31,172 sentence pairs. This is not sufficient for a good enough translation engine. However, this can serve as a base to build upon and to derive an initial BLEU score for the engine.
In summary, the features of the dataset are:
- English sentence
- Luganda sentence
There is a relationship between the number of words (and the number of characters) in the source language and the target language. These relationships can be used to help identify miss-alignments in the training data. We make use of visualizations as well as the technique of identifying outliers beyond 3 sample standard deviations. So-called studentized outliers are used after fitting an Ordinary Least Squares (OLS) model.
Data preparation process
Training data is forwarded as newly translated documents are completed. The current training data is extracted from about 24 translated documents.
After a new translation is supplied, a human aligner aligns the Luganda text against an English file, separeted into sentences. This part of the process is error-prone. Consequently, the file with the alignments is dropped into a folder called WORK_PATH and then subjected to a Java-based suite of data preparation utilities:
- validator.jar
- selector.jar
- inspector.jar
- separator.jar
from pathlib import Path
import pandas as pd
import glob
import re
import plotly.express as px
import seaborn as sns
import matplotlib.pyplot as plt
%matplotlib inline
import statsmodels.api as sm
import numpy as np
!python --version
# set the language
LANG = 'LUG'; print(LANG)
# set the prefix for files
# PREFIX = '19'; print(PREFIX)
PREFIX = 'CA'; print(PREFIX)
# set the suffix for original English file
# ORIG_SUFFIX = 't'
ORIG_SUFFIX = 'B123'
# ORIG_SUFFIX = 'B123E1R'
print(ORIG_SUFFIX)
# set English and Other language filenames
ENG = !ls {WORK_PATH}/{PREFIX}*_*_ENG_*-P*.txt
ENG = ENG[0].split('/')[-1].split('.')[0]
OTH = !ls {WORK_PATH}/{PREFIX}*_*_{LANG}_*-P*.txt
OTH = OTH[0].split('/')[-1].split('.')[0]
# set Original English filename
ENG_ORIG = !ls {WORK_PATH}/{PREFIX}*_*_ENG_*-{ORIG_SUFFIX}.txt
ENG_ORIG = ENG_ORIG[0].split('/')[-1].split('.')[0]
#find number of lines (including blank lines) to calc utilization
ENG_N_LINES = !wc -l {WORK_PATH}/"{ENG}".txt
ENG_N_LINES = int(ENG_N_LINES[0].split('/')[0])
#find number of lines (including blank lines) to calc utilization
ENG_ORIG_N_LINES = !wc -l {WORK_PATH}/"{ENG_ORIG}".txt
ENG_ORIG_N_LINES = int(ENG_ORIG_N_LINES[0].split('/')[0])
print('SENTENCE UTILIZATION:', ENG_N_LINES/ENG_ORIG_N_LINES)
# validator.jar tries to align English with Other language sentences
# !java -jar {UTIL_PATH}/validator.jar {WORK_PATH}/{ENG}.txt {WORK_PATH}/{OTH}.txt
#missed marks that human aligner might have entered to indicate problems
# !cat {WORK_PATH}/19*_*_{LANG}_*.txt | grep XX
!cat {WORK_PATH}/{PREFIX}*_*_{LANG}_*.txt | grep XX
#missed marks that human aligner might have entered to indicate problems
# !cat {WORK_PATH}/19*_*_{LANG}_*.txt | grep xx
!cat {WORK_PATH}/{PREFIX}*_*_{LANG}_*.txt | grep xx
#missed marks that human aligner might have entered to indicate problems
# !cat {WORK_PATH}/19*_*_{LANG}_*.txt | grep !!
!cat {WORK_PATH}/{PREFIX}*_*_{LANG}_*.txt | grep !!
#write -v file
!java -jar {UTIL_PATH}/validator.jar {WORK_PATH}/"{ENG}".txt {WORK_PATH}/"{OTH}".txt > {WORK_PATH}/0_"{ENG}"___"{OTH}"-v.txt
def print_pairs(pairs_list):
for i,p in enumerate(pairs_list):
print(p)
if i%2==1:
if pairs_list[i-1].split(' ',1)[0] != pairs_list[i].split(' ',1)[0]:
print('================ NOT A PAIR:', pairs_list[i-1].split(' ',1)[0], pairs_list[i].split(' ',1)[0])
break
else:
print('\n')
#write -m file (contains all sentence matches)
!java -jar {UTIL_PATH}/selector.jar {WORK_PATH}/0_*-v.txt > {WORK_PATH}/0_"{ENG}"___"{OTH}"-m.txt
#remove header
!tail -n +7 {WORK_PATH}/0_"{ENG}"___"{OTH}"-m.txt > {WORK_PATH}/new-m.txt
!mv {WORK_PATH}/new-m.txt {WORK_PATH}/0_"{ENG}"___"{OTH}"-m.txt
#works better in a maxed command line window with small font
# !java -jar {UTIL_PATH}/inspector.jar {WORK_PATH}/0_*-m.txt 40
# function to convert a -m file to a dataframe
def dash_m_to_dataframe(all_files):
dfs = []
for i,filepath in enumerate(all_files):
#print(filepath)
filename = filepath.split('/')[-1]; print(filename)
f = open(filepath, 'r', encoding='utf-8')
f.readline() #remove title line
rows = []
parts = filename.split('_')
descriptor = parts[1]; #print('descriptor=', descriptor)
lan_E = 'ENG'
version_E = parts[4]
lan_O = parts[9]
version_O = parts[10].split('-m.')[0]
eng_line = f.readline(); #print('eng_line =', eng_line)
while eng_line != "":
try:
signature_E,content_E = eng_line.split(" ", 1)
except:
print('------- EXCEPTION: ', filepath)
print('eng_line =', eng_line)
chars_E = len(content_E)
words_E = len(re.findall(r'\w+', content_E))
if not re.search("^[0-9]+\.[0-9]+\.[ncspqhijk]$", signature_E):
print(f"------- INVALID ENG SIGNATURE: {signature_E}")
oth_line = f.readline(); #print('oth_line =', oth_line)
try:
signature_O,content_O = oth_line.split(" ", 1)
except:
print('------- EXCEPTION: ', filepath)
print('oth_line =', oth_line)
chars_O = len(content_O)
words_O = len(re.findall(r'\w+', content_O))
if not re.search("^[0-9]+\.[0-9]+\.[ncspqhijk]$", signature_O):
print(f"------- INVALID OTH SIGNATURE: {signature_O}")
blank_line = f.readline()
row = [descriptor, signature_E, lan_E, version_E, content_E, chars_E, words_E, lan_O, version_O, content_O, chars_O, words_O]; #print('row', row)
rows.append(row)
eng_line = f.readline(); #print('eng_line =', eng_line)
#if i==200: break
df = pd.DataFrame(rows)
dfs.append(df)
df_A = pd.concat(dfs, axis=0, ignore_index=True)
df_A.columns = ['descriptor','sign','lan_E','version_E','content_E','chars_E','words_E','lan_O','version_O','content_O','chars_O','words_O']
# pd.set_option('display.max_colwidth',20)
print(df_A.shape)
return df_A
#inspect the ranges of some key features
df_file.loc[:, ['chars_E','chars_O','words_E','words_O']].describe()
#chars
x = df_file.loc[:, ['chars_E']].values
y = df_file.loc[:, ['chars_O']].values
mod_c = sm.OLS(y, sm.add_constant(x)).fit() #add_constant() equiv of fit_intercept=True
influence = mod_c.get_influence()
leverage = influence.hat_matrix_diag #hat values
# cooks_d = influence.cooks_distance #Cook's D values (and p-values) as tuple of arrays
# sres_c = influence.resid_studentized_internal #standardized residuals
tres_c = influence.resid_studentized_external #studentized residuals
print('R-squared: \n', mod_c.rsquared)
# mod_c.summary()
mse_c = mod_c.mse_resid
# mod_c.mse_total
yh_c = mod_c.predict(sm.add_constant(x))
#words
x = df_file.loc[:, ['words_E']].values
y = df_file.loc[:, ['words_O']].values
mod_w = sm.OLS(y, sm.add_constant(x)).fit() #add_constant() equiv of fit_intercept=True
influence = mod_w.get_influence()
leverage = influence.hat_matrix_diag #hat values
# cooks_d = influence.cooks_distance #Cook's D values (and p-values) as tuple of arrays
# sres_w = influence.resid_studentized_internal #standardized residuals
tres_w = influence.resid_studentized_external #studentized residuals
print('R-squared: \n', mod_w.rsquared)
# mod_w.summary()
mse_w = mod_w.mse_resid
# mod_w.mse_total
yh_w = mod_w.predict(sm.add_constant(x))
#add yhat columns
df = df_file.copy()
df['yh_c'] = yh_c
df['yh_w'] = yh_w
# df.head()
#add columns to indicate the presence of outliers
df['out_tres_c'] = False #char outlier
df['out_tres_w'] = False #word outlier
df['out_both'] = False #both outlier
df['outlier'] = 'not' #outlier type, n means not an outlier
# df['res'] = df['chars_O'] - df['prd_c']
# df['lev'] = leverage
# df['calc_sres'] = df['res']/(np.sqrt(mse*(1-df['lev'])))
df['tres_c'] = tres_c
df['tres_w'] = tres_w
df.loc[ np.abs(df['tres_c']) > 3, ['out_tres_c'] ] = True
df.loc[ np.abs(df['tres_c']) > 3, ['outlier'] ] = 'char' #char outlier
df.loc[ np.abs(df['tres_w']) > 3, ['out_tres_w'] ] = True
df.loc[ np.abs(df['tres_w']) > 3, ['outlier'] ] = 'word' #word outlier
df.loc[ (np.abs(df['tres_c']) > 3)&(np.abs(df['tres_w']) > 3), ['out_both'] ] = True
df.loc[ (np.abs(df['tres_c']) > 3)&(np.abs(df['tres_w']) > 3), ['outlier'] ] = 'both' #both outlier
print('total data-points:',len(df))
print('useful char data-points:',len(df[df['out_tres_c']==False]))
print('useful word data-points:',len(df[df['out_tres_w']==False]))
print('useful char & word data-points:',len(df[(df['out_tres_c']==False)&(df['out_tres_w']==False)]))
print('outlier char data-points:',len(df[df['out_tres_c']==True]))
print('outlier word data-points:',len(df[df['out_tres_w']==True]))
print('outlier char & word data-points:',len(df[df['out_both']==True]))
# df.head()
sns.set_style('darkgrid')
plt.figure(figsize=(12,8))
# palette = {False: 'green', True: 'red'}
palette = {'not': 'green', 'char': 'orange', 'word': 'blue', 'both': 'red'}
ax = sns.scatterplot(x='chars_E', y='chars_O',
#hue='out_tres_c',
hue='outlier',
#style='out_both',
data=df, alpha=.8, palette=palette)
ax.set_title("Translation Chars vs English Chars (outliers color-coded)", fontdict={'fontsize': 20})
plt.plot(df['chars_E'], df['yh_c'], color='grey', linestyle='-.', linewidth=.2);
sns.set_style('darkgrid')
plt.figure(figsize=(12,8))
# palette = {False: 'green', True: 'red'}
palette = {'not': 'green', 'char': 'orange', 'word': 'blue', 'both': 'red'}
ax = sns.scatterplot(x='words_E', y='words_O',
#hue='out_tres_w',
hue='outlier',
data=df, alpha=.8, palette=palette)
ax.set_title("Translation Words vs English Words (outliers color-coded)", fontdict={'fontsize': 20})
plt.plot(df['words_E'], df['yh_w'], color='grey', linestyle='-.', linewidth=.2);
Next, we remove all the outliers of type 'both'. These are the ones visualized in red above.
#previous summary again
print('total data-points:',len(df))
print('useful char data-points:',len(df[df['out_tres_c']==False]))
print('useful word data-points:',len(df[df['out_tres_w']==False]))
print('useful char & word data-points:',len(df[(df['out_tres_c']==False)&(df['out_tres_w']==False)]))
print('outlier char data-points:',len(df[df['out_tres_c']==True]))
print('outlier word data-points:' ,len(df[df['out_tres_w']==True]))
print('outlier char & word data-points:',len(df[df['out_both']==True]))
# DON'T USE THIS OPTION: We may want to change the definition of an outlier in the future,
# e.g. only values beyond 4 sample standard deviations. This option throws away outliers permanently.
# if doing -PO files:
# manually delete the outliers in the previous cell (from both -P files and change names to -PO)
# and run again from the top, UNTIL no more combined outliers
# df_O = df.copy()
# if doing -P files:
# find outliers on BOTH counts (char model AND word model), drop them
# ixs = df[(df['out_tres_c']==True)&(df['out_tres_w']==True)].index; print(ixs)
out_ixs = df[df['out_both']==True].index; print(out_ixs)
# if doing -P files: drop outliers
df_O = df.drop(out_ixs)
print('total data-points:',len(df))
print('useful data-points:',len(df_O))
print('outlier char & word data-points:',len(df)-len(df_O))
assert( len(df) == len(df_O) + len(out_ixs) )
sns.set_style('darkgrid')
plt.figure(figsize=(12,8))
# palette = {False: 'green', True: 'red'}
palette = {'not': 'green', 'char': 'orange', 'word': 'blue', 'both': 'red'}
ax = sns.scatterplot(x='chars_E', y='chars_O',
#hue='out_tres_c',
hue='outlier',
#style='out_both',
data=df_O, alpha=.8, palette=palette)
ax.set_title("Translation Chars vs English Chars (outliers color-coded)", fontdict={'fontsize': 20})
plt.plot(df['chars_E'], df['yh_c'], color='grey', linestyle='-.', linewidth=.2);
sns.set_style('darkgrid')
plt.figure(figsize=(12,8))
# palette = {False: 'green', True: 'red'}
palette = {'not': 'green', 'char': 'orange', 'word': 'blue', 'both': 'red'}
ax = sns.scatterplot(x='words_E', y='words_O',
#hue='out_tres_w',
hue='outlier',
data=df_O, alpha=.8, palette=palette)
ax.set_title("Translation Words vs English Words (outliers color-coded)", fontdict={'fontsize': 20})
plt.plot(df['words_E'], df['yh_w'], color='grey', linestyle='-.', linewidth=.2);
#find number of lines (including blank lines) to calc updated utilization
ENG_N_LINES = !wc -l {WORK_PATH}/"{ENG}".txt
ENG_N_LINES = int(ENG_N_LINES[0].split('/')[0])
print('ENG_N_LINES - len(out_ixs):', ENG_N_LINES - len(out_ixs))
#find number of lines (including blank lines) to calc updated utilization
ENG_ORIG_N_LINES = !wc -l {WORK_PATH}/"{ENG_ORIG}".txt
ENG_ORIG_N_LINES = int(ENG_ORIG_N_LINES[0].split('/')[0])
print('ENG_ORIG_N_LINES:', ENG_ORIG_N_LINES)
print('SENTENCE UTILIZATION (AFTER OUTLIER REMOVAL):', (ENG_N_LINES - len(out_ixs))/ENG_ORIG_N_LINES)
# NOTE: ALWAYS USE -PO TO END THE FILE NAME!
#write separated files
with(open(f'{WORK_PATH}/f_{ENG}___{OTH}.eng-PO.txt', mode='w', encoding='UTF-8')) as f:
for s in df_O['content_E']:
f.write(s)
with(open(f'{WORK_PATH}/e_{ENG}___{OTH}.{lang}-PO.txt', mode='w', encoding='UTF-8')) as f:
for s in df_O['content_O']:
f.write(s)
#remove -v and -i files
!rm ./0_"{ENG}"___"{OTH}"-v.txt
!rm ./0_"{ENG}"___"{OTH}"-m.txt
!rm {WORK_PATH}/0_"{ENG}"___"{OTH}"-v.txt
# !rm {WORK_PATH}/0_"{ENG}"___"{OTH}"-i.txt
#move aligned source files to database folder
!mv {WORK_PATH}/"{ENG}".txt "{MMDB_OTH_PATH}"
!mv {WORK_PATH}/"{OTH}".txt "{MMDB_OTH_PATH}"
#move -m file to database folder
!mv {WORK_PATH}/0_"{ENG}"___"{OTH}"-m.txt "{MMDB_OTH_PATH}"
#move separated files to database folder
!mv {WORK_PATH}/e_"{ENG}"___"{OTH}".{lang}*.txt "{MMDB_OTH_PATH}"
!mv {WORK_PATH}/f_"{ENG}"___"{OTH}".eng*.txt "{MMDB_OTH_PATH}"
#move original ENG file to database folder
!mv {WORK_PATH}/"{ENG_ORIG}".txt "{MMDB_ENG_PATH}"
#chars
x = df_all_files.loc[:, ['chars_E']].values
y = df_all_files.loc[:, ['chars_O']].values
mod_c = sm.OLS(y, sm.add_constant(x)).fit() #add_constant() equiv of fit_intercept=True
influence = mod_c.get_influence()
leverage = influence.hat_matrix_diag #hat values
# cooks_d = influence.cooks_distance #Cook's D values (and p-values) as tuple of arrays
# sres_c = influence.resid_studentized_internal #standardized residuals
tres_c = influence.resid_studentized_external #studentized residuals
print('R-squared: \n', mod_c.rsquared)
# mod_c.summary()
mse_c = mod_c.mse_resid
# mod_c.mse_total
yh_c = mod_c.predict(sm.add_constant(x))
#words
x = df_all_files.loc[:, ['words_E']].values
y = df_all_files.loc[:, ['words_O']].values
mod_w = sm.OLS(y, sm.add_constant(x)).fit() #add_constant() equiv of fit_intercept=True
influence = mod_w.get_influence()
leverage = influence.hat_matrix_diag #hat values
# cooks_d = influence.cooks_distance #Cook's D values (and p-values) as tuple of arrays
# sres_w = influence.resid_studentized_internal #standardized residuals
tres_w = influence.resid_studentized_external #studentized residuals
print('R-squared: \n', mod_w.rsquared)
# mod_w.summary()
mse_w = mod_w.mse_resid
# mod_w.mse_total
yh_w = mod_w.predict(sm.add_constant(x))
#add yhat columns
df = df_all_files.copy()
df['yh_c'] = yh_c
df['yh_w'] = yh_w
pd.set_option('display.max_colwidth',20)
# df.head()
#add columns to indicate the presence of outliers
df['out_tres_c'] = False #char outlier
df['out_tres_w'] = False #word outlier
df['out_both'] = False #both outlier
df['outlier'] = 'not' #outlier type, n means not an outlier
df['tres_c'] = tres_c
df['tres_w'] = tres_w
df.loc[ np.abs(df['tres_c']) > 3, ['out_tres_c'] ] = True
df.loc[ np.abs(df['tres_c']) > 3, ['outlier'] ] = 'char' #char outlier
df.loc[ np.abs(df['tres_w']) > 3, ['out_tres_w'] ] = True
df.loc[ np.abs(df['tres_w']) > 3, ['outlier'] ] = 'word' #word outlier
df.loc[ (np.abs(df['tres_c']) > 3)&(np.abs(df['tres_w']) > 3), ['out_both'] ] = True
df.loc[ (np.abs(df['tres_c']) > 3)&(np.abs(df['tres_w']) > 3), ['outlier'] ] = 'both' #both outlier
print('total data-points:',len(df))
print('useful char data-points:',len(df[df['out_tres_c']==False]))
print('useful word data-points:',len(df[df['out_tres_w']==False]))
print('useful char & word data-points:',len(df[(df['out_tres_c']==False)&(df['out_tres_w']==False)]))
print('outlier char data-points:',len(df[df['out_tres_c']==True]))
print('outlier word data-points:',len(df[df['out_tres_w']==True]))
print('outlier char & word data-points:',len(df[df['out_both']==True]))
# df.head()
sns.set_style('darkgrid')
plt.figure(figsize=(12,8))
# palette = {False: 'green', True: 'red'}
palette = {'not': 'green', 'char': 'orange', 'word': 'blue', 'both': 'red'}
ax = sns.scatterplot(x='chars_E', y='chars_O',
#hue='out_tres_c',
hue='outlier',
#style='out_both',
data=df, alpha=.8, palette=palette)
ax.set_title("Translation Chars vs English Chars (outliers color-coded)", fontdict={'fontsize': 20})
plt.plot(df['chars_E'], df['yh_c'], color='grey', linestyle='-.', linewidth=.2);
sns.set_style('darkgrid')
plt.figure(figsize=(12,8))
# palette = {False: 'green', True: 'red'}
palette = {'not': 'green', 'char': 'orange', 'word': 'blue', 'both': 'red'}
ax = sns.scatterplot(x='words_E', y='words_O',
#hue='out_tres_w',
hue='outlier',
data=df, alpha=.8, palette=palette)
ax.set_title("Translation Words vs English Words (outliers color-coded)", fontdict={'fontsize': 20})
plt.plot(df['words_E'], df['yh_w'], color='grey', linestyle='-.', linewidth=.2);
# find outliers on BOTH counts (char model AND word model), drop them
# ixs = df[(df['out_tres_c']==True)&(df['out_tres_w']==True)].index; print(ixs)
out_ixs = df[df['out_both']==True].index; print(out_ixs)
# drop outliers
df_O = df.drop(out_ixs)
print('total data-points:',len(df))
print('useful data-points:',len(df_O))
print('outlier char & word data-points:',len(df)-len(df_O))
assert( len(df) == len(df_O) + len(out_ixs) )
sns.set_style('darkgrid')
plt.figure(figsize=(12,8))
# palette = {False: 'green', True: 'red'}
palette = {'not': 'green', 'char': 'orange', 'word': 'blue', 'both': 'red'}
ax = sns.scatterplot(x='chars_E', y='chars_O',
#hue='out_tres_c',
hue='outlier',
#style='out_both',
data=df_O, alpha=.8, palette=palette)
ax.set_title("Translation Chars vs English Chars (outliers color-coded)", fontdict={'fontsize': 20})
plt.plot(df['chars_E'], df['yh_c'], color='grey', linestyle='-.', linewidth=.2);
sns.set_style('darkgrid')
plt.figure(figsize=(12,8))
# palette = {False: 'green', True: 'red'}
palette = {'not': 'green', 'char': 'orange', 'word': 'blue', 'both': 'red'}
ax = sns.scatterplot(x='words_E', y='words_O',
#hue='out_tres_w',
hue='outlier',
data=df_O, alpha=.8, palette=palette)
ax.set_title("Translation Words vs English Words (outliers color-coded)", fontdict={'fontsize': 20})
plt.plot(df['words_E'], df['yh_w'], color='grey', linestyle='-.', linewidth=.2);
df.columns
df.to_csv(f'{PATH}/dash-m-dataframe-{LANG}-O.csv', sep='~', index = False, header=True)
The purpose of this part is to position the train data for the translation engine, on an Azure virtual machine (VM). On this VM a copy of opennmt was installed previously. Opennmt is built on the popular PyTorch deep learning framework.
- if not correct name endings (example):
- select all e_ files
- rename .en-af.af to .bem.txt
- select all f_ files
- rename .en-af.en to .eng.txt
- select all e_ files
!cp "{MMDB_OTH_PATH}"/e_* {WORK_PATH}
!cp "{MMDB_OTH_PATH}"/f_* {WORK_PATH}
!ls -l {WORK_PATH}/e_* | wc -l
!ls -l {WORK_PATH}/f_* | wc -l
# - get names of train set, in a maxed mac terminal
# cd ~/work_train
# ls e_*
# - highlight all entries and copy
# - paste temporarily into textedit, then copy from here
# - open OpenNMTExperiment_OTH.xlsx, TrainSets tab
# - enter new TrainSet and paste in at the 'Titles' row
# - set the bottom of the sheet to 'Count'
# - select the paste and fill in the number of Messages
!cat `ls {WORK_PATH}/f_*` > {WORK_PATH}/train.eng.txt
!rm -f {WORK_PATH}/f_*
!cat `ls {WORK_PATH}/e_*` > {WORK_PATH}/train.{lang}.txt
!rm -f {WORK_PATH}/e_*
# - open to validate, with vim
# = maximize a terminal on mac
# = vim .
# = select oth file
# = cmd+ until zoom level is OK
# = :vsp .
# = select eng file
# = :set number in both windows, use ctrl-w left/right to move between windows
# = :$ to go to end and see how many sentences, :1 to go to top again
# = take an evenly spread sample and verify alignment
# % use round numbers like 20000, 40000, 60000, etc //easier to recognize
# % locate with :20000, :40000, etc //puts target in center, easier to find
# % switch left-right with Ctrl-W, h/j
# = close vim
!cp "{MMDB_PATH}"/TestSet/e_* {WORK_PATH}
!cp "{MMDB_PATH}"/TestSet/f_* {WORK_PATH}
!ls -l {WORK_PATH}/e_* | wc -l
!ls -l {WORK_PATH}/f_* | wc -l
# - get names of train set, in a maxed mac terminal
# cd ~/work_train
# ls e_*
# - highlight all entries and copy
# - paste temporarily into textedit, then copy from here
# - open OpenNMTExperiment_OTH.xlsx, TestSets tab
# - enter new id and paste in at the correct position
# - set the bottom of the sheet to 'Count'
# - select the paste and fill in the # messages
# - ensure the id of the TestSet is updated manually in the
# next statements (matching above id):
!cat `ls {WORK_PATH}/f_*` > {WORK_PATH}/valid1.eng.txt
!rm -f {WORK_PATH}/f_*
!cat `ls {WORK_PATH}/e_*` > {WORK_PATH}/valid1.{lang}.txt
!rm -f {WORK_PATH}/e_*
# - open to validate, with vim
# = maximize a terminal on mac
# = vim .
# = select oth file
# = cmd+ until zoom level is OK
# = :vsp .
# = select eng file
# = :set number in both windows, use ctrl-w left/right to move between windows
# = :$ to go to end and see how many sentences, :1 to go to top again
# = take an evenly spread sample and verify alignment
# % use round numbers like 20000, 40000, 60000, etc //easier to recognize
# % locate with :20000, :40000, etc //puts target in center, easier to find
# % switch left-right with Ctrl-W, h/j
# = close vim
The purpose of this part is to train a translation engine, making use of training and validation data. Training happens on a VM in Azure, pre-installed with the PyTorch-based opennmt library.