Predict Contribution Effort (Part 1)
Estimation of effort to proofread pre-translated sentences
- 1. Purpose
- 2. Dataset and Variables
- 3. Requirements
- 4. Design
- 5. Setup
- 6. Get train/valid data
- 7. Inspect distribution of output data-points
- 8. Select ONLY TE contributions (first eyes)
- 9. Save prepared data to file
1. Purpose
Machine-translated content in longer documents can usually not be presented to users directly. There is a need for a human proofreader to verify the quality first. Many factors determine the amount of human effort needed. Because the sentence is a good unit of semantics it makes sense to think about effort estimation in terms of individual sentences that need to be proofread. Examples of factors that determine proofreading effort are:
- length of sentence in characters
- length of sentence in words
- source language
- target language
- role of the proofreader (e.g. first pair of human eyes, or a second proofreader)
- quality of the pre-translator (BLEU score)
- skill of the proofreader in source and target langauges
The purpose of this project is to create an effort estimation model. The client has a number of pre-translation models that vary in quality. Similarly, human proofreaders on the team has varying levels of skill. In addition, the length of documents varies. Currently, a fixed compensation is paid per document. This model will enhance fairness when the client determines the compensation for proofreaders.
2. Dataset and Variables
The dataset comes in the form of contributions, each captured as a row or data-point. Each contribution is a sentence that could be in the source language (always English) or a translation of the source sentence. There could be many variations/versions of a translated sentence, including the version provided by the translation engine initially. Human proofreaders then provide their own corrections in the form of other versions.
There are 4 kinds of contributions:
- E: English contributions
- T: Translate contributions - provided by the translation engine
- C: Create contributions - corrections provided by human proofreaders
- V: Vote contributions - whenever a human proofreader indicates agreement with a contribution provided by the translation engine, it is recorded in the form of a vote contribution
The features of the dataset are:
- m_descriptor: Unique identifier of a document
- t_lan: Language of the translation (English is also considered a translation)
- t_senc: Number of sentences in a document
- t_version: Version of a translation
- s_typ: Type of the sentence
- s_rsen: Number of a sentence within a document
- e_id: Database primary key of a contribution's content
- e_top: Content of the contribution that got the most votes
- be_id: N/A
- be_top: N/A
- c_id: Database primary key of a contribution
- c_created_at: Creation time of a contribution
- c_kind: Kind of a contribution
- c_eis: N/A
- c_base: N/A
- a_role: N/A
- u_name: N/A
- e_content: Text content of a contribution
- chars: Number of characters in a contribution
- words: Number of words in a contribution
In this notebook we will only prepare the dataset. Modeling will occur in followup notebooks.
3. Requirements
- Input variables:
- document id
- target language (source language will always be English)
- quality of pre-translation model (with associated BLEU score)
- role of the proofreader
- proofreader code
- Output: The model should estimate the effort for each source (English) sentence in the document and eventually sum the estimated effort for all sentences.
- Model updates: The model should be updated periodically as new training data becomes available.
- Platform:
- The model should be deployed on Amazon Web Services (AWS)
- The model code should allow for a substantial growth in available data into the future
4. Design
The development will be undertaken on the AWS platform, making use of Sagemaker Notebook instances. In addition, to make provision for eventual big-data needs, Apache Spark technology will be used.
Initially, because the current dataset is still relatively small, development and training will be done locally. Eventually, however, the code will be linked to an AWS Elastic Map Reduce (EMR) cluster to allow for a big-data training platform.
At a lower level, development will occur in Jupyter notebooks. Dataframes will not be used in the native Spark format, but rather making use of the relatively new Databricks Koalas library. The advantage of this approach is that Koalas allow for Spark functionality packaged as familiar pandas function calls.
Note that this notebook runs on a Sagemaker Notebook instance (VM). It is not accessed via Sagemaker Studio.
!python --version
Databricks Koalas is central to our activities:
!pip install koalas
# !conda install koalas -c conda-forge #takes long!
# https://www.kaggle.com/general/185679
!pip install -U seaborn
import seaborn as sns
sns.__version__ #should be 0.11.1
import os
import boto3
from pyspark import SparkContext, SparkConf
from pyspark.sql import SparkSession
import sagemaker
from sagemaker import get_execution_role
import sagemaker_pyspark
import numpy as np
import re
import databricks.koalas as ks
ks.__version__
import io
import matplotlib.pyplot as plt
import pandas as pd
pd.__version__
!pip list | grep pyspark
sagemaker.__version__
role = get_execution_role()
# Configure Spark to use the SageMaker Spark dependency jars
jars = sagemaker_pyspark.classpath_jars()
classpath = ":".join(sagemaker_pyspark.classpath_jars())
# See the SageMaker Spark Github to learn how to connect to EMR from a notebook instance
spark = SparkSession.builder.config("spark.driver.extraClassPath", classpath)\
.master("local[*]").getOrCreate()
spark
# Koalas default index
# ks.set_option('compute.default_index_type', 'sequence')
ks.get_option('compute.default_index_type')
def display_all(df):
with pd.option_context("display.max_rows", 1000, "display.max_columns", 1000):
display(df)
df_E = ks.read_csv(data_location, sep='~'); df_E.shape
#- df_E = ks.read_csv(data_location, index_col=None, header=None, sep='~', encoding='utf-8'); df_E.shape
df_E.iloc[:5,:-2]
df_E.info()
# drop features that won't be used
df_E = df_E.drop(['e_id','e_top','be_id','be_top','c_id','c_created_at','c_kind','c_eis','c_base','a_role','u_name'], axis=1)
df_E.iloc[:5,:-1]
# handle NaNs in e_content
e_content_nans = df_E['e_content'].isna()
df_E[e_content_nans]
#replace e_content NaNs with empty strings
df_E.loc[e_content_nans, 'e_content'] = ''
# df_E.loc[e_content_nans, ['e_content']]
# OR
df_E[df_E['e_content']=='']
# add chars column
#- df_E['chars'] = [len(e) for e in df_E['e_content']] #does not work with Koalas
df_E['chars'] = df_E['e_content'].apply(len)
df_E.loc[:5,['m_descriptor','t_lan','t_version','s_rsen','chars']]
# df_E.loc[e_content_nans, ['e_content','chars']]
# OR
df_E[df_E['chars']==0]
# add words column
# https://www.geeksforgeeks.org/python-program-to-count-words-in-a-sentence/
#- df_E['words'] = [len(re.findall(r'\w+', e)) for e in df_E['e_content']] #does not work in Koalas
df_E['words'] = df_E['e_content'].apply(lambda e: len(re.findall(r'\w+', e)))
df_E.loc[:5,['m_descriptor','t_lan','t_senc','t_version','s_typ','s_rsen','chars','words']]
#remove BER part of version from t_version so that we can use this
# column to join the English contributions with their matching
# translated contributions
#- df_E['t_version'] = ['-'.join(e.split('-')[:2]) for e in df_E['t_version']] #does not work in Koalas
df_E['t_version'] = df_E['t_version'].apply(lambda e: '-'.join(e.split('-')[:2]))
df_E.loc[:5,['m_descriptor','t_lan','t_senc','t_version','s_typ','s_rsen','chars','words']]
df_C = ks.read_csv(data_location, sep='~'); df_C.shape
df_C.iloc[:5,:-2]
df_V = ks.read_csv(data_location, sep='~'); df_V.shape
df_CV = ks.concat([df_C, df_V], axis=0); df_CV.shape
tmp = df_CV.sort_values(by=['m_descriptor', 't_lan','t_version','s_rsen','a_role','u_name','c_created_at'])
df_aggd_CV = df_CV.groupby(['m_descriptor', 't_lan','t_version','s_rsen','a_role','u_name']).agg({'c_eis':['sum','count'], 'e_content':'last'})
df_aggd_CV.shape
df_aggd_CV = df_aggd_CV.reset_index(); df_aggd_CV.shape
df_aggd_CV.columns
df_aggd_CV.columns = ['m_descriptor','t_lan','t_version','s_rsen','a_role','u_name','c_eis_sum','c_eis_count','e_content']
df_aggd_CV.loc[:5,['m_descriptor','t_lan','t_version','s_rsen','a_role','c_eis_sum','c_eis_count']]
# remove BER from t_version
#- df_aggd_CV['t_version'] = ['-'.join(e.split('-')[:2]) for e in df_aggd_CV['t_version']] #does not work in Koalas
df_aggd_CV['t_version'] = df_aggd_CV['t_version'].apply(lambda e: '-'.join(e.split('-')[:2]))
df_aggd_CV.loc[:5,['m_descriptor','t_lan','t_version','s_rsen','a_role','c_eis_sum','c_eis_count']]
# df_aggd_joind_E_CV = ks.merge(df_E, df_aggd_CV, how='inner', on=['m_descriptor', 't_version', 's_rsen'], suffixes=('_E', '_CV'), sort=True) #sort not proper here
df_aggd_joind_E_CV = ks.merge(df_E, df_aggd_CV, how='inner', on=['m_descriptor', 't_version', 's_rsen'], suffixes=('_E', '_CV'))
df_aggd_joind_E_CV.sort_values(by=['m_descriptor', 't_version', 's_rsen'])
df_aggd_joind_E_CV.loc[20:30,['m_descriptor','t_lan_E','t_senc','t_version','s_typ','s_rsen','chars','words','t_lan_CV','a_role','c_eis_sum','c_eis_count']]
df_aggd_joind_E_CV.shape
# rename
df = df_aggd_joind_E_CV.copy()
# inspect range of ENG chars and words, and OTH c_eis_sum
#- display_all(ks.describe(include='all').T)
display_all(df.describe().T)
# conclusion: min and max are fine for both
df.columns
plt.figure(figsize=(20,10))
colors = ['green']
plt.hist(df['c_eis_sum'], bins=200, density=False, histtype='bar', color=colors, label=colors)
plt.title('Distribution of c_eis (seconds)\n', fontweight ="bold")
plt.xlabel("effort (seconds)")
plt.show()
# remove outliers
df_deoutlrd = df[df['c_eis_sum']<180] #less than 3 minutes
# df_deoutlrd = df[df['c_eis_sum']<300] #less than 5 minutes
# df_deoutlrd = df[df['c_eis_sum']<600] #less than 10 minutes
# df_deoutlrd = df[df['c_eis_sum']<1200] #less than 20 minutes
plt.figure(figsize=(20,10))
colors = ['green']
plt.hist(df_deoutlrd['c_eis_sum'], bins=200, density=False, histtype='bar', color=colors, label=colors)
plt.title('Distribution of c_eis (seconds)\n', fontweight ="bold")
plt.xlabel("effort (seconds)")
plt.show()
df_deoutlrd.columns
plt.figure(figsize=(20,10))
plt.title('c_eis_sum vs English chars', fontsize=30, color='black')
plt.xlabel('English chars', fontsize=20, color='black')
plt.ylabel('c_eis_sum', fontsize=20, rotation=90, color='black')
# plt.scatter(x='chars', y='c_eis_sum', data=df_deoutlrd);
# plt.scatter(x=df_deoutlrd['chars'], y=df_deoutlrd['c_eis_sum']);
plt.scatter(x='chars', y='c_eis_sum', data=df_deoutlrd.to_pandas());
df_deoutlrd_TEs = df_deoutlrd[df_deoutlrd['a_role']=='TE']; df_deoutlrd_TEs.shape
df_deoutlrd_TEs.loc[0:10,['m_descriptor','t_lan_E','t_senc','t_version','s_typ','s_rsen','chars','words','t_lan_CV','a_role','c_eis_sum','c_eis_count']]
df_deoutlrd_TEs.head()
plt.figure(figsize=(20,10))
plt.title('c_eis_sum vs English chars (TEs)', fontsize=30, color='black')
plt.xlabel('English chars', fontsize=20, color='black')
plt.ylabel('c_eis_sum', fontsize=20, rotation=90, color='black')
plt.scatter(x='chars', y='c_eis_sum', data=df_deoutlrd_TEs.to_pandas());
sdf = df_deoutlrd_TEs.to_spark()
csv_path = f"s3a://{bucket_str}/output/"
sdf.coalesce(1).orderBy(['m_descriptor','t_lan','t_version','s_rsen']).write.mode('overwrite').format( #1 is for a single file
"com.databricks.spark.csv").option("header","true").option("sep","~").save(csv_path)
# test
sdf = spark.read.csv(csv_path, sep='~', header=True, inferSchema=True, nanValue='null', nullValue='null')
# csv_df = spark.read.csv(csv_path, header=True, inferSchema = True)
kdf = sdf.to_koalas()
kdf.shape
kdf.columns
kdf.loc[:5,['m_descriptor','t_lan_E','t_senc','t_version','s_typ','s_rsen','chars','words','t_lan_CV','a_role','c_eis_sum','c_eis_count']]