Tutorial¶
Training a QE model¶
One can use DeepQuest to train QE models at either word, sentence or document-level. In its current version, DeepQuest provides the following, multi-level, QE models:
POSTECH: a two-stage end-to-end stacked neural architecture that combines a Predictor and an Estimator, designed by Kim et al., 2017.
BiRNN: simple architecture relying on two bi-directional RNNs, designed by Ive et al., 2018.
Depending on the desired level of prediction, the configuration will differ, and this section aims to give a detailed description of the customised parameters.
The first step is to create a configuration file (see configs/example_config-WordQE.py for an example), which defines the parameters of the model to train, starting with the definition of the task:
TASK_NAME
: name given to the task;SRC_LAN
,TRG_LAN
: extensions of correspnding source language and MT files (target language file for Predictor);DATA_ROOT_PATH
: directory where to find the data;TEXT_FILES
: a (Python) dictionary that contains the names of the training, development and test sets (without extension).
INPUTS_IDS_DATASETS
– defines the datasets used to train the QE model
source_text
– source textstate_below
– target text (reference for Predictor, MT for Estimator) one position right-shifted target text (for left POSTECH context, the same as previous word with NMT-Keras Teacher)state_above
– target text (reference for Predictor, MT for Estimator) one position left-shifted target text (for right POSTECH context, the same as next word with NMT-Keras Teacher)target
– MT text unshifted to obtain Predictor scores for itNote: onlysource_text
andtarget_text
inputs are used for biRNN models.
For outputs of single-task models set an output in
OUTPUTS_IDS_DATASET
from the following (+ setMULTI_TASK=False
, keep pre-set task names):
target_text
– for Predictor, Predictor training can be stopped after 2-3 epochs as soon as the quality in BLEU will stop improvingword_qe
– for word-level quality Estimatorphrase_qe
– for phrase-level quality Estimatorsent_qe
– for sentence-level quality Estimatordoc_qe
– for doc-level models
LOSS
– defines the loss function
categorical_crossentropy
for Predictor (POSTECH architecture)mse
for QE models
MODEL_TYPE
– defines the type of the model to train
POSTECH: Predictor, Estimator{Word, Phrase, Sent, Doc, DocAtt}BiRNN: Enc{Word, PhraseAtt, Sent, Doc, DocAtt}Note: document-level models take the last BiRNN states to produce the QE labels, while the document-level models with an Attention mechanism (DocAtt) take the sum of the BiRNN states, weighted by attention (see model_zoo.py for implementation details). EncPhraseAtt takes into account attended parts of source while estimating MT phrase quality (useful in the absence of phrase alignments).
Parameters per model type:
WORD_QE_CLASSES
,PHRASE_QE_CLASSES
– constantly set to 5, except for OK and BAD labels , since we have a set of standard labels related to padding and other pre-processingSAMPLE_WEIGHTS
– to specify a dictionary using task names above, labels and their weights (for non-regression tasks, like word-level QE)PRED_SCORE
– set as the extension of the tag file, (e.g. ``PRED_SCORE`` = ‘bleu’), for both sentence and document-level QE, while for word-level QE, sets as ‘tags’ extensionSECOND_DIM_SIZE
– (for phrase- and document-level QE only) to fix the size of a document (e.g. to the maximum length of the most frequent quartile)OUT_ACTIVATION
– set as ‘relu’ function if predicted scores are in (0, +infinity), as a ‘sigmoid’ function for scores in (0,1) (for example, BLEU or HTER), or as a linear’ function for scores in (-infinity, +infinity).
MULTI_TASK
– Multi-Tasks Learning (MTL) (POSTECH model only)
MULTI_TASK
= True / False, to activate / deactivate MTLOUTPUTS_IDS_DATASET_FULL
– defines order for multiple outputs for Multi-Tasks Learning (MTL)Standard order of tasks:target_text
,word_qe
,sent_qe
(LOSS
andMODEL_TYPE
will be ignored).The MTL will first pre-train the word-level weigths (keeping Predictor weights unchanged), and the Estimator (sentence-level).EPOCH_PER_UPDATE
= 1 – times every task is consequently repeated (each of N epochs as specified by the parameters below)EPOCH_PER_PRED
= 5 – Predictor epochsEPOCH_PER_EST_SENT
= 5 – EstimatorSent epochsEPOCH_PER_EST_WORD
= 5 – EstimatorWord epochs
Neural network parameters (should be kept the same for the large Predictor training and then MTL learning).
For a small POSTECH-inspired model the following parameters should be used:IN{OUT}PUT_VOCABULARY_SIZE
= 30000SOURCE{TARGET}_TEXT_EMBEDDING_SIZE
= 300EN{DE}CODER_HIDDEN_SIZE
= 500QE_VECTOR_SIZE
= 75For a large POSTECH-inspired model:IN{OUT}PUT_VOCABULARY_SIZE
= 70000SOURCE{TARGET}_TEXT_EMBEDDING_SIZE
= 500EN{DE}CODER_HIDDEN_SIZE
= 700QE_VECTOR_SIZE
= 100For document-level QE :DOC_DECODER_HIDDEN_SIZE
= 50For BiRNN models:ENCODER_HIDDEN_SIZE
= 50
Other training-related parameters:
PRED_VOCAB
– set the dictionary pickle dumped by the pre-trained model (dumped to the datasets folder)PRED_WEIGHTS
– set the pre-trained weights (as dumped to the trained_models/{model_name} folder)BATCH_SIZE
– typically 50 or 70 for smaller models; set to 5 for doc QEMAX_EPOCH
– max epochs the code will run (for MTL max quantity of iterations over all the three tasks)MAX_IN(OUT)PUT_TEXT_LEN
– longer sequences are cut to the specified lengthMAX_SRC(TRG)_INPUT_TEXT_LEN
– longer sequences are cut to the specified length; set this length separately if different for source and MT inputs (for example, for phrase-level QE, when source sentences and MT phrases are given as inputs)RELOAD
= {epoch_number}, combined withRELOAD_EPOCH
= True – helpful when you want to continue training from a certain epoch, also a good idea to specify the vocabulary as previously pickeled (PRED_VOCAB
)OPTIMIZER
= {optimizer}, also adjust the learning rate accordinglyLR
EARLY_STOP
= True – activate early stopping with requiredPATIENCE
= e.g. 5; set the right stop metric e.g.STOP_METRIC
= e.g. ‘pearson’ (for regression QE tasks: alo ‘mae’, ‘rmse’; for classification tasks: ‘precision’, ‘recall’, ‘f1’)
Once all the training parameters are defined in the configuration file quest/config.py, one can run the training of the QE model as follows:
export KERAS_BACKEND=theano export MKL_THREADING_LAYER=GNU THEANO_FLAGS=device={device_name} python main.py | tee -a /tmp/deepQuest.log 2>&1 &
One can observe the progression of the training in the log file created in the temporary directory.
Scoring¶
Test sets are scored after each epoch using the standard tests from the WMT QE Shared task metrics, with an inbuilt procedure. New test sets with already trained models can be scored by launching the same command as for training. Change the following parameters in your initial config (see configs/config-sentQEbRNNEval.py for an example, for now the scoring procedure is tested only for the sentence-level QE models):
EVAL_ON_SETS
– specify the set for scoringPRED_VOCAB
– set the path to the vocabulary of the pre-trained model (as dumped to the datasets/Dataset_{task_name}_{src_extension}{trg_extension}.pkl folder)PRED_WEIGHTS
– set the path to the pre-trained weights (as dumped to the trained_models/{model_name} folder) of the model that would be used for scoringMODE
– set to ‘sampling’NO_REF
– set to ‘True’ if you do not have a file with gold-standard labels
Examples¶
We also provide two scripts to train and test Sentence QE models for biRNN and POSTECH (configs/train-test-sentQEbRNN.sh and configs/train-test-sentQEPostech.sh respectively). Assuming that correct environment is already activated and all the environmental variables are set:
Sentence QE data in the format compatible for deepQuest could be downloaded, for example, from the WMT QE Shared task 2017 page. Download the task1_en-de_training-dev.tar.gz, task1_en-de_test.tar.gz and wmt17_en-de_gold.tar.gz archives. Make sure to get original version of the data and not the latest version they were replaced with. Create the folder examples/qe-2017 in the quest directory and unarchive all the three archives into the folder. Execute the following commands to rename the 2017 test data:
cd examples/qe-2017 rename 's/^test.2017/test/' * mv en-de_task1_test.2017.hter test.hter
Copy the necessary BiRNN shell script to the ‘quest’ folder. Launch the script from the ‘quest’ folder. Specify the name of the folder, extensions of the source and machine-translated files, as well the cuda device (specify ‘cpu’ to train on cpus):
cd deepQuest/quest cp ../configs/train-test-sentQEbRNN.sh . ./train-test-sentQEbRNN.sh --task qe-2017 --source src --target mt --score hter --activation sigmoid --device cuda0 > log-sentQEbRNN-qe-2017.txt 2>&1 &
The complete log is in quest/log-qe-2016_srcmt_EncSent.txt. The log log-sentQEbRNN-qe-2017.txt should show results comparable to the ones below:
cat log-sentQEbRNN-qe-2017.txt Analysing input parameters Traning the model qe-2017_srcmt_EncSent Best model weights are dumped into saved_models/qe-2017_srcmt_EncSent/epoch_12_weights.h5 Scoring test.mt Model output in trained_models/qe-2017_srcmt_EncSent/test_epoch_12_output_0.pred Evaluations results [24/07/2018 12:08:33] **SentQE** [24/07/2018 12:08:33] Pearson 0.3871 [24/07/2018 12:08:33] MAE 0.1380 [24/07/2018 12:08:33] RMSE 0.1819
Note If you try to launch the scripts with your data and you do not have gold-standard labels for your test data cf. the respective note in the Scoring section.
For POSTECH Predictor pre-training, parallel data containing human reference translations should be prepared. For example, the Europarl corpus can be used. The data can be pre-proccesed in a standard Moses pipeline (Corpus Preparation section). Typically, around 2M of parallel lines are used for training and 3K lines for testing (small Predictor model).
We provide an example of the Postech architecture training using Europarl and WMT 2017 Sentence QE data:
Create a data directory and download the EN-DE Europarl data:
mkdir -p europarl/raw && cd "$_" wget http://opus.nlpl.eu/download.php?f=Europarl/de-en.txt.zip unzip download.php\?f=Europarl%2Fde-en.txt.zip
Create your copy of the Moses toolkit:
git clone https://github.com/moses-smt/mosesdecoder.git
Copy the preprocessing scripts provided with the deepQuest tool to your main data directory and launch the preprocessing scripts by specifying the data info and the Moses clone location. This step may take a while.
cd /{your_path}/europarl cp deepQuest/configs/preprocess-data-predictor.sh ./ cp deepQuest/configs/split.py ./ ./preprocess-data-predictor.sh --name Europarl.de-en --source en --target de --dir /{your_path}/europarl --mosesdir /{your_path}/mosesdecoder
The final preprocessed data should look as follows:
wc -l /{your_path}/europarl/clean/en-de/* 3000 clean/en-de/dev.de 3000 clean/en-de/dev.en 3000 clean/en-de/test.de 3000 clean/en-de/test.en 1862790 clean/en-de/train.de 1862790 clean/en-de/train.en 3737580 total
Copy the prepared data files into the quest data directory:
mkdir /{your_path}/quest/examples/europarl-en-de cp /{your_path}/europarl/clean/en-de/* /{your_path}/quest/examples/europarl-en-de
Launch the Postech script:
cd deepQuest/quest cp ../configs/train-test-sentQEPostech.sh . ./train-test-sentQEPostech.sh --pred-task europarl-en-de --pred-source en --pred-target de --est-task qe-2017 --est-source src --est-target mt --score hter --activation sigmoid --device cuda0 > log-sentQEPostech-qe-2017.txt 2>&1 &
The complete logs are in quest/log-europarl-en-de_ende_Predictor.txt and quest/log-qe-2017_srcmt_EstimatorSent.txt The log log-sentQEPostech-qe-2017.txt should show results comparable to the following ones:
cat log-sentQEPostech-qe-2017.txt Analysing input parameters Traning the model europarl-en-de_ende_Predictor Traning the model qe-2017_srcmt_EstimatorSent Best model weights are dumped into saved_models/qe-2017_srcmt_EstimatorSent/epoch_3_weights.h5 Scoring test.mt Model output in trained_models/qe-2017_srcmt_EstimatorSent/test_epoch_3_output_0.pred Evaluations results [30/07/2018 14:24:51] Pearson 0.5276 [30/07/2018 14:24:51] MAE 0.1279 [30/07/2018 14:24:51] RMSE 0.1649 [30/07/2018 14:24:51] Done evaluating on metric qe_metrics