This repository contains the syntactic augmentation dataset to improve robustness in NLI, used in our ACL 2020 paper, Syntactic Data Augmentation Increases Robustness to Inference Heuristics, by Junghyun Min1, Tom McCoy1, Dipanjan Das2, Emily Pitler2, and Tal Linzen1. A 7 minnute presentation on the paper can be accessed here.
1Department of Cognitive Science, Johns Hopkins University, Baltimore, MD
2Google Research, New York, NY
Augmentation datasets are in the datasets
folder. Each file is named using the following abbreviations:
Transformation strategies:
inv
: inversionpass
: passivizationcomb
: combination of inversion and passivizationchaos
: random shuffling condition
Sentence pair:
orig
: original premise as premise, tranformed hypothesis as hypothesistrsf
: original hypothesis as premise, transformed hypothesis as hypothesis
Label:
pos
: augmentation examples whose label is entailmentneg
: augmentation examples whose label is nonentailment
Size:
small
: 101 examplesmedium
: 405 exampleslarge
: 1215 examples
For example, pass_trsf_pos_small.tsv
is an set of 101 passivization with transformed hypothesis examples whose labels are entailment. Also, please note that the negative combined transformed-hypothesis nonentailed datasets (comb_trsf_neg_large.tsv
, etc) are not discussed or reported in our paper.
Fields within each file are equivalent to the MNLI datasets downloadable from GLUE. However, only four fields index
, sentence1
(premise), sentence2
(hypothesis), and gold_label
are populated.
The attached .tsv
data files were used to augment the MultiNLI training set in our experiments. They are randomly selected subsets or unions of subsets of transformations created by running generate_dataset.py
, which requires MultiNLI's json file multinli_1.0_train.jsonl
to run. Simply modify the MNLI path argument before running python2 generate_dataset.py
.
This will create four files: inv_orig.tsv
, inv_trsf.tsv
, pass_orig.tsv
, and pass_trsf.tsv
. From these four files, individual augmentation sets similar to those included in the datasets
folder can be created by subsetting and/or concatenating.
In the config
folder, bert_config.json
contains BERT configurations, while train.sh
and hans_pred.sh
contain training, evaluation, and prediction parameters for running BERT's run_classifier.py
.
If you already haven't downloaded BERT and MNLI data, now is the time. You can download BERT from its repository, and MNLI data from running download_glue_data.py. It includes files mentioned below like train.tsv
and test_matched.tsv
:
python download_glue_data.py --data_dir ~/download/path --tasks MNLI
To finetune BERT with an augmented training set, you can concatenate an augmentation set to training set train.tsv
:
shuf -n1215 inv_trsf.tsv > inv_trsf_large.tsv
mv train.tsv train_orig.tsv
cat train_orig.tsv inv_trsf_large.tsv > train.tsv
and finetune BERT as you would on an unaugmented set by running train.sh
.
Once the model is trained, it will also be evaluated on MNLI, and the results will be recorded in eval_results.txt
in your output folder. It'll look something like this:
eval_accuracy = 0.8471727
eval_loss = 0.481841
global_step = 36929
loss = 0.48185167
Along with the results file, you'll also see checkpoint files starting with model.ckpt-some-number
. They are model weights at a particular point in training, the higher the number, the closer it is to completion of training. If you used large augmentation, you'll have model.ckpt-36929
as your trained model.
To evaluate the model on HANS, you'll need to have downloaded scripts and datasets from HANS. And, format heuristics_evaluation_set.txt
to resemble test_matched.tsv
and have fields sentence1
(premise) and sentence2
(hypothesis) as 9th and 10th fields. Other fields can be filled with dummy fillers. The formatted file will also need to be named test_matched.tsv
, so it is a good idea to keep MNLI and HANS directories separate.
Then, you can create the model's predictions on HANS with hans_pred.sh.
Once it is finished, it will produce test_results.tsv
in your output folder. To analyze it, you need to process the results:
python process_results.py
python evaluate_heur_output.py preds.txt
This will output HANS performance by heuristic, by subcase, and by template.
This repository is licenced under the MIT license.