Evaluating consistency of Question-Answering Models

This repository contains code for creating implications and evaluating the consistency of question-answering models, as described in the following paper:

Are Red Roses Red? Evaluating Consistency of Question-Answering Models
Marco Tulio Ribeiro, Carlos Guestrin, Sameer Singh
Association for Computational Linguistics (ACL), 2019

Installation

Clone this repository and cd to the folder:

git clone git@github.com:marcotcr/qa_consistency.git
cd qa_consistency

Create and activate a virtual environment, e.g.:

virtualenv -p python3.6 env
source env/bin/activate

Run the following, replacing [gpu] with [cpu] if you don't have a gpu.

pip install cython numpy
pip install benepar[gpu]
pip install -e .
cd qa_consistency
git clone https://github.com/kelvinguu/qanli.git
cd ..
python -c "import benepar;benepar.download('benepar_en_small')"
python -m spacy download en_core_web_sm

Generating implications:

VQA

import tensorflow as tf
config = tf.ConfigProto()
config.gpu_options.allow_growth=True
sess = tf.Session(config=config)
import qa_consistency.implication
gen = qa_consistency.implication.ImplicationsVQA()
gen.implications('How many birds?', '3')

[('Are there 3 birds ?', 'yes', 'yeseqcount'),
('Are there 4 birds ?', 'no', 'n+1'),
('Are there any birds ?', 'yes', 'ans>0 implies some')]

SQuAD

import tensorflow as tf
config = tf.ConfigProto()
config.gpu_options.allow_growth=True
sess = tf.Session(config=config)
import qa_consistency.implication
gen = qa_consistency.implication.ImplicationsSquad()
passage = 'Kublai originally named his eldest son, Zhenjin, as the Crown Prince, \
but he died before Kublai in 1285.'
gen.implications('When did Zhenjin die?', '1285', passage)

[('Who died in 1285?', 'Zhenjin', 'subj')]

Evaluating the consistency of models

VQA

Download and extract precomputed implications here. Create a folder for the consistency dataset (CONSISTENCY_FOLDER). Output your model predictions into a json file (PRED_FILE) in the VQA format. Then run:

import qa_consistency.dataset_utils
all_imps = pickle.load(open('vqa_imps.pkl', 'rb'))
vqa = qa_consistency.dataset_utils.load_vqa(vqa_path, 'validation')
# Uncomment the line below if you want vqa v2
# vqa = qa_consistency.dataset_utils.load_vqav2(vqa_path, 'validation')
qa_consistency.dataset_utils.generate_implication_vqa(vqa, PRED_FILE, all_imps, CONSISTENCY_FOLDER)

This will write CONSISTENCY_FOLDER/{questions,annotations}.json. At this point you should run your model on these files, and generate a new prediction file (CONSISTENCY_PRED_FILE), and then run:

stats = qa_consistency.dataset_utils.evaluate_consistency_vqa(CONSISTENCY_FOLDER, CONSISTENCY_PREDS_FILE)
print('Consistency by implication type:')
print()
for x, v in stats.items():
    if x == 'all':
        continue
    print('%s : %.1f' % (x, 100* v))
print()
print('Avg  : %.1f' % (100 * stats['all']))

SQuAD

Download and extract precomputed implications here. Let SQUAD_PATH be a pointer to the original squad dev set json (dev-v1.1.json), PRED_FILE be the predictions json on the dev set from your model in the SQuAD official format (dictionary of id : answer). Run:

import qa_consistency.dataset_utils
all_imps = pickle.load(open('squad_imps.pkl', 'rb'))
qa_consistency.dataset_utils.generate_implication_squad(
SQUAD_PATH, PRED_FILE, all_imps, NEW_SQUAD_JSON)

This will generate a new dataset in the SQuAD format in the NEW_SQUAD_JSON path. At this point you should run your model on this file, and generate a new prediction file (CONSISTENCY_PRED_FILE), and then run:

stats = qa_consistency.dataset_utils.evaluate_consistency_squad(NEW_SQUAD_JSON, CONSISTENCY_PRED_FILE)
print('Consistency by implication type:')
print()
for x, v in stats.items():
    if x == 'all':
        continue
    print('%s : %.1f' % (x, 100* v))
print()
print('Avg  : %.1f' % (100 * stats['all']))

Notebooks where we bring it all together

VQA notebook
SQuAD notebook

Code of Conduct

Microsoft Open Source Code of Conduct

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

README.md

README.md

Evaluating consistency of Question-Answering Models

Installation

Generating implications:

VQA

SQuAD

Evaluating the consistency of models

VQA

SQuAD

Notebooks where we bring it all together

Code of Conduct

Files

README.md

Latest commit

History

README.md

File metadata and controls

Evaluating consistency of Question-Answering Models

Installation

Generating implications:

VQA

SQuAD

Evaluating the consistency of models

VQA

SQuAD

Notebooks where we bring it all together

Code of Conduct