Visual Question Answering in PyTorch
- Run
pip install -r requirements.txt
to install all the required python packages. - Download the VQA (2.0) dataset from visualqa.org
We assume you use image embeddings, which you can process using preprocess_images.py
.
python preprocess_images.py <path to instances_train2014.json> \
--root <path to dataset root "train2014|val2014"> \
--split <train|val> --arch <vgg16|vgg19_bn|resnet152>
I have already pre-processed all the COCO images (both train and test sets) using the VGG-16, VGG-19-BN, and ResNet-152 models. To download them, please go into the image_embeddings
directory and run make <model>
.
Here <model>
can be either vgg16
, vgg19_bn
or resnet152
depending on which model's embeddings you need. E.g. make resnet152
Alternatively, you can find them here.
To run the training and evaluation code with default values, just type
make
If you wish to only run the training code, you can run
make train
If you want to use the raw RGB images from COCO, you can type
make raw_images
This takes the same arguments as make train
.
You can get a list of options with make options
or python main.py -h
.
Check out the Makefile
to get an idea of how to run the code.
NOTE The code will take care of all the text preprocessing. Just sit back and relax.
The minimum arguments required are:
- The VQA train annotations dataset
- The VQA train open-ended questions dataset
- Path to the COCO training image feature embeddings
- The VQA val annotations dataset
- The VQA val open-ended questions dataset
- Path to the COCO val image feature embeddings
Evaluating the performance of the model on a fine-grained basis is important. Thus this repo supports evaluating answers to questions based on answer type (e.g. "yes/no" questions).
To evaluate the model, run
make evaluate
You are required to pass in the --resume
argument to point to the trained model weights. The other arguments are
the same as in training.
We have a sample demo that you can run
make demo
You can use your own image or question:
python demo.py demo_img.jpg "what room is this?"
NOTE We train and evaluate on the balanced datasets.
The DeeperLSTM
model in this repo achieves the following results:
Overall Accuracy is: 49.15
Per Answer Type Accuracy is the following:
other : 38.12
yes/no : 69.55
number : 32.17