Welcome! In this repo, you can find the code for our paper published in WACV 2022.
Main idea of this metric is to use classic image captioning metrics CIDEr or SPICE to better evaluate retrieval models.
Using our metric as an adaptive margin can be found in https://github.com/andrespmd/semantic_adaptive_margin. Roughly, we want to take into account the NON-GROUND TRUTH items effect in top-k retrieved items to better evaluate what our models do.
Now, this repo is divided into main section. The former is for the curious and while the latter is for the pragmatists!
First off, we had to change the code of SPICE to save all the pairwise distances.
If you would like to compile from scratch or would like to see the changes we made to SPICE, please check the submodule!
Here is the link to download the compiled version:
SPICE.zip.
After downloading, unzip the file and run python get_stanford models
and then run
java -Xmx8G -jar spice-1.0.jar ./example.json
to see if it works. This should result in a file called spice_pairwise.csv.
Now, to obtain the pairwise distances of captions with CIDEr, we run:
python custom_cider.py --dataset [coco/f30k]
To obtain these distances we used MSCOCO and Flickr30k, here are they for you to download. The reason we run these commands is to preprocess all the pairwise distances. So that we reduce the time it takes to run the NCS metric.
You are a pragmatist and just wanna use the code (I feel you!). Download the precomputed pairwise distances here.
As a format, we expect a similarity matrix saved as json where each row are images and each column are sentences. For example, for Flickr30k, the matrix would have dimensions of 1000x5000; 1000 images, 5000 sentences. Distance metric choice doesn't matter, you can use anything. As an example of the format, we provide some of models' similarity matrix.
Finally, just run to get the results:
python eval.py --dataset [coco/f30k] --metric_name [spice/cider] --model_path [ThePathToSimilarityMatrix]
There are more options to be selected, you can read them inside the code.
To err is human.