The Tutorial of Image-Text Matching for Preliminary Insight.
Due to the urgent time, we temporarily store some state-of-the-arts in Posted in. The tutorial will be constantly updated.
(* indicates Ensemble models, ^ indicates questionable authen)
Method_name | Concise_note | Flickr8K | |||||
Sentence retrieval | Image retrieval | ||||||
R@1 | R@5 | R@10 | R@1 | R@5 | R@10 | ||
DeViSE | RCNN | 4.8 | 16.5 | 27.3 | 5.9 | 20.1 | 29.6 |
SDT-RNN | AlexNet | 4.5 | 18.0 | 28.6 | 6.1 | 18.5 | 29.0 |
SDT-RNN | RCNN | 6.0 | 22.7 | 34.0 | 6.6 | 21.6 | 31.7 |
DeFrag | AlexNet | 5.9 | 19.2 | 27.3 | 5.2 | 17.6 | 26.5 |
DeFrag | RCNN | 12.6 | 32.9 | 44.0 | 9.7 | 29.6 | 42.5 |
m-RNN | AlexNet | 14.5 | 37.2 | 48.5 | 11.5 | 31.0 | 42.4 |
DVSA | DepTree | 14.8 | 37.9 | 50.0 | 11.6 | 31.4 | 43.8 |
DVSA | RCNN | 16.5 | 40.6 | 54.2 | 11.8 | 32.1 | 44.7 |
UVSE | AlexNet | 13.5 | 36.2 | 45.7 | 10.4 | 31.0 | 43.7 |
UVSE | VggNet | 18.0 | 40.9 | 55.0 | 12.5 | 37.0 | 51.5 |
NIC | GoogleNet | 20 | -- | 61 | 19 | -- | 64 |
m-CNN* | OverFeat | 14.9 | 35.9 | 49.0 | 11.8 | 34.5 | 48.0 |
m-CNN* | VggNet | 24.8 | 53.7 | 67.1 | 20.3 | 47.6 | 61.7 |
HM-LSTM | RCNN | 27.7 | -- | 68.6 | 24.4 | -- | 68.1 |
SPE | VggNet | 30.1 | 60.4 | 73.7 | 23.0 | 51.3 | 64.8 |
FV | GMM+HGLMM | 31.0 | 59.3 | 73.7 | 21.2 | 50.0 | 64.8 |
NAA | ResNet | 37.2 | 68.1 | 79.1 | 27.7 | 59.6 | 71.8 |
SCAN* | BUTD | 52.2 | 81.0 | 89.2 | 38.3 | 67.8 | 78.9 |
IMRAM | BUTD, Image | 48.5 | 78.1 | 85.3 | 32.0 | 61.4 | 73.9 |
IMRAM | BUTD, Text | 52.1 | 81.5 | 90.1 | 40.2 | 69.0 | 79.2 |
IMRAM | BUTD, Full | 54.7 | 84.2 | 91.0 | 41.0 | 69.2 | 79.9 |
Method_name | Concise_note | Flickr30K | |||||
Sentence retrieval | Image retrieval | ||||||
R@1 | R@5 | R@10 | R@1 | R@5 | R@10 | ||
DeViSE | RCNN | 4.5 | 18.1 | 29.2 | 6.7 | 21.9 | 32.7 |
SDT-RNN | RCNN | 9.6 | 29.8 | 41.1 | 8.9 | 29.8 | 41.1 |
DeFrag | RCNN | 14.2 | 37.7 | 51.3 | 10.2 | 30.8 | 44.2 |
DeFrag | ftRCNN | 16.4 | 40.2 | 54.7 | 10.3 | 31.4 | 44.5 |
DCCA | AlexNet | 16.7 | 39.3 | 52.9 | 12.6 | 31.0 | 43.0 |
NIC | GoogleNet | 17 | -- | 56 | 17 | -- | 57 |
DVSA | DepTree | 20.0 | 46.6 | 59.4 | 15.0 | 36.5 | 48.2 |
DVSA | RCNN | 22.2 | 48.2 | 61.4 | 15.2 | 37.7 | 50.5 |
UVSE | AlexNet | 14.8 | 39.2 | 50.9 | 11.8 | 34.0 | 46.3 |
UVSE | VggNet | 23.0 | 50.7 | 62.9 | 16.8 | 42.0 | 56.5 |
LRCN | VggNet | 23.6 | 46.6 | 58.3 | 17.5 | 40.3 | 50.8 |
m-CNN* | OverFeat | 20.1 | 44.2 | 56.3 | 15.9 | 40.3 | 51.9 |
m-CNN* | VggNet | 33.6 | 64.1 | 74.9 | 26.2 | 56.3 | 69.6 |
m-RNN | AlexNet | 18.4 | 40.2 | 50.9 | 12.6 | 31.2 | 41.5 |
m-RNN | VggNet | 35.4 | 63.8 | 73.7 | 22.8 | 50.7 | 63.1 |
FV | GMM+HGLMM | 35.0 | 62.0 | 73.8 | 25.0 | 52.7 | 66.0 |
HM-LSTM | RCNN | 38.1 | -- | 76.5 | 27.7 | -- | 68.8 |
SPE | VggNet | 40.3 | 68.9 | 79.9 | 29.7 | 60.1 | 72.1 |
sm-LSTM | VggNet | 42.4 | 67.5 | 79.9 | 28.2 | 57.0 | 68.4 |
sm-LSTM* | VggNet | 42.5 | 71.9 | 81.5 | 30.2 | 60.4 | 72.3 |
CSE | ResNet | 44.6 | 74.3 | 83.8 | 36.9 | 69.1 | 79.6 |
RRF-Net | ResNet | 47.6 | 77.4 | 87.1 | 35.4 | 68.3 | 79.9 |
CMPL | MobileNet | 40.3 | 66.9 | 76.7 | 30.4 | 58.2 | 68.5 |
CMPL | ResNet | 49.6 | 76.8 | 86.1 | 37.3 | 65.7 | 75.5 |
2WayNet | VggNet | 49.8 | 67.5 | -- | 36.0 | 55.6 | -- |
VSE++ | VggNet | 41.3 | 69.1 | 77.9 | 31.4 | 60.0 | 71.2 |
VSE++ | ResNet | 52.9 | 80.5 | 87.2 | 39.6 | 70.1 | 79.5 |
TIMAM | ResNet, Bert | 53.1 | 78.8 | 87.6 | 42.6 | 71.6 | 81.9 |
DAN | VggNet | 41.4 | 73.5 | 82.5 | 31.8 | 61.7 | 72.5 |
DAN | ResNet | 55.0 | 81.8 | 89.0 | 39.4 | 69.2 | 79.1 |
NAA | ResNet | 55.1 | 80.3 | 89.6 | 39.4 | 68.8 | 79.9 |
SCO | VggNet | 44.2 | 74.1 | 83.6 | 32.8 | 64.3 | 74.9 |
SCO | ResNet | 55.5 | 82.0 | 89.3 | 41.1 | 70.5 | 80.1 |
Dual-Path | VggNet | 47.6 | 77.3 | 87.1 | 35.3 | 66.6 | 78.2 |
Dual-Path | ResNet | 55.6 | 81.9 | 89.5 | 39.1 | 69.2 | 80.9 |
CVSE++ | ResNet | 56.6 | 82.5 | 90.2 | 42.4 | 71.6 | 80.8 |
GXN | ResNet | 56.8 | -- | 89.6 | 41.5 | -- | 80.1 |
Align2Ground | BUTD | -- | -- | -- | 49.7 | 74.8 | 83.3 |
A3VSE | BUTD | 65.0 | 89.2 | 94.5 | 49.5 | 79.5 | 86.6 |
R-SCAN | BUTD, VrR-VG | 66.3 | 90.6 | 96.0 | 51.4 | 77.8 | 84.9 |
SAVE | ResNet | 67.2 | 88.3 | 94.2 | 49.8 | 78.7 | 86.2 |
SCAN | BUTD, t2i_AVE | 61.8 | 87.5 | 93.7 | 45.8 | 74.4 | 83.0 |
SCAN | BUTD, i2t_AVE | 67.9 | 89.0 | 94.4 | 43.9 | 74.2 | 82.8 |
SCAN* | BUTD, AVE+LSE | 67.4 | 90.3 | 95.8 | 48.6 | 77.7 | 85.2 |
BFAN | BUTD, prob | 65.5 | 89.4 | -- | 47.9 | 77.6 | -- |
BFAN | BUTD, equal | 64.5 | 89.7 | -- | 48.8 | 77.3 | -- |
BFAN* | BUTD | 68.1 | 91.4 | -- | 50.8 | 78.4 | -- |
CAMP | BUTD | 68.1 | 89.7 | 95.2 | 51.5 | 77.1 | 85.3 |
RDAN | BUTD | 68.1 | 91.0 | 95.9 | 54.1 | 80.9 | 87.2 |
Personality | ResNeXt, Transformer | 68.4 | 90.6 | 95.3 | -- | -- | -- |
CASC | ResNet | 68.5 | 90.6 | 95.9 | 50.2 | 78.3 | 86.3 |
GVSE* | BUTD | 68.5 | 90.9 | 95.5 | 50.6 | 79.8 | 87.6 |
HAL | SCAN_i2t | 68.6 | 89.9 | 94.7 | 46.0 | 74.0 | 82.3 |
OAN | BUTD | 68.6 | 93.0 | 96.0 | 53.3 | 80.1 | 87.1 |
SAEM | BUTD, Bert | 69.1 | 91.0 | 95.1 | 52.4 | 81.1 | 88.1 |
MPL | SCAN_i2t | 69.4 | 89.9 | 95.4 | 47.5 | 75.5 | 83.1 |
PFAN | BUTD, t2i | 66.0 | 89.6 | 94.3 | 49.6 | 77.0 | 84.2 |
PFAN | BUTD, i2t | 67.6 | 90.0 | 93.8 | 45.7 | 74.7 | 83.6 |
PFAN* | BUTD | 70.0 | 91.8 | 95.0 | 50.4 | 78.7 | 86.1 |
CAAN | BUTD | 70.1 | 91.6 | 97.2 | 52.8 | 79.0 | 87.9 |
DP-RNN | BUTD | 70.2 | 91.6 | 95.8 | 55.5 | 81.3 | 88.2 |
HOAD | BUTD | 70.8 | 92.7 | 96.0 | 59.5 | 85.6/td> | 91.0< |
HOAD | BUTD, +Dist | 70.8 | 92.7 | 96.0 | 60.9 | 86.1 | 91.0 |
GOT | SCAN_i2t | 70.9 | 92.8 | 95.5 | 50.7 | 78.7 | 86.2 |
VSRN* | BUTD | 71.3 | 90.6 | 96.0 | 54.7 | 81.8 | 88.2 |
SCG | VggNet, Prod | 57.2 | 85.1 | 92.1 | 40.1 | 69.5 | 79.5 |
SCG | VggNet, Gated | 71.8 | 90.8 | 94.8 | 49.3 | 76.4 | 85.6 |
SGM | BUTD | 71.8 | 91.7 | 95.5 | 53.5 | 79.6 | 86.5 |
IMRAM | BUTD, Image | 67.0 | 90.5 | 95.6 | 51.2 | 78.2 | 85.5 |
IMRAM | BUTD, Text | 68.8 | 91.6 | 96.0 | 53.0 | 79.0 | 87.1 |
IMRAM | BUTD, Full | 74.1 | 93.0 | 96.6 | 53.9 | 79.4 | 87.2 |
MMCA | BUTD, Bert | 74.2 | 92.8 | 96.4 | 54.8 | 81.4 | 87.8 |
SAN^ | VggNet | 67.0 | 88.0 | 94.6 | 51.4 | 77.2 | 85.2 |
SAN^ | ResNet | 75.5 | 92.6 | 96.2 | 60.1 | 84.7 | 90.6 |
GSMN | BUTD, sparse | 71.4 | 92.0 | 96.1 | 53.9 | 79.7 | 87.1 |
GSMN | BUTD, dense | 72.6 | 93.5 | 96.8 | 53.7 | 80.0 | 87.0 |
GSMN* | BUTD | 76.4 | 94.3 | 97.3 | 57.4 | 82.3 | 89.0 |
ADAPT | BUTD, i2t | 70.2 | 90.8 | 95.8 | 55.5 | 82.7 | 89.8 |
ADAPT | BUTD, t2i | 73.6 | 93.7 | 96.7 | 57.0 | 83.6 | 90.3 |
ADAPT* | BUTD | 76.6 | 95.4 | 97.6 | 60.7 | 86.6 | 92.0 |
SGRAF | BUTD, SAF | 73.7 | 93.3 | 96.3 | 56.1 | 81.5 | 88.0 |
SGRAF | BUTD, SGR | 75.2 | 93.3 | 96.6 | 56.2 | 81.0 | 86.5 |
SGRAF* | BUTD | 77.8 | 94.1 | 97.4 | 58.5 | 83.0 | 88.8 |
ACMM | BUTD | 80.0 | 95.5 | 98.2 | 50.2 | 76.8 | 84.7 |
ACMM* | BUTD | 85.2 | 96.7 | 98.4 | 53.8 | 79.8 | 86.8 |
Method_name | Concise_note | MSCOCO1K | |||||
Sentence retrieval | Image retrieval | ||||||
R@1 | R@5 | R@10 | R@1 | R@5 | R@10 | ||
STV | combine-skip | 33.8 | 67.7 | 82.1 | 25.9 | 60.0 | 74.6 |
DVSA | RCNN | 38.4 | 69.9 | 80.5 | 27.4 | 60.2 | 74.8 |
FV | GMM+HGLMM | 39.4 | 67.9 | 80.9 | 25.1 | 59.8 | 76.6 |
m-RNN | VggNet | 41.0 | 73.0 | 83.5 | 29.0 | 42.2 | 77.0 |
m-CNN* | VggNet | 42.8 | 73.1 | 84.1 | 32.6 | 68.6 | 82.8 |
UVSE | VggNet | 43.4 | 75.7 | 85.8 | 31.0 | 66.7 | 79.9 |
HM-LSTM | RCNN | 43.9 | -- | 87.8 | 36.1 | -- | 86.7 |
Order-emb | VggNet | 46.7 | -- | 88.9 | 37.9 | -- | 85.9 |
SPE | VggNet | 50.1 | 79.7 | 89.2 | 39.6 | 75.2 | 86.9 |
SEAM | VggNet | 50.7 | 81.4 | 90.9 | 40.3 | 75.7 | 87.4 |
sm-LSTM | VggNet | 52.4 | 81.7 | 90.8 | 38.6 | 73.4 | 84.6 |
sm-LSTM* | VggNet | 53.2 | 83.1 | 91.5 | 40.7 | 75.8 | 87.4 |
CMPL | MobileNet | 52.9 | 83.8 | 92.1 | 41.3 | 74.6 | 85.9 |
2WayNet | VggNet | 55.8 | 75.2 | -- | 39.7 | 63.3 | -- |
CMPM | ResNet | 56.1 | 86.3 | 92.9 | 44.6 | 78.8 | 89.0 |
CSE | ResNet | 56.3 | 84.4 | 92.2 | 45.7 | 81.2 | 90.6 |
RRF-Net | ResNet | 56.4 | 85.3 | 91.5 | 43.9 | 78.1 | 88.6 |
CHAIN-VSE | VggNet | 51.6 | 82.0 | 91.3 | 38.6 | 75.1 | 87.2 |
CHAIN-VSE | ResNet | 59.4 | 88.0 | 94.2 | 43.5 | 79.8 | 90.2 |
NAA | ResNet | 61.3 | 87.9 | 95.4 | 47.0 | 80.8 | 90.1 |
VSE++ | VggNet | 57.2 | 86.0 | 93.3 | 45.9 | 79.4 | 89.1 |
VSE++ | ResNet | 64.6 | 90.0 | 95.7 | 52.0 | 84.3 | 92.0 |
Dual-Path | VggNet | 59.4 | 86.2 | 92.9 | 41.6 | 76.3 | 87.5 |
Dual-Path | ResNet | 65.6 | 89.8 | 95.5 | 47.1 | 79.9 | 90.0 |
Personality | ResNeXt, Transformer | 67.3 | 91.7 | 96.5 | -- | -- | -- |
Align2Ground | BUTD | -- | -- | -- | 56.6 | 84.9 | 92.8 |
GXN | ResNet | 68.5 | -- | 97.9 | 56.6 | -- | 94.5 |
CVSE++ | ResNet | 69.1 | 92.2 | 96.1 | 55.6 | 86.7 | 93.8 |
PVSE | ResNet | 69.2 | 91.6 | 96.6 | 55.2 | 86.5 | 93.7 |
SCO | VggNet | 66.6 | 91.8 | 96.6 | 55.5 | 86.6 | 93.8 |
SCO | ResNet | 69.9 | 92.9 | 97.5 | 56.7 | 87.5 | 94.8 |
R-SCAN | BUTD, VrR-VG | 70.3 | 94.5 | 98.1 | 57.6 | 87.3 | 93.7 |
SAVE | ResNet | 70.8 | 93.2 | 97.6 | 56.9 | 87.6 | 94.4 |
MPL | SCAN_i2t | 71.1 | 93.7 | 98.2 | 56.8 | 86.7 | 93.0 |
SAEM | BUTD, Bert | 71.2 | 94.1 | 97.7 | 57.8 | 88.6 | 94.9 |
OAN | BUTD | 71.7 | 96.4 | 99.3 | 60.2 | 88.6 | 94.5 |
GVSE* | BUTD | 72.2 | 94.1 | 98.1 | 60.5 | 89.4 | 95.8 |
CAMP | BUTD | 72.3 | 94.8 | 98.3 | 58.5 | 87.9 | 95.0 |
CASC | ResNet | 72.3 | 96.0 | 99.0 | 58.9 | 89.8 | 96.0 |
SCAN | BUTD, t2i_AVE | 70.9 | 94.5 | 97.8 | 56.4 | 87.0 | 93.9 |
SCAN | BUTD, i2t_AVE | 69.2 | 93.2 | 97.5 | 54.4 | 86.0 | 93.6 |
SCAN* | BUTD, LSE+AVE | 72.7 | 94.8 | 98.4 | 58.8 | 88.4 | 94.8 |
SGM | BUTD | 73.4 | 93.8 | 97.8 | 57.5 | 87.3 | 94.3 |
ParNet | BUTD, NP | 72.8 | 94.9 | 97.9 | 57.9 | 87.4 | 94.0 |
ParNet | BUTD, P | 73.5 | 94.5 | 98.3 | 58.3 | 88.2 | 94.1 |
RDAN | BUTD | 74.6 | 96.2 | 98.7 | 61.6 | 89.2 | 94.7 |
MMCA | BUTD, Bert | 74.8 | 95.6 | 97.7 | 61.6 | 89.8 | 95.2 |
BFAN | BUTD, prob | 73.0 | 94.8 | -- | 58.0 | 87.6 | -- |
BFAN | BUTD, equal | 73.7 | 94.9 | -- | 58.3 | 87.5 | -- |
BFAN* | BUTD | 74.9 | 95.2 | -- | 59.4 | 88.4 | -- |
DP-RNN | BUTD | 75.3 | 95.8 | 98.6 | 62.5 | 89.7 | 95.1 |
CAAN | BUTD | 75.5 | 95.4 | 98.5 | 61.3 | 89.7 | 95.2 |
VSRN* | BUTD | 76.2 | 94.8 | 98.2 | 62.8 | 89.7 | 95.1 |
ADAPT | BUTD, i2t | 74.5 | 94.2 | 97.9 | 62.0 | 90.4 | 95.5 |
ADAPT | BUTD, t2i | 75.3 | 95.1 | 98.4 | 63.3 | 90.0 | 95.5 |
ADAPT* | BUTD | 76.5 | 95.6 | 98.9 | 62.2 | 90.5 | 96.0 |
PFAN | BUTD, t2i | 75.8 | 95.9 | 99.0 | 61.0 | 89.1 | 95.1 |
PFAN | BUTD, i2t | 70.7 | 94.1 | 97.8 | 53.0 | 84.5 | 92.6 |
PFAN* | BUTD | 76.5 | 96.3 | 99.0 | 61.6 | 89.6 | 95.2 |
SCG | VggNet, Prod | 73.4 | 94.8 | 97.6 | 56.3 | 85.6 | 93.5 |
SCG | VggNet, Gated | 76.6 | 96.3 | 99.2 | 61.4 | 88.9 | 95.1 |
IMRAM | BUTD, Image | 76.1 | 95.3 | 98.2 | 61.0 | 88.6 | 94.5 |
IMRAM | BUTD, Text | 74.0 | 95.6 | 98.4 | 60.6 | 88.9 | 94.6 |
IMRAM | BUTD, Full | 76.7 | 95.6 | 98.5 | 61.7 | 89.1 | 95.0 |
HOAD | BUTD | 77.0 | 96.1 | 98.7 | 65.1 | 93.1/td> | 97.9< |
HOAD | BUTD, +Dist | 77.8 | 96.1 | 98.7 | 66.2 | 93.0 | 97.9 |
HAL | SCAN_i2t | 78.3 | 96.3 | 98.5 | 60.1 | 86.7 | 92.8 |
GSMN | BUTD, sparse | 76.1 | 95.6 | 98.3 | 60.4 | 88.7 | 95.0 |
GSMN | BUTD, dense | 74.7 | 95.3 | 98.2 | 60.3 | 88.5 | 94.6 |
GSMN* | BUTD | 78.4 | 96.4 | 98.6 | 63.3 | 90.1 | 95.7 |
SGRAF | BUTD, SAF | 76.1 | 95.4 | 98.3 | 61.8 | 89.4 | 95.3 |
SGRAF | BUTD, SGR | 78.0 | 95.8 | 98.2 | 61.4 | 89.3 | 95.4 |
SGRAF* | BUTD | 79.6 | 96.2 | 98.5 | 63.2 | 90.7 | 96.1 |
ACMM | BUTD | 81.9 | 98.0 | 99.3 | 58.2 | 87.3 | 93.9 |
ACMM* | BUTD | 84.1 | 97.8 | 99.4 | 60.7 | 88.7 | 94.9 |
SAN^ | VggNet | 74.9 | 94.9 | 98.2 | 60.8 | 90.3 | 95.7 |
SAN^ | ResNet | 85.4 | 97.5 | 99.0 | 69.1 | 93.4 | 97.2 |
Method_name | Concise_note | MSCOCO5K | |||||
Sentence retrieval | Image retrieval | ||||||
R@1 | R@5 | R@10 | R@1 | R@5 | R@10 | ||
DVSA | RCNN | 16.5 | 39.2 | 52.0 | 10.7 | 29.6 | 42.2 |
FV | GMM+HGLMM | 17.3 | 39.0 | 50.2 | 10.8 | 28.3 | 40.1 |
Order-emb | VggNet | 23.3 | -- | 65.0 | 18.0 | -- | 57.6 |
CSE | ResNet | 27.9 | 57.1 | 70.4 | 22.2 | 50.2 | 64.4 |
CMPL | MobileNet | 24.6 | 52.3 | 66.4 | 19.1 | 44.6 | 58.4 |
CMPM | ResNet | 31.1 | 60.7 | 73.9 | 22.9 | 50.2 | 63.8 |
Dual-Path | VggNet | 35.5 | 63.2 | 75.6 | 21.0 | 47.5 | 60.9 |
Dual-Path | ResNet | 41.2 | 70.5 | 81.1 | 25.3 | 53.4 | 66.4 |
VSE++ | VggNet | 32.9 | 61.7 | 74.7 | 24.1 | 52.8 | 66.2 |
VSE++ | ResNet | 41.3 | 71.1 | 81.2 | 30.3 | 59.4 | 72.4 |
GXN | ResNet | 42.0 | -- | 84.7 | 31.7 | -- | 74.6 |
SCO | VggNet | 40.2 | 70.1 | 81.3 | 31.3 | 61.5 | 73.9 |
SCO | ResNet | 42.8 | 72.3 | 83.0 | 33.1 | 62.9 | 75.5 |
CVSE++ | ResNet | 43.2 | 73.5 | 84.1 | 32.4 | 62.2 | 74.6 |
PVSE | ResNet | 45.2 | 74.3 | 84.5 | 32.4 | 63.0 | 75.0 |
R-SCAN | BUTD, VrR-VG | 45.4 | 77.9 | 87.9 | 36.2 | 65.5 | 76.7 |
SAVE | ResNet | 46.7 | 76.3 | 86.1 | 34.0 | 64.8 | 77.0 |
MPL | SCAN_i2t | 46.9 | 77.7 | 87.6 | 34.4 | 64.2 | 75.9 |
CASC | ResNet | 47.2 | 78.3 | 87.4 | 34.7 | 64.8 | 76.8 |
OAN | BUTD | 47.8 | 81.2 | 90.4 | 37.0 | 66.6 | 78.0 |
A3VSE | BUTD | 49.3 | 81.1 | 90.2 | 39.0 | 68.0 | 80.1 |
GVSE* | BUTD | 49.9 | 77.4 | 87.6 | 38.4 | 68.5 | 79.7 |
SGM | BUTD | 50.0 | 79.3 | 87.9 | 35.3 | 64.9 | 76.5 |
CAMP | BUTD | 50.1 | 82.1 | 89.7 | 39.0 | 68.9 | 80.2 |
SCAN | BUTD, i2t_LSE | 46.4 | 77.4 | 87.2 | 34.4 | 63.7 | 75.7 |
SCAN* | BUTD, AVE+LSE | 50.4 | 82.2 | 90.0 | 38.6 | 69.3 | 80.4 |
GOT | SCAN_i2t | 50.5 | 80.2 | 89.8 | 38.1 | 66.8 | 78.5 |
HOAD | BUTD | 51.2 | 81.7 | 89.1 | 39.4 | 72.5 | 84.1 |
HOAD | BUTD, +Dist | 51.4 | 81.8 | 89.1 | 40.5 | 73.5 | 84.1 |
CAAN | BUTD | 52.5 | 83.3 | 90.9 | 41.2 | 70.3 | 82.9 |
VSRN* | BUTD | 53.0 | 81.1 | 89.4 | 40.5 | 70.6 | 81.1 |
IMRAM | BUTD, Image | 53.2 | 82.5 | 90.4 | 38.9 | 68.5 | 79.2 |
IMRAM | BUTD, Text | 52.0 | 81.8 | 90.1 | 38.6 | 68.1 | 79.1 |
IMRAM | BUTD, Full | 53.7 | 83.2 | 91.0 | 39.7 | 69.1 | 79.8 |
MMCA | BUTD, Bert | 54.0 | 82.5 | 90.7 | 38.7 | 69.7 | 80.8 |
SCG | VggNet, Prod | 49.9 | 78.9 | 88.1 | 33.2 | 62.4 | 74.7 |
SCG | VggNet, Gated | 56.6 | 84.5 | 92.0 | 39.2 | 68.0 | 81.3 |
SGRAF | BUTD, SAF | 53.3 | 82.3 | 90.1 | 39.8 | 69.0 | 80.2 |
SGRAF | BUTD, SGR | 56.9 | 83.2 | 90.5 | 40.2 | 69.0 | 79.8 |
SGRAF* | BUTD | 57.8 | 84.9 | 91.6 | 41.9 | 70.7 | 81.3 |
SAN^ | ResNet | 65.4 | 89.4 | 94.8 | 46.2 | 77.4 | 86.6 |
ACMM | BUTD | 63.5 | 88.0 | 93.6 | 36.7 | 65.1 | 76.7 |
ACMM* | BUTD | 66.9 | 89.6 | 94.9 | 39.5 | 69.6 | 81.1 |
Method_name | Concise_note | CUHK-PEDES | ||
R@1 | R@5 | R@10 | ||
LSTM-Q+I | VggNet | 17.19 | -- | 57.82 |
GNA-RNN | VggNet | 19.05 | -- | 53.64 |
IATV | VggNet | 25.94 | -- | 60.48 |
PWM-ATH | VggNet | 27.14 | 49.45 | 61.02 |
GLA | ResNet | 43.58 | 66.93 | 76.26 |
Dual-Path | VggNet | 32.15 | 54.42 | 64.30 |
Dual-Path | ResNet | 44.40 | 66.26 | 75.07 |
CMPM | MobileNet | 44.02 | -- | 77.00 |
CMPL | MobileNet | 49.37 | -- | 79.27 |
PMA | VggNet | 47.02 | 68.54 | 78.06 |
PMA | ResNet | 53.81 | 73.54 | 81.23 |
TIMAM | ResNet, Bert | 54.51 | 77.56 | 84.78 |
Method_name | Concise_note | CUB | Flowers | ||
Image-to-Text | Text-to-Image | Image-to-Text | Text-to-Image | ||
R@1 | AP@50 | R@1 | AP@50 | ||
FV | GMM+HGLMM | 36.5 | 35.6 | 54.8 | 52.8 |
Word2Vec | 38.6 | 33.5 | 54.2 | 52.1 | |
Word-NN | CNN | 51.0 | 43.3 | 60.7 | 56.3 |
Word-NN | CNN-RNN | 56.8 | 48.7 | 65.6 | 59.6 |
IATV | Triplet | 52.5 | 52.4 | 64.3 | 64.9 |
IATV | VggNet | 61.5 | 57.6 | 68.4 | 70.1 |
CMPM | MobileNet | 62.1 | 64.6 | 66.1 | 67.7 |
CMPL | MobileNet | 64.3 | 67.9 | 68.9 | 69.7 |
TIMAM | ResNet, Bert | 67.7 | 70.3 | 70.6 | 73.7 |
(NIPS2013_DeViSE) DeViSE: A Deep Visual-Semantic Embedding Model.
Andrea Frome, Greg S. Corrado, Jonathon Shlens, Samy Bengio, Jeffrey Dean, Marc’Aurelio Ranzato, Tomas Mikolov.
[paper]
(TACL2014_SDT-RNN) Grounded Compositional Semantics for Finding and Describing Images with Sentences.
Richard Socher, Andrej Karpathy, Quoc V. Le, Christopher D. Manning, Andrew Y. Ng.
[paper]
(NIPSws2014_UVSE) Unifying Visual-Semantic Embeddings with Multimodal Neural Language Models.
Ryan Kiros, Ruslan Salakhutdinov, Richard S. Zemel.
[paper]
[code]
[demo]
(NIPS2014_DeFrag) Deep fragment embeddings for bidirectional image sentence mapping.
Andrej Karpathy, Armand Joulin, Li Fei-Fei.
[paper]
(ICCV2015_m-CNN) Multimodal Convolutional Neural Networks for Matching Image and Sentence.
Lin Ma, Zhengdong Lu, Lifeng Shang, Hang Li.
[paper]
(CVPR2015_DCCA) Deep Correlation for Matching Images and Text.
Fei Yan, Krystian Mikolajczyk.
[paper]
(CVPR2015_FV) Associating Neural Word Embeddings with Deep Image Representationsusing Fisher Vectors.
Benjamin Klein, Guy Lev, Gil Sadeh, Lior Wolf.
[paper]
(CVPR2015_DVSA) Deep Visual-Semantic Alignments for Generating Image Descriptions.
Andrej Karpathy, Li Fei-Fei.
[paper]
(NIPS2015_STV) Skip-thought Vectors.
Ryan Kiros, Yukun Zhu, Ruslan Salakhutdinov, Richard S. Zemel, Antonio Torralba, Raquel Urtasun, Sanja Fidler.
[paper]
(CVPR2016_SPE) Learning Deep Structure-Preserving Image-Text Embeddings.
Liwei Wang, Yin Li, Svetlana Lazebnik.
[paper]
(ICCV2017_HM-LSTM) Hierarchical Multimodal LSTM for Dense Visual-Semantic Embedding.
Zhenxing Niu, Mo Zhou, Le Wang, Xinbo Gao, Gang Hua.
[paper]
(ICCV2017_RRF-Net) Learning a Recurrent Residual Fusion Network for Multimodal Matching.
Yu Liu, Yanming Guo, Erwin M. Bakker, Michael S. Lew.
[paper]
(CVPR2017_2WayNet) Linking Image and Text with 2-Way Nets.
Aviv Eisenschtat, Lior Wolf.
[paper]
(WACV2018_SEAM) Fast Self-Attentive Multimodal Retrieval.
Jônatas Wehrmann, Maurício Armani Lopes, Martin D More, Rodrigo C. Barros.
[paper]
[code]
(CVPR2018_CSE) End-to-end Convolutional Semantic Embeddings.
Quanzeng You, Zhengyou Zhang, Jiebo Luo.
[paper]
(CVPR2018_CHAIN-VSE) Bidirectional Retrieval Made Simple.
Jonatas Wehrmann, Rodrigo C. Barros.
[paper]
[code]
(CVPR2018_SCO) Learning Semantic Concepts and Order for Image and Sentence Matching.
Yan Huang, Qi Wu, Liang Wang.
[paper]
(MM2019_SAEM) Learning Fragment Self-Attention Embeddings for Image-Text Matching.
Yiling Wu, Shuhui Wang, Guoli Song, Qingming Huang.
[paper]
[code]
(ICCV2019_VSRN) Visual Semantic Reasoning for Image-Text Matching.
Kunpeng Li, Yulun Zhang, Kai Li, Yuanyuan Li, Yun Fu.
[paper]
[code]
(CVPR2019_Personality) Engaging Image Captioning via Personality.
Kurt Shuster, Samuel Humeau, Hexiang Hu, Antoine Bordes, Jason Weston.
[paper]
(CVPR2019_PVSE) Polysemous Visual-Semantic Embedding for Cross-Modal Retrieval.
Yale Song, Mohammad Soleymani.
[paper]
[code]
(TOMM2020_NIS) Upgrading the Newsroom: An Automated Image Selection System for News Articles.
Fangyu Liu, Rémi Lebret, Didier Orel, Philippe Sordet, Karl Aberer.
[paper]
[slides]
[demo]
(WACV2020_SGM) Cross-modal Scene Graph Matching for Relationship-aware Image-Text Retrieval.
Sijin Wang, Ruiping Wang, Ziwei Yao, Shiguang Shan, Xilin Chen.
[paper]
(arXiv2014_NIC) Show and Tell: A Neural Image Caption Generator.
Oriol Vinyals, Alexander Toshev, Samy Bengio, Dumitru Erhan.
[paper]
(ICLR2015_m-RNN) Deep Captioning with Multimodal Recurrent Neural Network(M-RNN).
Junhua Mao, Wei Xu, Yi Yang, Jiang Wang, Zhiheng Huang, Alan Yuille.
[paper]
[code]
(CVPR2015_LRCN) Long-term Recurrent Convolutional Networks for Visual Recognition and Description.
Jeff Donahue, Lisa Anne Hendricks, Marcus Rohrbach, Subhashini Venugopalan, Sergio Guadarrama, Kate Saenko, Trevor Darrell.
[paper]
(CVPR2017_DAN) Dual Attention Networks for Multimodal Reasoning and Matching.
Hyeonseob Nam, Jung-Woo Ha, Jeonghee Kim.
[paper]
(CVPR2017_sm-LSTM) Instance-aware Image and Sentence Matching with Selective Multimodal LSTM.
Yan Huang, Wei Wang, Liang Wang.
[paper]
(ECCV2018_CITE) Conditional Image-Text Embedding Networks.
Bryan A. Plummer, Paige Kordas, M. Hadi Kiapour, Shuai Zheng, Robinson Piramuthu, Svetlana Lazebnik.
[paper]
(ECCV2018_SCAN) Stacked Cross Attention for Image-Text Matching.
Kuang-Huei Lee, Xi Chen, Gang Hua, Houdong Hu, Xiaodong He.
[paper]
[code]
(arXiv2019_R-SCAN) Learning Visual Relation Priors for Image-Text Matching and Image Captioning with Neural Scene Graph Generators.
Kuang-Huei Lee, Hamid Palang, Xi Chen, Houdong Hu, Jianfeng Gao.
[paper]
(arXiv2019_ParNet) ParNet: Position-aware Aggregated Relation Network for Image-Text matching.
Yaxian Xia, Lun Huang, Wenmin Wang, Xiaoyong Wei, Jie Chen.
[paper]
(ACML2019_SAVE) Multi-Scale Visual Semantics Aggregation with Self-Attention for End-to-End Image-Text Matching.
Zhuobin Zheng, Youcheng Ben, Chun Yuan.
[paper]
(ICMR2019_OAN) Improving What Cross-Modal Retrieval Models Learn through Object-Oriented Inter- and Intra-Modal Attention Networks.
Po-Yao Huang, Vaibhav, Xiaojun Chang, Alexander Georg Hauptmann.
[paper]
[code]
(MM2019_BFAN) Focus Your Attention: A Bidirectional Focal Attention Network for Image-Text Matching.
Chunxiao Liu, Zhendong Mao, An-An Liu, Tianzhu Zhang, Bin Wang, Yongdong Zhang.
[paper]
[code]
(MM2019_MTFN) Matching Images and Text with Multi-modal Tensor Fusion and Re-ranking.
Tan Wang, Xing Xu, Yang Yang, Alan Hanjalic, Heng Tao Shen, Jingkuan Song.
[paper]
[code]
(IJCAI2019_RDAN) Multi-Level Visual-Semantic Alignments with Relation-Wise Dual Attention Network for Image and Text Matching.
Zhibin Hu, Yongsheng Luo,Jiong Lin,Yan Yan, Jian Chen.
[paper]
(IJCAI2019_PFAN) Position Focused Attention Network for Image-Text Matching.
Yaxiong Wang, Hao Yang, Xueming Qian, Lin Ma, Jing Lu, Biao Li, Xin Fan.
[paper]
[code]
(ICCV2019_CAMP) CAMP: Cross-Modal Adaptive Message Passing for Text-Image Retrieval.
Zihao Wang, Xihui Liu, Hongsheng Li, Lu Sheng, Junjie Yan, Xiaogang Wang, Jing Shao.
[paper]
[code]
(ICCV2019_SAN) Saliency-Guided Attention Network for Image-Sentence Matching.
Zhong Ji, Haoran Wang, Jungong Han, Yanwei Pang.
[paper]
[code]
(TNNLS2020_CASC) Cross-Modal Attention With Semantic Consistence for Image-Text Matching.
Xing Xu, Tan Wang, Yang Yang, Lin Zuo, Fumin Shen, Heng Tao Shen.
[paper]
[code]
(AAAI2020_DP-RNN) Expressing Objects just like Words: Recurrent Visual Embedding for Image-Text Matching.
Tianlang Chen, Jiebo Luo.
[paper]
(AAAI2020_ADAPT) Adaptive Cross-modal Embeddings for Image-Text Alignment.
Jonatas Wehrmann, Camila Kolling, Rodrigo C Barros.
[paper]
[code]
(CVPR2020_CAAN) Context-Aware Attention Network for Image-Text Retrieval.
Qi Zhang, Zhen Lei, Zhaoxiang Zhang, Stan Z. Li.
[paper]
(CVPR2020_MMCA) Multi-Modality Cross Attention Network for Image and Sentence Matching.
Xi Wei, Tianzhu Zhang, Yan Li, Yongdong Zhang, Feng Wu.
[paper]
(CVPR2020_IMRAM) IMRAM: Iterative Matching with Recurrent Attention Memory for Cross-Modal Image-Text Retrieval.
Hui Chen, Guiguang Ding, Xudong Liu, Zijia Lin, Ji Liu, Jungong Han.
[paper]
[code]
(ICLR2016_Order-emb) Order-Embeddings of Images and Language.
Ivan Vendrov, Ryan Kiros, Sanja Fidler, Raquel Urtasun.
[paper]
(CVPR2020_HOAD) Visual-Semantic Matching by Exploring High-Order Attention and Distraction.
Yongzhi Li, Duo Zhang, Yadong Mu.
[paper]
(CVPR2020_GSMN) Graph Structured Network for Image-Text Matching.
Chunxiao Liu, Zhendong Mao, Tianzhu Zhang, Hongtao Xie, Bin Wang, Yongdong Zhang.
[paper]
[code]
(ICML2020_GOT) Graph Optimal Transport for Cross-Domain Alignment.
Liqun Chen, Zhe Gan, Yu Cheng, Linjie Li, Lawrence Carin, Jingjing Liu.
[paper]
[code]
(AAAI2021_SGRAF) Similarity Reasoning and Filtration for Image-Text Matching.
Haiwen Diao, Ying Zhang, Lin Ma, Huchuan Lu.
[paper]
[code]
(TPAMI2018_TBNN) Learning Two-Branch Neural Networks for Image-Text Matching Tasks.
Liwei Wang, Yin Li, Jing Huang, Svetlana Lazebnik.
[paper]
[code]
(BMVC2018_VSE++) VSE++: Improving Visual-Semantic Embeddings with Hard Negatives.
Fartash Faghri, David J. Fleet, Jamie Ryan Kiros, Sanja Fidler.
[paper]
[code]
(ECCV2018_CMPL) Deep Cross-Modal Projection Learning for Image-Text Matching.
Ying Zhang, Huchuan Lu.
[paper]
[code]
(ACLws2019_kNN-loss) A Strong and Robust Baseline for Text-Image Matching.
Fangyu Liu, Rongtian Ye.
[paper]
(ICASSP2019_NAA) A Neighbor-aware Approach for Image-text Matching.
Chunxiao Liu, Zhendong Mao, Wenyu Zang, Bin Wang.
[paper]
(CVPR2019_PVSE) Polysemous Visual-Semantic Embedding for Cross-Modal Retrieval.
Yale Song, Mohammad Soleymani.
[paper]
[code]
(TOMM2020_Dual-Path) Dual-path Convolutional Image-Text Embeddings with Instance Loss.
Zhedong Zheng, Liang Zheng, Michael Garrett, Yi Yang, Mingliang Xu, YiDong Shen.
[paper]
[code]
(AAAI2020_HAL) HAL: Improved Text-Image Matching by Mitigating Visual Semantic Hubs.
Fangyu Liu, Rongtian Ye, Xun Wang, Shuaipeng Li.
[paper]
[code]
(AAAI2020_CVSE++) Ladder Loss for Coherent Visual-Semantic Embedding.
Mo Zhou, Zhenxing Niu, Le Wang, Zhanning Gao, Qilin Zhang, Gang Hua.
[paper]
(CVPR2020_MPL) Universal Weighting Metric Learning for Cross-Modal Matching.
Jiwei Wei, Xing Xu, Yang Yang, Yanli Ji, Zheng Wang, Heng Tao Shen.
[paper]
(ECCV2018_VSA-AE-MMD) Visual-Semantic Alignment Across Domains Using a Semi-Supervised Approach.
Angelo Carraggi, Marcella Cornia, Lorenzo Baraldi, Rita Cucchiara.
[paper]
(MM2019_A3VSE) Annotation Efficient Cross-Modal Retrieval with Adversarial Attentive Alignment.
Po-Yao Huang, Guoliang Kang, Wenhe Liu, Xiaojun Chang, Alexander G Hauptmann.
[paper]
(CVPR2017_DEM) Learning a Deep Embedding Model for Zero-Shot Learning.
Li Zhang, Tao Xiang, Shaogang Gong.
[paper]
[code]
(AAAI2019_GVSE) Few-shot image and sentence matching via gated visual-semantic matching.
Yan Huang, Yang Long, Liang Wang.
[paper]
(ICCV2019_ACMM) ACMM: Aligned Cross-Modal Memory for Few-Shot Image andSentence Matching.
Yan Huang, Liang Wang.
[paper]
(MM2017_ACMR) Adversarial Cross-Modal Retrieval.
Bokun Wang, Yang Yang, Xing Xu, Alan Hanjalic, Heng Tao Shen.
[paper]
[code]
(COLING2018_CAS) Learning Visually-Grounded Semantics from Contrastive Adversarial Samples.
Haoyue Shi, Jiayuan Mao, Tete Xiao, Yuning Jiang, Jian Sun.
[paper]
[code]
(CVPR2018_GXN) Look, Imagine and Match: Improving Textual-Visual Cross-Modal Retrieval with Generative Models.
Jiuxiang Gu, Jianfei Cai, Shafiq Joty, Li Niu, Gang Wang.
[paper]
(ICCV2019_TIMAM) Adversarial Representation Learning for Text-to-Image Matching.
Nikolaos Sarafianos, Xiang Xu, Ioannis A. Kakadiaris.
[paper]
(CVPR2019_UniVSE) Unified Visual-Semantic Embeddings: Bridging Vision and Language with Structured Meaning Representations.
Hao Wu, Jiayuan Mao, Yufeng Zhang, Yuning Jiang, Lei Li, Weiwei Sun, Wei-Ying Ma.
[paper]
(IJCAI2019_SCG) Knowledge Aware Semantic Concept Expansion for Image-Text Matching.
Botian Shi, Lei Ji, Pan Lu, Zhendong Niu, Nan Duan.
[paper]
(ICCV2015_LSTM-Q+I) VQA: Visual question answering.
Stanislaw Antol, Aishwarya Agrawal, Jiasen Lu, MargaretMitchell, Dhruv Batra, C Lawrence Zitnick, Devi Parikh.
[paper]
(CVPR2016_Word-NN) Learning Deep Representations of Fine-grained Visual Descriptions.
Scott Reed, Zeynep Akata, Bernt Schiele, Honglak Lee.
[paper]
(CVPR2017_GNA-RNN) Person search with natural language description.
huang Li, Tong Xiao, Hongsheng Li, Bolei Zhou, DayuYue, Xiaogang Wang.
[paper]
[code]
(ICCV2017_IATV) Identity-aware textual-visual matching with latent co-attention.
Shuang Li, Tong Xiao, Hongsheng Li, Wei Yang, Xiaogang Wang.
[paper]
(WACV2018_PWM-ATH) Improving text-based person search by spatial matching and adaptive threshold.
Tianlang Chen, Chenliang Xu, Jiebo Luo.
[paper]
(ECCV2018_GLA) Improving deep visual representation for person re-identification by global and local image-language association.
Dapeng Chen, Hongsheng Li, Xihui Liu, Yantao Shen, JingShao, Zejian Yuan, Xiaogang Wang.
[paper]
(CVPR2019_DSCMR) Deep Supervised Cross-modal Retrieval.
Liangli Zhen, Peng Hu, Xu Wang, Dezhong Peng.
[paper]
[code]
(AAAI2020_PMA) Pose-Guided Multi-Granularity Attention Network for Text-Based Person Search.
Ya Jing, Chenyang Si, Junbo Wang, Wei Wang, Liang Wang, Tieniu Tan.
[paper]
(Machine Learning 2010) Large scale image annotation: learning to rank with joint word-image embeddings.
Jason Weston, Samy Bengio, Nicolas Usunier.
[paper]
(NIPS2013_Word2Vec) Distributed Representations of Words and Phrases and their Compositionality.
Tomas Mikolov, Ilya Sutskever, Kai Chen, Greg Corrado, Jeffrey Dean.
[paper]
(CVPR2017_DVSQ) Deep Visual-Semantic Quantization for Efficient Image Retrieval.
Yue Cao, Mingsheng Long, Jianmin Wang, Shichen Liu.
[paper]
(ACL2018_ILU) Illustrative Language Understanding: Large-Scale Visual Grounding with Image Search.
Jamie Kiros, William Chan, Geoffrey Hinton.
[paper]
(AAAI2018_VSE-ens) VSE-ens: Visual-Semantic Embeddings with Efficient Negative Sampling.
Guibing Guo, Songlin Zhai, Fajie Yuan, Yuan Liu, Xingwei Wang.
[paper]
(ECCV2018_HTG) An Adversarial Approach to Hard Triplet Generation.
Yiru Zhao, Zhongming Jin, Guo-jun Qi, Hongtao Lu, Xian-sheng Hua.
[paper]
(ECCV2018_WebNet) CurriculumNet: Weakly Supervised Learning from Large-Scale Web Images.
Sheng Guo, Weilin Huang, Haozhi Zhang, Chenfan Zhuang, Dengke Dong, Matthew R. Scott, Dinglong Huang.
[paper]
[code]
(CVPR2018_BUTD) Bottom-Up and Top-Down Attention for Image Captioning and Visual Question Answering.
Peter Anderson, Xiaodong He, Chris Buehler, Damien Teney, Mark Johnson, Stephen Gould, Lei Zhang.
[paper]
[code]
(CVPR2018_DML) Deep Mutual Learning.
Ying Zhang, Tao Xiang, Timothy M. Hospedales, Huchuan Lu.
[paper]
[code]
(EMNLP2019_GMMR) Multi-Head Attention with Diversity for Learning Grounded Multilingual Multimodal Representations.
Po-Yao Huang, Xiaojun Chang, Alexander Hauptmann.
[paper]
(EMNLP2019_MIMSD) Unsupervised Discovery of Multimodal Links in Multi-Image, Multi-Sentence Documents.
Jack Hessel, Lillian Lee, David Mimno.
[paper]
[code]
(ICCV2019_DRNet) Fashion Retrieval via Graph Reasoning Networks on a Similarity Pyramid.
Zhanghui Kuang, Yiming Gao, Guanbin Li, Ping Luo, Yimin Chen, Liang Lin, Wayne Zhang.
[paper]
(ICCV2019_Align2Ground) Align2Ground: Weakly Supervised Phrase Grounding Guided by Image-Caption Alignment.
Samyak Datta, Karan Sikka, Anirban Roy, Karuna Ahuja, Devi Parikh, Ajay Divakaran.
[paper]
(CVPR2019_TIRG) Composing Text and Image for Image Retrieval - An Empirical Odyssey.
Nam Vo, Lu Jiang, Chen Sun, Kevin Murphy, Li-Jia Li, Li Fei-Fei, James Hays.
[paper]
(SIGIR2019_PAICM) Prototype-guided Attribute-wise Interpretable Scheme forClothing Matching.
Xianjing Han, Xuemeng Song, Jianhua Yin, Yinglong Wang, Liqiang Nie.
[paper]
(SIGIR2019_NCR) Neural Compatibility Ranking for Text-based Fashion Matching.
Suthee Chaidaroon, Mix Xie, Yi Fang, Alessandro Magnani.
[paper]
(ECCV2020_InfoNCE) Contrastive Learning for Weakly Supervised Phrase Grounding.
Tanmay Gupta, Arash Vahdat, Gal Chechik, Xiaodong Yang, Jan Kautz, Derek Hoiem.
[paper]
[code]
(ECCV2020) Adaptive Offline Quintuplet Loss for Image-Text Matching.
Tianlang Chen, Jiajun Deng, Jiebo Luo.
[paper]
[code]
(ECCV2020) Learning Joint Visual Semantic Matching Embeddings for Language-guided Retrieval.
Yanbei Chen, Loris Bazzani.
[paper]
(ECCV2020) Consensus-Aware Visual-Semantic Embedding for Image-Text Matching.
Haoran Wang, Ying Zhang, Zhong Ji, Yanwei Pang, Lin Ma.
[paper]
[code]
(ECCV2020) Preserving Semantic Neighborhoods for Robust Cross-modal Retrieval.
Christopher Thomas, Adriana Kovashka.
[paper]
[code]
(COLING2020) Probing Multimodal Embeddings for Linguistic Properties: the Visual-Semantic Case.
Adam Dahlgren Lindström, Suna Bensch, Johanna Björklund, Frank Drewes.
[paper]
[code]