Skip to content

Arlen-yuzu/Cross-modal_Retrieval_Tutorial

 
 

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

41 Commits
 
 

Repository files navigation

Cross-modal_Retrieval_Tutorial

The Tutorial of Image-Text Matching for Preliminary Insight.
Due to the urgent time, we temporarily store some state-of-the-arts in Posted in. The tutorial will be constantly updated.


Catalogue


Peformance comparison

Performance of Flickr8K

(* indicates Ensemble models, ^ indicates questionable authen)

Method_name Concise_note Flickr8K
Sentence retrieval Image retrieval
R@1R@5R@10 R@1R@5R@10
DeViSERCNN 4.816.527.3 5.920.129.6
SDT-RNNAlexNet 4.518.028.6 6.118.529.0
SDT-RNNRCNN 6.022.734.0 6.621.631.7
DeFragAlexNet 5.919.227.3 5.217.626.5
DeFragRCNN 12.632.944.0 9.729.642.5
m-RNNAlexNet 14.537.248.5 11.531.042.4
DVSADepTree 14.837.950.0 11.631.443.8
DVSARCNN 16.540.654.2 11.832.144.7
UVSEAlexNet 13.536.245.7 10.431.043.7
UVSEVggNet 18.040.955.0 12.537.051.5
NICGoogleNet 20--61 19--64
m-CNN*OverFeat 14.935.949.0 11.834.548.0
m-CNN*VggNet 24.853.767.1 20.347.661.7
HM-LSTMRCNN 27.7--68.6 24.4--68.1
SPEVggNet 30.160.473.7 23.051.364.8
FVGMM+HGLMM 31.059.373.7 21.250.064.8
NAAResNet 37.268.179.1 27.759.671.8
SCAN*BUTD 52.281.089.2 38.367.878.9
IMRAMBUTD, Image 48.578.185.3 32.061.473.9
IMRAMBUTD, Text 52.181.590.1 40.269.079.2
IMRAMBUTD, Full 54.784.291.0 41.069.279.9

Performance of Flickr30K

Method_name Concise_note Flickr30K
Sentence retrieval Image retrieval
R@1R@5R@10 R@1R@5R@10
DeViSERCNN 4.518.129.2 6.721.932.7
SDT-RNNRCNN 9.629.841.1 8.929.841.1
DeFragRCNN 14.237.751.3 10.230.844.2
DeFragftRCNN 16.440.254.7 10.331.444.5
DCCAAlexNet 16.739.352.9 12.631.043.0
NICGoogleNet 17--56 17--57
DVSADepTree 20.046.659.4 15.036.548.2
DVSARCNN 22.248.261.4 15.237.750.5
UVSEAlexNet 14.839.250.9 11.834.046.3
UVSEVggNet 23.050.762.9 16.842.056.5
LRCNVggNet 23.646.658.3 17.540.350.8
m-CNN*OverFeat 20.144.256.3 15.940.351.9
m-CNN*VggNet 33.664.174.9 26.256.369.6
m-RNNAlexNet 18.440.250.9 12.631.241.5
m-RNNVggNet 35.463.873.7 22.850.763.1
FVGMM+HGLMM 35.062.073.8 25.052.766.0
HM-LSTMRCNN 38.1--76.5 27.7--68.8
SPEVggNet 40.368.979.9 29.760.172.1
sm-LSTMVggNet 42.467.579.9 28.257.068.4
sm-LSTM*VggNet 42.571.981.5 30.260.4 72.3
CSEResNet 44.674.383.8 36.969.179.6
RRF-NetResNet 47.677.487.1 35.468.379.9
CMPLMobileNet 40.366.976.7 30.458.268.5
CMPLResNet 49.676.886.1 37.365.775.5
2WayNetVggNet 49.867.5-- 36.055.6--
VSE++VggNet 41.369.177.9 31.460.071.2
VSE++ResNet 52.980.587.2 39.670.179.5
TIMAMResNet, Bert 53.178.887.6 42.671.681.9
DANVggNet 41.473.582.5 31.861.772.5
DANResNet 55.081.8 89.0 39.469.279.1
NAAResNet 55.180.389.6 39.468.879.9
SCOVggNet 44.274.183.6 32.864.374.9
SCOResNet 55.582.089.3 41.170.580.1
Dual-PathVggNet 47.677.387.1 35.366.678.2
Dual-PathResNet 55.681.989.5 39.169.280.9
CVSE++ResNet 56.682.590.2 42.471.680.8
GXNResNet 56.8--89.6 41.5--80.1
Align2GroundBUTD ------ 49.774.883.3
A3VSEBUTD 65.089.294.5 49.579.586.6
R-SCANBUTD, VrR-VG 66.390.696.0 51.477.884.9
SAVEResNet 67.288.394.2 49.878.786.2
SCANBUTD, t2i_AVE 61.887.593.7 45.874.483.0
SCANBUTD, i2t_AVE 67.989.094.4 43.974.282.8
SCAN*BUTD, AVE+LSE 67.490.395.8 48.677.785.2
BFANBUTD, prob 65.589.4-- 47.977.6--
BFANBUTD, equal 64.589.7-- 48.877.3--
BFAN*BUTD 68.191.4-- 50.878.4--
CAMPBUTD 68.189.795.2 51.577.185.3
RDANBUTD 68.191.095.9 54.180.987.2
PersonalityResNeXt, Transformer 68.490.695.3 ------
CASCResNet 68.590.695.9 50.278.386.3
GVSE*BUTD 68.590.995.5 50.679.887.6
HALSCAN_i2t 68.689.994.7 46.074.082.3
OANBUTD 68.693.096.0 53.380.187.1
SAEMBUTD, Bert 69.191.095.1 52.481.188.1
MPLSCAN_i2t 69.489.995.4 47.575.583.1
PFANBUTD, t2i 66.089.694.3 49.677.084.2
PFANBUTD, i2t 67.690.093.8 45.774.783.6
PFAN*BUTD 70.091.895.0 50.478.786.1
CAANBUTD 70.191.697.2 52.879.087.9
DP-RNNBUTD 70.291.695.8 55.581.388.2
HOADBUTD 70.892.796.0 59.585.6/td>91.0<
HOADBUTD, +Dist 70.892.796.0 60.986.191.0
GOTSCAN_i2t 70.992.895.5 50.778.786.2
VSRN*BUTD 71.390.696.0 54.781.888.2
SCGVggNet, Prod 57.285.192.1 40.169.579.5
SCGVggNet, Gated 71.890.894.8 49.376.485.6
SGMBUTD 71.891.795.5 53.579.686.5
IMRAMBUTD, Image 67.090.595.6 51.278.285.5
IMRAMBUTD, Text 68.891.696.0 53.079.087.1
IMRAMBUTD, Full 74.193.096.6 53.979.487.2
MMCABUTD, Bert 74.292.896.4 54.881.487.8
SAN^VggNet 67.088.094.6 51.477.285.2
SAN^ResNet 75.592.696.2 60.184.790.6
GSMNBUTD, sparse 71.492.096.1 53.979.787.1
GSMNBUTD, dense 72.693.596.8 53.780.087.0
GSMN*BUTD 76.494.397.3 57.482.389.0
ADAPTBUTD, i2t 70.290.895.8 55.582.789.8
ADAPTBUTD, t2i 73.693.796.7 57.083.690.3
ADAPT*BUTD 76.695.497.6 60.786.692.0
SGRAFBUTD, SAF 73.793.396.3 56.181.588.0
SGRAFBUTD, SGR 75.293.396.6 56.281.086.5
SGRAF*BUTD 77.894.197.4 58.583.088.8
ACMMBUTD 80.095.598.2 50.276.884.7
ACMM*BUTD 85.296.798.4 53.879.886.8

Performance of MSCOCO1K

Method_name Concise_note MSCOCO1K
Sentence retrieval Image retrieval
R@1R@5R@10 R@1R@5R@10
STVcombine-skip 33.867.782.1 25.960.074.6
DVSARCNN 38.469.980.5 27.460.274.8
FVGMM+HGLMM 39.467.980.9 25.159.876.6
m-RNNVggNet 41.073.083.5 29.042.277.0
m-CNN*VggNet 42.873.184.1 32.668.682.8
UVSEVggNet 43.475.785.8 31.066.779.9
HM-LSTMRCNN 43.9--87.8 36.1--86.7
Order-embVggNet 46.7--88.9 37.9--85.9
SPEVggNet 50.179.789.2 39.675.286.9
SEAMVggNet 50.781.490.9 40.375.787.4
sm-LSTMVggNet 52.481.790.8 38.673.484.6
sm-LSTM*VggNet 53.283.191.5 40.775.887.4
CMPLMobileNet 52.983.892.1 41.374.685.9
2WayNetVggNet 55.875.2-- 39.763.3--
CMPMResNet 56.186.392.9 44.678.889.0
CSEResNet 56.384.492.2 45.781.290.6
RRF-NetResNet 56.485.391.5 43.978.188.6
CHAIN-VSEVggNet 51.682.091.3 38.675.187.2
CHAIN-VSEResNet 59.488.094.2 43.579.890.2
NAAResNet 61.387.995.4 47.080.890.1
VSE++VggNet 57.286.093.3 45.979.489.1
VSE++ResNet 64.690.095.7 52.084.392.0
Dual-PathVggNet 59.486.292.9 41.676.387.5
Dual-PathResNet 65.689.895.5 47.179.990.0
PersonalityResNeXt, Transformer 67.391.796.5 ------
Align2GroundBUTD ------ 56.684.992.8
GXNResNet 68.5--97.9 56.6--94.5
CVSE++ResNet 69.192.296.1 55.686.793.8
PVSEResNet 69.291.696.6 55.286.593.7
SCOVggNet 66.691.896.6 55.586.693.8
SCOResNet 69.992.997.5 56.787.594.8
R-SCANBUTD, VrR-VG 70.394.598.1 57.687.393.7
SAVEResNet 70.893.297.6 56.987.694.4
MPLSCAN_i2t 71.193.798.2 56.886.793.0
SAEMBUTD, Bert 71.294.197.7 57.888.694.9
OANBUTD 71.796.499.3 60.288.694.5
GVSE*BUTD 72.294.198.1 60.589.495.8
CAMPBUTD 72.394.898.3 58.587.995.0
CASCResNet 72.396.099.0 58.989.896.0
SCANBUTD, t2i_AVE 70.994.597.8 56.487.093.9
SCANBUTD, i2t_AVE 69.293.297.5 54.486.093.6
SCAN*BUTD, LSE+AVE 72.794.898.4 58.888.494.8
SGMBUTD 73.493.897.8 57.587.394.3
ParNetBUTD, NP 72.894.997.9 57.987.494.0
ParNetBUTD, P 73.594.598.3 58.388.294.1
RDANBUTD 74.696.298.7 61.689.294.7
MMCABUTD, Bert 74.895.697.7 61.689.895.2
BFANBUTD, prob 73.094.8-- 58.087.6--
BFANBUTD, equal 73.794.9-- 58.387.5--
BFAN*BUTD 74.995.2-- 59.488.4--
DP-RNNBUTD 75.395.898.6 62.589.795.1
CAANBUTD 75.595.498.5 61.389.795.2
VSRN*BUTD 76.294.898.2 62.889.795.1
ADAPTBUTD, i2t 74.594.297.9 62.090.495.5
ADAPTBUTD, t2i 75.395.198.4 63.390.095.5
ADAPT*BUTD 76.595.698.9 62.290.596.0
PFANBUTD, t2i 75.895.999.0 61.089.195.1
PFANBUTD, i2t 70.794.197.8 53.084.592.6
PFAN*BUTD 76.596.399.0 61.689.695.2
SCGVggNet, Prod 73.494.897.6 56.385.693.5
SCGVggNet, Gated 76.696.399.2 61.488.995.1
IMRAMBUTD, Image 76.195.398.2 61.088.694.5
IMRAMBUTD, Text 74.095.698.4 60.688.994.6
IMRAMBUTD, Full 76.795.698.5 61.789.195.0
HOADBUTD 77.096.198.7 65.193.1/td>97.9<
HOADBUTD, +Dist 77.896.198.7 66.293.097.9
HALSCAN_i2t 78.396.398.5 60.186.792.8
GSMNBUTD, sparse 76.195.698.3 60.488.795.0
GSMNBUTD, dense 74.795.398.2 60.388.594.6
GSMN*BUTD 78.496.498.6 63.390.195.7
SGRAFBUTD, SAF 76.195.498.3 61.889.495.3
SGRAFBUTD, SGR 78.095.898.2 61.489.395.4
SGRAF*BUTD 79.696.298.5 63.290.796.1
ACMMBUTD 81.998.099.3 58.287.393.9
ACMM*BUTD 84.197.899.4 60.788.794.9
SAN^VggNet 74.994.998.2 60.890.395.7
SAN^ResNet 85.497.599.0 69.193.497.2

Performance of MSCOCO5K

Method_name Concise_note MSCOCO5K
Sentence retrieval Image retrieval
R@1R@5R@10 R@1R@5R@10
DVSARCNN 16.539.252.0 10.729.642.2
FVGMM+HGLMM 17.339.050.2 10.828.340.1
Order-embVggNet 23.3--65.0 18.0--57.6
CSEResNet 27.957.170.4 22.250.264.4
CMPLMobileNet 24.652.366.4 19.144.658.4
CMPMResNet 31.160.773.9 22.950.263.8
Dual-PathVggNet 35.563.275.6 21.047.560.9
Dual-PathResNet 41.270.581.1 25.353.466.4
VSE++VggNet 32.961.774.7 24.152.866.2
VSE++ResNet 41.371.181.2 30.359.472.4
GXNResNet 42.0--84.7 31.7--74.6
SCOVggNet 40.270.181.3 31.361.573.9
SCOResNet 42.872.383.0 33.162.975.5
CVSE++ResNet 43.273.584.1 32.462.274.6
PVSEResNet 45.274.384.5 32.463.075.0
R-SCANBUTD, VrR-VG 45.477.987.9 36.265.576.7
SAVEResNet 46.776.386.1 34.064.877.0
MPLSCAN_i2t 46.977.787.6 34.464.275.9
CASCResNet 47.278.387.4 34.764.876.8
OANBUTD 47.881.290.4 37.066.678.0
A3VSEBUTD 49.381.190.2 39.068.080.1
GVSE*BUTD 49.977.487.6 38.468.579.7
SGMBUTD 50.079.387.9 35.364.976.5
CAMPBUTD 50.182.189.7 39.068.980.2
SCANBUTD, i2t_LSE 46.477.487.2 34.463.775.7
SCAN*BUTD, AVE+LSE 50.482.290.0 38.669.380.4
GOTSCAN_i2t 50.580.289.8 38.166.878.5
HOADBUTD 51.281.789.1 39.472.584.1
HOADBUTD, +Dist 51.481.889.1 40.573.584.1
CAANBUTD 52.583.390.9 41.270.382.9
VSRN*BUTD 53.081.189.4 40.570.681.1
IMRAMBUTD, Image 53.282.590.4 38.968.579.2
IMRAMBUTD, Text 52.081.890.1 38.668.179.1
IMRAMBUTD, Full 53.783.291.0 39.769.179.8
MMCABUTD, Bert 54.082.590.7 38.769.780.8
SCGVggNet, Prod 49.978.988.1 33.262.474.7
SCGVggNet, Gated 56.684.592.0 39.268.081.3
SGRAFBUTD, SAF 53.382.390.1 39.869.080.2
SGRAFBUTD, SGR 56.983.290.5 40.269.079.8
SGRAF*BUTD 57.884.991.6 41.970.781.3
SAN^ResNet 65.489.494.8 46.277.486.6
ACMMBUTD 63.588.093.6 36.765.176.7
ACMM*BUTD 66.989.694.9 39.569.681.1

Performance of CUHK-PEDES

Method_name Concise_note CUHK-PEDES
R@1R@5R@10
LSTM-Q+IVggNet 17.19--57.82
GNA-RNNVggNet 19.05--53.64
IATVVggNet 25.94--60.48
PWM-ATHVggNet 27.1449.4561.02
GLAResNet 43.5866.9376.26
Dual-PathVggNet 32.1554.4264.30
Dual-PathResNet 44.4066.2675.07
CMPMMobileNet 44.02--77.00
CMPLMobileNet 49.37--79.27
PMAVggNet 47.0268.5478.06
PMAResNet 53.8173.5481.23
TIMAMResNet, Bert 54.5177.5684.78

Performance of CUB-Flowers

Method_name Concise_note CUB Flowers
Image-to-Text Text-to-Image Image-to-Text Text-to-Image
R@1AP@50 R@1AP@50
FVGMM+HGLMM 36.535.6 54.852.8
Word2Vec 38.633.5 54.252.1
Word-NNCNN 51.043.3 60.756.3
Word-NNCNN-RNN 56.848.7 65.659.6
IATVTriplet 52.552.4 64.364.9
IATVVggNet 61.557.6 68.470.1
CMPMMobileNet 62.164.6 66.167.7
CMPLMobileNet 64.367.9 68.969.7
TIMAMResNet, Bert 67.770.3 70.673.7

Method summary

*Generic-feature extraction*

(NIPS2013_DeViSE) DeViSE: A Deep Visual-Semantic Embedding Model.
Andrea Frome, Greg S. Corrado, Jonathon Shlens, Samy Bengio, Jeffrey Dean, Marc’Aurelio Ranzato, Tomas Mikolov.
[paper]

(TACL2014_SDT-RNN) Grounded Compositional Semantics for Finding and Describing Images with Sentences.
Richard Socher, Andrej Karpathy, Quoc V. Le, Christopher D. Manning, Andrew Y. Ng.
[paper]

(NIPSws2014_UVSE) Unifying Visual-Semantic Embeddings with Multimodal Neural Language Models.
Ryan Kiros, Ruslan Salakhutdinov, Richard S. Zemel.
[paper] [code] [demo]

(NIPS2014_DeFrag) Deep fragment embeddings for bidirectional image sentence mapping.
Andrej Karpathy, Armand Joulin, Li Fei-Fei.
[paper]

(ICCV2015_m-CNN) Multimodal Convolutional Neural Networks for Matching Image and Sentence.
Lin Ma, Zhengdong Lu, Lifeng Shang, Hang Li.
[paper]

(CVPR2015_DCCA) Deep Correlation for Matching Images and Text.
Fei Yan, Krystian Mikolajczyk.
[paper]

(CVPR2015_FV) Associating Neural Word Embeddings with Deep Image Representationsusing Fisher Vectors.
Benjamin Klein, Guy Lev, Gil Sadeh, Lior Wolf.
[paper]

(CVPR2015_DVSA) Deep Visual-Semantic Alignments for Generating Image Descriptions.
Andrej Karpathy, Li Fei-Fei.
[paper]

(NIPS2015_STV) Skip-thought Vectors.
Ryan Kiros, Yukun Zhu, Ruslan Salakhutdinov, Richard S. Zemel, Antonio Torralba, Raquel Urtasun, Sanja Fidler.
[paper]

(CVPR2016_SPE) Learning Deep Structure-Preserving Image-Text Embeddings.
Liwei Wang, Yin Li, Svetlana Lazebnik.
[paper]

(ICCV2017_HM-LSTM) Hierarchical Multimodal LSTM for Dense Visual-Semantic Embedding.
Zhenxing Niu, Mo Zhou, Le Wang, Xinbo Gao, Gang Hua.
[paper]

(ICCV2017_RRF-Net) Learning a Recurrent Residual Fusion Network for Multimodal Matching.
Yu Liu, Yanming Guo, Erwin M. Bakker, Michael S. Lew.
[paper]

(CVPR2017_2WayNet) Linking Image and Text with 2-Way Nets.
Aviv Eisenschtat, Lior Wolf.
[paper]

(WACV2018_SEAM) Fast Self-Attentive Multimodal Retrieval.
Jônatas Wehrmann, Maurício Armani Lopes, Martin D More, Rodrigo C. Barros.
[paper] [code]

(CVPR2018_CSE) End-to-end Convolutional Semantic Embeddings.
Quanzeng You, Zhengyou Zhang, Jiebo Luo.
[paper]

(CVPR2018_CHAIN-VSE) Bidirectional Retrieval Made Simple.
Jonatas Wehrmann, Rodrigo C. Barros.
[paper] [code]

(CVPR2018_SCO) Learning Semantic Concepts and Order for Image and Sentence Matching.
Yan Huang, Qi Wu, Liang Wang.
[paper]

(MM2019_SAEM) Learning Fragment Self-Attention Embeddings for Image-Text Matching.
Yiling Wu, Shuhui Wang, Guoli Song, Qingming Huang.
[paper] [code]

(ICCV2019_VSRN) Visual Semantic Reasoning for Image-Text Matching.
Kunpeng Li, Yulun Zhang, Kai Li, Yuanyuan Li, Yun Fu.
[paper] [code]

(CVPR2019_Personality) Engaging Image Captioning via Personality.
Kurt Shuster, Samuel Humeau, Hexiang Hu, Antoine Bordes, Jason Weston.
[paper]

(CVPR2019_PVSE) Polysemous Visual-Semantic Embedding for Cross-Modal Retrieval.
Yale Song, Mohammad Soleymani.
[paper] [code]

(TOMM2020_NIS) Upgrading the Newsroom: An Automated Image Selection System for News Articles.
Fangyu Liu, Rémi Lebret, Didier Orel, Philippe Sordet, Karl Aberer.
[paper] [slides] [demo]

(WACV2020_SGM) Cross-modal Scene Graph Matching for Relationship-aware Image-Text Retrieval.
Sijin Wang, Ruiping Wang, Ziwei Yao, Shiguang Shan, Xilin Chen.
[paper]

*Cross-modal interaction*

(arXiv2014_NIC) Show and Tell: A Neural Image Caption Generator.
Oriol Vinyals, Alexander Toshev, Samy Bengio, Dumitru Erhan.
[paper]

(ICLR2015_m-RNN) Deep Captioning with Multimodal Recurrent Neural Network(M-RNN).
Junhua Mao, Wei Xu, Yi Yang, Jiang Wang, Zhiheng Huang, Alan Yuille.
[paper] [code]

(CVPR2015_LRCN) Long-term Recurrent Convolutional Networks for Visual Recognition and Description.
Jeff Donahue, Lisa Anne Hendricks, Marcus Rohrbach, Subhashini Venugopalan, Sergio Guadarrama, Kate Saenko, Trevor Darrell.
[paper]

(CVPR2017_DAN) Dual Attention Networks for Multimodal Reasoning and Matching.
Hyeonseob Nam, Jung-Woo Ha, Jeonghee Kim.
[paper]

(CVPR2017_sm-LSTM) Instance-aware Image and Sentence Matching with Selective Multimodal LSTM.
Yan Huang, Wei Wang, Liang Wang.
[paper]

(ECCV2018_CITE) Conditional Image-Text Embedding Networks.
Bryan A. Plummer, Paige Kordas, M. Hadi Kiapour, Shuai Zheng, Robinson Piramuthu, Svetlana Lazebnik.
[paper]

(ECCV2018_SCAN) Stacked Cross Attention for Image-Text Matching.
Kuang-Huei Lee, Xi Chen, Gang Hua, Houdong Hu, Xiaodong He.
[paper] [code]

(arXiv2019_R-SCAN) Learning Visual Relation Priors for Image-Text Matching and Image Captioning with Neural Scene Graph Generators.
Kuang-Huei Lee, Hamid Palang, Xi Chen, Houdong Hu, Jianfeng Gao.
[paper]

(arXiv2019_ParNet) ParNet: Position-aware Aggregated Relation Network for Image-Text matching.
Yaxian Xia, Lun Huang, Wenmin Wang, Xiaoyong Wei, Jie Chen.
[paper]

(ACML2019_SAVE) Multi-Scale Visual Semantics Aggregation with Self-Attention for End-to-End Image-Text Matching.
Zhuobin Zheng, Youcheng Ben, Chun Yuan.
[paper]

(ICMR2019_OAN) Improving What Cross-Modal Retrieval Models Learn through Object-Oriented Inter- and Intra-Modal Attention Networks.
Po-Yao Huang, Vaibhav, Xiaojun Chang, Alexander Georg Hauptmann.
[paper] [code]

(MM2019_BFAN) Focus Your Attention: A Bidirectional Focal Attention Network for Image-Text Matching.
Chunxiao Liu, Zhendong Mao, An-An Liu, Tianzhu Zhang, Bin Wang, Yongdong Zhang.
[paper] [code]

(MM2019_MTFN) Matching Images and Text with Multi-modal Tensor Fusion and Re-ranking.
Tan Wang, Xing Xu, Yang Yang, Alan Hanjalic, Heng Tao Shen, Jingkuan Song.
[paper] [code]

(IJCAI2019_RDAN) Multi-Level Visual-Semantic Alignments with Relation-Wise Dual Attention Network for Image and Text Matching.
Zhibin Hu, Yongsheng Luo,Jiong Lin,Yan Yan, Jian Chen.
[paper]

(IJCAI2019_PFAN) Position Focused Attention Network for Image-Text Matching.
Yaxiong Wang, Hao Yang, Xueming Qian, Lin Ma, Jing Lu, Biao Li, Xin Fan.
[paper] [code]

(ICCV2019_CAMP) CAMP: Cross-Modal Adaptive Message Passing for Text-Image Retrieval.
Zihao Wang, Xihui Liu, Hongsheng Li, Lu Sheng, Junjie Yan, Xiaogang Wang, Jing Shao.
[paper] [code]

(ICCV2019_SAN) Saliency-Guided Attention Network for Image-Sentence Matching.
Zhong Ji, Haoran Wang, Jungong Han, Yanwei Pang.
[paper] [code]

(TNNLS2020_CASC) Cross-Modal Attention With Semantic Consistence for Image-Text Matching.
Xing Xu, Tan Wang, Yang Yang, Lin Zuo, Fumin Shen, Heng Tao Shen.
[paper] [code]

(AAAI2020_DP-RNN) Expressing Objects just like Words: Recurrent Visual Embedding for Image-Text Matching.
Tianlang Chen, Jiebo Luo.
[paper]

(AAAI2020_ADAPT) Adaptive Cross-modal Embeddings for Image-Text Alignment.
Jonatas Wehrmann, Camila Kolling, Rodrigo C Barros.
[paper] [code]

(CVPR2020_CAAN) Context-Aware Attention Network for Image-Text Retrieval.
Qi Zhang, Zhen Lei, Zhaoxiang Zhang, Stan Z. Li.
[paper]

(CVPR2020_MMCA) Multi-Modality Cross Attention Network for Image and Sentence Matching.
Xi Wei, Tianzhu Zhang, Yan Li, Yongdong Zhang, Feng Wu.
[paper]

(CVPR2020_IMRAM) IMRAM: Iterative Matching with Recurrent Attention Memory for Cross-Modal Image-Text Retrieval.
Hui Chen, Guiguang Ding, Xudong Liu, Zijia Lin, Ji Liu, Jungong Han.
[paper] [code]

*Similarity measurement*

(ICLR2016_Order-emb) Order-Embeddings of Images and Language.
Ivan Vendrov, Ryan Kiros, Sanja Fidler, Raquel Urtasun.
[paper]

(CVPR2020_HOAD) Visual-Semantic Matching by Exploring High-Order Attention and Distraction.
Yongzhi Li, Duo Zhang, Yadong Mu.
[paper]

(CVPR2020_GSMN) Graph Structured Network for Image-Text Matching.
Chunxiao Liu, Zhendong Mao, Tianzhu Zhang, Hongtao Xie, Bin Wang, Yongdong Zhang.
[paper] [code]

(ICML2020_GOT) Graph Optimal Transport for Cross-Domain Alignment.
Liqun Chen, Zhe Gan, Yu Cheng, Linjie Li, Lawrence Carin, Jingjing Liu.
[paper] [code]

(AAAI2021_SGRAF) Similarity Reasoning and Filtration for Image-Text Matching.
Haiwen Diao, Ying Zhang, Lin Ma, Huchuan Lu.
[paper] [code]

*Loss function*

(TPAMI2018_TBNN) Learning Two-Branch Neural Networks for Image-Text Matching Tasks.
Liwei Wang, Yin Li, Jing Huang, Svetlana Lazebnik.
[paper] [code]

(BMVC2018_VSE++) VSE++: Improving Visual-Semantic Embeddings with Hard Negatives.
Fartash Faghri, David J. Fleet, Jamie Ryan Kiros, Sanja Fidler.
[paper] [code]

(ECCV2018_CMPL) Deep Cross-Modal Projection Learning for Image-Text Matching.
Ying Zhang, Huchuan Lu.
[paper] [code]

(ACLws2019_kNN-loss) A Strong and Robust Baseline for Text-Image Matching.
Fangyu Liu, Rongtian Ye.
[paper]

(ICASSP2019_NAA) A Neighbor-aware Approach for Image-text Matching.
Chunxiao Liu, Zhendong Mao, Wenyu Zang, Bin Wang.
[paper]

(CVPR2019_PVSE) Polysemous Visual-Semantic Embedding for Cross-Modal Retrieval.
Yale Song, Mohammad Soleymani.
[paper] [code]

(TOMM2020_Dual-Path) Dual-path Convolutional Image-Text Embeddings with Instance Loss.
Zhedong Zheng, Liang Zheng, Michael Garrett, Yi Yang, Mingliang Xu, YiDong Shen.
[paper] [code]

(AAAI2020_HAL) HAL: Improved Text-Image Matching by Mitigating Visual Semantic Hubs.
Fangyu Liu, Rongtian Ye, Xun Wang, Shuaipeng Li.
[paper] [code]

(AAAI2020_CVSE++) Ladder Loss for Coherent Visual-Semantic Embedding.
Mo Zhou, Zhenxing Niu, Le Wang, Zhanning Gao, Qilin Zhang, Gang Hua.
[paper]

(CVPR2020_MPL) Universal Weighting Metric Learning for Cross-Modal Matching.
Jiwei Wei, Xing Xu, Yang Yang, Yanli Ji, Zheng Wang, Heng Tao Shen.
[paper]

*Un-supervised or Semi-supervised*

(ECCV2018_VSA-AE-MMD) Visual-Semantic Alignment Across Domains Using a Semi-Supervised Approach.
Angelo Carraggi, Marcella Cornia, Lorenzo Baraldi, Rita Cucchiara.
[paper]

(MM2019_A3VSE) Annotation Efficient Cross-Modal Retrieval with Adversarial Attentive Alignment.
Po-Yao Huang, Guoliang Kang, Wenhe Liu, Xiaojun Chang, Alexander G Hauptmann.
[paper]

*Zero-shot or Fewer-shot*

(CVPR2017_DEM) Learning a Deep Embedding Model for Zero-Shot Learning.
Li Zhang, Tao Xiang, Shaogang Gong.
[paper] [code]

(AAAI2019_GVSE) Few-shot image and sentence matching via gated visual-semantic matching.
Yan Huang, Yang Long, Liang Wang.
[paper]

(ICCV2019_ACMM) ACMM: Aligned Cross-Modal Memory for Few-Shot Image andSentence Matching.
Yan Huang, Liang Wang.
[paper]

*Adversarial learning*

(MM2017_ACMR) Adversarial Cross-Modal Retrieval.
Bokun Wang, Yang Yang, Xing Xu, Alan Hanjalic, Heng Tao Shen.
[paper] [code]

(COLING2018_CAS) Learning Visually-Grounded Semantics from Contrastive Adversarial Samples.
Haoyue Shi, Jiayuan Mao, Tete Xiao, Yuning Jiang, Jian Sun.
[paper] [code]

(CVPR2018_GXN) Look, Imagine and Match: Improving Textual-Visual Cross-Modal Retrieval with Generative Models.
Jiuxiang Gu, Jianfei Cai, Shafiq Joty, Li Niu, Gang Wang.
[paper]

(ICCV2019_TIMAM) Adversarial Representation Learning for Text-to-Image Matching.
Nikolaos Sarafianos, Xiang Xu, Ioannis A. Kakadiaris.
[paper]

(CVPR2019_UniVSE) Unified Visual-Semantic Embeddings: Bridging Vision and Language with Structured Meaning Representations.
Hao Wu, Jiayuan Mao, Yufeng Zhang, Yuning Jiang, Lei Li, Weiwei Sun, Wei-Ying Ma.
[paper]

*Commonsense learning*

(IJCAI2019_SCG) Knowledge Aware Semantic Concept Expansion for Image-Text Matching.
Botian Shi, Lei Ji, Pan Lu, Zhendong Niu, Nan Duan.
[paper]

*Identification learning*

(ICCV2015_LSTM-Q+I) VQA: Visual question answering.
Stanislaw Antol, Aishwarya Agrawal, Jiasen Lu, MargaretMitchell, Dhruv Batra, C Lawrence Zitnick, Devi Parikh.
[paper]

(CVPR2016_Word-NN) Learning Deep Representations of Fine-grained Visual Descriptions.
Scott Reed, Zeynep Akata, Bernt Schiele, Honglak Lee.
[paper]

(CVPR2017_GNA-RNN) Person search with natural language description.
huang Li, Tong Xiao, Hongsheng Li, Bolei Zhou, DayuYue, Xiaogang Wang.
[paper] [code]

(ICCV2017_IATV) Identity-aware textual-visual matching with latent co-attention.
Shuang Li, Tong Xiao, Hongsheng Li, Wei Yang, Xiaogang Wang.
[paper]

(WACV2018_PWM-ATH) Improving text-based person search by spatial matching and adaptive threshold.
Tianlang Chen, Chenliang Xu, Jiebo Luo.
[paper]

(ECCV2018_GLA) Improving deep visual representation for person re-identification by global and local image-language association.
Dapeng Chen, Hongsheng Li, Xihui Liu, Yantao Shen, JingShao, Zejian Yuan, Xiaogang Wang.
[paper]

(CVPR2019_DSCMR) Deep Supervised Cross-modal Retrieval.
Liangli Zhen, Peng Hu, Xu Wang, Dezhong Peng.
[paper] [code]

(AAAI2020_PMA) Pose-Guided Multi-Granularity Attention Network for Text-Based Person Search.
Ya Jing, Chenyang Si, Junbo Wang, Wei Wang, Liang Wang, Tieniu Tan.
[paper]

*Related works*

(Machine Learning 2010) Large scale image annotation: learning to rank with joint word-image embeddings.
Jason Weston, Samy Bengio, Nicolas Usunier.
[paper]

(NIPS2013_Word2Vec) Distributed Representations of Words and Phrases and their Compositionality.
Tomas Mikolov, Ilya Sutskever, Kai Chen, Greg Corrado, Jeffrey Dean.
[paper]

(CVPR2017_DVSQ) Deep Visual-Semantic Quantization for Efficient Image Retrieval.
Yue Cao, Mingsheng Long, Jianmin Wang, Shichen Liu.
[paper]

(ACL2018_ILU) Illustrative Language Understanding: Large-Scale Visual Grounding with Image Search.
Jamie Kiros, William Chan, Geoffrey Hinton.
[paper]

(AAAI2018_VSE-ens) VSE-ens: Visual-Semantic Embeddings with Efficient Negative Sampling.
Guibing Guo, Songlin Zhai, Fajie Yuan, Yuan Liu, Xingwei Wang.
[paper]

(ECCV2018_HTG) An Adversarial Approach to Hard Triplet Generation.
Yiru Zhao, Zhongming Jin, Guo-jun Qi, Hongtao Lu, Xian-sheng Hua.
[paper]

(ECCV2018_WebNet) CurriculumNet: Weakly Supervised Learning from Large-Scale Web Images.
Sheng Guo, Weilin Huang, Haozhi Zhang, Chenfan Zhuang, Dengke Dong, Matthew R. Scott, Dinglong Huang.
[paper] [code]

(CVPR2018_BUTD) Bottom-Up and Top-Down Attention for Image Captioning and Visual Question Answering.
Peter Anderson, Xiaodong He, Chris Buehler, Damien Teney, Mark Johnson, Stephen Gould, Lei Zhang.
[paper] [code]

(CVPR2018_DML) Deep Mutual Learning.
Ying Zhang, Tao Xiang, Timothy M. Hospedales, Huchuan Lu.
[paper] [code]

(EMNLP2019_GMMR) Multi-Head Attention with Diversity for Learning Grounded Multilingual Multimodal Representations.
Po-Yao Huang, Xiaojun Chang, Alexander Hauptmann.
[paper]

(EMNLP2019_MIMSD) Unsupervised Discovery of Multimodal Links in Multi-Image, Multi-Sentence Documents.
Jack Hessel, Lillian Lee, David Mimno.
[paper] [code]

(ICCV2019_DRNet) Fashion Retrieval via Graph Reasoning Networks on a Similarity Pyramid.
Zhanghui Kuang, Yiming Gao, Guanbin Li, Ping Luo, Yimin Chen, Liang Lin, Wayne Zhang.
[paper]

(ICCV2019_Align2Ground) Align2Ground: Weakly Supervised Phrase Grounding Guided by Image-Caption Alignment.
Samyak Datta, Karan Sikka, Anirban Roy, Karuna Ahuja, Devi Parikh, Ajay Divakaran.
[paper]

(CVPR2019_TIRG) Composing Text and Image for Image Retrieval - An Empirical Odyssey.
Nam Vo, Lu Jiang, Chen Sun, Kevin Murphy, Li-Jia Li, Li Fei-Fei, James Hays.
[paper]

(SIGIR2019_PAICM) Prototype-guided Attribute-wise Interpretable Scheme forClothing Matching.
Xianjing Han, Xuemeng Song, Jianhua Yin, Yinglong Wang, Liqiang Nie.
[paper]

(SIGIR2019_NCR) Neural Compatibility Ranking for Text-based Fashion Matching.
Suthee Chaidaroon, Mix Xie, Yi Fang, Alessandro Magnani.
[paper]

(ECCV2020_InfoNCE) Contrastive Learning for Weakly Supervised Phrase Grounding.
Tanmay Gupta, Arash Vahdat, Gal Chechik, Xiaodong Yang, Jan Kautz, Derek Hoiem.
[paper] [code]

Posted in


(ECCV2020) Adaptive Offline Quintuplet Loss for Image-Text Matching.
Tianlang Chen, Jiajun Deng, Jiebo Luo.
[paper] [code]

(ECCV2020) Learning Joint Visual Semantic Matching Embeddings for Language-guided Retrieval.
Yanbei Chen, Loris Bazzani.
[paper]

(ECCV2020) Consensus-Aware Visual-Semantic Embedding for Image-Text Matching.
Haoran Wang, Ying Zhang, Zhong Ji, Yanwei Pang, Lin Ma.
[paper] [code]

(ECCV2020) Preserving Semantic Neighborhoods for Robust Cross-modal Retrieval.
Christopher Thomas, Adriana Kovashka.
[paper] [code]

(COLING2020) Probing Multimodal Embeddings for Linguistic Properties: the Visual-Semantic Case.
Adam Dahlgren Lindström, Suna Bensch, Johanna Björklund, Frank Drewes.
[paper] [code]

About

The Paper List of Image-Text Matching for Preliminary Insight.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published