This is the official source code for the paper CODER: Coupled Diversity-Sensitive Momentum Contrastive Learning for Image-Text Retrieval (ECCV 2022).
Abstract: Image-Text Retrieval (ITR) is challenging in bridging visual and lingual modalities. Contrastive learning has been adopted by most prior arts. Except for limited amount of negative image-text pairs, the capability of constrastive learning is restricted by manually weighting negative pairs as well as unawareness of external knowledge. In this paper, we propose our novel Coupled Diversity-Sensitive Momentum Constrastive Learning (CODER) for improving cross-modal representation. Firstly, a novel diversity-sensitive contrastive learning (DCL) architecture is invented. We introduce dynamic dictionaries for both modalities to enlarge the scale of image-text pairs, and diversity-sensitiveness is achieved by adaptive negative pair weighting. Furthermore, two branches are designed in CODER. One learns instance-level embeddings from image/text, and it also generates pseudo online clustering labels for its input image/text based on their embeddings. Meanwhile, the other branch learns to query from commonsense knowledge graph to form conceptlevel descriptors for both modalities. Afterwards, both branches leverage DCL to align the cross-modal embedding spaces while an extra pseudo clustering label prediction loss is utilized to promote concept-level representation learning for the second branch. Extensive experiments conducted on two popular benchmarks, i.e. MSCOCO and Flicker30K, validate CODER remarkably outperforms the state-of-the-art approaches.
The results on MSCOCO and Flicke30K dataset:
| Image-to-Text | Text-to-Image | ||||||
| Dataset | R@1 | R@5 | R@10 | R@1 | R@5 | R@10 | R@sum |
| MSCOCO | 82.1 | 96.6 | 98.8 | 65.5 | 91.5 | 96.2 | 530.6 |
| Flickr30k | 83.2 | 96.5 | 98.0 | 63.1 | 87.1 | 93.0 | 520.9 |
You can config the running enrionment by using
pip install -r requirements.txtWe recommended the following dependencies.
- Python 3.7
- NumPy 1.19
- PyTorch 1.8
- transformers 2.1.0
- TensorBoard
- torchtext 0.4.0
- torchvision 0.9.0
Download the dataset files. We use the image feature created by SCAN, downloaded here. All the data needed for reproducing the experiments in the paper, including image features, text, vocabularies and concept annotation files, can be downloaded from:
wget https://pan.baidu.com/s/1ATcSpcOxn6CJCHvYL0ap-A?pwd=duxhThe checkpoints of our trained models can be downloaded from:
wget https://pan.baidu.com/s/1otO_LB5RSNH235HNkYJZvQ?pwd=qp7bExtract the runs.tar.gz to get the trained model files for Flickr30K dataset and put the extracted folder runs in the root directory.
- Train on MSCOCO dataset:
python train_mine_coco_CODER.py- Train on Flickr30K dataset:
python train_mine_f30k_CODER.py.py- Test on Flickr30K dataset:
python eval_mine_f30k_CODER.pyIf this repo is useful for your research, please cite our paper:
@inproceedings{wang2022coder,
title={Coder: Coupled diversity-sensitive momentum contrastive learning for image-text retrieval},
author={Wang, Haoran and He, Dongliang and Wu, Wenhao and Xia, Boyang and Yang, Min and Li, Fu and Yu, Yunlong and Ji, Zhong and Ding, Errui and Wang, Jingdong},
booktitle={European conference on computer vision},
pages={700--716},
year={2022},
organization={Springer}
}
@article{Wang2020CVSE,
title={Consensus-Aware Visual-Semantic Embedding for Image-Text Matching},
author={Wang, Haoran and Zhang, Ying and Ji, Zhong and Pang, Yanwei and Ma, Lin},
booktitle={ECCV},
year={2020}
}
