From 97f2f30a30415128d4a36cd3777e1c3d61c36107 Mon Sep 17 00:00:00 2001
From: weili-nlp <weili.ucas.ict@gmail.com>
Date: Fri, 20 May 2022 18:04:15 +0800
Subject: [PATCH] open release code for UNIMO-2

---
 NLP/UNIMO-2/README-md-bak | 216 ++++++++++++++++++++++++++++++++++++++
 NLP/UNIMO-2/README.md     |  41 +-------
 2 files changed, 217 insertions(+), 40 deletions(-)
 create mode 100644 NLP/UNIMO-2/README-md-bak

diff --git a/NLP/UNIMO-2/README-md-bak b/NLP/UNIMO-2/README-md-bak
new file mode 100644
index 00000000..18ec6f58
--- /dev/null
+++ b/NLP/UNIMO-2/README-md-bak
@@ -0,0 +1,216 @@
+UNIMO
+====
+Code for the findings of ACL2022 long paper [UNIMO-2: End-to-End Unified Vision-Language Grounded Learning](https://arxiv.org/pdf/2203.09067.pdf)
+
+
+Abstract
+---
+
+Vision-Language Pre-training (VLP) has achieved impressive performance on various cross-modal downstream tasks. 
+However, most existing methods can only learn from aligned image-caption data and rely heavily on expensive regional 
+features, which greatly limits their scalability and performance. In this paper, we propose an end-to-end unified-modal 
+pre-training framework, namely UNIMO-2, for joint learning on both aligned image-caption data and unaligned image-only 
+and text-only corpus. We build a unified Transformer model to jointly learn visual representations, textual 
+representations and semantic alignment between images and texts. In particular, we propose to conduct grounded learning 
+on both images and texts via a sharing grounded space, which helps bridge unaligned images and texts, and align the 
+visual and textual semantic spaces on different types of corpora. The experiments show that our grounded learning 
+method can improve textual and visual semantic alignment for improving performance on various cross-modal tasks. 
+Moreover, benefiting from effective joint modeling of different types of corpora, our model also achieves impressive 
+performance on single-modal visual and textual tasks. Our code and models are public at the UNIMO project page 
+\url{https://unimo-ptm.github.io}.
+
+![UNIMO-2](images/paper.png#pic_center)
+
+
+
+Dependencies
+---
+python3.7.4\
+cuda-10.1\
+cudnn_v7.6\
+nccl2.4.2\
+java1.8
+paddlepaddle-gpu==2.1.2\
+pyrouge==0.1.3
+
+
+Pre-trained Models
+---
+Similar to UNIMO, UNIMO-2 adopts large-scale text corpus, image collections and image-text aligned datasets as the pre-training data. 
+We provide pre-trained UNIMO-2 models:
+
+```
+cd /path/to/model_files
+wget --no-check-certificate -q https://unimo-2.bj.bcebos.com/model/UNIMO-2.tar.gz
+tar -zxf UNIMO-2.tar.gz
+```
+
+
+Experiments
+---
+
+Our fine-tuning experiments are carried on V100 GPU. Here are the results from the UNIMO-2 model:
+
+
+1 Cross-Modal  Tasks
+---
+
+
+### (1) Image-Text Retrieval
+
+#### Download Flickr30k dataset:
+
+```
+cd /path/to/data
+wget --no-check-certificate -q https://unimo-2.bj.bcebos.com/data/Flickr30k.tar.gz
+tar -zxf Flickr30k.tar.gz
+```
+
+#### Run the following common to train and evaluate on the Flickr30k dataset:
+
+```
+bash ./script/retrieval-grounded/Flickr30k-fleet/run.sh
+```
+
+#### Evaluation Results:
+
+Results of Image Retrieval task on Flickr30k dataset
+
+|   Model   | R@1 |  R@5  |  R@10  |
+| ----------- | ------- | ------- | ------- |
+|   UNIMO-2 (zero-shot)  |  72.70 | 91.18 | 94.60  |
+|   UNIMO-2 (finetuned)  |  80.14 | 95.58 | 97.75  |
+
+Results of Text Retrieval task on Flickr30k dataset
+
+|   Model   |  R@1  |  R@5  |  R@10  |
+| ----------- | ------- | ------- | ------- |
+|   UNIMO-2 (zero-shot)  |  88.46 | 96.84 | 98.92  |
+|   UNIMO-2 (finetuned) | 92.01 | 99.31 | 99.51 |
+
+
+
+### (2) Image Caption Generation
+
+#### Download COCO Caption dataset:
+
+```
+cd /path/to/data
+wget --no-check-certificate -q https://unimo-2.bj.bcebos.com/data/coco.tar.gz
+tar -zxf coco.tar.gz
+```
+
+#### Download evaluation script:
+
+```
+mkdir src/eval/tasks
+cd src/eval/tasks
+wget --no-check-certificate -q https://unimo.bj.bcebos.com/eval_script/coco.tar.gz
+tar -zxf coco.tar.gz
+```
+
+#### Run the following common to train and evaluate on the COCO Caption dataset:
+
+```
+bash ./script/img2txt-grounded/coco-oscar/run.sh
+```
+
+
+#### Evaluation Results:
+
+|   Model   | BLUE4 | CIDEr |
+| ----------- | ------- | ------- |
+|   UNIMO-2 |  39.7  |  131.2  |
+
+
+
+### (3) Visual Entailment
+####todo
+
+
+
+### (4) Visual Question Answering (VQA)
+####todo
+
+
+
+
+
+2 Visual Tasks
+---
+
+### (1) Image Classification
+####todo
+
+### (2) Zero-shot Image Classification
+####todo
+
+
+
+3 Textual Tasks
+---
+
+### (1) Natural Language Inference
+
+#### Download MNLI-AX dataset:
+```
+cd /path/to/data
+wget --no-check-certificate -q https://unimo-2.bj.bcebos.com/data/MNLI-AX.tar.gz
+tar -zxf MNLI-AX.tar.gz
+```
+
+#### Run the following common to train and evaluate on the MNLI-AX dataset:
+
+```
+bash ./script/classification/MNLI-AX/run.sh
+```
+
+
+#### Evaluation Results:
+
+|   Model   | Acc-(m/mm) |
+| ----------- | ------- |
+|   UNIMO-2  |  87.5/87.5 |
+
+
+
+
+### (2) Sentiment Classification
+####todo
+
+
+
+
+
+### (3) Similarity Tasks
+####todo
+
+
+
+
+
+### (4) Linguistic Acceptability Judgments
+####todo
+
+
+
+
+
+Citation
+---
+If you find our paper and code useful, please cite the following paper:
+```
+@article{li2022unimo,
+  title={UNIMO-2: End-to-End Unified Vision-Language Grounded Learning},
+  author={Li, Wei and Gao, Can and Niu, Guocheng and Xiao, Xinyan and Liu, Hao and Liu, Jiachen and Wu, Hua and Wang, Haifeng},
+  journal={arXiv preprint arXiv:2203.09067},
+  year={2022}
+}
+```
+
+Contact information
+---
+
+For help or issues using UNIMO-2, please submit a GitHub issue.
+
+For personal communication related to UNIMO, please contact Wei Li (liwei85@baidu.com), Can Gao (gaocan01@baidu.com), Guocheng Niu (niuguocheng@baidu.com).
diff --git a/NLP/UNIMO-2/README.md b/NLP/UNIMO-2/README.md
index 18ec6f58..52817bab 100644
--- a/NLP/UNIMO-2/README.md
+++ b/NLP/UNIMO-2/README.md
@@ -124,30 +124,9 @@ bash ./script/img2txt-grounded/coco-oscar/run.sh
 
 
 
-### (3) Visual Entailment
-####todo
 
 
-
-### (4) Visual Question Answering (VQA)
-####todo
-
-
-
-
-
-2 Visual Tasks
----
-
-### (1) Image Classification
-####todo
-
-### (2) Zero-shot Image Classification
-####todo
-
-
-
-3 Textual Tasks
+2 Textual Tasks
 ---
 
 ### (1) Natural Language Inference
@@ -175,24 +154,6 @@ bash ./script/classification/MNLI-AX/run.sh
 
 
 
-### (2) Sentiment Classification
-####todo
-
-
-
-
-
-### (3) Similarity Tasks
-####todo
-
-
-
-
-
-### (4) Linguistic Acceptability Judgments
-####todo
-
-