PaddlePaddle · zhangyimi · Feb 20, 2023
diff --git a/NLP/ACL2022-SynCLM/README.md b/NLP/ACL2022-SynCLM/README.md
@@ -1,4 +1,73 @@
 SynCLM
-===
-Code for Findings of ACL 2022 paper: Syntax-guided Contrastive Learning for Pre-trained Language Model
+====
+Code for Findings of ACL 2022 long paper: [Syntax-guided Contrastive Learning for Pre-trained Language Model](https://aclanthology.org/2022.findings-acl.191/)
+
+
+
+
+Abstract
+---
+Syntactic information has been proved to be useful for transformer-based pre-trained language models. Previous studies often rely on additional syntax-guided attention components to enhance the transformer, which require more parameters and additional syntactic parsing in downstream tasks. This increase in complexity severely limits the application of syntax-enhanced language model in a wide range of scenarios. In order to inject syntactic knowledge effectively and efficiently into pre-trained language models, we propose a novel syntax-guided contrastive learning method which does not change the transformer architecture. Based on constituency and dependency structures of syntax trees, we design phrase-guided and tree-guided contrastive objectives, and optimize them in the pre-training stage, so as to help the pre-trained language model to capture rich syntactic knowledge in its representations. Experimental results show that our contrastive method achieves consistent improvements in a variety of tasks, including grammatical error detection, entity tasks, structural probing and GLUE. Detailed analysis further verifies that the improvements come from the utilization of syntactic information, and the learned attention weights are more explainable in terms of linguistics.
+
+
+![SynCLM](images/framework.png#pic_center)
+
+
+
+Dependencies
+---
+python3.7.4\
+cuda-10.1\
+cudnn_v7.6\
+nccl2.4.2\
+java1.8
+paddlepaddle-gpu2.1.2\
+stanza1.2\
+numpy1.20.2
+
+
+
+Pre-trained Models
+---
+SynCLM is trained based on RoBERTa model, users can use the following command to download the paddle version of RoBERTa model:
+
+```shell
+cd /path/to/model_files
+# download base model
+sh ./download_roberta_base_en.sh
+# or download large model
+# sh ./download_roberta_large_en.sh
+cd -
+```
+To obtain the syntactic structures of the text, we use [Stanza](https://github.com/stanfordnlp/stanza)  to preprocess a data which is English Wikipedia and BookCorpus. We provide input examples in the `/path/to/data/pretrain` directory.
+
+After preparing the data, you can run the following command for training:
+```shell
+cd /path/to
+# base model
+sh ./script/roberta_base_en/run.sh
+# or large model
+# sh ./script/roberta_large_en/run.sh
+```
+After pre-training the model, users can use the following command to fine-tune it on downstream tasks：
+```shell
+# classification
+python ./src/run_classifier.py
+# regression
+python ./src/run_regression.py
+```
+
+
+Citation
+---
+If you find our paper and code useful, please cite the following paper:
+```
+@inproceedings{zhang2022syntax,
+  title={Syntax-guided Contrastive Learning for Pre-trained Language Model},
+  author={Zhang, Shuai and Lijie, Wang and Xiao, Xinyan and Wu, Hua},
+  booktitle={Findings of the Association for Computational Linguistics: ACL 2022},
+  pages={2430--2440},
+  year={2022}
+}
+```
 
diff --git a/NLP/ACL2022-SynCLM/data/pretrain/demo_input b/NLP/ACL2022-SynCLM/data/pretrain/demo_input
diff --git a/NLP/ACL2022-SynCLM/data/pretrain/train_filelist b/NLP/ACL2022-SynCLM/data/pretrain/train_filelist
@@ -0,0 +1 @@
+./data/pretrain/demo_input
diff --git a/NLP/ACL2022-SynCLM/data/pretrain/valid_filelist b/NLP/ACL2022-SynCLM/data/pretrain/valid_filelist
@@ -0,0 +1 @@
+./data/pretrain/demo_input
diff --git a/NLP/ACL2022-SynCLM/env_local/env.sh b/NLP/ACL2022-SynCLM/env_local/env.sh
@@ -0,0 +1,25 @@
+#!/usr/bin/env bash
+set -x
+#在LD_LIBRARY_PATH中添加cuda库的路径
+export LD_LIBRARY_PATH=/home/work/cuda-10.1_cudnn7.6.5/lib64${LD_LIBRARY_PATH:+:$LD_LIBRARY_PATH}
+export LD_LIBRARY_PATH=/home/work/cuda-10.1_cudnn7.6.5/extras/CUPTI/lib64:$LD_LIBRARY_PATH
+#在LD_LIBRARY_PATH中添加cudnn库的路径
+export LD_LIBRARY_PATH=/home/work/cudnn/cudnn_v7.6/cuda/lib64:$LD_LIBRARY_PATH
+#需要先下载NCCL，然后在LD_LIBRARY_PATH中添加NCCL库的路径
+export LD_LIBRARY_PATH=/home/work/nccl2.4.2_cuda10.1/lib:$LD_LIBRARY_PATH
+#如果FLAGS_sync_nccl_allreduce为1，则会在allreduce_op_handle中调用cudaStreamSynchronize（nccl_stream），这种模式在某些情况下可以获得更好的性能
+export FLAGS_sync_nccl_allreduce=1
+#表示分配的显存块占GPU总可用显存大小的比例，范围[0,1]
+export FLAGS_fraction_of_gpu_memory_to_use=1
+#表示是否使用垃圾回收策略来优化网络的内存使用，<0表示禁用，>=0表示启用
+export FLAGS_eager_delete_tensor_gb=1.0
+#是否使用快速垃圾回收策略
+export FLAGS_fast_eager_deletion_mode=1
+#垃圾回收策略释放变量的内存大小百分比，范围为[0.0, 1.0]
+export FLAGS_memory_fraction_of_eager_deletion=1
+
+export iplist=`hostname -i`
+#http_proxy
+unset http_proxy
+unset https_proxy
+set +x
diff --git a/NLP/ACL2022-SynCLM/images/framework.png b/NLP/ACL2022-SynCLM/images/framework.png
diff --git a/NLP/ACL2022-SynCLM/model_files/config/roberta_base_en.json b/NLP/ACL2022-SynCLM/model_files/config/roberta_base_en.json
@@ -0,0 +1,14 @@
+{
+  "attention_probs_dropout_prob": 0.1, 
+  "hidden_act": "gelu", 
+  "hidden_dropout_prob": 0.1, 
+  "hidden_size": 768,   
+  "initializer_range": 0.02, 
+  "max_position_embeddings": 514,
+  "num_attention_heads": 12, 
+  "num_hidden_layers": 12, 
+  "type_vocab_size": 0,
+  "sent_type_vocab_size": 0, 
+  "task_type_vocab_size": 0, 
+  "vocab_size": 50265
+}
diff --git a/NLP/ACL2022-SynCLM/model_files/config/roberta_large_en.json b/NLP/ACL2022-SynCLM/model_files/config/roberta_large_en.json
@@ -0,0 +1,14 @@
+{
+  "attention_probs_dropout_prob": 0.1, 
+  "hidden_act": "gelu", 
+  "hidden_dropout_prob": 0.1, 
+  "hidden_size": 1024,
+  "initializer_range": 0.02, 
+  "max_position_embeddings": 514,
+  "num_attention_heads": 16, 
+  "num_hidden_layers": 24, 
+  "type_vocab_size": 0,
+  "sent_type_vocab_size": 0, 
+  "task_type_vocab_size": 0, 
+  "vocab_size": 50265
+}
diff --git a/NLP/ACL2022-SynCLM/model_files/dict/roberta_base_en.encoder.json b/NLP/ACL2022-SynCLM/model_files/dict/roberta_base_en.encoder.json