Merge pull request #18 from airaria/v0.2.0dev

v0.2.0dev
airaria · Jul 30, 2020 · 000f4af · 000f4af
2 parents 6883348 + 4ea0515
commit 000f4af
Show file tree

Hide file tree

Showing 27 changed files with 1,280 additions and 55 deletions.
diff --git a/README.md b/README.md
@@ -28,8 +28,14 @@ Check our paper through [ACL Anthology](https://www.aclweb.org/anthology/2020.ac
 
 ## Update
 
+**Jul 29, 2020**
+
+* **Updated to 0.2.0**:
+    * Added the support for distributed data-parallel training with `DistributedDataParallel`: `TrainingConfig` now accpects the `local_rank` argument. See the documentation of `TrainingConfig` for detail.
+* Added an example of distillation on the Chinese NER task to demonstrate distributed data-parallel training. See [examples/msra_ner_example](examples/msra_ner_example).
+
 **Jul 14, 2020**
-* Updated to 0.1.10:
+* **Updated to 0.1.10**:
     * Now supports mixed precision training with Apex! Just set `fp16` to `True` in `TrainingConfig`. See the documentation of `TrainingConfig` for detail.
     * Added `data_parallel` option in `TrainingConfig` to enable data parallel training and mixed precision training work together.
 
@@ -41,17 +47,17 @@ Check our paper through [ACL Anthology](https://www.aclweb.org/anthology/2020.ac
 
 **Apr 22, 2020**
 
-* Updated to 0.1.9 (added cache option which speeds up distillation; fixed some bugs). See details in [releases](https://github.com/airaria/TextBrewer/releases/tag/v0.1.9).
+* **Updated to 0.1.9** (added cache option which speeds up distillation; fixed some bugs). See details in [releases](https://github.com/airaria/TextBrewer/releases/tag/v0.1.9).
 * Added experimential results for distilling Electra-base to Electra-small on Chinese tasks.
 * TextBrewer has been accepted by [ACL 2020](http://acl2020.org) as a demo paper, please use our new [bib entry](#Citation).
 
 **Mar 17, 2020**
 
-* Added CoNLL-2003 English NER distillation example, see [examples/conll2003_example](examples/conll2003_example).
+* Added CoNLL-2003 English NER distillation example. See [examples/conll2003_example](examples/conll2003_example).
 
 **Mar 11, 2020**
 
-* Updated to 0.1.8 (Improvements on TrainingConfig and train method). See details in [releases](https://github.com/airaria/TextBrewer/releases/tag/v0.1.8).
+* **Updated to 0.1.8** (Improvements on TrainingConfig and train method). See details in [releases](https://github.com/airaria/TextBrewer/releases/tag/v0.1.8).
 
 **Mar 2, 2020**
 
@@ -398,7 +404,7 @@ We recommend that users use pre-trained student models whenever possible to full
 
 ## Known Issues
 
-* Multi-GPU training support is only available through `DataParallel` currently.
+* ~~Multi-GPU training support is only available through `DataParallel` currently.~~
 
 ## Citation
 

diff --git a/README_ZH.md b/README_ZH.md
@@ -29,6 +29,12 @@
 
 ## 更新
 
+**Jul 29, 2020**
+
+* **版本更新至0.2.0**:
+    * 增加对分布式数据并行训练的支持：可通过在`TrainingConfig`中传入相应的`local_rank`以启用。详细设置参见`TraningConfig`的说明。
+* 增加了分布式数据并行训练的使用示例：中文命名实体识别任务上的ELECTRA-base模型的蒸馏，见[examples/msra_ner_example](examples/msra_ner_example)。
+
 **Jul 14, 2020**
 * **版本更新至0.1.10**:
     * 支持apex混合精度训练功能：可通过在`TrainingConfig`中设置`fp16=True`启用。详细设置参见`TraningConfig`的说明。
@@ -389,7 +395,7 @@ Distiller负责执行实际的蒸馏过程。目前实现了以下的distillers:
 
 ## 已知问题
 
-* 尚不支持DataParallel以外的多卡训练策略。
+* ~~尚不支持DataParallel以外的多卡训练策略。~~
 
 ## 引用
 

diff --git a/examples/cmrc2018_example/README.md b/examples/cmrc2018_example/README.md
@@ -7,7 +7,7 @@ This example demonstrates distilltion on CMRC 2018 task, and using DRCD dataset
 * run_cmrc2018_distill_T3.sh : distills the teacher to T3 with CMRC 2018 and DRCD datasets.
 * run_cmrc2018_distill_T4tiny.sh :  distills the teacher to T4tiny with CMRC 2018 and DRCD datasets.
 
-Modify the following variables in the shell scripts before running:
+Set the following variables in the shell scripts before running:
 
 * BERT_DIR :  where RoBERTa-wwm-base stores，including vocab.txt, pytorch_model.bin, bert_config.json
 * OUTPUT_ROOT_DIR : this directory stores logs and trained model weights

diff --git a/examples/cmrc2018_example/README_ZH.md b/examples/cmrc2018_example/README_ZH.md
@@ -6,7 +6,7 @@
 * run_cmrc2018_distill_T3.sh : 在cmrc2018和drcd数据集上蒸馏教师模型到T3
 * run_cmrc2018_distill_T4tiny.sh : 在cmrc2018和drcd数据集上蒸馏教师模型到T4-tiny
 
-运行脚本前，请根据自己的环境修改相应变量：
+运行脚本前，请根据自己的环境设置相应变量：
 
 * BERT_DIR : 存放RoBERTa-wwm-base模型的目录，包含vocab.txt, pytorch_model.bin, bert_config.json
 * OUTPUT_ROOT_DIR : 存放训练好的模型权重文件和日志

diff --git a/examples/conll2003_example/README_ZH.md b/examples/conll2003_example/README_ZH.md
@@ -10,7 +10,7 @@
 * Transformers
 * seqeval
 
-运行脚本前，请根据自己的环境修改相应变量：
+运行脚本前，请根据自己的环境设置相应变量：
 
 * BERT_MODEL : 存放BERT-base模型的目录，包含vocab.txt, pytorch_model.bin, config.json
 * OUTPUT_DIR : 存放训练好的模型权重文件

diff --git a/examples/mnli_example/README.md b/examples/mnli_example/README.md
@@ -6,7 +6,7 @@ This example demonstrates distilltion on MNLI task.
 * run_mnli_distill_T4tiny.sh : distills the teacher to T4tiny.
 * run_mnli_distill_multiteacher.sh : runs multi-teacher distillation，distilling several teacher models into a student model.
 
-Modify the following variables in the shell scripts before running:
+Set the following variables in the shell scripts before running:
 
 * BERT_DIR : where BERT-base-cased stores，including vocab.txt, pytorch_model.bin, bert_config.json
 * OUTPUT_ROOT_DIR : this directory stores logs and trained model weights

diff --git a/examples/mnli_example/README_ZH.md b/examples/mnli_example/README_ZH.md
@@ -6,7 +6,7 @@
 * run_mnli_distill_T4tiny.sh : 在MNLI上蒸馏教师模型到T4Tiny
 * run_mnli_distill_multiteacher.sh : 执行多教师蒸馏，将多个教师模型压缩到一个学生模型
 
-运行脚本前，请根据自己的环境修改相应变量：
+运行脚本前，请根据自己的环境设置相应变量：
 
 * BERT_DIR : 存放BERT-base-cased模型的目录，包含vocab.txt, pytorch_model.bin, bert_config.json
 * OUTPUT_ROOT_DIR : 存放训练好的模型和日志

diff --git a/examples/msra_ner_example/README.md b/examples/msra_ner_example/README.md
@@ -0,0 +1,25 @@
+[**中文说明**](README_ZH.md) | [**English**](README.md)
+
+This example demonstrates distilling a [Chinese-ELECTRA-base](https://github.com/ymcui/Chinese-ELECTRA) model on the MSRA NER task with **distributed data-parallel training**(single node, muliti-GPU).
+
+
+* ner_ElectraTrain_dist.sh : trains a treacher model (Chinese-ELECTRA-base) on MSRA NER.
+* ner_ElectraDistill_dist.sh : distills the teacher to a ELECTRA-small model.
+
+
+Set the following variables in the shell scripts before running:
+
+* ELECTRA_DIR_BASE :  where Chinese-ELECTRA-base locates, should includ vocab.txt, pytorch_model.bin and config.json.
+
+* OUTPUT_DIR : this directory stores the logs and the trained model weights.
+* DATA_DIR : it includes MSRA NER dataset:
+  * msra_train_bio.txt
+  * msra_test_bio.txt
+
+For distillation:
+
+* ELECTRA_DIR_SMALL :  where the pretrained Chinese-ELECTRA-small weight locates, should include pytorch_model.bin. This is optional. If you don't provide the ELECTRA-small weight, the student model will be initialized randomly.
+* student_config_file : the model config file (i.e., config.json) for the student. Usually it should be in $\{ELECTRA_DIR_SMALL\}.
+* trained_teacher_model_file : the ELECTRA-base teacher model that has been fine-tuned.
+
+The scripts have been tested under **PyTorch==1.2, Transformers==2.8**.
diff --git a/examples/msra_ner_example/README_ZH.md b/examples/msra_ner_example/README_ZH.md
@@ -0,0 +1,25 @@
+[**中文说明**](README_ZH.md) | [**English**](README.md)
+
+这个例子展示MSRA NER(中文命名实体识别)任务上，在**分布式数据并行训练**(Distributed Data-Parallel, DDP)模式(single node, muliti-GPU)下的[Chinese-ELECTRA-base](https://github.com/ymcui/Chinese-ELECTRA)模型蒸馏。
+
+
+* ner_ElectraTrain_dist.sh : 训练教师模型(ELECTRA-base)。
+* ner_ElectraDistill_dist.sh : 将教师模型蒸馏到学生模型(ELECTRA-small)。
+
+
+运行脚本前，请根据自己的环境设置相应变量：
+
+* ELECTRA_DIR_BASE :  存放Chinese-ELECTRA-base模型的目录，包含vocab.txt，pytorch_model.bin和config.json。
+
+* OUTPUT_DIR : 存放训练好的模型权重文件和日志。
+* DATA_DIR : MSRA NER数据集目录，包含
+  * msra_train_bio.txt
+  * msra_test_bio.txt
+
+对于蒸馏，需要设置:
+
+* ELECTRA_DIR_SMALL :  Chinese-ELECTRA-small预训练权重所在目录。应包含pytorch_model.bin。 也可不提供预训练权重，则学生模型将随机初始化。
+* student_config_file : 学生模型配置文件，一般文件名为config.json，也位于 $\{ELECTRA_DIR_SMALL\}。
+* trained_teacher_model_file : 在MSRA NER任务上训练好的ELECTRA-base教师模型。
+
+该脚本在 **PyTorch==1.2, Transformers==2.8** 下测试通过。
diff --git a/examples/msra_ner_example/config.py b/examples/msra_ner_example/config.py
@@ -0,0 +1,90 @@
+import argparse
+
+args = None
+
+def parse(opt=None):
+    parser = argparse.ArgumentParser()
+
+    ## Required parameters
+
+    parser.add_argument("--vocab_file", default=None, type=str, required=True,
+                        help="The vocabulary file that the BERT model was trained on.")
+    parser.add_argument("--output_dir", default=None, type=str, required=True,
+                        help="The output directory where the model checkpoints will be written.")
+
+    ## Other parameters
+    parser.add_argument("--train_file", default=None, type=str)
+    parser.add_argument("--predict_file", default=None, type=str)
+    parser.add_argument("--do_lower_case", action='store_true',
+                        help="Whether to lower case the input text. Should be True for uncased "
+                             "models and False for cased models.")
+    parser.add_argument("--max_seq_length", default=416, type=int,
+                        help="The maximum total input sequence length after WordPiece tokenization. Sequences "
+                             "longer than this will be truncated, and sequences shorter than this will be padded.")
+    parser.add_argument("--do_train", default=False, action='store_true', help="Whether to run training.")
+    parser.add_argument("--do_predict", default=False, action='store_true', help="Whether to run eval on the dev set.")
+    parser.add_argument("--train_batch_size", default=32, type=int, help="Total batch size for training.")
+    parser.add_argument("--predict_batch_size", default=8, type=int, help="Total batch size for predictions.")
+    parser.add_argument("--learning_rate", default=3e-5, type=float, help="The initial learning rate for Adam.")
+    parser.add_argument("--num_train_epochs", default=3.0, type=float,
+                        help="Total number of training epochs to perform.")
+    parser.add_argument("--warmup_proportion", default=0.1, type=float,
+                        help="Proportion of training to perform linear learning rate warmup for. E.g., 0.1 = 10% "
+                             "of training.")
+    parser.add_argument("--verbose_logging", default=False, action='store_true',
+                        help="If true, all of the warnings related to data processing will be printed. "
+                             "A number of warnings are expected for a normal SQuAD evaluation.")
+    parser.add_argument("--no_cuda",
+                        default=False,
+                        action='store_true',
+                        help="Whether not to use CUDA when available")
+    parser.add_argument('--gradient_accumulation_steps',
+                        type=int,
+                        default=1,
+                        help="Number of updates steps to accumualte before performing a backward/update pass.")
+    parser.add_argument("--local_rank",
+                        type=int,
+                        default=-1,
+                        help="local_rank for distributed training on gpus")
+    parser.add_argument('--fp16',
+                        default=False,
+                        action='store_true',
+                        help="Whether to use 16-bit float precisoin instead of 32-bit")
+
+    parser.add_argument('--random_seed',type=int,default=10236797)
+    parser.add_argument('--load_model_type',type=str,default='bert',choices=['bert','all','none'])
+    parser.add_argument('--weight_decay_rate',type=float,default=0.01)
+    parser.add_argument('--do_eval',action='store_true')
+    parser.add_argument('--PRINT_EVERY',type=int,default=200)
+    parser.add_argument('--weight',type=float,default=1.0)
+    parser.add_argument('--ckpt_frequency',type=int,default=2)
+
+    parser.add_argument('--tuned_checkpoint_T',type=str,default=None)
+    parser.add_argument('--tuned_checkpoint_S',type=str,default=None)
+    parser.add_argument("--init_checkpoint_S", default=None, type=str)
+    parser.add_argument("--bert_config_file_T", default=None, type=str, required=True)
+    parser.add_argument("--bert_config_file_S", default=None, type=str, required=True)
+    parser.add_argument("--temperature", default=1, type=float, required=False)
+    parser.add_argument("--teacher_cached",action='store_true')
+
+    parser.add_argument('--schedule',type=str,default='warmup_linear_release')
+
+    parser.add_argument('--no_inputs_mask',action='store_true')
+    parser.add_argument('--no_logits', action='store_true')
+    parser.add_argument('--output_encoded_layers'  ,default='true',choices=['true','false'])
+    parser.add_argument('--output_attention_layers',default='true',choices=['true','false'])
+    parser.add_argument('--matches',nargs='*',type=str)
+
+    parser.add_argument('--lr_decay',default=None,type=float)
+    parser.add_argument('--official_schedule',default='linear',type=str)
+    global args
+    if opt is None:
+        args = parser.parse_args()
+    else:
+        args = parser.parse_args(opt)
+
+
+if __name__ == '__main__':
+    print (args)
+    parse(['--SAVE_DIR','test'])
+    print(args)