Skip to content

Commit

Permalink
Merge pull request #18 from airaria/v0.2.0dev
Browse files Browse the repository at this point in the history
v0.2.0dev
  • Loading branch information
airaria authored Jul 30, 2020
2 parents 6883348 + 4ea0515 commit 000f4af
Show file tree
Hide file tree
Showing 27 changed files with 1,280 additions and 55 deletions.
16 changes: 11 additions & 5 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -28,8 +28,14 @@ Check our paper through [ACL Anthology](https://www.aclweb.org/anthology/2020.ac

## Update

**Jul 29, 2020**

* **Updated to 0.2.0**:
* Added the support for distributed data-parallel training with `DistributedDataParallel`: `TrainingConfig` now accpects the `local_rank` argument. See the documentation of `TrainingConfig` for detail.
* Added an example of distillation on the Chinese NER task to demonstrate distributed data-parallel training. See [examples/msra_ner_example](examples/msra_ner_example).

**Jul 14, 2020**
* Updated to 0.1.10:
* **Updated to 0.1.10**:
* Now supports mixed precision training with Apex! Just set `fp16` to `True` in `TrainingConfig`. See the documentation of `TrainingConfig` for detail.
* Added `data_parallel` option in `TrainingConfig` to enable data parallel training and mixed precision training work together.

Expand All @@ -41,17 +47,17 @@ Check our paper through [ACL Anthology](https://www.aclweb.org/anthology/2020.ac

**Apr 22, 2020**

* Updated to 0.1.9 (added cache option which speeds up distillation; fixed some bugs). See details in [releases](https://github.com/airaria/TextBrewer/releases/tag/v0.1.9).
* **Updated to 0.1.9** (added cache option which speeds up distillation; fixed some bugs). See details in [releases](https://github.com/airaria/TextBrewer/releases/tag/v0.1.9).
* Added experimential results for distilling Electra-base to Electra-small on Chinese tasks.
* TextBrewer has been accepted by [ACL 2020](http://acl2020.org) as a demo paper, please use our new [bib entry](#Citation).

**Mar 17, 2020**

* Added CoNLL-2003 English NER distillation example, see [examples/conll2003_example](examples/conll2003_example).
* Added CoNLL-2003 English NER distillation example. See [examples/conll2003_example](examples/conll2003_example).

**Mar 11, 2020**

* Updated to 0.1.8 (Improvements on TrainingConfig and train method). See details in [releases](https://github.com/airaria/TextBrewer/releases/tag/v0.1.8).
* **Updated to 0.1.8** (Improvements on TrainingConfig and train method). See details in [releases](https://github.com/airaria/TextBrewer/releases/tag/v0.1.8).

**Mar 2, 2020**

Expand Down Expand Up @@ -398,7 +404,7 @@ We recommend that users use pre-trained student models whenever possible to full

## Known Issues

* Multi-GPU training support is only available through `DataParallel` currently.
* ~~Multi-GPU training support is only available through `DataParallel` currently.~~

## Citation

Expand Down
8 changes: 7 additions & 1 deletion README_ZH.md
Original file line number Diff line number Diff line change
Expand Up @@ -29,6 +29,12 @@

## 更新

**Jul 29, 2020**

* **版本更新至0.2.0**:
* 增加对分布式数据并行训练的支持:可通过在`TrainingConfig`中传入相应的`local_rank`以启用。详细设置参见`TraningConfig`的说明。
* 增加了分布式数据并行训练的使用示例:中文命名实体识别任务上的ELECTRA-base模型的蒸馏,见[examples/msra_ner_example](examples/msra_ner_example)

**Jul 14, 2020**
* **版本更新至0.1.10**:
* 支持apex混合精度训练功能:可通过在`TrainingConfig`中设置`fp16=True`启用。详细设置参见`TraningConfig`的说明。
Expand Down Expand Up @@ -389,7 +395,7 @@ Distiller负责执行实际的蒸馏过程。目前实现了以下的distillers:

## 已知问题

* 尚不支持DataParallel以外的多卡训练策略。
* ~~尚不支持DataParallel以外的多卡训练策略。~~

## 引用

Expand Down
2 changes: 1 addition & 1 deletion examples/cmrc2018_example/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -7,7 +7,7 @@ This example demonstrates distilltion on CMRC 2018 task, and using DRCD dataset
* run_cmrc2018_distill_T3.sh : distills the teacher to T3 with CMRC 2018 and DRCD datasets.
* run_cmrc2018_distill_T4tiny.sh : distills the teacher to T4tiny with CMRC 2018 and DRCD datasets.

Modify the following variables in the shell scripts before running:
Set the following variables in the shell scripts before running:

* BERT_DIR : where RoBERTa-wwm-base stores,including vocab.txt, pytorch_model.bin, bert_config.json
* OUTPUT_ROOT_DIR : this directory stores logs and trained model weights
Expand Down
2 changes: 1 addition & 1 deletion examples/cmrc2018_example/README_ZH.md
Original file line number Diff line number Diff line change
Expand Up @@ -6,7 +6,7 @@
* run_cmrc2018_distill_T3.sh : 在cmrc2018和drcd数据集上蒸馏教师模型到T3
* run_cmrc2018_distill_T4tiny.sh : 在cmrc2018和drcd数据集上蒸馏教师模型到T4-tiny

运行脚本前,请根据自己的环境修改相应变量
运行脚本前,请根据自己的环境设置相应变量

* BERT_DIR : 存放RoBERTa-wwm-base模型的目录,包含vocab.txt, pytorch_model.bin, bert_config.json
* OUTPUT_ROOT_DIR : 存放训练好的模型权重文件和日志
Expand Down
2 changes: 1 addition & 1 deletion examples/conll2003_example/README_ZH.md
Original file line number Diff line number Diff line change
Expand Up @@ -10,7 +10,7 @@
* Transformers
* seqeval

运行脚本前,请根据自己的环境修改相应变量
运行脚本前,请根据自己的环境设置相应变量

* BERT_MODEL : 存放BERT-base模型的目录,包含vocab.txt, pytorch_model.bin, config.json
* OUTPUT_DIR : 存放训练好的模型权重文件
Expand Down
2 changes: 1 addition & 1 deletion examples/mnli_example/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -6,7 +6,7 @@ This example demonstrates distilltion on MNLI task.
* run_mnli_distill_T4tiny.sh : distills the teacher to T4tiny.
* run_mnli_distill_multiteacher.sh : runs multi-teacher distillation,distilling several teacher models into a student model.

Modify the following variables in the shell scripts before running:
Set the following variables in the shell scripts before running:

* BERT_DIR : where BERT-base-cased stores,including vocab.txt, pytorch_model.bin, bert_config.json
* OUTPUT_ROOT_DIR : this directory stores logs and trained model weights
Expand Down
2 changes: 1 addition & 1 deletion examples/mnli_example/README_ZH.md
Original file line number Diff line number Diff line change
Expand Up @@ -6,7 +6,7 @@
* run_mnli_distill_T4tiny.sh : 在MNLI上蒸馏教师模型到T4Tiny
* run_mnli_distill_multiteacher.sh : 执行多教师蒸馏,将多个教师模型压缩到一个学生模型

运行脚本前,请根据自己的环境修改相应变量
运行脚本前,请根据自己的环境设置相应变量

* BERT_DIR : 存放BERT-base-cased模型的目录,包含vocab.txt, pytorch_model.bin, bert_config.json
* OUTPUT_ROOT_DIR : 存放训练好的模型和日志
Expand Down
25 changes: 25 additions & 0 deletions examples/msra_ner_example/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,25 @@
[**中文说明**](README_ZH.md) | [**English**](README.md)

This example demonstrates distilling a [Chinese-ELECTRA-base](https://github.com/ymcui/Chinese-ELECTRA) model on the MSRA NER task with **distributed data-parallel training**(single node, muliti-GPU).


* ner_ElectraTrain_dist.sh : trains a treacher model (Chinese-ELECTRA-base) on MSRA NER.
* ner_ElectraDistill_dist.sh : distills the teacher to a ELECTRA-small model.


Set the following variables in the shell scripts before running:

* ELECTRA_DIR_BASE : where Chinese-ELECTRA-base locates, should includ vocab.txt, pytorch_model.bin and config.json.

* OUTPUT_DIR : this directory stores the logs and the trained model weights.
* DATA_DIR : it includes MSRA NER dataset:
* msra_train_bio.txt
* msra_test_bio.txt

For distillation:

* ELECTRA_DIR_SMALL : where the pretrained Chinese-ELECTRA-small weight locates, should include pytorch_model.bin. This is optional. If you don't provide the ELECTRA-small weight, the student model will be initialized randomly.
* student_config_file : the model config file (i.e., config.json) for the student. Usually it should be in $\{ELECTRA_DIR_SMALL\}.
* trained_teacher_model_file : the ELECTRA-base teacher model that has been fine-tuned.

The scripts have been tested under **PyTorch==1.2, Transformers==2.8**.
25 changes: 25 additions & 0 deletions examples/msra_ner_example/README_ZH.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,25 @@
[**中文说明**](README_ZH.md) | [**English**](README.md)

这个例子展示MSRA NER(中文命名实体识别)任务上,在**分布式数据并行训练**(Distributed Data-Parallel, DDP)模式(single node, muliti-GPU)下的[Chinese-ELECTRA-base](https://github.com/ymcui/Chinese-ELECTRA)模型蒸馏。


* ner_ElectraTrain_dist.sh : 训练教师模型(ELECTRA-base)。
* ner_ElectraDistill_dist.sh : 将教师模型蒸馏到学生模型(ELECTRA-small)。


运行脚本前,请根据自己的环境设置相应变量:

* ELECTRA_DIR_BASE : 存放Chinese-ELECTRA-base模型的目录,包含vocab.txt,pytorch_model.bin和config.json。

* OUTPUT_DIR : 存放训练好的模型权重文件和日志。
* DATA_DIR : MSRA NER数据集目录,包含
* msra_train_bio.txt
* msra_test_bio.txt

对于蒸馏,需要设置:

* ELECTRA_DIR_SMALL : Chinese-ELECTRA-small预训练权重所在目录。应包含pytorch_model.bin。 也可不提供预训练权重,则学生模型将随机初始化。
* student_config_file : 学生模型配置文件,一般文件名为config.json,也位于 $\{ELECTRA_DIR_SMALL\}
* trained_teacher_model_file : 在MSRA NER任务上训练好的ELECTRA-base教师模型。

该脚本在 **PyTorch==1.2, Transformers==2.8** 下测试通过。
90 changes: 90 additions & 0 deletions examples/msra_ner_example/config.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,90 @@
import argparse

args = None

def parse(opt=None):
parser = argparse.ArgumentParser()

## Required parameters

parser.add_argument("--vocab_file", default=None, type=str, required=True,
help="The vocabulary file that the BERT model was trained on.")
parser.add_argument("--output_dir", default=None, type=str, required=True,
help="The output directory where the model checkpoints will be written.")

## Other parameters
parser.add_argument("--train_file", default=None, type=str)
parser.add_argument("--predict_file", default=None, type=str)
parser.add_argument("--do_lower_case", action='store_true',
help="Whether to lower case the input text. Should be True for uncased "
"models and False for cased models.")
parser.add_argument("--max_seq_length", default=416, type=int,
help="The maximum total input sequence length after WordPiece tokenization. Sequences "
"longer than this will be truncated, and sequences shorter than this will be padded.")
parser.add_argument("--do_train", default=False, action='store_true', help="Whether to run training.")
parser.add_argument("--do_predict", default=False, action='store_true', help="Whether to run eval on the dev set.")
parser.add_argument("--train_batch_size", default=32, type=int, help="Total batch size for training.")
parser.add_argument("--predict_batch_size", default=8, type=int, help="Total batch size for predictions.")
parser.add_argument("--learning_rate", default=3e-5, type=float, help="The initial learning rate for Adam.")
parser.add_argument("--num_train_epochs", default=3.0, type=float,
help="Total number of training epochs to perform.")
parser.add_argument("--warmup_proportion", default=0.1, type=float,
help="Proportion of training to perform linear learning rate warmup for. E.g., 0.1 = 10% "
"of training.")
parser.add_argument("--verbose_logging", default=False, action='store_true',
help="If true, all of the warnings related to data processing will be printed. "
"A number of warnings are expected for a normal SQuAD evaluation.")
parser.add_argument("--no_cuda",
default=False,
action='store_true',
help="Whether not to use CUDA when available")
parser.add_argument('--gradient_accumulation_steps',
type=int,
default=1,
help="Number of updates steps to accumualte before performing a backward/update pass.")
parser.add_argument("--local_rank",
type=int,
default=-1,
help="local_rank for distributed training on gpus")
parser.add_argument('--fp16',
default=False,
action='store_true',
help="Whether to use 16-bit float precisoin instead of 32-bit")

parser.add_argument('--random_seed',type=int,default=10236797)
parser.add_argument('--load_model_type',type=str,default='bert',choices=['bert','all','none'])
parser.add_argument('--weight_decay_rate',type=float,default=0.01)
parser.add_argument('--do_eval',action='store_true')
parser.add_argument('--PRINT_EVERY',type=int,default=200)
parser.add_argument('--weight',type=float,default=1.0)
parser.add_argument('--ckpt_frequency',type=int,default=2)

parser.add_argument('--tuned_checkpoint_T',type=str,default=None)
parser.add_argument('--tuned_checkpoint_S',type=str,default=None)
parser.add_argument("--init_checkpoint_S", default=None, type=str)
parser.add_argument("--bert_config_file_T", default=None, type=str, required=True)
parser.add_argument("--bert_config_file_S", default=None, type=str, required=True)
parser.add_argument("--temperature", default=1, type=float, required=False)
parser.add_argument("--teacher_cached",action='store_true')

parser.add_argument('--schedule',type=str,default='warmup_linear_release')

parser.add_argument('--no_inputs_mask',action='store_true')
parser.add_argument('--no_logits', action='store_true')
parser.add_argument('--output_encoded_layers' ,default='true',choices=['true','false'])
parser.add_argument('--output_attention_layers',default='true',choices=['true','false'])
parser.add_argument('--matches',nargs='*',type=str)

parser.add_argument('--lr_decay',default=None,type=float)
parser.add_argument('--official_schedule',default='linear',type=str)
global args
if opt is None:
args = parser.parse_args()
else:
args = parser.parse_args(opt)


if __name__ == '__main__':
print (args)
parse(['--SAVE_DIR','test'])
print(args)
Loading

0 comments on commit 000f4af

Please sign in to comment.