Skip to content

Latest commit

 

History

History
60 lines (49 loc) · 2.91 KB

README.md

File metadata and controls

60 lines (49 loc) · 2.91 KB

NLP-course-project

NLP course project trying something on document-level relation extraction

Dataset

由于整个项目是把 Fine-tune Bert for DocRED with Two-step Process 的复现和 DREEAM 的复现比较粗糙的整合处理,因此实际使用过程中需要处理两个数据集的问题

Since the whole project is a rough integration of Fine-tune Bert for DocRED with two-step Process and DREEAM, Two data sets need to be dealt with in actual use

DREEAM

The DocRED dataset can be downloaded following the instructions at link. The expected structure of files is:

DREEAM
 |-- dataset
 |    |-- docred
 |    |    |-- train_annotated.json
 |    |    |-- train_distant.json
 |    |    |-- dev.json
 |    |    |-- test.json
 |    |    |-- (train_revised.json)
 |    |    |-- (dev_revised.json)
 |    |    |-- (test_revised.json)
 |-- meta
 |    |-- rel2id.json
 |    |-- rel_info.json

Fine-tune Bert for DocRED with Two-step Process

Download metadata from TsinghuaCloud or GoogleDrive for baseline method and put them into prepro_data folder.

python3 gen_data.py --in_path ../data --out_path prepro_data

Data can be downloaded from Google Drive. and put them into data folder.

Train

DREEAM

To train DREEAM under a fully-supervised setting, make sure the file structure is the same as above, and run below:

bash scripts/run_bert.sh ${name} ${lambda} ${seed} # for BERT
bash scripts/run_roberta.sh ${name} ${lambda} ${seed} # for RoBERTa

where ${name} is the identifier of this run displayed in wandb, ${lambda} is the scaler than controls the weight of evidence loss (see Eq. 11 in the paper), and ${seed} is the value of random seed.

The training loss and evaluation results on the dev set are synced to the wandb dashboard. All the outputs including the checkpoints, predictions and evaluation scores will be stored under a directory named ${name}_lambda${lambda}/${timestamp}/ , where ${timestamp} is the time stamp automatically generated by the code.

Main

training:

CUDA_VISIBLE_DEVICES=0 python3 train.py --model_name BiLSTM --save_name checkpoint_BiLSTM --train_prefix dev_train --test_prefix dev_dev

Test

testing (--test_prefix dev_dev for dev set, dev_test for test set):

CUDA_VISIBLE_DEVICES=0 python3 test.py --model_name BiLSTM --save_name checkpoint_BiLSTM --train_prefix dev_train --test_prefix dev_dev --input_theta 0.3601

You can get pre-trained model in GoogleDrive Remember to rename the model to fix the argument --save_name checkpoint_BiLSTM .