Skip to content

NLP course project trying something on document-level relation extraction

License

Notifications You must be signed in to change notification settings

innocentARJOS/NLP-course-project

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

23 Commits
 
 
 
 
 
 
 
 

Repository files navigation

NLP-course-project

NLP course project trying something on document-level relation extraction

Dataset

由于整个项目是把 Fine-tune Bert for DocRED with Two-step Process 的复现和 DREEAM 的复现比较粗糙的整合处理,因此实际使用过程中需要处理两个数据集的问题

Since the whole project is a rough integration of Fine-tune Bert for DocRED with two-step Process and DREEAM, Two data sets need to be dealt with in actual use

DREEAM

The DocRED dataset can be downloaded following the instructions at link. The expected structure of files is:

DREEAM
 |-- dataset
 |    |-- docred
 |    |    |-- train_annotated.json
 |    |    |-- train_distant.json
 |    |    |-- dev.json
 |    |    |-- test.json
 |    |    |-- (train_revised.json)
 |    |    |-- (dev_revised.json)
 |    |    |-- (test_revised.json)
 |-- meta
 |    |-- rel2id.json
 |    |-- rel_info.json

Fine-tune Bert for DocRED with Two-step Process

Download metadata from TsinghuaCloud or GoogleDrive for baseline method and put them into prepro_data folder.

python3 gen_data.py --in_path ../data --out_path prepro_data

Data can be downloaded from Google Drive. and put them into data folder.

Train

DREEAM

To train DREEAM under a fully-supervised setting, make sure the file structure is the same as above, and run below:

bash scripts/run_bert.sh ${name} ${lambda} ${seed} # for BERT
bash scripts/run_roberta.sh ${name} ${lambda} ${seed} # for RoBERTa

where ${name} is the identifier of this run displayed in wandb, ${lambda} is the scaler than controls the weight of evidence loss (see Eq. 11 in the paper), and ${seed} is the value of random seed.

The training loss and evaluation results on the dev set are synced to the wandb dashboard. All the outputs including the checkpoints, predictions and evaluation scores will be stored under a directory named ${name}_lambda${lambda}/${timestamp}/ , where ${timestamp} is the time stamp automatically generated by the code.

Main

training:

CUDA_VISIBLE_DEVICES=0 python3 train.py --model_name BiLSTM --save_name checkpoint_BiLSTM --train_prefix dev_train --test_prefix dev_dev

Test

testing (--test_prefix dev_dev for dev set, dev_test for test set):

CUDA_VISIBLE_DEVICES=0 python3 test.py --model_name BiLSTM --save_name checkpoint_BiLSTM --train_prefix dev_train --test_prefix dev_dev --input_theta 0.3601

You can get pre-trained model in GoogleDrive Remember to rename the model to fix the argument --save_name checkpoint_BiLSTM .

About

NLP course project trying something on document-level relation extraction

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages