DKEC: Domain Knowledge Enhanced Multi-Label Classification for Diagnosis Prediction

code repository for EMNLP 2024 main conference paper DKEC: Domain Knowledge Enhanced Multi-Label Classification for Diagnosis Prediction

Introduction

Multi-label text classification (MLTC) tasks in the medical domain often face the long-tail label distribution problem. Prior works have explored hierarchical label structures to find relevant information for few-shot classes, but mostly neglected to incorporate external knowledge from medical guidelines. This paper presents DKEC, Domain Knowledge Enhanced Classification for diagnosis prediction with two innovations:

(1) automated construction of heterogeneous knowledge graphs from external sources to capture semantic relations among diverse medical entities.

(2) incorporating the heterogeneous knowledge graphs in few-shot classification using a label-wise attention mechanism.

Dataset

EMS dataset
- The EMS dataset is restrictedly available due to patient privacy concerns.
MIMIC-III dataset
- MIMIC-III dataset is publicly available. Refer to this page to apply online.
- The created adjacent matrix for ICD-9 codes are stored in corresponding MIMIC-III data folder
  - SYMPTOM: icd9code2symptom.json, symptom2icd9code.json
  - TREATMENT: icd9code2treatment.json, treatment2icd9.json
Web Annotation
- To evaluate the accuracy of different methods for constructing knowledge graphs, we evenly sampled 50 codes from head, middle, and tail classes and manually annotated symptoms and treatments from Wikipedia and Mayo Clinic website contents for ICD-9 diagnosis codes. For EMS protocols, we manually annotated all 43 protocols in ODEMSA documents.

Environment

Run the following commands to get an anaconda environment DKEC

chmod +x install.sh
./install.sh

Steps to run the code

Generate train / val / test:

download code from CAML and run notebook dataproc_mimic_III.ipynb, you need to download pre-trained embeddings BioWordVec_PubMed_MIMICIII_d200.vec.bin from link.
run mimic_iii_6668.ipynb, mimic_iii_3737.ipynb and mimic_iii_1000.ipynb in sequence.
- You need specify the root for CAML code and BioWordVec_PubMed_MIMICIII_d200.vec.bin in the mimic_iii_6668.ipynb

Generate pre-trained embedding

First specify config files well for every backbone
run python Heterogeneous_graph.py config/whichname.json to generate embedding for different backbones
- We suggest to change dataset to MIMIC3-6668 since it will generate the initial node embedding for all 6668 ICD-9 codes.

Config

This section specifies some parameters that can be changed in config file

train
- dataset: MIMIC3-3737 or MIMIC3-1000 or MIMIC3-6668
- root_dir: the absolute path the DKEC directory
- topk: 8 (MIMIC3-3737); 6 (MIMIC3-1000); 12 (MIMIC3-6668)
- seed: 3407 or 1234 or 42 or 0 or 1
test
- epoch: you need to select the model of the epoch has the best performance
- seed: change the seed based on the seed set in train
- is_test: True when testing, False when training
wandb
- enable: True if you use wandb to check training curves
- entity: your wandb account name

Slurm

This section specifies the terminal commands

Cluster: you can run the project with slurm, go to the slurm folder and run with sbatch whichname.slurm
Local machine: python main.py config/whichname.json

Reproduce Experimental Results

The following tables specify how to reproduce main experimental results in Table 4 by using slurm. You can also find corresponding json file in config to run on local machine. For ISD, we directly use their github code.

Model	Slurm script or URL
CAML	`sbatch CAML.slurm`
ZAGCNN	`sbatch ZAGCNN.slurm`
MultiResCNN	`sbatch MultiResCNN.slurm`
ISD	https://github.com/tongzhou21/ISD/tree/master
DKEC-M-CNN	`sbatch DKEC_CNN.slurm`
DKEC-GatirTron	`sbatch DKEC_GatorTron.slurm`

Citation

If you find this work helpful, please cite,

@article{ge2023dkec,
  title={Dkec: Domain knowledge enhanced multi-label classification for electronic health records},
  author={Ge, Xueren and Williams, Ronald Dean and Stankovic, John A and Alemzadeh, Homa},
  journal={arXiv preprint arXiv:2310.07059},
  year={2023}
}

Name		Name	Last commit message	Last commit date
Latest commit History 35 Commits
config		config
dataset		dataset
figure		figure
notebooks		notebooks
slurm		slurm
Heterogeneous_graph.py		Heterogeneous_graph.py
README.md		README.md
__init__.py		__init__.py
config.py		config.py
data_utils.py		data_utils.py
default_sets.py		default_sets.py
environment.yml		environment.yml
eval_metrics.py		eval_metrics.py
install.sh		install.sh
logger.py		logger.py
loss_fn.py		loss_fn.py
main.py		main.py
model.py		model.py
notes.md		notes.md
optimizer.py		optimizer.py
train_one_epoch.py		train_one_epoch.py
trainer.py		trainer.py
utils.py		utils.py
visualize.py		visualize.py
vocab.py		vocab.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

DKEC: Domain Knowledge Enhanced Multi-Label Classification for Diagnosis Prediction

Introduction

Dataset

Environment

Steps to run the code

Generate train / val / test:

Generate pre-trained embedding

Config

Slurm

Reproduce Experimental Results

Citation

About

Releases

Packages

Languages

UVA-DSA/DKEC

Folders and files

Latest commit

History

Repository files navigation

DKEC: Domain Knowledge Enhanced Multi-Label Classification for Diagnosis Prediction

Introduction

Dataset

Environment

Steps to run the code

Generate train / val / test:

Generate pre-trained embedding

Config

Slurm

Reproduce Experimental Results

Citation

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages