PhenoGPT2

PhenoGPT2 is an advanced phenotype recognition model, leveraging the robust capabilities of large language models. It is an improved version of PhenoGPT (Jingye et. al. 2023). It employs a fine-tuned implementation on the synthetic medical data generated by Llama 3.1 70B and Human Phenotype Ontology Database, to enhance prediction accuracy and alignments. Like GPT's broad utilization, PhenoGPT2 can process diverse clinical abstracts for improved flexibility. For enhanced model precision and specialization, you have the option to further fine-tune the proposed PhenoGPT2 model on your own clinical datasets. This process is elaborated in the subsequent section.

PhenoGPT2 is distributed under the MIT License by Wang Genomics Lab.

Installation

We need to install the required packages for model fine-tuning and inference.

conda create -n phenogpt2 python=3.11
conda activate phenogpt2
conda install pandas numpy scikit-learn matplotlib seaborn requests
conda install pytorch torchvision torchaudio pytorch-cuda=12.1 -c pytorch -c nvidia
conda install -c "nvidia/label/cuda-12.1" cuda-toolkit
conda install -c conda-forge jupyter
conda install intel-openmp blas
conda install mpi4py
pip install transformers datasets
pip install fastobo sentencepiece einops protobuf
pip install evaluate sacrebleu scipy accelerate deepspeed
# PLEASE LOAD CUDA MODE IN YOUR ENVIRONMENT BEFORE INSTALL FLASH ATTENTION PACKAGE. FOR EXAMPLE BELOW:
module load CUDA/12.1.1
pip install flash-attn --no-build-isolation
pip install xformers
conda install -c anaconda ipykernel
python -m ipykernel install --user --name=phenogpt2

Set Up Model, Input, and Output directories

Models:
- To use LLaMA 3.1 8B model, please apply for access first and download it into the local drive. Download here
- Save model in the Llama3_1/Meta-Llama-3.1-8B-Instruct
- Download the updated fine-tuning in the release section on GitHub (Latest version: v1.0.0)
- Save model weights in the ./model/
Input:
- Input files should be txt files
- Input argument can be either a single txt file or a whole directory containing all input txt files
- Please see the input and output directories for reference
JSON-formatted answer:
- Ideally, the output files include the raw results in _phenogpt2.txt and _phenogpt2.json
- However, due to the nature of LLMs, sometimes the generated format does not fit with JSON format. You will receive the error f"Please review the output file at _phenogpt2.txt. The result was successfully generated in text format; however, there may be some extra single or double quotes, or colons, which could cause a JSON format error. Please inspect and remove any of them." This means it is a high chance that in the _phenogpt2.txt answer file, the JSON format is not properly set. Hence, it is suggestive that you check them manually.

Fine-tuning

You can reproduce PhenoGPT2 model with your own datasets or other foundation models. To fine-tune a specialized phenotype recognition language model, we recommend to follow this script script for details. Please modify the directories properly. We provided our data used for fine-tuning: data generated by Llama 3.1 70B model with all Human Phenotype Ontology terms and demographics, data consisted of definitions and comments of many HPO terms from HPO database.

Inference

If you want to simply implement PhenoGPT2 on your local machine for inference, the fine-tuned models are saved in the model directory. Please follow the inference section of the script to run your model.

Please use the following command:

python inference.py -i your_input_folder_directory -o your_output_folder_directory -model_dir your_model_directory

-model_dir: you can use any other models to generate the results given our fine-tuning prompts (optional)

Developers:

Quan Minh Nguyen - Bioengineering PhD student at the University of Pennsylvania Dr. Kai Wang - Professor of Pathology and Laboratory Medicine at the University of Pennsylvania and Children's Hospital of Philadelphia

Citations

Yang, J., Liu, C., Deng, W., Wu, D., Weng, C., Zhou, Y., & Wang, K. (2023). Enhancing phenotype recognition in clinical notes using large language models: PhenoBCBERT and PhenoGPT. Patterns (New York, N.Y.), 5(1), 100887. https://doi.org/10.1016/j.patter.2023.100887

Name		Name	Last commit message	Last commit date
Latest commit History 3 Commits
testing		testing
.gitignore		.gitignore
LICENSE		LICENSE
OutputConverter.py		OutputConverter.py
README.md		README.md
hpo_added_tokens.json		hpo_added_tokens.json
hpo_database_training_data_wDemographics.json		hpo_database_training_data_wDemographics.json
inference.py		inference.py
phenogpt2_training.py		phenogpt2_training.py
run_inference.sh		run_inference.sh
run_phenogpt.sh		run_phenogpt.sh
synthetic_abstracts_training_data_wDemographics.json		synthetic_abstracts_training_data_wDemographics.json

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

PhenoGPT2

Installation

Set Up Model, Input, and Output directories

Fine-tuning

Inference

Developers:

Citations

About

Releases

Packages

Languages

License

WGLab/PhenoGPT2

Folders and files

Latest commit

History

Repository files navigation

PhenoGPT2

Installation

Set Up Model, Input, and Output directories

Fine-tuning

Inference

Developers:

Citations

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages