Verb Cardinality Reduction for BioMedical Pred-Argument Graphs Extracted from Unstructured Text

Introduction

Predicate-Argument Graphs extracted from the unstructured text have a high cardinality of verbs (Arguments), limiting the use of graphs. Particularly in the biomedical domain, there are no existing data sources that could use to train or map verbs. Reducing the verb count while not losing information is the key challenge.

verbReduce do not:

Require existing resource for Biomedical domain
Require 'Gold' verb set
Require 'K' for verbs
Require evaluation dataset

Given the unlabeled data, our approach provides a lookup table mapping source verb to target verb.

Architecture Diagram

Setup

Run the following to setup the code

make install-dependencies

Tests:

pytest -s

Running the code

We use the external libraries in the code, so it is useful to be familiar with the way these libraries work. The libraries are:

PyTorch Lightning
- In particular we use the Lightning Module and Lightning Datamodule
PySAT
- In particular we use Minimum/minimal hitting set solver
Prefect
- We only use the basic task-flow paradigm of prefect.
Dynaconf
- We use this to specify the parameters for the train and predict flow

Environment Variables:

export PREFECT_HOME=<path where you have enough space>
- Prefect stores output from task in local disk. Make sure to provide path where there is enough space.
export TOKENIZERS_PARALLELISM=false
- This is to disable the warning messages thrown by HuggingFace Tokenizers
export ENV_FOR_DYNACONF=default
- This is to set the which environment variables you are going to run from the settings.local.toml (This file is not tracked by git and can vary with each local config). Please refer to this link for further information.

Links

Understanding the code

Addressing the challenge in three parts:

Features to be implemented

Support multi-gpu training/inference (currently the code only supports one gpu)
Use context in verb substitution prediction
Deal with multiple token verbs. (currently the approach only uses found in vocabulary. If the verb is split into two tokens, we ignore it)

Name		Name	Last commit message	Last commit date
Latest commit History 14 Commits
tests		tests
verb_cluster		verb_cluster
.gitignore		.gitignore
AUTHORS.md		AUTHORS.md
CONTRIBUTING.md		CONTRIBUTING.md
LICENSE		LICENSE
Makefile		Makefile
README.md		README.md
SOURCES.md		SOURCES.md
british_spellings.json		british_spellings.json
config.py		config.py
dep.py		dep.py
requirements.txt		requirements.txt
settings.toml		settings.toml
setup.cfg		setup.cfg
setup.py		setup.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Verb Cardinality Reduction for BioMedical Pred-Argument Graphs Extracted from Unstructured Text

Introduction

Architecture Diagram

Setup

Running the code

Links

Understanding the code

Features to be implemented

About

Releases

Packages

Languages

License

AstraZeneca/verbReduce

Folders and files

Latest commit

History

Repository files navigation

Verb Cardinality Reduction for BioMedical Pred-Argument Graphs Extracted from Unstructured Text

Introduction

Architecture Diagram

Setup

Running the code

Links

Understanding the code

Features to be implemented

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages