Predicate-Argument Graphs extracted from the unstructured text have a high cardinality of verbs (Arguments), limiting the use of graphs. Particularly in the biomedical domain, there are no existing data sources that could use to train or map verbs. Reducing the verb count while not losing information is the key challenge.
verbReduce
do not:
- Require existing resource for Biomedical domain
- Require 'Gold' verb set
- Require 'K' for verbs
- Require evaluation dataset
Given the unlabeled data, our approach provides a lookup table mapping source verb to target verb.
Run the following to setup the code
make install-dependencies
Tests:
pytest -s
We use the external libraries in the code, so it is useful to be familiar with the way these libraries work. The libraries are:
- PyTorch Lightning
- In particular we use the Lightning Module and Lightning Datamodule
- PySAT
- In particular we use Minimum/minimal hitting set solver
- Prefect
- We only use the basic task-flow paradigm of prefect.
- Dynaconf
- We use this to specify the parameters for the train and predict flow
Environment Variables:
export PREFECT_HOME=<path where you have enough space>
- Prefect stores output from task in local disk. Make sure to provide path where there is enough space.
export TOKENIZERS_PARALLELISM=false
- This is to disable the warning messages thrown by HuggingFace Tokenizers
export ENV_FOR_DYNACONF=default
- This is to set the which environment variables you are going to run from the
settings.local.toml
(This file is not tracked by git and can vary with each local config). Please refer to this link for further information.
- This is to set the which environment variables you are going to run from the
Addressing the challenge in three parts:
- Identify potential set candidate substitute verbs
- Reduce the cardinality of verbs
- Evaluate the accuracy of replacements
- Support multi-gpu training/inference (currently the code only supports one gpu)
- Use context in verb substitution prediction
- Deal with multiple token verbs. (currently the approach only uses found in vocabulary. If the verb is split into two tokens, we ignore it)