6. Advanced usage

Overview

leADS can be trained by users in multiple different ways. The training data made available to users consists of i) EC number indices with embedding (biocyc21_Xe.pkl) and ii) pathway indices (biocyc21_y.pkl). To train leADS as per user specifications, a preprocessing step has to be first executed.

Note: Make sure to put the source code leADS/ (see Installing leADS) into the same directory as explained in the Download files section. Additionally, create a log/ and result/ folders (if you have not already created one during pathway prediction) in the same leADS_materials/ directory. The final structure should look like this:

leADS_materials/
	├── objectset/
        │       └── ...
	├── model/
        │       └── ...
	├── dataset/
        │       └── ...
	├── result/
        │       └── ...
	├── log/
        │       └── ...
	└── leADS/
                └── ...

For all experiments, using a terminal (On Linux and macOS) or an Anaconda command prompt (On Windows), navigate to the src/ folder in the leADS/ directory and then run the commands as shown in the Examples section of Preprocessing and Training .

To display leADS' running options use: python main.py --help. It should be self-contained.

Preprocessing

This step is crucial and only performed if users wish to generate embeddings files such as "[DATANAME]_Xe.pkl", "[DATANAME]_Xa.pkl" etc., from a "[DATANAME]_X.pkl" file in order to use it for training.

Input:

The input file used for preprocessing is any matrix file containing EC number indices (e.g. biocyc21_X.pkl, cami_X.pkl)

Other files required for preprocessing:

biocyc.pkl
pathway2ec.pkl
pathway2ec_idx.pkl
pathway2vec_embeddings.npz
hin.pkl

Command:

The basic command is represented below. Do not use this for preprocessing. This command is only a representation of all the flags used. See Example below on how to preprocess your datasets.

python main.py \
--preprocess-dataset \
--object-name "biocyc.pkl" \
--pathway2ec-name "pathway2ec.pkl" \
--pathway2ec-idx-name "pathway2ec_idx.pkl" \
--hin-name "hin.pkl" \
--features-name "pathway2vec_embeddings.npz" \
--X-name "[DATANAME]_X*.pkl" \
--file-name "[input (or save) file name]" \
--ospath "[absolute path to the object files directory (e.g. objectset)]" \
--dspath "[absolute path to the dataset directory (e.g. dataset)]" \
--batch 50 \
--num-jobs 2

Argument descriptions:

The table below summarizes all the command-line arguments that are specific to this framework:

Argument name	Description	Value
--preprocess-dataset	Preprocess inputs based on Biocyc collection	True
--object-name	The preprocessed MetaCyc database file	biocyc.pkl
--pathway2ec-name	The matrix file representing Pathway-EC association	pathway2ec.pkl
--pathway2ec-idx-name	The pathway2ec association indices file	pathway2ec_idx.pkl
--hin-name	The heterogeneous information network file	hin.pkl
--features-name	The features corresponding ECs and pathways	pathway2vec_embeddings.npz
--X-name	The Input file name to be provided for preprocessing	[DATANAME]_X*.pkl
--file-name	The names of input preprocessed files (without extension)	[input (or save) file name]
--ospath	The path to the data object that contains extracted information from the MetaCyc database (e.g. "biocyc.pkl")	Outside source code
--dspath	The path to the datasets	Outside source code
--batch	Batch size	50
--num-jobs	The number of parallel workers	2

Output:

The output files generated after running the command are:

File	Description
[DATANAME]_Xa.pkl	A matrix file (stored in the "dspath" location) representing information about organisms (or multi-organisms). Each row in this matrix represents an organism or multi-organisms and columns indicate EC number indices and abundance features.
[DATANAME]_Xc.pkl	A matrix file (stored in the "dspath" location) representing information about organisms (or multi-organisms). Each row in this matrix represents an organism or multi-organisms and columns indicate EC number indices and coverage features.
[DATANAME]_Xe.pkl	A matrix file (stored in the "dspath" location) representing information about organisms (or multi-organisms). Each row in this matrix represents an organism or multi-organisms and columns indicate EC number indices and embeddings.
[DATANAME]_Xea.pkl	A matrix file (stored in the "dspath" location) representing information about organisms (or multi-organisms). Each row in this matrix represents an organism or multi-organisms and columns indicate EC number indices and abundance features.
[DATANAME]_Xec.pkl	A matrix file (stored in the "dspath" location) representing information about organisms (or multi-organisms). Each row in this matrix represents an organism or multi-organisms and columns indicate EC number indices and coverage features.
[DATANAME]_Xm.pkl	A matrix file (stored in the "dspath" location) representing information about organisms (or multi-organisms). Each row in this matrix represents an organism or multi-organisms and columns indicate EC number indices, embeddings, abundance, and coverage features.
[DATANAME]_Xp.pkl	A matrix file (stored in the "dspath" location) representing information about organisms (or multi-organisms). Each row in this matrix represents an organism or multi-organisms and their embeddings in columns.

Note: Each of these files differs in the total number of columns they contain, which is why the file used for training should also be used during prediction if one decides to train their own model based on certain specifications mentioned above.

Example

Execute the following command to preprocess "cami" data (as an example):

python main.py --preprocess-dataset --object-name "biocyc.pkl" --pathway2ec-name "pathway2ec.pkl" --pathway2ec-idx-name "pathway2ec_idx.pkl" --hin-name "hin.pkl" --features-name "pathway2vec_embeddings.npz" --X-name "cami_X.pkl" --file-name "cami" --batch 50 --num-jobs 2

After running the command, the output will be saved to the dataset/ folder. All the feature files described in the table above are generated. The tree structure for the folder with the outputs will look like this:

leADS_materials/
	├── objectset/
        │       └── ...
	├── model/
        │       └── ...
	├── dataset/
        │       ├── cami_Xa.pkl
        │       ├── cami_Xc.pkl
        │       ├── cami_Xe.pkl
        │       ├── cami_Xea.pkl
        │       ├── cami_Xec.pkl
        │       ├── cami_Xm.pkl
        │       ├── cami_Xp.pkl
        │       └── ...
	├── result/
        │       └── ...
	└── leADS/
                └── ...

Training

Training can be done using one of the output files from the preprocessing step and the [DATANAME]_y.pkl file.

That being said, one has to keep in mind to use the same file during pathway predictions to avoid errors since each output file from the preprocessing step contains a different number of columns. Here we show you the recommended command but you can also use other flags as described in the argument descriptions table to suit your requirements.

Input:

The input to the command is the output obtained from the preprocessing step above (any one of the [DATANAME]_X*.pkl and the [DATANAME]_y.pkl)

Recommended command:

The basic command is represented below. Do not use this for training. This command is only a representation of all the flags used. See Examples below on how to train a model.

python main.py \
--train \
--train-labels \
--calc-ads \
--ads-percent 0.7 \
--acquisition-type "psp" \
--top-k 50 \
--ssample-input-size 0.7 \
--ssample-label-size 2000 \
--calc-subsample-size 1000 \
--lambdas 0.01 0.01 0.01 0.01 0.01 10 \
--penalty "l21" \
--X-name "[DATANAME]_X*.pkl" \
--y-name "[DATANAME]_y.pkl" \
--model-name "[MODELNAME] (without extension)" \
--ospath "[absolute path to the object files directory (e.g. objectset)]" \
--dspath "[absolute path to the dataset directory (e.g. dataset)]" \
--mdpath "[absolute path to the model directory (e.g. model)]" \
--rspath "[absolute path to the result directory (e.g. result)]" \
--logpath "[absolute path to the log directory (e.g. log)]" \
--batch 50 \
--max-inner-iter 100 \
--num-epochs 10 \
--num-models 10 \
--num-jobs 2 \

Argument descriptions:

The table below summarizes all the command-line arguments that are specific to this framework:

Argument name	Description	Default Value
--train	Training the leADS model	True
--train-labels	A boolean variable to suggest training leADS using only class-labels data (e.g. "biocyc21_Xe.pkl" and "biocyc21_y.pkl")	False
--calc-ads	A boolean variable indicating whether to subsample dataset using active dataset subsampling (ADS)	False
--ads-percent	Corresponds the dataset subsampling size (within [0, 1])	0.7
--acquisition-type	The acquisition function for estimating the predictive uncertainty (["entropy", "mutual", "variation", "psp"])	"psp"
--top-k	The labels to be considered for variation ratio or psp acquisition functions	10
--ssample-input-size	Corresponds to the size of random subsampled inputs	0.7
--ssample-label-size	Corresponds to the size of random subsampled pathway labels	2000
--calc-subsample-size	The number of samples on which the cost function is computed	1000
--lambdas	Corresponds to the six hyper-parameters for constraints	0.01, 0.01, 0.01, 0.01, 0.01, 10
--penalty	The type of regularization term to be applied	l21
--X-name	Input space of multi-label data	biocyc_Xe.pkl
--y-name	Pathway space of multi-label data	biocyc_y.pkl
--model-name	Corresponds to the name of the model excluding any EXTENSION. The model name will have .pkl extension	leADS
--mdpath	Path to store model	Outside source code
--rspath	Path to store costs and resulting samples indices	Outside source code
--logpath	path to the log directory	Outside source code
--ospath	The path to the data object that contains extracted information from the MetaCyc database (e.g. "biocyc.pkl")	Outside source code
--dspath	The path to the datasets	Outside source code
--batch	Batch size	50
--max-inner-iter	Corresponds to the number of inner iteration for logistic regression	100
--num-epochs	Corresponds to the number of iterations over the training set	10
--num-models	Corresponds to the number of base learners in an ensemble	10
--num-jobs	The number of parallel workers	2

Output:

File	Description
[MODELNAME].pkl	The trained model
[MODELNAME]_cost.txt	This file contains error values between predicted values and expected values
[MODELNAME]_samples.pkl	This file contains the sample indices that were produced during the training of the model. It is only created if the subsampling flag (`--calc-ads`) is applied. See Example 1 and 2
log file	This file contains information regarding the run such as time taken to train the model, arguments applied and the files to which the results were stored

Examples

Example 1:

If you wish to train a pathway dataset (e.g. "biocyc21_Xe.pkl" and "biocyc21_y.pkl") using the subsampling step with variation or psp as an acquisition function, you will need to provide an additional argument --top-k. Run the following command:

python main.py --train --train-labels --calc-ads --ads-percent 0.7 --acquisition-type "psp" --top-k 50 --ssample-input-size 0.7 --ssample-label-size 2000 --calc-subsample-size 1000 --lambdas 0.01 0.01 0.01 0.01 0.01 10 --penalty "l21" --X-name "biocyc21_Xe.pkl" --y-name "biocyc21_y.pkl" --model-name "leADS_retrained_1" --batch 50 --max-inner-iter 5 --num-epochs 10 --num-models 3 --num-jobs 2

After running the command, the output will be saved to the model/, result/, and log/ folders. A short description of the output is given in the table above. The tree structure for the folder with the outputs will look like this:

leADS_materials/
	├── objectset/
        │       └── ...
	├── model/
        │       ├── leADS_retrained_1.pkl
        │       └── ...
	├── dataset/
        │       └── ...
	├── result/
        |       ├── leADS_retrained_1_cost.txt
        |       ├── leADS_retrained_1_samples.pkl
        │       └── ...
	├── log/
        |       ├── leADS_events
        │       └── ...
	└── leADS/
                └── ...

Example 2:

To train a pathway dataset using subsampling and a different acquisition function, execute the following command:

python main.py --train --train-labels --calc-ads --ads-percent 0.7 --acquisition-type "entropy" --ssample-input-size 0.7 --ssample-label-size 2000 --calc-subsample-size 1000 --lambdas 0.01 0.01 0.01 0.01 0.01 10 --penalty "l21" --X-name "biocyc21_Xe.pkl" --y-name "biocyc21_y.pkl" --model-name "leADS_retrained_2" --batch 50 --max-inner-iter 5 --num-epochs 10 --num-models 3 --num-jobs 2

After running the command, the output will be saved to the model/, result/, and log/ folders. A short description of the output is given in the table above. The tree structure for the folder with the outputs will look like this:

leADS_materials/
	├── objectset/
        │       └── ...
	├── model/
        │       ├── leADS_retrained_2.pkl
        │       └── ...
	├── dataset/
        │       └── ...
	├── result/
        |       ├── leADS_retrained_2_cost.txt
        |       ├── leADS_retrained_2_samples.pkl
        │       └── ...
	├── log/
        |       ├── leADS_events
        │       └── ...
	└── leADS/
                └── ...

Example 3:

To train a pathway dataset without the subsampling step, execute the following command:

python main.py --train --train-labels --ssample-input-size 0.7 --ssample-label-size 2000 --calc-subsample-size 1000 --lambdas 0.01 0.01 0.01 0.01 0.01 10 --penalty "l21" --X-name "biocyc21_Xe.pkl" --y-name "biocyc21_y.pkl" --model-name "leADS_retrained_3" --batch 50 --max-inner-iter 5 --num-epochs 10 --num-models 3 --num-jobs 2

After running the command, the output will be saved to the model/, result/, and log/ folders. A short description of the output is given in the table above. The tree structure for the folder with the outputs will look like this:

leADS_materials/
	├── objectset/
        │       └── ...
	├── model/
        │       ├── leADS_retrained_3.pkl
        │       └── ...
	├── dataset/
        │       └── ...
	├── result/
        |       ├── leADS_retrained_3_cost.txt
        │       └── ...
	├── log/
        |       ├── leADS_events
        │       └── ...
	└── leADS/
                └── ...

Example 4:

To train a pathway dataset using predefined samples, you need to provide an additional argument --train-selected-sample and the name of the file in --samples-ids (e.g."leADS_samples.pkl") that is stored in rspath:

python main.py --train --train-labels --train-selected-sample --ssample-input-size 0.7 --ssample-label-size 2000 --calc-subsample-size 1000 --lambdas 0.01 0.01 0.01 0.01 0.01 10 --penalty "l21" --X-name "biocyc21_Xe.pkl" --y-name "biocyc21_y.pkl" --samples-ids "leADS_samples.pkl" --model-name "leADS_retrained_4" --batch 50 --max-inner-iter 5 --num-epochs 10 --num-models 3 --num-jobs 2

After running the command, the output will be saved to the model/, result/, and log/ folders. A short description of the output is given in the table above. The tree structure for the folder with the outputs will look like this:

leADS_materials/
	├── objectset/
        │       └── ...
	├── model/
        │       ├── leADS_retrained_4.pkl
        │       └── ...
	├── dataset/
        │       └── ...
	├── result/
        |       ├── leADS_retrained_4_cost.txt
        │       └── ...
	├── log/
        |       ├── leADS_events
        │       └── ...
	└── leADS/
                └── ...

back to top

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

6. Advanced usage

Overview

Table of Contents

Preprocessing

Input:

Other files required for preprocessing:

Command:

Argument descriptions:

Output:

Example

Training

Input:

Recommended command:

Argument descriptions:

Output:

Examples

Example 1:

Example 2:

Example 3:

Example 4:

Clone this wiki locally