Skip to content

6. Advanced usage

Abdurrahman Abul-Basher edited this page Jun 3, 2021 · 31 revisions

Overview

leADS can be trained by users in multiple different ways. The training data made available to users that consists of i)- the Enzyme Commission (EC) number indices with embedding (biocyc21_Xe.pkl) and ii)- the pathway indices (biocyc21_y.pkl). To train leADS as per user specifications, a preprocessing step has to be first executed.

                        --------------------------------------------------
       		       | Train the model based on features using files    |
       		       | other than [DATANAME]_X.pkl and [DATANAME]_y.pkl |
        		--------------------------------------------------
                                                │
                                                │
		        [Yes]───────────────────│───────────────────[No]
		          │                                          │
	                  │                                          │
 		      ----------	                ------------------------------ 
		     |Preprocess|                      | Train using [DATANAME]_X.pkl | 
		      ----------		       |     and [DATANAME]_y.pkl     |
 		      	  |                		------------------------------
		          │
	   ---------------------------------
          | Train using data of your choice |
	   ---------------------------------        

Note: As before make sure to put the source code leADS (Installing leADS) into the same directory as explained in Download files. Additionally, create a log and result (if you have not already created one during pathway prediction) folder in the same leADS_materials/ directory. The final structure should look like this:

leADS_materials/
	├── objectset/
        │       └── ...
	├── model/
        │       └── ...
	├── dataset/
        │       └── ...
	├── result/
        │       └── ...
	├── log/
        │       └── ...
	└── leADS/
                └── ...

For all experiments, using a terminal navigate to the src folder in the leADS directory and then run the commands. To display leADS's running options use: python main.py --help. It should be self-contained.

Preprocessing

This step is crucial if users wish to use a different input file (other than just Enzyme Commission (EC) numbers) for training followed by pathway prediction. Preprocessing is only done if the user wants to generate embeddings files such as "[DATANAME]_Xe.pkl", "[DATANAME]_Xa.pkl" etc., from a "[DATANAME]_X.pkl" file in order to use it for training.

Input:

The input file used for preprocessing is any matrix file containing Enzyme Commission (EC) numbers (e.g. biocyc21_X.pkl, cami_X.pkl)

Other files required for preprocessing:

  1. biocyc.pkl
  2. pathway2ec.pkl
  3. pathway2ec_idx.pkl
  4. pathway2vec_embeddings.npz
  5. hin.pkl

Command:

python main.py \
--preprocess-dataset \
--object-name "biocyc.pkl" \
--pathway2ec-name "pathway2ec.pkl" \
--pathway2ec-idx-name "pathway2ec_idx.pkl" \
--hin-name "hin.pkl" \
--features-name "pathway2vec_embeddings.npz" \
--X-name "[DATANAME]_Xe.pkl" \
--file-name "[input (or save) file name]" \
--ospath "[absolute path to the object files directory (e.g. objectset)]" \
--dspath "[absolute path to the dataset directory (e.g. dataset)]" \
--batch 50 \
--num-jobs 2

Argument descriptions:

The table below summarizes all the command-line arguments that are specific to this framework:

Argument name Description Value
--preprocess-dataset Preprocess inputs based on Biocyc collection True
--object-name The preprocessed MetaCyc database file biocyc.pkl
--pathway2ec-name The matrix file representing Pathway-EC association pathway2ec.pkl
--pathway2ec-idx-name The pathway2ec association indices file pathway2ec_idx.pkl
--hin-name The heterogeneous information network file hin.pkl
--features-name The features corresponding ECs and pathways pathway2vec_embeddings.npz
--X-name The Input file name to be provided for preprocessing [input file name]_X.pkl
--file-name The names of input preprocessed files (without extension) [input (or save) file name]
--ospath The path to the data object that contains extracted information from the MetaCyc database (biocyc.pkl) Outside source code
--dspath The path to the datasets Outside source code
--batch Batch size 50
--num-jobs The number of parallel workers 2

Output:

The output files generated after running the command are:

File Description
[DATANAME]_Xa.pkl A matrix file (stored in the "dspath" location) representing organisms ids as rows and the concatenated abundance features in the columns
[DATANAME]_Xc.pkl A matrix file (stored in the "dspath" location) representing organisms ids as rows and concatenated coverage features in the columns
[DATANAME]_Xe.pkl A matrix file (stored in the "dspath" location) representing organisms ids as rows and the concatenated EC features in the columns
[DATANAME]_Xea.pkl A matrix file (stored in the "dspath" location) representing organisms ids as rows and concatenated EC and abundance features in the columns
[DATANAME]_Xec.pkl A matrix file (stored in the "dspath" location) representing organisms ids as rows and concatenated EC and coverage features in the columns
[DATANAME]_Xm.pkl A matrix file (stored in the "dspath" location) representing organisms ids as rows and concatenated EC, abundance, and coverage features in the columns
[DATANAME]_Xp.pkl A matrix file (stored in the "dspath" location) representing organisms ids as rows and the transformed instances to EC in the columns

Note: Each of these files differ in the total number of columns they contain, which is why the file used for training should also be used during prediction if one decides to train their own model based on certain specifications mentioned above.

Example

Execute the following command to preprocess "cami" data (as an example):

python main.py --preprocess-dataset --object-name "biocyc.pkl" --pathway2ec-name "pathway2ec.pkl" --pathway2ec-idx-name "pathway2ec_idx.pkl" --hin-name "hin.pkl" --features-name "pathway2vec_embeddings.npz" --X-name "cami_X.pkl" --file-name "cami" --batch 50 --num-jobs 2

After running the command, the output will be saved to the dataset/ folder. All the feature files described in the output table above are generated. The tree structure for the folder with the outputs will look like this:

leADS_materials/
	├── objectset/
        │       └── ...
	├── model/
        │       └── ...
	├── dataset/
        │       ├── cami_Xa.pkl
        │       ├── cami_Xc.pkl
        │       ├── cami_Xe.pkl
        │       ├── cami_Xea.pkl
        │       ├── cami_Xec.pkl
        │       ├── cami_Xm.pkl
        │       ├── cami_Xp.pkl
        │       └── ...
	├── result/
        │       └── ...
	└── leADS/
                └── ...

Training

Training can be done using one of the output files from the preprocessing step and the [DATANAME]_y.pkl file.

That being said, one has to keep in mind to use the same file during pathway predictions to avoid errors since each output file from the preprocessing step contains a different number of columns. Here we show you the recommended command but you can also use other flags as described in the argument descriptions table to suit your requirements.

Input:

The input to the command is the output obtained from the preprocessing step above (any one of the [DATANAME]_X*.pkl and the [DATANAME]_y.pkl)

Recommended command:

python main.py \
--train \
--train-labels \
--calc-ads \
--ads-percent 0.7 \
--acquisition-type "psp" \
--top-k 50 \
--ssample-input-size 0.7 \
--ssample-label-size 2000 \
--calc-subsample-size 1000 \
--lambdas 0.01 0.01 0.01 0.01 0.01 10 \
--penalty "l21" \
--X-name "[DATANAME]_X*.pkl" \
--y-name "[DATANAME]_y.pkl" \
--model-name "[MODELNAME] (without extension)" \
--ospath "[absolute path to the object files directory (e.g. objectset)]" \
--dspath "[absolute path to the dataset directory (e.g. dataset)]" \
--mdpath "[absolute path to the model directory (e.g. model)]" \
--rspath "[absolute path to the result directory (e.g. result)]" \
--logpath "[absolute path to the log directory (e.g. log)]" \
--batch 50 \
--max-inner-iter 100 \
--num-epochs 10 \
--num-models 10 \
--num-jobs 2 \

Argument descriptions:

The table below summarizes all the command-line arguments that are specific to this framework:

Argument name Description Default Value
--train Training the leADS model True
--train-labels A boolean variable to suggest training leADS using only class-labels data (e.g. "biocyc21_Xe.pkl" and "biocyc21_y.pkl") False
--calc-ads A boolean variable indicating whether to subsample dataset using active dataset subsampling (ADS) False
--ads-percent Corresponds the dataset subsampling size (within [0, 1]) 0.7
--acquisition-type The acquisition function for estimating the predictive uncertainty (["entropy", "mutual", "variation", "psp"]) "psp"
--top-k The labels to be considered for variation ratio or psp acquisition functions 10
--ssample-input-size Corresponds to the size of random subsampled inputs 0.7
--ssample-label-size Corresponds to the size of random subsampled pathway labels 2000
--calc-subsample-size The number of samples on which the cost function is computed 1000
--lambdas Corresponds to the six hyper-parameters for constraints 0.01, 0.01, 0.01, 0.01, 0.01, 10
--penalty The type of regularization term to be applied l21
--X-name Input space of multi-label data biocyc_Xe.pkl
--y-name Pathway space of multi-label data biocyc_y.pkl
--model-name Corresponds to the name of the model excluding any EXTENSION. The model name will have .pkl extension leADS
--mdpath Path to store model Outside source code
--rspath Path to store costs and resulting samples indices Outside source code
--logpath path to the log directory Outside source code
--ospath The path to the data object that contains extracted information from the MetaCyc database (biocyc.pkl) Outside source code
--dspath The path to the datasets Outside source code
--batch Batch size 50
--max-inner-iter Corresponds to the number of inner iteration for logistic regression 100
--num-epochs Corresponds to the number of iterations over the training set 10
--num-models Corresponds to the number of base learners in an ensemble 10
--num-jobs The number of parallel workers 2

Ouput:

File Description
[MODELNAME].pkl The trained model
[MODELNAME]_cost.txt This file contains error values between predicted values and expected values
[MODELNAME]_samples.pkl This file contains the sample indices that were produced during training of the model. It is only created if the subsampling flag (--calc-ads) is applied. See Example 1 and 2
log file This file contains information regarding the run such as time taken to train the model, arguments applied and the files to which the results were stored

Examples

Example 1:

If you wish to train a multi-label dataset (e.g. "biocyc21_Xe.pkl" and "biocyc21_y.pkl") using the subsampling step with variation or psp as an acquisition function, you will need to provide an additional argument --top-k. Run the following command:

python main.py --train --train-labels --calc-ads --ads-percent 0.7 --acquisition-type "psp" --top-k 50 --ssample-input-size 0.7 --ssample-label-size 2000 --calc-subsample-size 1000 --lambdas 0.01 0.01 0.01 0.01 0.01 10 --penalty "l21" --X-name "biocyc21_Xe.pkl" --y-name "biocyc21_y.pkl" --model-name "leADS_retrained_1" --batch 50 --max-inner-iter 5 --num-epochs 10 --num-models 3 --num-jobs 2

After running the command, the output will be saved to the model/, result/ and log/ folders. A short description of the output is given in the table above. The tree structure for the folder with the outputs will look like this:

leADS_materials/
	├── objectset/
        │       └── ...
	├── model/
        │       ├── leADS_retrained_1.pkl
        │       └── ...
	├── dataset/
        │       └── ...
	├── result/
        |       ├── leADS_retrained_1_cost.txt
        |       ├── leADS_retrained_1_samples.pkl
        │       └── ...
	├── log/
        |       ├── leADS_events
        │       └── ...
	└── leADS/
                └── ...

Example 2:

To train a multi-label dataset using subsampling and a different acquisition function, execute the following command:

python main.py --train --train-labels --calc-ads --ads-percent 0.7 --acquisition-type "entropy" --ssample-input-size 0.7 --ssample-label-size 2000 --calc-subsample-size 1000 --lambdas 0.01 0.01 0.01 0.01 0.01 10 --penalty "l21" --X-name "biocyc21_Xe.pkl" --y-name "biocyc21_y.pkl" --model-name "leADS_retrained_2" --batch 50 --max-inner-iter 5 --num-epochs 10 --num-models 3 --num-jobs 2

After running the command, the output will be saved to the model/, result/, and log/ folders. A short description of the output is given in the table above. The tree structure for the folder with the outputs will look like this:

leADS_materials/
	├── objectset/
        │       └── ...
	├── model/
        │       ├── leADS_retrained_2.pkl
        │       └── ...
	├── dataset/
        │       └── ...
	├── result/
        |       ├── leADS_retrained_2_cost.txt
        |       ├── leADS_retrained_2_samples.pkl
        │       └── ...
	├── log/
        |       ├── leADS_events
        │       └── ...
	└── leADS/
                └── ...

Example 3:

To train a multi-label dataset without the subsampling step, execute the following command:

python main.py --train --train-labels --ssample-input-size 0.7 --ssample-label-size 2000 --calc-subsample-size 1000 --lambdas 0.01 0.01 0.01 0.01 0.01 10 --penalty "l21" --X-name "biocyc21_Xe.pkl" --y-name "biocyc21_y.pkl" --model-name "leADS_retrained_3" --batch 50 --max-inner-iter 5 --num-epochs 10 --num-models 3 --num-jobs 2

After running the command, the output will be saved to the model/, result/, and log/ folders. A short description of the output is given in the table above. The tree structure for the folder with the outputs will look like this:

leADS_materials/
	├── objectset/
        │       └── ...
	├── model/
        │       ├── leADS_retrained_3.pkl
        │       └── ...
	├── dataset/
        │       └── ...
	├── result/
        |       ├── leADS_retrained_3_cost.txt
        │       └── ...
	├── log/
        |       ├── leADS_events
        │       └── ...
	└── leADS/
                └── ...

Example 4:

To train a multi-label dataset using predefined samples, you need to provide an additional arguments --train-selected-sample and the name of the file in --samples-ids (e.g."leADS_samples.pkl") that is stored in rspath:

python main.py --train --train-labels --train-selected-sample --ssample-input-size 0.7 --ssample-label-size 2000 --calc-subsample-size 1000 --lambdas 0.01 0.01 0.01 0.01 0.01 10 --penalty "l21" --X-name "biocyc21_Xe.pkl" --y-name "biocyc21_y.pkl" --samples-ids "leADS_samples.pkl" --model-name "leADS_retrained_4" --batch 50 --max-inner-iter 5 --num-epochs 10 --num-models 3 --num-jobs 2

After running the command, the output will be saved to the model/, result/, and log/ folders. A short description of the output is given in the table above. The tree structure for the folder with the outputs will look like this:

leADS_materials/
	├── objectset/
        │       └── ...
	├── model/
        │       ├── leADS_retrained_4.pkl
        │       └── ...
	├── dataset/
        │       └── ...
	├── result/
        |       ├── leADS_retrained_4_cost.txt
        │       └── ...
	├── log/
        |       ├── leADS_events
        │       └── ...
	└── leADS/
                └── ...
Clone this wiki locally