5. Advanced usage

Overview

To train reMap, the main input data consists of i) EC number indices ("biocyc205_tier23_9255_X.pkl") and ii) pathway indices ("biocyc205_tier23_9255_y.pkl"). The remaining files can be generated through the flag preprocessing.

Note: Make sure to put the source code reMap/ (see Installing reMap) into the same directory as explained in the Download files section. Additionally, create a log/ and result/ folders in the same reMap_materials/ directory. The final structure should look like this:

reMap_materials/
	├── model/
        │       └── ...
	├── dataset/
        │       └── ...
	├── result/
        │       └── ...
	├── log/
        │       └── ...
	└── reMap/
                └── ...

For all experiments, using a terminal (On Linux and macOS) or an Anaconda command prompt (On Windows), navigate to the src/ folder in the reMap/ directory and then run the commands as shown in the Examples section of Preprocessing and Training.

To display reMap' running options use: python main.py --help. It should be self-contained.

Preprocessing

This step is crucial and only performed if users wish to build pathway groups centroids and to recover maximum expected pathways for each group. The outputs of this step are several supplementary files that are required for transformation and training, such as "[FILENAME]_centroid.npz", "[DATANAME]_B.pkl" etc.

Input:

The input file used for preprocessing are:

phi.npz
sigma.npz
pathway2vec_embeddings.npz
hin.pkl
vocab.pkl

Command:

The basic command is represented below. Do not use this for preprocessing. This command is only a representation of all the flags used. See Example below on how to preprocess your datasets.

python main.py \
--define-bags \
--recover-max-bags \
--alpha 16 \
--top-k 90 \
--v-cos 0.1 \
--vocab-name "vocab.pkl" \
--bag-phi-name "phi.npz" \
--bag-sigma-name "sigma.npz" \
--hin-name "hin.pkl" \
--features-name "pathway2vec_embeddings.npz" \
--file-name "[FILENAME]" \
--y-name "[DATANAME]_y.pkl" \
--mdpath "[absolute path to the model directory (e.g. model)]" \
--dspath "[absolute path to the dataset directory (e.g. dataset)]" \

Argument descriptions:

The table below summarizes all the command-line arguments that are specific to this framework:

Argument name	Description	Value
--define-bags	Whether to construct pathway groups centroids	False
--recover-max-bags	Whether to recover the maximum number of pathway groups	False
--alpha	A hyper-parameter for controlling pathway groups centroids	16
--top-k	Top k pathways to be considered for each pathway group	90
--v-cos	A cutoff threshold for cosine similarity	0.1
--vocab-name	A dictionary file representing pathway indices as keys and MetaCyc pathway ids as values	vocab.pkl
--bag-phi-name	The filename for pathways distribution over pathway groups	phi.npz
--bag-sigma-name	The filename for pathway groups covariance	sigma.npz
--hin-name	The heterogeneous information network file	hin.pkl
--features-name	The features corresponding ECs and pathways	pathway2vec_embeddings.npz
--y-name	The Input file name to be provided for preprocessing	[DATANAME]_y.pkl
--file-name	The names of input preprocessed files (without extension)	[FILENAME]
--mdpath	The path to the supplementary files	[Outside source code]
--dspath	The path to the datasets	[Outside source code]

Output:

The output files generated after running the command are:

With the `--define-bags` flag only

See Example 1 for the command:

File	Description
[FILENAME]_centroid.npz	A matrix file (stored in the "dspath" location) representing groups centroids.
[FILENAME]_exp_phi_trim.npz	A matrix file (stored in the "dspath" location) representing the distribution of pathways over groups. The rows correspond to the group indices and columns represent the pathway indices.
[FILENAME]_features.npz	A matrix file (stored in the "dspath" location) representing pathway features. It contains 2526 pathway indices shown in the first column and their 128 features in the remaining columns.
[FILENAME]_rho.npz	A matrix file (stored in the "dspath" location) representing the group-group correlations.
[FILENAME]_idxvocab.pkl	A file (stored in the "dspath" location) representing the pathway indices.
[FILENAME]_labels_distr_idx.pkl	A file (stored in the "dspath" location) representing information about indices of pathways and their associated pathway groups indices.
[FILENAME]_pathway_group.pkl	A binary matrix file (stored in the "dspath" location) indicating the association of groups indices in rows to pathway indices in columns.

With the `--recover-max-bags` flag only

See Example 2 for the command:

File	Description
[FILENAME]_B.pkl	A +1/-1 matrix file (stored in the "dspath" location) indicating the presence/absence of group indices for each organism (or multi-organisms). Each row in this matrix represents an organism or multi-organisms and columns indicate pathway group indices.

With both flags: `--define-bags` and `--recover-max-bags` only

See Example 3, you will get combined results from running both flags separately.

Examples

Example 1:

To construct groups, execute the following command:

python main.py --define-bags --alpha 16 --top-k 90 --hin-name "hin.pkl" --vocab-name "vocab.pkl" --bag-phi-name "phi.npz" --bag-sigma-name "sigma.npz" --features-name "pathway2vec_embeddings.npz" --file-name "temp"

After running the command, the output will be saved to the dataset/ folder. All the files described in the table above are generated.

reMap_materials/
	├── objectset/
        │       └── ...
	├── model/
        │       └── ...
	├── dataset/
        │       ├── temp_centroid.npz
        │       ├── temp_exp_phi_trim.npz
        │       ├── temp_features.npz
        │       ├── temp_rho.npz
        │       ├── temp_labels_distr_idx.pkl
        │       ├── temp_idxvocab.pkl
        │       ├── temp_pathway_group.pkl
        │       └── ...
	├── result/
        │       └── ...
	└── reMap/
                └── ...

Example 2:

To recover the maximum set of groups, execute the following command:

python main.py --recover-max-bags --alpha 16 --v-cos 0.1 --file-name "temp" --y-name "biocyc_y.pkl"

After running the command, the output will be saved to the dataset/ folder. All the files described in the table above are generated.

reMap_materials/
	├── objectset/
        │       └── ...
	├── model/
        │       └── ...
	├── dataset/
        │       ├── temp_B.pkl
        │       └── ...
	├── result/
        │       └── ...
	└── reMap/
                └── ...

Example 3:

If you wish to perform the above two examples, execute the following command:

python main.py --define-bags --recover-max-bags --alpha 16 --top-k 90 --v-cos 0.1 --hin-name "hin.pkl" --vocab-name "vocab.pkl" --bag-phi-name "phi.npz" --bag-sigma-name "sigma.npz" --features-name "pathway2vec_embeddings.npz" --file-name "temp" --y-name "biocyc_y.pkl"

After running the command, the output will be saved to the dataset/ folder. All the files described in Example 1 and Example 2 above are generated.

Training

Training can be done using one of the output files from the preprocessing step and the [DATANAME]_y.pkl file.

That being said, one has to keep in mind to use the same file during pathway predictions to avoid errors since each output file from the preprocessing step contains a different number of columns. Here we show you the recommended command but you can also use other flags as described in the argument descriptions table to suit your requirements.

Input:

The input to the command is the output obtained from the preprocessing step above (any one of the [DATANAME]_X*.pkl and the [DATANAME]_y.pkl)

Recommended command:

The basic command is represented below. Do not use this for training. This command is only a representation of all the flags used. See Examples below on how to train a model.

python main.py \
--train \
--alpha 16 \
--ssample-input-size 0.05 \
--ssample-label-size 50 \
--calc-subsample-size 50 \
--bags-labels "bag_pathway.pkl" \
--features-name "features.npz" \
--bag-centroid-name "bag_centroid.npz" \
--rho-name "rho.npz" \
--X-name "[DATANAME]_X.pkl" \
--y-name "[DATANAME]_y.pkl" \
--yB-name "[DATANAME]_B.pkl" \
--model-name "[MODELNAME] (without extension)" \
--dspath "[absolute path to the dataset directory (e.g. dataset)]" \
--mdpath "[absolute path to the model directory (e.g. model)]" \
--rspath "[absolute path to the result directory (e.g. result)]" \
--logpath "[absolute path to the log directory (e.g. log)]" \
--batch 30 \
--num-epochs 3 \
--num-jobs 2

Argument descriptions:

The table below summarizes all the command-line arguments that are specific to this framework:

Argument name	Description	Default Value
--train	Train the reMap model	False
--alpha	A hyper-parameter for controlling pathway groups centroids	16
--ssample-input-size	The size of input subsample	0.05
--ssample-label-size	Maximum number of pathways to be sampled	50
--calc-subsample-size	Compute loss on selected samples	50
--bags-labels	The input file name for pathway groups consisting of associated pathways to groups	pathway_group.pkl
--features-name	The features corresponding pathways	features.npz
--bag-centroid-name	The input file name for the pathway groups centroids	centroid.npz
--rho-name	The input file name for the pathway group correlation	rho.npz
--X-name	The input file name to be provided for transformation	[DATANAME]_X.pkl
--y-name	The input file name to be provided for transformation	[DATANAME]_y.pkl
--yB-name	The input file name to be provided for transformation	[DATANAME]_B.pkl
--model-name	Corresponds to the name of the model excluding any EXTENSION. The model name will have .pkl extension	reMap
--dspath	The path to the datasets	[Outside source code]
--mdpath	The path to store model	[Outside source code]
--rspath	The path to store costs	[Outside source code]
--logpath	The path to the log directory	[Outside source code]
--batch	Batch size	30
--num-epochs	Corresponds to the number of iterations over the training set	3
--num-jobs	The number of parallel workers	2

Output:

The output files generated after running the command are:

File	Description
[MODELNAME].pkl	The trained model
[MODELNAME]_cost.txt	This file contains error values between predicted values and expected values
log file	This file contains information regarding the run such as time taken to train the model, arguments applied and the files to which the results were stored

Examples

To train reMap given pathway data (e.g. "biocyc_X.pkl" and "biocyc_y.pkl") and pathway group dataset (e.g. "biocyc_B.pkl") that is obtained from the Preprocessing step, execute the following command:

If you wish to decompose M of 100 components, you will need to provide an additional argument --num-components. Run the following command:

python main.py --train --alpha 16 --ssample-input-size 0.05 --ssample-label-size 50 --calc-subsample-size 50 --bags-labels "pathway_group.pkl" --features-name "features.npz" --bag-centroid-name "centroid.npz" --rho-name "rho.npz" --X-name "biocyc_X.pkl" --y-name "biocyc_y.pkl" --yB-name "biocyc_B.pkl" --model-name "reMap_retrained" --batch 30 --num-epochs 3 --num-jobs 2

After running the command, the output will be saved to the model/, result/, and log/ folders. A short description of the output is given in the table above. The tree structure for the folder with the outputs will look like this:

triUMPF_materials/
	├── model/
        │       ├── reMap_retrained.pkl
        │       └── ...
	├── dataset/
        │       └── ...
	├── result/
        |       ├── reMap_retrained_cost.txt
        │       └── ...
	├── log/
        |       ├── reMap_events
        │       └── ...
	└── reMap/
                └── ...

back to top

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

5. Advanced usage

Overview

Table of Contents

Preprocessing

Input:

Command:

Argument descriptions:

Output:

With the `--define-bags` flag only

With the `--recover-max-bags` flag only

With both flags: `--define-bags` and `--recover-max-bags` only

Examples

Example 1:

Example 2:

Example 3:

Training

Input:

Recommended command:

Argument descriptions:

Output:

Examples

Clone this wiki locally

5. Advanced usage

Overview

Table of Contents

Preprocessing

Input:

Command:

Argument descriptions:

Output:

With the --define-bags flag only

With the --recover-max-bags flag only

With both flags: --define-bags and --recover-max-bags only

Examples

Example 1:

Example 2:

Example 3:

Training

Input:

Recommended command:

Argument descriptions:

Output:

Examples

Clone this wiki locally

With the `--define-bags` flag only

With the `--recover-max-bags` flag only

With both flags: `--define-bags` and `--recover-max-bags` only