-
Notifications
You must be signed in to change notification settings - Fork 0
4. Tutorial on transformation process
reMap is used to generate a pathway group dataset for the purpose of improving the sensitivity of pathway prediction in both organismal and multi-organismal genomes. This tutorial is meant to walk you through the basic steps of the transformation process using either your own input data or the test data provided by us. Once the input (in the specified format below), trained model, and other required files are provided, a pathway group dataset is generated that can be used for the pathway prediction using leADS.
Note: Make sure to put the source code reMap/
(see Installing reMap) into the same directory as explained in the Download files section. The final structure of the folder should look like this:
reMap_materials/
├── model/
│ └── ...
├── dataset/
│ └── ...
└── reMap/
└── ...
For all experiments, using a terminal
(On Linux and macOS) or an Anaconda command prompt
(On Windows), navigate to the src/
folder in the reMap/
directory and then run the commands as shown in the Examples section.
To display reMap' running options use: python main.py --help
. It should be self-contained.
The inputs are files generated after following the steps under Advanced usage. One can also use the files provided by us. The required files include:
- pathway_group.pkl
- features.npz
- centroid.npz
- rho.npz
- reMap.pkl
- [DATANAME]_X.pkl
- [DATANAME]_y.pkl
The basic command is represented below. Do not use this to run the transformation process. This command is only a representation of all the flags used. See Examples below on how to carry out such a task.
python main.py \
--transform \
--ssample-label-size 50\
--bags-labels "pathway_group.pkl" \
--features-name "features.npz" \
--bag-centroid-name "centroid.npz" \
--rho-name "rho.npz" \
--X-name "[DATANAME]_X.pkl" \
--y-name "[DATANAME]_y.pkl" \
--file-name "[FILENAME]" \
--model-name "reMap" \
--dspath "[absolute path to the dataset directory (e.g. dataset)]" \
--mdpath "[absolute path to the model directory (e.g. model)]" \
--batch 30 \
--num-jobs 2
The table below summarizes all the command-line arguments that are specific to this framework:
Argument name | Description | Value |
---|---|---|
--transform | Transform pathway data to pathway group data using reMap | False |
--ssample-label-size | Maximum number of pathways to be sampled | 50 |
--bags-labels | The input file name for pathway groups consisting of associated pathways to groups | pathway_group.pkl |
--features-name | The features corresponding pathways | features.npz |
--bag-centroid-name | The input file name for the pathway groups centroids | centroid.npz |
--rho-name | The input file name for the pathway group correlation | rho.npz |
--X-name | The input file name to be provided for transformation | [DATANAME]_X.pkl |
--y-name | The input file name to be provided for transformation | [DATANAME]_y.pkl |
--file-name | The name of the input file (without extension) | [FILENAME] |
--model-name | The name of the model excluding any **EXTENSION ** | reMap |
--dspath | The path to the datasets | [Outside source code] |
--mdpath | The path to the pre-trained model (e.g. reMap.pkl) | [Outside source code] |
--batch | Batch size | 30 |
--num-jobs | The number of parallel workers | 2 |
The output files generated after running the command are:
File | Description |
---|---|
[FILENAME]_B.pkl | A +1/-1 matrix file (stored in the "dspath" location) indicating the presence/absence of group indices for each organism (or multi-organisms). Each row in this matrix represents an organism or multi-organisms and columns indicate pathway group indices. |
To transform a dataset (e.g. "golden_X.pkl") into pathway group data using a pre-trained model ("reMap.pkl"), execute the following command:
python main.py --transform --ssample-label-size 50 --bags-labels "pathway_group.pkl" --features-name "features.npz" --bag-centroid-name "centroid.npz" --rho-name "rho.npz" --X-name "golden_X.pkl" --y-name "golden_y.pkl" --file-name "biocyc_golden" --model-name "reMap" --batch 30 --num-jobs 2
Upon executing this command, the "biocyc_golden_B.pkl" will be produced in the dataset/
folder. The tree structure for the folder with the outputs will look like this:
reMap_materials/
├── model/
│ └── ...
├── dataset/
│ ├── biocyc_golden_B.pkl
│ └── ...
└── reMap/
└── ...
Having obtained the transformed data biocyc_golden_B.pkl
, you may perform pathway prediction and training using leADS.