-
Notifications
You must be signed in to change notification settings - Fork 1
3. Download files
leADS requires a set of object files to run the core commands along with test samples to train and predict pathways. The test samples can either be used to train or test the leADS model. Please download these files from Zenodo. Once you have downloaded the leADS_materials.zip
file, unzip it and make sure you obtain the three folders: "objectset", "model", and "dataset", as depicted below:
leADS_materials/
├── objectset/
│ ├── biocyc.pkl
│ ├── pathway2ec.pkl
│ ├── pathway2ec_idx.pkl
│ ├── hin.pkl
│ ├── pathway2vec_embeddings.npz
│ └── ...
├── model/
│ ├── leADS.pkl
│ └── ...
└── dataset/
├── biocyc21_X.pkl, biocyc21_Xe.pkl, biocyc21_y.pkl, biocyc21_species.pkl, leADS_samples.pkl
├── three_ecoli/
│ ├── MG1655
│ │ └── 0.pf
│ ├── EDL933
│ │ └── 0.pf
│ └── CFT073
│ └── 0.pf
├── golden_X.pkl, golden_Xe.pkl, golden_y.pkl
├── cami_X.pkl, cami_Xe.pkl, cami_y.pkl
├── symbionts_X.pkl, symbionts_Xe.pkl
├── hots_4_X.pkl, hots_4_Xe.pkl
├── delicious_X.pkl, delicious_y.pkl
├── birds_X.pkl, birds_y.pkl
└── ...
Below you will find a short description of the contents of these folders.
In this folder, 8 core object files are provided that contain various pathway and enzyme information. These files are important for Preprocessing, Predicting, and Training leADS. We will use the following five object files in this wiki:
File | Description | Size |
---|---|---|
biocyc.pkl | An object containing the preprocessed MetaCyc database in the form of Pathway IDs, Enzyme Commission (EC) numbers, Reaction IDs, Gene names, and Gene IDs, etc. | 91.8MB |
pathway2ec.pkl | A matrix representing the pathway-enzyme association. It contains 2526 pathways and 3650 enzymes (represented as Enzyme Commission (EC) numbers) in rows and columns respectively. | 81.0kB |
pathway2ec_idx.pkl | A matrix of pathway2ec association indices. | 29.4kB |
hin.pkl | A sample of heterogeneous information network. | 10.5MB |
pathway2vec_embeddings.npz | A matrix file containing a sample of embeddings using RUST-norm. The rows (22593) shown in the image below correspond to the pathway, enzyme, and compound embeddings and the columns (128) represent the features. These features can be generated using pathway2vec. | 11.6MB |
Here, we show you a visual depiction of some of the object files to help deepen your understanding.
The biocyc.pkl
file contains the preprocessed MetaCyc database. Genes, Proteins, Enzymes, Reactions, Pathways and Compounds are all represented as dictionaries containing the individual IDs for each of the 6 categories. This file can be obtained by following the steps highlighted in prepBioCyc
biocyc.pkl
├── [Gene]
│ ├── [XXX]
│ ├── [XXX]
├── [Pathway]
│ ├── [XXX]
│ ├── [XXX]
├── [Enzyme]
│ ├── [XXX]
│ ├── [XXX]
└── [Reaction]
├── [XXX]
└── [XXX]
├── [XXX]
├── [XXX]
├── [XXX]
└── [XXX]
The pathway2ec.pkl
file contains the pathway-enzyme associations with the values in the enzyme columns depicting the number of times an enzyme contributes to the pathways shown. An example as seen from the table below is the enzyme enoyl-[acyl-carrier-protein] reductase (EC 1.3.1.9) that contributes 7 times to the palmitate biosynthesis II pathway in bacteria and plants but does not contribute to any of the other pathways shown in the table.
Pathway | EC-1.1.1.86 | EC-1.3.1.9 | EC-2.1.1.79 | EC-2.2.1.6 | EC-2.6.1.42 | EC-2.6.1.13 | EC-3.5.3.1 | EC-4.2.1.59 | EC-6.2.1.3 | EC-6.3.2.M5 |
---|---|---|---|---|---|---|---|---|---|---|
L-valine biosynthesis | 1 | 0 | 0 | 1 | 1 | 0 | 0 | 0 | 0 | 0 |
L-arginine degradation VI (arginase 2 pathway) | 0 | 0 | 0 | 0 | 0 | 1 | 1 | 0 | 0 | 0 |
cyclopropane fatty acid (CFA) biosynthesis | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
palmitate biosynthesis II (bacteria and plants) | 0 | 7 | 0 | 0 | 0 | 0 | 0 | 7 | 2 | 0 |
jasmonoyl-amino acid conjugates biosynthesis I | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 |
The pathway2vec_embeddings.npz
is a matrix file where rows represent the pathway and enzyme information and columns represent features. These features are generated using pathway2vec.
Pathway and EC | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | 10 |
---|---|---|---|---|---|---|---|---|---|---|
L-valine biosynthesis | 0.089106 | 0.092924 | 0.089035 | 0.101823 | 0.072792 | 0.083173 | 0.096259 | 0.064823 | 0.071481 | 0.094392 |
methylquercetin biosynthesis | 0.112329 | 0.075717 | 0.087717 | 0.094391 | 0.081035 | 0.074514 | 0.095572 | 0.072581 | 0.068458 | 0.096449 |
cyanide degradation | 0.073566 | 0.094817 | 0.087664 | 0.099661 | 0.089182 | 0.103727 | 0.093147 | 0.093047 | 0.083330 | 0.095017 |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
EC-1.1.1.10 | 0.095318 | 0.094138 | 0.097567 | 0.087115 | 0.084483 | 0.098668 | 0.078173 | 0.091465 | 0.086675 | 0.086497 |
EC-1.1.1.100 | 0.047987 | 0.096748 | 0.092529 | 0.092395 | 0.116745 | 0.092556 | 0.106274 | 0.107414 | 0.079025 | 0.098948 |
EC-1.1.1.101 | 0.090137 | 0.085566 | 0.087589 | 0.089496 | 0.082936 | 0.088855 | 0.083835 | 0.091411 | 0.085721 | 0.090588 |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
In this folder, 4 pre-trained models are provided to predict metabolic pathways using the datasets described in the dataset/ section. We will be using "leADS.pkl" in this wiki:
File | Description | Size |
---|---|---|
leADS.pkl | A pretrained model generated using files- biocyc21_Xe.pkl and biocyc21_y.pkl with nPSP (k=50), ensemble size 10, and per% = 70%. This model was trained using the class-labels (pathways) approach. | 763.7MB |
In this folder, 25 datasets are provided to predict, train, and evaluate metabolic pathways using the pre-trained leADS model (e.g., "leADS.pkl") or to train a new model. We will use 22 datasets that are categorized into the following three types: 1)- Pathway training data, 2)- Pathway prediction and evaluation data, and 3)- Other multi-label data.
The following four files can be used to train leADS. Biocyc version 21 tier 2 and 3 PGDBs were processed using prepBioCyc.
File | Description | Size |
---|---|---|
biocyc21_X.pkl | A file (matrix format) containing the Biocyc version 21 tier 2 and 3 Pathway/Genome Databases (PGDBs) in the form of rows and columns. The rows (9429) represent organisms and the columns (3650) represent the Enzyme Commission (EC) number indices. | 27.1MB |
biocyc21_Xe.pkl | A file (matrix format) containing the Biocyc version 21 tier 2 and 3 Pathway/Genome Databases (PGDBs) in the form of rows and columns. The rows (9429) represent organisms and the columns (3778) represent the Enyme Commission (EC) number and embeddings. | 79.8MB |
biocyc21_y.pkl | A file (matrix format) containing the Biocyc version 21 tier 2 and 3 Pathway/Genome Databases (PGDBs) in the form of rows and columns. The rows (9429) represent organisms and the columns (2526) represent the pathway indices. | 67.6MB |
biocyc21_species.pkl | A metadata in a tuple format (folder id, taxa id, species) consisting of file metadata information extracted from Biocyc version 21 tier 2 and 3 Pathway/Genome Databases (PGDBs). | 6.47MB |
leADS_samples.pkl | A file containing a list of integer indices of 4752 samples from biocyc21_Xe.pkl. This file was generated during training "leADS.pkl". | 146.7KB |
The following table depicts the biocyc21_X.pkl
file, where the rows represent organisms and the columns represent Enzyme Commission (EC) number indices.
Taxa | Species | EC-1.1.1.10 | EC-1.1.1.101 | EC-1.1.1.102 | EC-6.4.1.4 | EC-6.4.1.5 | EC-6.4.1.6 | EC-6.4.1.7 | EC-6.4.1.8 | EC-6.4.1.b | EC-6.5.1.8 | EC-6.6.1.1 | EC-6.6.1.2 |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|
TAX-887700 | Acetobacter aceti | 0.0 | 1.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 1.0 | 1.0 |
TAX-1048834 | Alicyclobacillus acidocaldarius | 0.0 | 1.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 1.0 | 0.0 |
TAX-521098 | Alicyclobacillus acidocaldarius | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 1.0 | 0.0 |
TAX-1035194 | Aggregatibacter actinomycetemcomitans | 0.0 | 1.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 |
TAX-1089447 | Aggregatibacter actinomycetemcomitans | 0.0 | 1.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 |
The following table depicts the biocyc21_y.pkl
file, where the rows represent organisms and the columns represent the pathway indices.
Taxa | Species | L-valine biosynthesis | L-arginine degradation VI (arginase 2 pathway) | cyclopropane fatty acid (CFA) biosynthesis | almitate biosynthesis II (bacteria and plants) | jasmonoyl-amino acid conjugates biosynthesis I | pyridoxal 5'-phosphate salvage I | adenosine deoxyribonucleotides de novo biosynthesis |
---|---|---|---|---|---|---|---|---|
TAX-887700 | Acetobacter aceti | 1.0 | 0.0 | 1.0 | 1.0 | 0.0 | 1.0 | 1.0 |
TAX-1048834 | Alicyclobacillus acidocaldarius | 1.0 | 1.0 | 0.0 | 1.0 | 0.0 | 0.0 | 1.0 |
TAX-521098 | Alicyclobacillus acidocaldarius | 1.0 | 1.0 | 0.0 | 1.0 | 0.0 | 0.0 | 1.0 |
TAX-1035194 | Aggregatibacter actinomycetemcomitans | 0.0 | 0.0 | 0.0 | 1.0 | 0.0 | 1.0 | 1.0 |
TAX-1089447 | Aggregatibacter actinomycetemcomitans | 1.0 | 0.0 | 0.0 | 1.0 | 0.0 | 1.0 | 1.0 |
The following data can be used to perform pathway prediction and evaluation of the pre-trained leADS model. Please see the mlLGPR and triUMPF repositories on how to obtain and preprocess the data below.
Files | Description | Size |
---|---|---|
three_ecoli/ | This directory contains "0.pf" files for E. coli K-12 substr. MG1655 (TAX-511145), E. coli str. CFT073 (TAX-199310), and E. coli O157:H7 str. EDL933 (TAX-155864). A tutorial on how to use this data type is provided in Tutorial on pathway prediction (Example 1). | 767KB |
golden_X.pkl; golden_Xe.pkl; golden_y.pkl | This is the Golden dataset in a matrix format where rows correspond to AraCyc, EcoCyc, HumanCyc, LeishCyc, TrypanoCyc, and YeastCyc, respectively. Columns for "*X.pkl", "*Xe.pkl", and "*y.pkl" correspond to 3650 enzyme indices, 3778 enzyme indices and embeddings, and 2526 pathway indices. | 154KB |
cami_X.pkl, cami_Xe.pkl, cami_y.pkl | These files correspond to the CAMI low complexity data with the rows representing 40 species. Columns for "*X.pkl", "*Xe.pkl", and "*y.pkl" correspond to 3650 enzyme indices, 3778 enzyme indices and embeddings, and 2526 pathway indices. | 396KB |
symbionts_X.pkl, symbionts_Xe.pkl | These files correspond to the symbiont dataset with the rows representing: Moranella, Tremblaya, and a composition of both genomes. Columns for "*X.pkl" and "*Xe.pkl" correspond to 3650 enzyme indices and 3778 enzyme indices and embeddings. | 13.1KB |
hots_4_X.pkl, hots_4_Xe.pkl | These files correspond to the Hawaii Ocean Time Series (HOTS) data at 10m (0, 1 rows), 75m (2, 3 rows), 110m (4 row), and 500m (5, 6 rows) ocean depth intervals. Columns for "*X.pkl" and "*Xe.pkl" correspond to 3650 enzyme indices and 3778 enzyme indices and embeddings. | 172KB |
The three_ecoli data corresponds to the three E. coli strains - E. coli K-12 substr. MG1655 (TAX-511145), E. coli str. CFT073 (TAX-199310), and E. coli O157:H7 str. EDL933 (TAX-155864), that have been used for several benchmarking analysis. The 0.pf file represented below is the main input file to pathway tools that pathologic uses to make pathway predictions. It contains the annotated enzymes that result from MetaPathways v2 comparing the Open Reading Frames (ORFs) to the MetaCyc database.
Below is an example of "0.pf" file from E. coli strain K-12 substr. MG1655.
ID ecoli-COLI-K12_0_2
NAME ecoli-COLI-K12_0_2
STARTBASE 3734
ENDBASE 5020
PRODUCT Threonine synthase # THRESYN-RXN 4.2.3.1
PRODUCT-TYPE P
EC 4.2.3.1
//
ID ecoli-COLI-K12_0_6
NAME ecoli-COLI-K12_0_6
STARTBASE 8238
ENDBASE 9191
PRODUCT Transaldolase # TRANSALDOL-RXN 2.2.1.2
PRODUCT-TYPE P
EC 2.2.1.2
//
ID ecoli-COLI-K12_0_7
NAME ecoli-COLI-K12_0_7
STARTBASE 9306
ENDBASE 9893
PRODUCT Molybdopterin adenylyltransferase # RXN-8344 2.7.7.75
PRODUCT-TYPE P
EC 2.7.7.75
//
leADS can be used to train any multi-label data (without including embeddings). Here, we include two well-known multi-label datasets obtained from Mulan.
Files | Description | Size |
---|---|---|
delicious_X.pkl, delicious_y.pkl | These files correspond to the delicious multi-label dataset. | 4.85MB |
birds_X.pkl, birds_y.pkl | These files correspond to the birds multi-label dataset. | 1.09MB |