Skip to content

4 Process and outputs detailed descriptions

Matteopaluh edited this page Jan 10, 2023 · 2 revisions

1) KEGG Modules Completeness evaluation

Starting from previously obtained functional annotation, the MAG/Genome in input is evaluated in terms of KEGG Orthologs (KO) presence. This evaluation is performed in the framework of KEGG Modules, i.e. manually defined functional units composed of KOs, in order to recapitulate (meta)genomic potential.
KEMET output regarding this task is composed of 2 files for each MAG/Genome of interest.
They are located in the the output folders which, by default, are KEMET/report_tsv/ and KEMET/report_txt/, respectively:

- reportKMC_FASTA.tsv

This output is a tabular file with infomations regarding each KEGG Module, indicating the metabolic potential of the MAG/genome defined with the FASTA name.

Each line includes the tab-separated info as in the following example table:

Module_id Module_name Completeness complete/total blocks missing KOs KOs present
M00029 Urea cycle 1 BLOCK MISSING 4__5 K01948,K14681 K00611,K01940,K01755,K01476
  • The Completeness indications are accordingly: "INCOMPLETE", "2 BLOCKS MISSING", "1 BLOCK MISSING", or "COMPLETE".

  • complete/total blocks is indicated with the format "COMPLETE__TOTAL" (with two underscores).

- reportKMC_FASTA.txt

This output is a flat file with indication of KEGG MODULES completeness for every Module, up to the block level. It gives info on which sequential step of the Module path has missing KOs.


2) HMM-based ortholog search

KEMET performs bulk nucleotidic sequences download from KEGG GENES using KEGG API. For license terms see this site. API service is available free of charge to academic users only. If users prefer different download options they are encouraged to request a KEGG FTP subscription.
Downloaded GENES sequences are filtered (all unique sequences are considered once for HMM building), aligned using MAFFT multi-sequence aligner and a profile is created using HMMer suite. The nucleotidic profiles obtained are further searched in the MAG/Genome of interest.
As a default, a threshold value is imposed in order to enrich for complete profiles while not including hits resulting from partial sequences.
Only hits with a score that surpass the threshold are considered proper hits, resulting in the presence of KO(s) of interest in the MAG/Genome sequences.

Information regarding HMM hits is included in the output files:

- FASTA_HMM_hits.txt

A tabular file including HMM hits of a single MAG/Genome, defined with the FASTA name. It contains informations on the hits in the form:

KO corr_score, e-value contig_name strand genome_left_bound genome_right_bound profile_lenght begin_of_HMMsequence_hit end_of_HMMsequence_hit
  • corr_score is a metric that describes HMM profile scoring, corrected on the sequence lenght of that profile.

- file_recap_DATE.tsv

After a single KEMET run, a tabular summary file is generated. It includes every "_HMM_hits" file information and incorporates them in a single table.

Moreover, the file includes further fields:

frame seq xseq
  • frame indicates the most likely translated reading-frame.

  • seq is the nucleotidic sequence as retrieved from the MAG/Genome.

  • xseq is the translated aminoacidic sequence derived from HMM seq using the generic Bacterial/Archaeal translation table (t11).


3) Genome-scale metabolic model gapfilling

The script connects missing KOs content, retrieved via HMM hits, to reactions in the BiGG namespace (ModelSEED namespace will be added in a next release).

Based on the --gsmm_mode parameter it operates in two different ways:

--gsmm_mode denovo allows an automatic gene-calling from MAG/Genome sequences using Prodigal, and automatically adds the hits retrieved with HMMs to proteins multiFASTA (.faa) files.

After that, KEMET performs a CarveMe reconstruction including these newly found sequences.
NOTE The usage thus described is subject to CarveMe dependences, including the IBM CPLEX Optimizer. More regarding the dependencies can be read about CarveMe installation procedure here.

Using this mode, the newly generated gene prediction and GSMM are included in the KEMET/de_novo_models folder.

--gsmm_mode existing allows the identified reactions to be incorporated in esisting genome-scale metabolic models (GSMMs) previously generated with CarveMe, if those are missing.

At time of manuscript publication (March 2022) the only tested way to add reaction to pre-existing GSMMs is via the ReFramed package. Further improvement would permit adding it through the cobrapy platform.

Informations regarding reaction gapfilling (if performed using the --gsmm_mode existing parameter) are included in several output files:

- bigg_log_FASTA.txt

A flat-file with the indication of every BiGG reaction that potentially could be added to the model in input, defined with the FASTA name. The BiGG reactions are included in a one per line format.

- FASTA_added_reactions

A flat-file with the reactions that were actually added for a given MAG/Genome-derived GSMM, defined with the FASTA name. Reaction names are indicated one per line. followed by the respective reaction string.

- gapfilled model

Individual GSMMs are saved again after the gapfilling procedure with new reactions and metabolites content as FASTA_KEGGadd_DATE.xml, where FASTA follows the input definition and DATE includes the day of analysis. Files generated this way are stored in the KEMET/model_gapfilled/ folder.