Merge pull request nf-core#343 from Darcy220606/add_mmseqs2_taxonomy

Add mmseqs2 taxonomy
Darcy220606 · Apr 2, 2024 · f4495a5 · f4495a5
2 parents f980af6 + d974f15
commit f4495a5
Show file tree

Hide file tree

Showing 48 changed files with 2,274 additions and 252 deletions.
diff --git a/.github/workflows/ci.yml b/.github/workflows/ci.yml
@@ -77,3 +77,34 @@ jobs:
       - name: Run pipeline with test data (BGC workflow)
         run: |
           nextflow run ${GITHUB_WORKSPACE} -profile test_bgc,docker --outdir ./results ${{ matrix.parameters }} --bgc_skip_deepbgc
+
+  test_taxonomy:
+    name: Run pipeline with test data (AMP, ARG and BGC taxonomy workflows)
+    # Only run on push if this is the nf-core dev branch (merged PRs)
+    if: "${{ github.event_name != 'push' || (github.event_name == 'push' && github.repository == 'nf-core/funcscan') }}"
+    runs-on: ubuntu-latest
+    strategy:
+      matrix:
+        NXF_VER:
+          - "23.04.0"
+          - "latest-everything"
+        parameters:
+          - "--annotation_tool prodigal"
+          - "--annotation_tool prokka"
+          - "--annotation_tool bakta --annotation_bakta_db_downloadtype light"
+
+    steps:
+      - name: Check out pipeline code
+        uses: actions/checkout@b4ffde65f46336ab88eb53be808477a3936bae11 # v4
+
+      - name: Install Nextflow
+        uses: nf-core/setup-nextflow@v1
+        with:
+          version: "${{ matrix.NXF_VER }}"
+
+      - name: Disk space cleanup
+        uses: jlumbroso/free-disk-space@54081f138730dfa15788a46383842cd2f914a1be # v1.3.1
+
+      - name: Run pipeline with test data (AMP, ARG and BGC taxonomy workflows)
+        run: |
+          nextflow run ${GITHUB_WORKSPACE} -profile test_taxonomy,docker --outdir ./results ${{ matrix.parameters }}
diff --git a/CHANGELOG.md b/CHANGELOG.md
@@ -11,12 +11,12 @@ and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0
 - [#324](https://github.com/nf-core/funcscan/pull/324) Removed separate DeepARG test profile because database download is now stable. (by @jasmezz)
 - [#332](https://github.com/nf-core/funcscan/pull/332) & [#327](https://github.com/nf-core/funcscan/pull/327) Merged pipeline template of nf-core/tools version 2.12.1 (by @jfy133, @jasmezz)
 - [#338](https://github.com/nf-core/funcscan/pull/338) Set `--meta` parameter to default for Bakta, with singlemode optional. (by @jasmezz)
+- [#343](https://github.com/nf-core/funcscan/pull/343) Added contig taxonomic classification using [MMseqs2](https://github.com/soedinglab/MMseqs2/). (by @darcy220606)
 
 ### `Fixed`
 
-- [#348](https://github.com/nf-core/funcscan/pull/348) Updated samplesheet for pipeline tests to 'samplesheet_reduced.csv' with smaller datasets to reduce resource consumption. Updated prodigal module to fix pigz issue. (by @darcy220606)
-
-### `Dependencies`
+- [#343](https://github.com/nf-core/funcscan/pull/343) Standardized the resulting workflow summary tables to always start with 'sample_id\tcontig_id\t..'. Reformatted the output of `hamronization/summarize` module. (by @darcy220606)
+- [#348](https://github.com/nf-core/funcscan/pull/348) Updated samplesheet for pipeline tests to 'samplesheet_reduced.csv' with smaller datasets to reduce resource consumption. Updated prodigal module to fix pigz issue. Removed `tests/` from `.gitignore`. (by @darcy220606)
 
 | Tool          | Previous version | New version |
 | ------------- | ---------------- | ----------- |

diff --git a/CITATIONS.md b/CITATIONS.md
@@ -90,6 +90,10 @@
 
   > Alcock, B. P., Huynh, W., Chalil, R., Smith, K. W., Raphenya, A. R., Wlodarski, M. A., Edalatmand, A., Petkau, A., Syed, S. A., Tsang, K. K., Baker, S. J. C., Dave, M., McCarthy, M. C., Mukiri, K. M., Nasir, J. A., Golbon, B., Imtiaz, H., Jiang, X., Kaur, K., Kwong, M., Liang, Z. C., Niu, K. C., Shan, P., Yang, J. Y. J., Gray, K. L., Hoad, G. R., Jia, B., Bhando, T., Carfrae, L. A., Farha, M. A., French, S., Gordzevich, R., Rachwalski, K., Tu, M. M., Bordeleau, E., Dooley, D., Griffiths, E., Zubyk, H. L., Brown, E. D., Maguire, F., Beiko, R. G., Hsiao, W. W. L., Brinkman F. S. L., Van Domselaar, G., McArthur, A. G. (2023). CARD 2023: expanded curation, support for machine learning, and resistome prediction at the Comprehensive Antibiotic Resistance Database. Nucleic acids research, 51(D1):D690-D699. [DOI: 10.1093/nar/gkac920](https://doi.org/10.1093/nar/gkac920)
 
+- [MMseqs2](https://doi.org/10.1093bioinformatics/btab184)
+
+  > Mirdita, M., Steinegger, M., Breitwieser, F., Söding, J., Levy Karin, E. (2021). Fast and sensitive taxonomic assignment to metagenomic contigs. Bioinformatics, 37(18),3029–3031. [DOI: 10.1093/bioinformatics/btab184](https://doi.org/10.1093/bioinformatics/btab184)
+
 ## Software packaging/containerisation tools
 
 - [Anaconda](https://anaconda.com)

diff --git a/README.md b/README.md
@@ -30,12 +30,13 @@ The nf-core/funcscan AWS full test dataset are contigs generated by the MGnify s
 
 ## Pipeline summary
 
-1. Annotation of assembled prokaryotic contigs with [`Prodigal`](https://github.com/hyattpd/Prodigal), [`Pyrodigal`](https://github.com/althonos/pyrodigal), [`Prokka`](https://github.com/tseemann/prokka), or [`Bakta`](https://github.com/oschwengers/bakta)
-2. Screening contigs for antimicrobial peptide-like sequences with [`ampir`](https://cran.r-project.org/web/packages/ampir/index.html), [`Macrel`](https://github.com/BigDataBiology/macrel), [`HMMER`](http://hmmer.org/), [`AMPlify`](https://github.com/bcgsc/AMPlify)
-3. Screening contigs for antibiotic resistant gene-like sequences with [`ABRicate`](https://github.com/tseemann/abricate), [`AMRFinderPlus`](https://github.com/ncbi/amr), [`fARGene`](https://github.com/fannyhb/fargene), [`RGI`](https://card.mcmaster.ca/analyze/rgi), [`DeepARG`](https://bench.cs.vt.edu/deeparg)
-4. Screening contigs for biosynthetic gene cluster-like sequences with [`antiSMASH`](https://antismash.secondarymetabolites.org), [`DeepBGC`](https://github.com/Merck/deepbgc), [`GECCO`](https://gecco.embl.de/), [`HMMER`](http://hmmer.org/)
-5. Creating aggregated reports for all samples across the workflows with [`AMPcombi`](https://github.com/Darcy220606/AMPcombi) for AMPs, [`hAMRonization`](https://github.com/pha4ge/hAMRonization) for ARGs, and [`comBGC`](https://raw.githubusercontent.com/nf-core/funcscan/master/bin/comBGC.py) for BGCs
-6. Software version and methods text reporting with [`MultiQC`](http://multiqc.info/)
+1. Taxonomic classification of contigs of **prokaryotic origin** with [`MMseqs2`](https://github.com/soedinglab/MMseqs2)
+2. Annotation of assembled prokaryotic contigs with [`Prodigal`](https://github.com/hyattpd/Prodigal), [`Pyrodigal`](https://github.com/althonos/pyrodigal), [`Prokka`](https://github.com/tseemann/prokka), or [`Bakta`](https://github.com/oschwengers/bakta)
+3. Screening contigs for antimicrobial peptide-like sequences with [`ampir`](https://cran.r-project.org/web/packages/ampir/index.html), [`Macrel`](https://github.com/BigDataBiology/macrel), [`HMMER`](http://hmmer.org/), [`AMPlify`](https://github.com/bcgsc/AMPlify)
+4. Screening contigs for antibiotic resistant gene-like sequences with [`ABRicate`](https://github.com/tseemann/abricate), [`AMRFinderPlus`](https://github.com/ncbi/amr), [`fARGene`](https://github.com/fannyhb/fargene), [`RGI`](https://card.mcmaster.ca/analyze/rgi), [`DeepARG`](https://bench.cs.vt.edu/deeparg)
+5. Screening contigs for biosynthetic gene cluster-like sequences with [`antiSMASH`](https://antismash.secondarymetabolites.org), [`DeepBGC`](https://github.com/Merck/deepbgc), [`GECCO`](https://gecco.embl.de/), [`HMMER`](http://hmmer.org/)
+6. Creating aggregated reports for all samples across the workflows with [`AMPcombi`](https://github.com/Darcy220606/AMPcombi) for AMPs, [`hAMRonization`](https://github.com/pha4ge/hAMRonization) for ARGs, and [`comBGC`](https://raw.githubusercontent.com/nf-core/funcscan/master/bin/comBGC.py) for BGCs
+7. Software version and methods text reporting with [`MultiQC`](http://multiqc.info/)
 
 ![funcscan metro workflow](docs/images/funcscan_metro_workflow.png)
 

diff --git a/bin/comBGC.py b/bin/comBGC.py
@@ -1,5 +1,8 @@
 #!/usr/bin/env python3
 
+# Written by Jasmin Frangenberg and released under the MIT license.
+# See below for full license text.
+
 from Bio import SeqIO
 import pandas as pd
 import argparse
@@ -643,6 +646,10 @@ def gecco_workflow(gecco_paths):
         inplace=True,
     )
 
+    # Rearrange and rename the columns in the summary df
+    summary_all = summary_all.iloc[:, [0, 2, 1] + list(range(3, len(summary_all.columns)))]
+    summary_all.rename(columns={'Sample_ID':'sample_id', 'Contig_ID':'contig_id', 'CDS_ID':'BGC_region_contig_ids'}, inplace=True)
+
     # Write results to TSV
     if not os.path.exists(outdir):
         os.makedirs(outdir)

diff --git a/bin/merge_taxonomy.py b/bin/merge_taxonomy.py
@@ -0,0 +1,231 @@
+#!/usr/bin/env python3
+
+# Written by Anan Ibrahim and released under the MIT license.
+# See git repository (https://github.com/Darcy220606/AMPcombi) for full license text.
+# Date: March 2024
+# Version: 0.1.0
+
+# Required modules
+import sys
+import os
+import pandas as pd
+import numpy as np
+import argparse
+
+tool_version = "0.1.0"
+#########################################
+# TOP LEVEL: AMPCOMBI
+#########################################
+parser = argparse.ArgumentParser(prog = 'merge_taxonomy', formatter_class=argparse.RawDescriptionHelpFormatter,
+                                usage='%(prog)s [options]',
+                                description=('''\
+    .............................................................................
+                                    *merge_taxonomy*
+    .............................................................................
+                This script merges all three funcscan workflows with
+    MMseqs2 taxonomy results. This is done in three submodules that can be
+    activated seperately.
+    .............................................................................'''),
+                                epilog='''Thank you for running taxonomy_merge!''',
+                                add_help=True)
+parser.add_argument('--version', action='version', version='merge_taxonomy ' + tool_version)
+
+#########################################
+# SUBPARSERS
+#########################################
+subparsers = parser.add_subparsers(required=True)
+
+#########################################
+# SUBPARSER: AMPCOMBI
+#########################################
+ampcombi_parser = subparsers.add_parser('ampcombi_taxa')
+
+ampcombi_parser.add_argument("--ampcombi", dest="amp", nargs='?', help="Enter the path to the ampcombi_complete_summary.tsv' \n (default: %(default)s)",
+                    type=str, default='ampcombi_complete_summary.csv')
+ampcombi_parser.add_argument("--taxonomy", dest="taxa1", nargs='+', help="Enter the list of taxonomy files for all samples. ")
+
+#########################################
+# SUBPARSER: COMBGC
+#########################################
+combgc_parser = subparsers.add_parser('combgc_taxa')
+
+combgc_parser.add_argument("--combgc", dest="bgc", nargs='?', help="Enter the path to the combgc_complete_summary.tsv' \n (default: %(default)s)",
+                    type=str, default='combgc_complete_summary.csv')
+combgc_parser.add_argument("--taxonomy", dest="taxa2", nargs='+', help="Enter the list of taxonomy files for all samples. ")
+
+#########################################
+# SUBPARSER: HAMRONIZATION
+#########################################
+hamronization_parser = subparsers.add_parser('hamronization_taxa')
+
+hamronization_parser.add_argument("--hamronization", dest="arg", nargs='?', help="Enter the path to the hamronization_complete_summary.tsv' \n (default: %(default)s)",
+                    type=str, default='hamronization_complete_summary.csv')
+hamronization_parser.add_argument("--taxonomy", dest="taxa3",nargs='+', help="Enter the list of taxonomy files for all samples. ")
+
+#########################################
+# TAXONOMY
+#########################################
+def reformat_mmseqs_taxonomy(mmseqs_taxonomy):
+    mmseqs2_df = pd.read_csv(mmseqs_taxonomy, sep='\t', header=None, names=['contig_id', 'taxid', 'rank_label', 'scientific_name', 'lineage', 'mmseqs_lineage_contig'])
+    # remove the lineage column
+    mmseqs2_df.drop('lineage', axis=1, inplace=True)
+    mmseqs2_df['mmseqs_lineage_contig'].unique()
+    # convert any classification that has Eukaryota/root to NaN as funcscan targets bacteria ONLY **
+    for i, row in mmseqs2_df.iterrows():
+        lineage = str(row['mmseqs_lineage_contig'])
+        if 'Eukaryota' in lineage or 'root' in lineage:
+            mmseqs2_df.at[i, 'mmseqs_lineage_contig'] = np.nan
+    # insert the sample name in the first column according to the file basename
+    file_basename = os.path.basename(mmseqs_taxonomy)
+    filename = os.path.splitext(file_basename)[0]
+    mmseqs2_df.insert(0, 'sample_id', filename)
+    return mmseqs2_df
+
+#########################################
+# FUNCTION: AMPCOMBI
+#########################################
+def ampcombi_taxa(args):
+    merged_df = pd.DataFrame()
+
+    # assign input args to variables
+    ampcombi = args.amp
+    taxa_list = args.taxa1
+
+    # prepare the taxonomy files
+    taxa_df = pd.DataFrame()
+    # append the dfs to the taxonomy_files_combined
+    for file in taxa_list: # list of taxa files ['','']
+        df = reformat_mmseqs_taxonomy(file)
+        taxa_df = pd.concat([taxa_df, df])
+
+    # filter the tool df
+    tool_df = pd.read_csv(ampcombi, sep=',') #current ampcombi version is comma sep. CHANGE WITH VERSION 0.2.0
+    # make sure 1st and 2nd column have the same column labels
+    tool_df.rename(columns={tool_df.columns[0]: 'sample_id'}, inplace=True)
+    tool_df.rename(columns={tool_df.columns[1]: 'contig_id'}, inplace=True)
+    # grab the real contig id in another column copy for merging
+    tool_df['contig_id_merge'] = tool_df['contig_id'].str.rsplit('_', 1).str[0]
+
+    # merge rows from taxa to ampcombi_df based on substring match in sample_id
+    # grab the unique sample names from the taxonomy table
+    samples_taxa = taxa_df['sample_id'].unique()
+    # for every sampleID in taxadf merge the results
+    for sampleID in samples_taxa:
+        # subset ampcombi
+        subset_tool = tool_df.loc[tool_df['sample_id'].str.contains(sampleID)]
+        # subset taxa
+        subset_taxa = taxa_df.loc[taxa_df['sample_id'].str.contains(sampleID)]
+        # merge
+        subset_df = pd.merge(subset_tool, subset_taxa, left_on = 'contig_id_merge', right_on='contig_id', how='left')
+        # cleanup the table
+        columnsremove = ['contig_id_merge','contig_id_y', 'sample_id_y']
+        subset_df.drop(columnsremove, axis=1, inplace=True)
+        subset_df.rename(columns={'contig_id_x': 'contig_id', 'sample_id_x':'sample_id'},inplace=True)
+        # append in the combined_df
+        merged_df = merged_df.append(subset_df, ignore_index=True)
+
+    # write to file
+    merged_df.to_csv('ampcombi_complete_summary_taxonomy.tsv', sep='\t', index=False)
+
+#########################################
+# FUNCTION: COMBGC
+#########################################
+def combgc_taxa(args):
+    merged_df = pd.DataFrame()
+
+    # assign input args to variables
+    combgc = args.bgc
+    taxa_list = args.taxa2
+
+    # prepare the taxonomy files
+    taxa_df = pd.DataFrame()
+    # append the dfs to the taxonomy_files_combined
+    for file in taxa_list: # list of taxa files ['','']
+        df = reformat_mmseqs_taxonomy(file)
+        taxa_df = pd.concat([taxa_df, df])
+
+    # filter the tool df
+    tool_df = pd.read_csv(combgc, sep='\t')
+    # make sure 1st and 2nd column have the same column labels
+    tool_df.rename(columns={tool_df.columns[0]: 'sample_id'}, inplace=True)
+    tool_df.rename(columns={tool_df.columns[1]: 'contig_id'}, inplace=True)
+
+    # merge rows from taxa to ampcombi_df based on substring match in sample_id
+    # grab the unique sample names from the taxonomy table
+    samples_taxa = taxa_df['sample_id'].unique()
+    # for every sampleID in taxadf merge the results
+    for sampleID in samples_taxa:
+        # subset ampcombi
+        subset_tool = tool_df.loc[tool_df['sample_id'].str.contains(sampleID)]
+        # subset taxa
+        subset_taxa = taxa_df.loc[taxa_df['sample_id'].str.contains(sampleID)]
+        # merge
+        subset_df = pd.merge(subset_tool, subset_taxa, left_on = 'contig_id', right_on='contig_id', how='left')
+        # cleanup the table
+        columnsremove = ['sample_id_y']
+        subset_df.drop(columnsremove, axis=1, inplace=True)
+        subset_df.rename(columns={'sample_id_x':'sample_id'},inplace=True)
+        # append in the combined_df
+        merged_df = merged_df.append(subset_df, ignore_index=True)
+
+    # write to file
+    merged_df.to_csv('combgc_complete_summary_taxonomy.tsv', sep='\t', index=False)
+
+#########################################
+# FUNCTION: HAMRONIZATION
+#########################################
+def hamronization_taxa(args):
+    merged_df = pd.DataFrame()
+
+    # assign input args to variables
+    hamronization = args.arg
+    taxa_list = args.taxa3
+
+    # prepare the taxonomy files
+    taxa_df = pd.DataFrame()
+    # append the dfs to the taxonomy_files_combined
+    for file in taxa_list: # list of taxa files ['','']
+        df = reformat_mmseqs_taxonomy(file)
+        taxa_df = pd.concat([taxa_df, df])
+
+    # filter the tool df
+    tool_df = pd.read_csv(hamronization, sep='\t')
+    # rename the columns
+    tool_df.rename(columns={'input_file_name':'sample_id', 'input_sequence_id':'contig_id'}, inplace=True)
+    # reorder the columns
+    new_order = ['sample_id', 'contig_id'] + [col for col in tool_df.columns if col not in ['sample_id', 'contig_id']]
+    tool_df = tool_df.reindex(columns=new_order)
+    # grab the real contig id in another column copy for merging
+    tool_df['contig_id_merge'] = tool_df['contig_id'].str.rsplit('_', 1).str[0]
+
+    # merge rows from taxa to ampcombi_df based on substring match in sample_id
+    # grab the unique sample names from the taxonomy table
+    samples_taxa = taxa_df['sample_id'].unique()
+    # for every sampleID in taxadf merge the results
+    for sampleID in samples_taxa:
+        # subset ampcombi
+        subset_tool = tool_df.loc[tool_df['sample_id'].str.contains(sampleID)]
+        # subset taxa
+        subset_taxa = taxa_df.loc[taxa_df['sample_id'].str.contains(sampleID)]
+        # merge
+        subset_df = pd.merge(subset_tool, subset_taxa, left_on = 'contig_id_merge', right_on='contig_id', how='left')
+        # cleanup the table
+        columnsremove = ['contig_id_merge','contig_id_y', 'sample_id_y']
+        subset_df.drop(columnsremove, axis=1, inplace=True)
+        subset_df.rename(columns={'contig_id_x': 'contig_id', 'sample_id_x':'sample_id'},inplace=True)
+        # append in the combined_df
+        merged_df = merged_df.append(subset_df, ignore_index=True)
+
+    # write to file
+    merged_df.to_csv('hamronization_complete_summary_taxonomy.tsv', sep='\t', index=False)
+
+#########################################
+# SUBPARSERS: DEFAULT
+#########################################
+ampcombi_parser.set_defaults(func=ampcombi_taxa)
+combgc_parser.set_defaults(func=combgc_taxa)
+hamronization_parser.set_defaults(func=hamronization_taxa)
+
+if __name__ == '__main__':
+    args = parser.parse_args()
+    args.func(args)  # call the default function