Skip to content

Commit

Permalink
Merge pull request nf-core#343 from Darcy220606/add_mmseqs2_taxonomy
Browse files Browse the repository at this point in the history
Add mmseqs2 taxonomy
  • Loading branch information
Darcy220606 authored Apr 2, 2024
2 parents f980af6 + d974f15 commit f4495a5
Show file tree
Hide file tree
Showing 48 changed files with 2,274 additions and 252 deletions.
31 changes: 31 additions & 0 deletions .github/workflows/ci.yml
Original file line number Diff line number Diff line change
Expand Up @@ -77,3 +77,34 @@ jobs:
- name: Run pipeline with test data (BGC workflow)
run: |
nextflow run ${GITHUB_WORKSPACE} -profile test_bgc,docker --outdir ./results ${{ matrix.parameters }} --bgc_skip_deepbgc
test_taxonomy:
name: Run pipeline with test data (AMP, ARG and BGC taxonomy workflows)
# Only run on push if this is the nf-core dev branch (merged PRs)
if: "${{ github.event_name != 'push' || (github.event_name == 'push' && github.repository == 'nf-core/funcscan') }}"
runs-on: ubuntu-latest
strategy:
matrix:
NXF_VER:
- "23.04.0"
- "latest-everything"
parameters:
- "--annotation_tool prodigal"
- "--annotation_tool prokka"
- "--annotation_tool bakta --annotation_bakta_db_downloadtype light"

steps:
- name: Check out pipeline code
uses: actions/checkout@b4ffde65f46336ab88eb53be808477a3936bae11 # v4

- name: Install Nextflow
uses: nf-core/setup-nextflow@v1
with:
version: "${{ matrix.NXF_VER }}"

- name: Disk space cleanup
uses: jlumbroso/free-disk-space@54081f138730dfa15788a46383842cd2f914a1be # v1.3.1

- name: Run pipeline with test data (AMP, ARG and BGC taxonomy workflows)
run: |
nextflow run ${GITHUB_WORKSPACE} -profile test_taxonomy,docker --outdir ./results ${{ matrix.parameters }}
6 changes: 3 additions & 3 deletions CHANGELOG.md
Original file line number Diff line number Diff line change
Expand Up @@ -11,12 +11,12 @@ and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0
- [#324](https://github.com/nf-core/funcscan/pull/324) Removed separate DeepARG test profile because database download is now stable. (by @jasmezz)
- [#332](https://github.com/nf-core/funcscan/pull/332) & [#327](https://github.com/nf-core/funcscan/pull/327) Merged pipeline template of nf-core/tools version 2.12.1 (by @jfy133, @jasmezz)
- [#338](https://github.com/nf-core/funcscan/pull/338) Set `--meta` parameter to default for Bakta, with singlemode optional. (by @jasmezz)
- [#343](https://github.com/nf-core/funcscan/pull/343) Added contig taxonomic classification using [MMseqs2](https://github.com/soedinglab/MMseqs2/). (by @darcy220606)

### `Fixed`

- [#348](https://github.com/nf-core/funcscan/pull/348) Updated samplesheet for pipeline tests to 'samplesheet_reduced.csv' with smaller datasets to reduce resource consumption. Updated prodigal module to fix pigz issue. (by @darcy220606)

### `Dependencies`
- [#343](https://github.com/nf-core/funcscan/pull/343) Standardized the resulting workflow summary tables to always start with 'sample_id\tcontig_id\t..'. Reformatted the output of `hamronization/summarize` module. (by @darcy220606)
- [#348](https://github.com/nf-core/funcscan/pull/348) Updated samplesheet for pipeline tests to 'samplesheet_reduced.csv' with smaller datasets to reduce resource consumption. Updated prodigal module to fix pigz issue. Removed `tests/` from `.gitignore`. (by @darcy220606)

| Tool | Previous version | New version |
| ------------- | ---------------- | ----------- |
Expand Down
4 changes: 4 additions & 0 deletions CITATIONS.md
Original file line number Diff line number Diff line change
Expand Up @@ -90,6 +90,10 @@

> Alcock, B. P., Huynh, W., Chalil, R., Smith, K. W., Raphenya, A. R., Wlodarski, M. A., Edalatmand, A., Petkau, A., Syed, S. A., Tsang, K. K., Baker, S. J. C., Dave, M., McCarthy, M. C., Mukiri, K. M., Nasir, J. A., Golbon, B., Imtiaz, H., Jiang, X., Kaur, K., Kwong, M., Liang, Z. C., Niu, K. C., Shan, P., Yang, J. Y. J., Gray, K. L., Hoad, G. R., Jia, B., Bhando, T., Carfrae, L. A., Farha, M. A., French, S., Gordzevich, R., Rachwalski, K., Tu, M. M., Bordeleau, E., Dooley, D., Griffiths, E., Zubyk, H. L., Brown, E. D., Maguire, F., Beiko, R. G., Hsiao, W. W. L., Brinkman F. S. L., Van Domselaar, G., McArthur, A. G. (2023). CARD 2023: expanded curation, support for machine learning, and resistome prediction at the Comprehensive Antibiotic Resistance Database. Nucleic acids research, 51(D1):D690-D699. [DOI: 10.1093/nar/gkac920](https://doi.org/10.1093/nar/gkac920)
- [MMseqs2](https://doi.org/10.1093bioinformatics/btab184)

> Mirdita, M., Steinegger, M., Breitwieser, F., Söding, J., Levy Karin, E. (2021). Fast and sensitive taxonomic assignment to metagenomic contigs. Bioinformatics, 37(18),3029–3031. [DOI: 10.1093/bioinformatics/btab184](https://doi.org/10.1093/bioinformatics/btab184)
## Software packaging/containerisation tools

- [Anaconda](https://anaconda.com)
Expand Down
13 changes: 7 additions & 6 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -30,12 +30,13 @@ The nf-core/funcscan AWS full test dataset are contigs generated by the MGnify s

## Pipeline summary

1. Annotation of assembled prokaryotic contigs with [`Prodigal`](https://github.com/hyattpd/Prodigal), [`Pyrodigal`](https://github.com/althonos/pyrodigal), [`Prokka`](https://github.com/tseemann/prokka), or [`Bakta`](https://github.com/oschwengers/bakta)
2. Screening contigs for antimicrobial peptide-like sequences with [`ampir`](https://cran.r-project.org/web/packages/ampir/index.html), [`Macrel`](https://github.com/BigDataBiology/macrel), [`HMMER`](http://hmmer.org/), [`AMPlify`](https://github.com/bcgsc/AMPlify)
3. Screening contigs for antibiotic resistant gene-like sequences with [`ABRicate`](https://github.com/tseemann/abricate), [`AMRFinderPlus`](https://github.com/ncbi/amr), [`fARGene`](https://github.com/fannyhb/fargene), [`RGI`](https://card.mcmaster.ca/analyze/rgi), [`DeepARG`](https://bench.cs.vt.edu/deeparg)
4. Screening contigs for biosynthetic gene cluster-like sequences with [`antiSMASH`](https://antismash.secondarymetabolites.org), [`DeepBGC`](https://github.com/Merck/deepbgc), [`GECCO`](https://gecco.embl.de/), [`HMMER`](http://hmmer.org/)
5. Creating aggregated reports for all samples across the workflows with [`AMPcombi`](https://github.com/Darcy220606/AMPcombi) for AMPs, [`hAMRonization`](https://github.com/pha4ge/hAMRonization) for ARGs, and [`comBGC`](https://raw.githubusercontent.com/nf-core/funcscan/master/bin/comBGC.py) for BGCs
6. Software version and methods text reporting with [`MultiQC`](http://multiqc.info/)
1. Taxonomic classification of contigs of **prokaryotic origin** with [`MMseqs2`](https://github.com/soedinglab/MMseqs2)
2. Annotation of assembled prokaryotic contigs with [`Prodigal`](https://github.com/hyattpd/Prodigal), [`Pyrodigal`](https://github.com/althonos/pyrodigal), [`Prokka`](https://github.com/tseemann/prokka), or [`Bakta`](https://github.com/oschwengers/bakta)
3. Screening contigs for antimicrobial peptide-like sequences with [`ampir`](https://cran.r-project.org/web/packages/ampir/index.html), [`Macrel`](https://github.com/BigDataBiology/macrel), [`HMMER`](http://hmmer.org/), [`AMPlify`](https://github.com/bcgsc/AMPlify)
4. Screening contigs for antibiotic resistant gene-like sequences with [`ABRicate`](https://github.com/tseemann/abricate), [`AMRFinderPlus`](https://github.com/ncbi/amr), [`fARGene`](https://github.com/fannyhb/fargene), [`RGI`](https://card.mcmaster.ca/analyze/rgi), [`DeepARG`](https://bench.cs.vt.edu/deeparg)
5. Screening contigs for biosynthetic gene cluster-like sequences with [`antiSMASH`](https://antismash.secondarymetabolites.org), [`DeepBGC`](https://github.com/Merck/deepbgc), [`GECCO`](https://gecco.embl.de/), [`HMMER`](http://hmmer.org/)
6. Creating aggregated reports for all samples across the workflows with [`AMPcombi`](https://github.com/Darcy220606/AMPcombi) for AMPs, [`hAMRonization`](https://github.com/pha4ge/hAMRonization) for ARGs, and [`comBGC`](https://raw.githubusercontent.com/nf-core/funcscan/master/bin/comBGC.py) for BGCs
7. Software version and methods text reporting with [`MultiQC`](http://multiqc.info/)

![funcscan metro workflow](docs/images/funcscan_metro_workflow.png)

Expand Down
7 changes: 7 additions & 0 deletions bin/comBGC.py
Original file line number Diff line number Diff line change
@@ -1,5 +1,8 @@
#!/usr/bin/env python3

# Written by Jasmin Frangenberg and released under the MIT license.
# See below for full license text.

from Bio import SeqIO
import pandas as pd
import argparse
Expand Down Expand Up @@ -643,6 +646,10 @@ def gecco_workflow(gecco_paths):
inplace=True,
)

# Rearrange and rename the columns in the summary df
summary_all = summary_all.iloc[:, [0, 2, 1] + list(range(3, len(summary_all.columns)))]
summary_all.rename(columns={'Sample_ID':'sample_id', 'Contig_ID':'contig_id', 'CDS_ID':'BGC_region_contig_ids'}, inplace=True)

# Write results to TSV
if not os.path.exists(outdir):
os.makedirs(outdir)
Expand Down
231 changes: 231 additions & 0 deletions bin/merge_taxonomy.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,231 @@
#!/usr/bin/env python3

# Written by Anan Ibrahim and released under the MIT license.
# See git repository (https://github.com/Darcy220606/AMPcombi) for full license text.
# Date: March 2024
# Version: 0.1.0

# Required modules
import sys
import os
import pandas as pd
import numpy as np
import argparse

tool_version = "0.1.0"
#########################################
# TOP LEVEL: AMPCOMBI
#########################################
parser = argparse.ArgumentParser(prog = 'merge_taxonomy', formatter_class=argparse.RawDescriptionHelpFormatter,
usage='%(prog)s [options]',
description=('''\
.............................................................................
*merge_taxonomy*
.............................................................................
This script merges all three funcscan workflows with
MMseqs2 taxonomy results. This is done in three submodules that can be
activated seperately.
.............................................................................'''),
epilog='''Thank you for running taxonomy_merge!''',
add_help=True)
parser.add_argument('--version', action='version', version='merge_taxonomy ' + tool_version)

#########################################
# SUBPARSERS
#########################################
subparsers = parser.add_subparsers(required=True)

#########################################
# SUBPARSER: AMPCOMBI
#########################################
ampcombi_parser = subparsers.add_parser('ampcombi_taxa')

ampcombi_parser.add_argument("--ampcombi", dest="amp", nargs='?', help="Enter the path to the ampcombi_complete_summary.tsv' \n (default: %(default)s)",
type=str, default='ampcombi_complete_summary.csv')
ampcombi_parser.add_argument("--taxonomy", dest="taxa1", nargs='+', help="Enter the list of taxonomy files for all samples. ")

#########################################
# SUBPARSER: COMBGC
#########################################
combgc_parser = subparsers.add_parser('combgc_taxa')

combgc_parser.add_argument("--combgc", dest="bgc", nargs='?', help="Enter the path to the combgc_complete_summary.tsv' \n (default: %(default)s)",
type=str, default='combgc_complete_summary.csv')
combgc_parser.add_argument("--taxonomy", dest="taxa2", nargs='+', help="Enter the list of taxonomy files for all samples. ")

#########################################
# SUBPARSER: HAMRONIZATION
#########################################
hamronization_parser = subparsers.add_parser('hamronization_taxa')

hamronization_parser.add_argument("--hamronization", dest="arg", nargs='?', help="Enter the path to the hamronization_complete_summary.tsv' \n (default: %(default)s)",
type=str, default='hamronization_complete_summary.csv')
hamronization_parser.add_argument("--taxonomy", dest="taxa3",nargs='+', help="Enter the list of taxonomy files for all samples. ")

#########################################
# TAXONOMY
#########################################
def reformat_mmseqs_taxonomy(mmseqs_taxonomy):
mmseqs2_df = pd.read_csv(mmseqs_taxonomy, sep='\t', header=None, names=['contig_id', 'taxid', 'rank_label', 'scientific_name', 'lineage', 'mmseqs_lineage_contig'])
# remove the lineage column
mmseqs2_df.drop('lineage', axis=1, inplace=True)
mmseqs2_df['mmseqs_lineage_contig'].unique()
# convert any classification that has Eukaryota/root to NaN as funcscan targets bacteria ONLY **
for i, row in mmseqs2_df.iterrows():
lineage = str(row['mmseqs_lineage_contig'])
if 'Eukaryota' in lineage or 'root' in lineage:
mmseqs2_df.at[i, 'mmseqs_lineage_contig'] = np.nan
# insert the sample name in the first column according to the file basename
file_basename = os.path.basename(mmseqs_taxonomy)
filename = os.path.splitext(file_basename)[0]
mmseqs2_df.insert(0, 'sample_id', filename)
return mmseqs2_df

#########################################
# FUNCTION: AMPCOMBI
#########################################
def ampcombi_taxa(args):
merged_df = pd.DataFrame()

# assign input args to variables
ampcombi = args.amp
taxa_list = args.taxa1

# prepare the taxonomy files
taxa_df = pd.DataFrame()
# append the dfs to the taxonomy_files_combined
for file in taxa_list: # list of taxa files ['','']
df = reformat_mmseqs_taxonomy(file)
taxa_df = pd.concat([taxa_df, df])

# filter the tool df
tool_df = pd.read_csv(ampcombi, sep=',') #current ampcombi version is comma sep. CHANGE WITH VERSION 0.2.0
# make sure 1st and 2nd column have the same column labels
tool_df.rename(columns={tool_df.columns[0]: 'sample_id'}, inplace=True)
tool_df.rename(columns={tool_df.columns[1]: 'contig_id'}, inplace=True)
# grab the real contig id in another column copy for merging
tool_df['contig_id_merge'] = tool_df['contig_id'].str.rsplit('_', 1).str[0]

# merge rows from taxa to ampcombi_df based on substring match in sample_id
# grab the unique sample names from the taxonomy table
samples_taxa = taxa_df['sample_id'].unique()
# for every sampleID in taxadf merge the results
for sampleID in samples_taxa:
# subset ampcombi
subset_tool = tool_df.loc[tool_df['sample_id'].str.contains(sampleID)]
# subset taxa
subset_taxa = taxa_df.loc[taxa_df['sample_id'].str.contains(sampleID)]
# merge
subset_df = pd.merge(subset_tool, subset_taxa, left_on = 'contig_id_merge', right_on='contig_id', how='left')
# cleanup the table
columnsremove = ['contig_id_merge','contig_id_y', 'sample_id_y']
subset_df.drop(columnsremove, axis=1, inplace=True)
subset_df.rename(columns={'contig_id_x': 'contig_id', 'sample_id_x':'sample_id'},inplace=True)
# append in the combined_df
merged_df = merged_df.append(subset_df, ignore_index=True)

# write to file
merged_df.to_csv('ampcombi_complete_summary_taxonomy.tsv', sep='\t', index=False)

#########################################
# FUNCTION: COMBGC
#########################################
def combgc_taxa(args):
merged_df = pd.DataFrame()

# assign input args to variables
combgc = args.bgc
taxa_list = args.taxa2

# prepare the taxonomy files
taxa_df = pd.DataFrame()
# append the dfs to the taxonomy_files_combined
for file in taxa_list: # list of taxa files ['','']
df = reformat_mmseqs_taxonomy(file)
taxa_df = pd.concat([taxa_df, df])

# filter the tool df
tool_df = pd.read_csv(combgc, sep='\t')
# make sure 1st and 2nd column have the same column labels
tool_df.rename(columns={tool_df.columns[0]: 'sample_id'}, inplace=True)
tool_df.rename(columns={tool_df.columns[1]: 'contig_id'}, inplace=True)

# merge rows from taxa to ampcombi_df based on substring match in sample_id
# grab the unique sample names from the taxonomy table
samples_taxa = taxa_df['sample_id'].unique()
# for every sampleID in taxadf merge the results
for sampleID in samples_taxa:
# subset ampcombi
subset_tool = tool_df.loc[tool_df['sample_id'].str.contains(sampleID)]
# subset taxa
subset_taxa = taxa_df.loc[taxa_df['sample_id'].str.contains(sampleID)]
# merge
subset_df = pd.merge(subset_tool, subset_taxa, left_on = 'contig_id', right_on='contig_id', how='left')
# cleanup the table
columnsremove = ['sample_id_y']
subset_df.drop(columnsremove, axis=1, inplace=True)
subset_df.rename(columns={'sample_id_x':'sample_id'},inplace=True)
# append in the combined_df
merged_df = merged_df.append(subset_df, ignore_index=True)

# write to file
merged_df.to_csv('combgc_complete_summary_taxonomy.tsv', sep='\t', index=False)

#########################################
# FUNCTION: HAMRONIZATION
#########################################
def hamronization_taxa(args):
merged_df = pd.DataFrame()

# assign input args to variables
hamronization = args.arg
taxa_list = args.taxa3

# prepare the taxonomy files
taxa_df = pd.DataFrame()
# append the dfs to the taxonomy_files_combined
for file in taxa_list: # list of taxa files ['','']
df = reformat_mmseqs_taxonomy(file)
taxa_df = pd.concat([taxa_df, df])

# filter the tool df
tool_df = pd.read_csv(hamronization, sep='\t')
# rename the columns
tool_df.rename(columns={'input_file_name':'sample_id', 'input_sequence_id':'contig_id'}, inplace=True)
# reorder the columns
new_order = ['sample_id', 'contig_id'] + [col for col in tool_df.columns if col not in ['sample_id', 'contig_id']]
tool_df = tool_df.reindex(columns=new_order)
# grab the real contig id in another column copy for merging
tool_df['contig_id_merge'] = tool_df['contig_id'].str.rsplit('_', 1).str[0]

# merge rows from taxa to ampcombi_df based on substring match in sample_id
# grab the unique sample names from the taxonomy table
samples_taxa = taxa_df['sample_id'].unique()
# for every sampleID in taxadf merge the results
for sampleID in samples_taxa:
# subset ampcombi
subset_tool = tool_df.loc[tool_df['sample_id'].str.contains(sampleID)]
# subset taxa
subset_taxa = taxa_df.loc[taxa_df['sample_id'].str.contains(sampleID)]
# merge
subset_df = pd.merge(subset_tool, subset_taxa, left_on = 'contig_id_merge', right_on='contig_id', how='left')
# cleanup the table
columnsremove = ['contig_id_merge','contig_id_y', 'sample_id_y']
subset_df.drop(columnsremove, axis=1, inplace=True)
subset_df.rename(columns={'contig_id_x': 'contig_id', 'sample_id_x':'sample_id'},inplace=True)
# append in the combined_df
merged_df = merged_df.append(subset_df, ignore_index=True)

# write to file
merged_df.to_csv('hamronization_complete_summary_taxonomy.tsv', sep='\t', index=False)

#########################################
# SUBPARSERS: DEFAULT
#########################################
ampcombi_parser.set_defaults(func=ampcombi_taxa)
combgc_parser.set_defaults(func=combgc_taxa)
hamronization_parser.set_defaults(func=hamronization_taxa)

if __name__ == '__main__':
args = parser.parse_args()
args.func(args) # call the default function
Loading

0 comments on commit f4495a5

Please sign in to comment.