16S rRNA gene sequence curation and phylogenetic reference set creation
- confirm availability of necessary libraries to compile dependencies
(on Ubuntu:
sudo apt-get install gfortran libopenblas-dev liblapack-dev
) - Install Python >= 3.8 or Python 3 Virtual Environment
% python3 -m venv bin-env % source bin-env/bin/activate % bin/bootstrap.sh
the deenurp executable should now be on your $PATH
See required system libraries above.
First, install binary dependencies:
Python 3
pip, for installing python dependencies (http://www.pip-installer.org/)
Python packages:
- Run
pip install PACKAGE
for every PACKAGE listed in requirements.txt, e.g.cat requirements.txt | xargs -n 1 pip install
- Run
vsearch (https://github.com/torognes/vsearch)
Infernal version 1.1 (http://infernal.janelia.org/)
pplacer suite (http://matsen.fhcrc.org/pplacer)
FastTree 2 (http://www.microbesonline.org/fasttree/#Install)
Optional (for filter-outliers
and pairwise-distances
):
- muscle (http://www.drive5.com/muscle/)
Finally, install:
python setup.py install
Deenurp can be run from a Docker image which can be built locally from the Dockerfile
or pulled docker pull nghoffman/deenurp:v0.3.0
Similarity-search based reference sequence selection
The deenurp
package under the current directory provides to subcommands,
accessed via the script deenurp.py
, or the command deenurp
if installed.
Subcommands fall into two general categories:
- Building a set of reference sequences for use in refpkg building
- Selecting sequences for a specific reference package
Removes outlier sequences from a reference database
Expands poorly-represented names in a sequence file by similarity search
Cluster reference sequences, first by tax-id at a specified rank
(default: species), then by similarity for unnamed sequences or
sequences not classified to the desired rank. Serves as input to
search-sequences
.
Builds a set of hierarchical reference packages.
Searches a set of sequences against a FASTA file containing possible reference sequences.
This subcommand does searches sequences against a reference FASTA
file, saving the results and some metadata to a sqlite database for
use in select-references
Given the output of search-sequences
, select-references
attempts to find a good set of reference sequences.
For each reference cluster with a minimal amount of sequences having
best hits to the cluster, (see cluster-refs
), selects a set number
of sequences to serve as references.
Taxa who are the sole descendent of their parent can complicate taxonomic classification.
The fill-lonely
subcommand finds some company for these lonely
taxa.
Fetches sequences from a sequence file which match the taxtable for a reference set at a given rank. Useful for adding type strains.
Runs the tax2tree
program on a reference package, updating the
seq_info
file.
Sequences whose lineage changes are relabeled. The prior tax_id
is
added to the seq_info
file in the reference package.