-
Notifications
You must be signed in to change notification settings - Fork 20
Taxonomic Workflows
- This article refers to the lambda-next branch and releases >= 1.9.2
- These features are currently EXPERIMENTAL, please give us feedback on them!
If you are using LAMBDA in taxonomic workflows, you might want to make use of some of the following features:
You only need to do this once:
- Make sure your subject sequences contain accession numbers; GIs are not supported. The following accession numbers are automatically detected and extracted from fasta/fastq headers:
* UniProt (more information)
* NCBI nucl, NCBI prot, NCBI wgs and NCBI mga (more information)
*
Refseq(not yet supported) - Download a mapping file from the NCBI (make sure it's the correct one).
- Rebuild your index, but add
--acc-tax-map /path/to/file.accession2taxid[.gz]
(you don't have to unzip the file). * Building the new index will take longer, but it only increases the index's size by a few MBs. * If LAMBDA fails to assign most of your sequences to taxa, it will warn you!
You need to tell it to print the taxonomic information:
- for the tabular BLAST Output Formats, specify e.g.
--output-columns 'std staxids'
. - for the SAMTOOLS Output Formats, specify e.g.
--sam-bam-tags 'AS NM ZE ZI ZF st'
(the last tag is the important one). - there is no impact on the run-time of lambda.
Note that this implies no taxonomic binning, you just get the taxa corresponding to the subject sequences of your individual matches, i.e. staxids is a per-match specifier.
Lambda can do taxonomic binning, i.e. it will compute the lowest common ancestor taxon for all matches of one query sequence. This helps with taxonomic assessment, although it should be noted, that it does no statistical evaluation or weighting of matches. Other tools like SLIMM do a more complex analysis.
You only need to do this once:
- Do all of the things for printing subject taxonomic IDs as described above.
- But before that also download the
taxdump.tar.gz
and untar it to some place. - Rebuild your index, but in addition to
--acc-tax-map /path/to/file.accession2taxid[.gz]
, also add the path to the untarred taxdump directory, i.e.--tax-dump-dir /path/to/directory
* this will only marginally increase your indexing build time and index size
You need to tell it to print the lca information (either as taxon id or scientific name):
- for the tabular BLAST Output Formats, specify e.g.
--output-columns 'std lcaid lcataxid'
. - for the SAMTOOLS Output Formats, specify e.g.
--sam-bam-tags 'AS NM ZE ZI ZF ls lt'
(the last two tags are the important one). - there is no significant impact on the run-time of lambda.
- although this information field is printed per-match (like all other fields), it of course refers to all matches of the query sequence -- and is thus identical for all matches of the each query sequence.
Some things to note:
- matches against subject sequences that have no identified taxon id do not contribute to LCA computation. Alternatively we could assign all unknown sequences to the root taxon, but this would skew results strongly.
- the
--num-matches
parameter strongly influences the LCA. Choose smaller values if you always end up with very generic LCAs.
If anything is unclear, don't hesitate to contact to me.