Tony Liu, Mahan Rafiee, Connor Morgan-Lang, Joe Ho, Avery Noonan, Kateryna Ievdokymenko, Zach Armstrong, Steven Hallam
conda install -c hallamlab \
-c imperial-college-research-computing \
-c bioconda -c conda-forge fabfos
# minimal
fabfos --output ./example_out \
--interleaved --reads /.../interleaved.fastq \
--background /.../host_background_genome.fasta \
--pool-size 384
# full
fabfos --output ./example_out \
--threads 8 \
--verbose \
--assembler megahit \
--reads /.../forward_reads.fastq \
--reverse /.../reverse_reads.fastq \
--parity pe \
--background /.../host_background_genome.fasta \
--vector /.../plasmid_backbone.fasta \
--ends /.../end_sequences.fq \
--ends-name-regex "\w+_\d+" \
--ends-fw-flag "FW"
Fabfos is a pipeline for resolving cloned inserts from pooled fosmid libraries using the following steps:
- Read QC
- Estimate pool size
- Reads crossing one of the two junctions between the vector backbone and insert are prepared and fed to VSEARCH [10], which estimates the number of unique sequences as a proxy to the number of clones in the pool.
- Alternatively, the user can provide an estimate directly.
- Assembly
- Assessing the completeness of inserts using end sequences
- End sequences covering both junctions between the insert and vector backbone are aligned to the assembled contigs using BLAST [1]. Contigs with mapped end sequences for both junctions are considered complete.
Pick one of Conda, Singularity, Docker, or Manual.
conda install \
-c hallamlab \
-c imperial-college-research-computing \
-c bioconda \
-c conda-forge \
fabfos
Consider using mamba, a drop in multithreaded replacement for conda
singularity pull ./fabfos.sif docker://quay.io/hallamlab/fabfos
# example run
singularity exec \
--bind ./:/ws \
--workdir /ws \
./fabfos.sif fabfos --help
docker pull quay.io/hallamlab/fabfos
# example run
docker run -it --rm \
-u $(id -u):$(id -g) \
--mount type=bind,source="./",target="/ws"\
--workdir="/ws" \
quay.io/hallamlab/fabfos fabfos --help
Clone this repo
Install dependencies from .yml
file
./dev.sh --ibase
# or
conda env create --no-default-packages -n fabfos_env -f ./envs/base.yml
Activate the environment
conda activate fabfos_env
and run the source code directly
./dev.sh -r --help
# or
cd ./src
python -m fabfos --help
fabfos --output ./example_out \
--threads 8 \
--verbose \
--assembler megahit \
--reads /.../forward_reads.fastq \
--reverse /.../reverse_reads.fastq \
--parity pe \
--background /.../host_background_genome.fasta \
--vector /.../plasmid_backbone.fasta \
--ends /.../end_sequences.fq \
--ends-name-regex "\w+_\d+" \
--ends-fw-flag "FW"
Explanation of arguments:
--output
: The folder where Fabfos should store intermediate and output files. If it doesn't exist, Fabfos will create it.--threads
: Maximum number of threads to use. Fabfos itself will only ever use 1, due to the current limitations of python--verbose
: Include debug messages in printouts--assembler
: The assembler and preset to use. Options are:megahit
megahit is significantly faster than spades with comparable assembly performance, however each option tends to resolve a slightly different set contigs so trying them all will produce the most complete setspades_meta
spades_isolate
spades_sc
--reads
,--reverse
: paths to the raw reads of the pooled- default is paried end fastqs
- Fabfos should handle gzipped reads automatically
- for single end reads
fabfos ... --reads /.../reads.fq --parity se
, leaving out--reverse
- for interleaved reads
fabfos ... --reads /.../reads.fq --interleaved
, leaving out--reverse
--background
: path to genomic fasta of host background. Fabfos filters out reads that map to the host. Example for E. coli k12--vector
: path to fasta of the vector backbone sequence. An example would be pcc1- used to estimate the pool size
- can be replaced with a manual estimate, example:
--pool-size 384
, in which case Fabfos will not perform an estimate
- (optional)
--ends
: end sequences should be sequenced inward from one of the two junctions between the insert and vector backbone--ends-name-regex
: regex to pull the name of the clone from the header of the end sequence fastq. Example:"\w+_\d+"
would get "ABC_123" from ">ABC_123_FW"--ends-fw-flag
a token that, if found within the fastq header of the end sequence, would indicate that it was sequenced from the "forward" junction. Example:FW
would indicate that ">ABC_123_FW" is the end sequence of the "forward" juction.- if ommitted, assembled contigs will not be checked for completeness
Within the output folder specified in --output /.../NAME
there will exist the following...
temp_*/
work folders for various steps and toolsfabfos.log
main logNAME_metadata.tsv
metadata table including read stats, assembly stats, and pool size estimateNAME_fosmids_*.fasta
resolved fosmidsNAME_fosmids_all_contigs.fasta
all assembled contigsNAME_fosmids_both_mapped.fasta
inserts (contigs) with end sequences mapped to both endsNAME_fosmids_single_mapped.fasta
inserts (contigs) with only one end mapped to an end sequenceNAME_fosmids_not_mapped
contigs with that didn't map to any end sequence
NAME_end_mapping.tsv
blast results from mapping end sequences to assembled contigsNAME_end_mapping_failures.tsv
table of provided end sequences that didn't map to any contig
Some outputs may be ommitted if some inputs are not provided. For example, the end mapping tables will not exist if end sequences were not provided.
Example 1
- Paird end reads
megahit
as assembler- Fabfos estimates pool size
- End sequences
fabfos --output ./example_out \
--assembler megahit \
--reads /.../forward_reads.fastq \
--reverse /.../reverse_reads.fastq \
--background /.../host_background_genome.fasta \
--vector /.../plasmid_backbone.fasta \
--ends /.../end_sequences.fq \
--ends-name-regex "\w+_\d+" \
--ends-fw-flag "FW"
Example 2
- Single end reads
spades_isolate
as assembler- Fabfos estimates pool size
- No end sequences
fabfos --output ./example_out \
--assembler spades_isolate \
--reads /.../se.fastq \
--parity se \
--background /.../host_background_genome.fasta \
--vector /.../plasmid_backbone.fasta \
Example 3
- Interleaved reads
spades_meta
as assembler- User provides pool size estimate
- End sequences
- headers look like this
">AB001-R"
and">AB001-L"
for left and right instead of forward and reverse
- headers look like this
fabfos --output ./example_out \
--assembler spades_meta \
--interleaved \
--reads /.../se.fastq \
--background /.../host_background_genome.fasta \
--pool-size 384 \
--ends /.../end_sequences.fq \
--ends-name-regex "\w+\d+" \
--ends-fw-flag "L"
usage: fabfos [-b BACKGROUND] [--vector VECTOR] [-n POOL_SIZE] [--nanopore_reads NANOPORE_READS] [--skip_correction]
[-r READS] [-2 REVERSE] [-t {B,F}] [-i] [-p {pe,se}] [-a {spades_meta,spades_isolate,spades_sc,megahit}]
[-o OUTPUT] [-T THREADS] [-e ENDS] [--ends-name-regex ENDS_NAME_REGEX] [--overwrite]
[--ends-fw-flag ENDS_FW_FLAG] [-v] [--verbose] [-h]
Pipeline for filtering, assembling and organizing fosmid sequence information.
Required arguments:
-b BACKGROUND, --background BACKGROUND
Path to fosmid background fasta
--vector VECTOR Path to vector backbone fasta, used for estimating pool size, required if --pool-size not
given
-n POOL_SIZE, --pool-size POOL_SIZE
Estimate of number of fosmids in pool, required if --vector not given
Sequence read-specific arguments:
-r READS, --reads READS
Path to the forward strand file or the interleaved paired-end file. Can be in either FastQ or
BAM format.
-2 REVERSE, --reverse REVERSE
Path to the reverse-end read file (if applicable)
-t {B,F}, --type {B,F}
Enter B if input type is BAM, F for FastQ. [ DEFAULT = 'F' ]
-i, --interleaved Flag indicating the reads are interleaved (i.e. forward and reverse pairs are in the same
file). [DEFAULT = False]
-p {pe,se}, --parity {pe,se}
Specifying the sequencing chemistry used, either paired-end (pe) or single-end (se). [DEFAULT
= 'pe']
Nanopore-specific [development] options:
--nanopore_reads NANOPORE_READS
A FASTA file containing nanopore reads to be used in assembly.
--skip_correction Do not perform error-correction of nanopore reads using proovread
Optional arguments:
-a {spades_meta,spades_isolate,spades_sc,megahit}, --assembler {spades_meta,spades_isolate,spades_sc,megahit}
Genome assembly software to use. [DEFAULT = spades_isolate]
-o OUTPUT, --output OUTPUT
path to temp. workspace [DEFAULT = /tmp]
-T THREADS, --threads THREADS
The number of threads that can be used [DEFAULT = 8]
-e ENDS, --ends ENDS FASTA file containing fosmid ends - these will be used for alignment to fosmid contigs.
--ends-name-regex ENDS_NAME_REGEX
regex for getting name of endseq., ex. "\w+_\d+" would get ABC_123 from ABC_123_FW
--overwrite overwrite the output directory if it already exists
--ends-fw-flag ENDS_FW_FLAG
string that marks forward ends, ex. "_FW" would cause ABC_123_FW and ABC_123_RE to be assigned
to forward and reverse respectively.
Miscellaneous options:
-v, --version Print the FabFos version and exit.
--verbose Increase the level of verbosity in runtime log.
-h, --help Show this help message and exit
Please consider citing these along with Fabfos.
-
Altschul, S.F., Gish, W., Miller, W., Myers, E.W., Lipman, D.J., 1990. Basic local alignment search tool. J. Mol. Biol. 215, 403–410. https://doi.org/10.1016/S0022-2836(05)80360-2
-
Bankevich, A., Nurk, S., Antipov, D., Gurevich, A.A., Dvorkin, M., Kulikov, A.S., Lesin, V.M., Nikolenko, S.I., Pham, S., Prjibelski, A.D., Pyshkin, A.V., Sirotkin, A.V., Vyahhi, N., Tesler, G., Alekseyev, M.A., Pevzner, P.A., 2012. SPAdes: A New Genome Assembly Algorithm and Its Applications to Single-Cell Sequencing. J. Comput. Biol. 19, 455–477. https://doi.org/10.1089/cmb.2012.0021
-
Bolger, A.M., Lohse, M., Usadel, B., 2014. Trimmomatic: a flexible trimmer for Illumina sequence data. Bioinformatics 30, 2114–2120. https://doi.org/10.1093/bioinformatics/btu170
-
Danecek, P., Bonfield, J.K., Liddle, J., Marshall, J., Ohan, V., Pollard, M.O., Whitwham, A., Keane, T., McCarthy, S.A., Davies, R.M., Li, H., 2021. Twelve years of SAMtools and BCFtools. GigaScience 10, giab008. https://doi.org/10.1093/gigascience/giab008
-
Koren, S., Walenz, B.P., Berlin, K., Miller, J.R., Bergman, N.H., Phillippy, A.M., 2017. Canu: scalable and accurate long-read assembly via adaptive k-mer weighting and repeat separation. Genome Res. 27, 722–736. https://doi.org/10.1101/gr.215087.116
-
Li, D., Liu, C.-M., Luo, R., Sadakane, K., Lam, T.-W., 2015. MEGAHIT: an ultra-fast single-node solution for large and complex metagenomics assembly via succinct de Bruijn graph. Bioinformatics 31, 1674–1676. https://doi.org/10.1093/bioinformatics/btv033
-
Li, H., Durbin, R., 2009. Fast and accurate short read alignment with Burrows–Wheeler transform. Bioinformatics 25, 1754–1760. https://doi.org/10.1093/bioinformatics/btp324
-
Nurk, S., Meleshko, D., Korobeynikov, A., Pevzner, P.A., 2017. metaSPAdes: a new versatile metagenomic assembler. Genome Res. 27, 824–834. https://doi.org/10.1101/gr.213959.116
-
Prjibelski, A., Antipov, D., Meleshko, D., Lapidus, A., Korobeynikov, A., 2020. Using SPAdes De Novo Assembler. Curr. Protoc. Bioinforma. 70, e102. https://doi.org/10.1002/cpbi.102
-
Rognes, T., Flouri, T., Nichols, B., Quince, C., Mahé, F., 2016. VSEARCH: a versatile open source tool for metagenomics. PeerJ 4, e2584. https://doi.org/10.7717/peerj.2584