\ No newline at end of file
diff --git a/dev/index.html b/dev/index.html
index 27c5a5a..781ba84 100644
--- a/dev/index.html
+++ b/dev/index.html
@@ -1 +1 @@
- XAVIER Documentation
XAVIER - eXome Analysis and Variant explorER. XAVIER is an open-source, reproducible, and scalable solution for analyzing Whole Exome sequencing data. Its long-term goals: to accurately call germline and somatic variants, to infer CNVs, and to boldly annotate variants like no pipeline before!
Welcome to XAVIER's documentation! This guide is the main source of documentation for users that are getting started with the XAVIER pipeline.
The xavier pipeline is composed several inter-related sub commands to setup and run the pipeline across different systems. Each of the available sub commands perform different functions:
xavier run: Run the XAVIER pipeline with your input files.
xavier unlock: Unlocks a previous runs output directory.
XAVIER is a comprehensive whole exome-sequencing pipeline following the Broad's set of best practices. It relies on technologies like Singularity1 to maintain the highest-level of reproducibility. The pipeline consists of a series of data processing and quality-control steps orchestrated by Snakemake2, a flexible and scalable workflow management system, to submit jobs to a cluster or cloud provider.
The pipeline is compatible with data generated from Illumina short-read sequencing technologies. As input, it accepts a set of FastQ or BAM files and can be run locally on a compute instance, on-premise using a cluster, or on the cloud (feature coming soon!). A user can define the method or mode of execution. The pipeline can submit jobs to a cluster using a job scheduler like SLURM, or run on AWS using Tibanna (feature coming soon!). A hybrid approach ensures the pipeline is accessible to all users.
Before getting started, we highly recommend reading through the usage section of each available sub command.
For more information about issues or trouble-shooting a problem, please checkout our FAQ prior to opening an issue on Github.
This site is a living document, created for and by members like you. XAVIER is maintained by the members of CCBR and is improved by continuous feedback! We encourage you to contribute new content and make improvements to existing content via pull request to our GitHub repository .
XAVIER - eXome Analysis and Variant explorER. XAVIER is an open-source, reproducible, and scalable solution for analyzing Whole Exome sequencing data. Its long-term goals: to accurately call germline and somatic variants, to infer CNVs, and to boldly annotate variants like no pipeline before!
Welcome to XAVIER's documentation! This guide is the main source of documentation for users that are getting started with the XAVIER pipeline.
The xavier pipeline is composed several inter-related sub commands to setup and run the pipeline across different systems. Each of the available sub commands perform different functions:
xavier run: Run the XAVIER pipeline with your input files.
xavier unlock: Unlocks a previous runs output directory.
XAVIER is a comprehensive whole exome-sequencing pipeline following the Broad's set of best practices. It relies on technologies like Singularity1 to maintain the highest-level of reproducibility. The pipeline consists of a series of data processing and quality-control steps orchestrated by Snakemake2, a flexible and scalable workflow management system, to submit jobs to a cluster or cloud provider.
The pipeline is compatible with data generated from Illumina short-read sequencing technologies. As input, it accepts a set of FastQ or BAM files and can be run locally on a compute instance, on-premise using a cluster, or on the cloud (feature coming soon!). A user can define the method or mode of execution. The pipeline can submit jobs to a cluster using a job scheduler like SLURM, or run on AWS using Tibanna (feature coming soon!). A hybrid approach ensures the pipeline is accessible to all users.
Before getting started, we highly recommend reading through the usage section of each available sub command.
For more information about issues or trouble-shooting a problem, please checkout our FAQ prior to opening an issue on Github.
This site is a living document, created for and by members like you. XAVIER is maintained by the members of CCBR and is improved by continuous feedback! We encourage you to contribute new content and make improvements to existing content via pull request to our GitHub repository .
Permission is hereby granted, free of charge, to any person obtaining a copy of this software and associated documentation files (the "Software"), to deal in the Software without restriction, including without limitation the rights to use, copy, modify, merge, publish, distribute, sublicense, and/or sell copies of the Software, and to permit persons to whom the Software is furnished to do so, subject to the following conditions:
The above copyright notice and this permission notice shall be included in all copies or substantial portions of the Software.
THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.
Last update: 2024-01-30
\ No newline at end of file
+ License - XAVIER Documentation
Permission is hereby granted, free of charge, to any person obtaining a copy of this software and associated documentation files (the "Software"), to deal in the Software without restriction, including without limitation the rights to use, copy, modify, merge, publish, distribute, sublicense, and/or sell copies of the Software, and to permit persons to whom the Software is furnished to do so, subject to the following conditions:
The above copyright notice and this permission notice shall be included in all copies or substantial portions of the Software.
THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.
\ No newline at end of file
diff --git a/dev/pipeline-details/methods/index.html b/dev/pipeline-details/methods/index.html
index 37a26f0..9f27bf1 100644
--- a/dev/pipeline-details/methods/index.html
+++ b/dev/pipeline-details/methods/index.html
@@ -1 +1 @@
- Methods - XAVIER Documentation
Low-quality and adapters sequences are trimmed from the raw sequencing reads using Trimmomatic (v. 0.39)1. Trimmed reads are then aligned to the human hg38 reference genome using BWA mapping software (v. 0.7.17)2. Duplicate reads are marked using Samblaster (v. 0.1.25)3 and sorted using samtools (v. 1.8). Finally, base quality score recalibration is performed as indicated in the GATK4 (v. 4.2.2.0) best practices 4.
HaplotypeCaller from GATK4 (v. 4.2.2.0) is used to call germline variants, parallelized across chromosomes, and all samples in the cohort are joint genotyped together 4,5.
Somatic variant calling (SNPs and Indels) is performed using Mutect (v. 1.1.7)6, Mutect2 (GATK v. 4.2.0)7, Strelka2 (v. 2.9.0)8, and VarDict (v. 1.4)9 in tumor-normal mode. Variants from all callers are merged using the CombineVariants tool from GATK version 3.8-1. Genomic, functional and consequence annotations are added using Variant Effect Predictor (VEP v. 99)10 and converted to Mutation Annotation Format (MAF) using the vcf2maf tool (v. 1.6.16)11.
For Copy Number Variants (CNVs), Control-Freec (v. 11.6)12 is used to generate pileups, which are used as input for the R package 'sequenza' (v. 3.0.0)13. The complete Control-Freec workflow is then re-run using ploidy and cellularity estimates from 'sequenza'.
SOBDetector is a tool that scores variants based on strand-orientation bias, which can be a sign of DNA damage caused by fixation of tissue. This pipeline runs SOBDetector in a two-pass method. The first pass uses parameters provided with the software (calculated from publicly available data from TCGA), then cohort-specific bias metrics are computed from those results, and SOBDetector is re-run using these cohort-specific values.
Ancestry and relatedness scores are generated using Somalier (v. 0.2.13)14. Contamination analyses are performed against viral and bacterial genomes from NCBI using Kraken2 (v. 2.1.2)15, as well as against mouse, human, and UniVec databases using FastQ Screen (v. 0.14.1)16. Sequence, mapping and variant statistics are computed using FastQC (v. 0.11.9), Qualimap (v. 2.2.1)17 and SNPeff (v. 4.3t)18. All of these metrics are combined into an interactive HTML report using MultiQC (v. 1.11)19.
Bolger, A.M., M. Lohse, and B. Usadel, Trimmomatic: a flexible trimmer for Illumina sequence data. Bioinformatics, 2014. 30(15): p. 2114-20. ↩
Li, H. and R. Durbin, Fast and accurate short read alignment with Burrows-Wheeler transform. Bioinformatics, 2009. 25(14): p. 1754-60. ↩
Faust, G.G. and I.M. Hall, SAMBLASTER: fast duplicate marking and structural variant read extraction. Bioinformatics, 2014. 30(17): p. 2503-5. ↩
Van der Auwera, G.A. and B.D. O'Connor, Genomics in the cloud : using Docker, GATK, and WDL in Terra. First edition. ed. 2020, Sebastopol, CA: O'Reilly Media. ↩↩
Poplin, R., et al., Scaling accurate genetic variant discovery to tens of thousands of samples. bioRxiv, 2018: p. 201178. ↩
Cibulskis, K., et al., Sensitive detection of somatic point mutations in impure and heterogeneous cancer samples. Nat Biotechnol, 2013. 31(3): p. 213-9. ↩
Benjamin, D., et al., Calling Somatic SNVs and Indels with Mutect2. bioRxiv, 2019: p. 861054. ↩
Kim, S., et al., Strelka2: fast and accurate calling of germline and somatic variants. Nat Methods, 2018. 15(8): p. 591-594. ↩
Lai, Z., et al., VarDict: a novel and versatile variant caller for next-generation sequencing in cancer research. Nucleic Acids Res, 2016. 44(11): p. e108. ↩
McLaren, W., et al., The Ensembl Variant Effect Predictor. Genome Biol, 2016. 17(1): p. 122. ↩
Boeva, V., et al., Control-FREEC: a tool for assessing copy number and allelic content using next-generation sequencing data. Bioinformatics, 2012. 28(3): p. 423-5. ↩
Favero, F., et al., Sequenza: allele-specific copy number and mutation profiles from tumor sequencing data. Ann Oncol, 2015. 26(1): p. 64-70. ↩
Pedersen, B. somalier: extract informative sites, evaluate relatedness, and perform quality-control on BAM/CRAM/BCF/VCF/GVCF. 2018; Available from: https://github.com/brentp/somalier. ↩
Wood, D.E., J. Lu, and B. Langmead, Improved metagenomic analysis with Kraken 2. Genome Biol, 2019. 20(1): p. 257. ↩
Wingett, S.W. and S. Andrews, FastQ Screen: A tool for multi-genome mapping and quality control. F1000Res, 2018. 7: p. 1338. ↩
Okonechnikov, K., A. Conesa, and F. Garcia-Alcalde, Qualimap 2: advanced multi-sample quality control for high-throughput sequencing data. Bioinformatics, 2016. 32(2): p. 292-4. ↩
Cingolani, P., et al., A program for annotating and predicting the effects of single nucleotide polymorphisms, SnpEff: SNPs in the genome of Drosophila melanogaster strain w1118; iso-2; iso-3. Fly (Austin), 2012. 6(2): p. 80-92. ↩
Ewels, P., et al., MultiQC: summarize analysis results for multiple tools and samples in a single report. Bioinformatics, 2016. 32(19): p. 3047-8. ↩
Koster, J. and S. Rahmann, Snakemake-a scalable bioinformatics workflow engine. Bioinformatics, 2018. 34(20): p. 3600. ↩
Last update: 2024-01-30
\ No newline at end of file
+ Methods - XAVIER Documentation
Low-quality and adapters sequences are trimmed from the raw sequencing reads using Trimmomatic (v. 0.39)1. Trimmed reads are then aligned to the human hg38 reference genome using BWA mapping software (v. 0.7.17)2. Duplicate reads are marked using Samblaster (v. 0.1.25)3 and sorted using samtools (v. 1.8). Finally, base quality score recalibration is performed as indicated in the GATK4 (v. 4.2.2.0) best practices 4.
HaplotypeCaller from GATK4 (v. 4.2.2.0) is used to call germline variants, parallelized across chromosomes, and all samples in the cohort are joint genotyped together 4,5.
Somatic variant calling (SNPs and Indels) is performed using Mutect (v. 1.1.7)6, Mutect2 (GATK v. 4.2.0)7, Strelka2 (v. 2.9.0)8, and VarDict (v. 1.4)9 in tumor-normal mode. Variants from all callers are merged using the CombineVariants tool from GATK version 3.8-1. Genomic, functional and consequence annotations are added using Variant Effect Predictor (VEP v. 99)10 and converted to Mutation Annotation Format (MAF) using the vcf2maf tool (v. 1.6.16)11.
For Copy Number Variants (CNVs), Control-Freec (v. 11.6)12 is used to generate pileups, which are used as input for the R package 'sequenza' (v. 3.0.0)13. The complete Control-Freec workflow is then re-run using ploidy and cellularity estimates from 'sequenza'.
SOBDetector is a tool that scores variants based on strand-orientation bias, which can be a sign of DNA damage caused by fixation of tissue. This pipeline runs SOBDetector in a two-pass method. The first pass uses parameters provided with the software (calculated from publicly available data from TCGA), then cohort-specific bias metrics are computed from those results, and SOBDetector is re-run using these cohort-specific values.
Ancestry and relatedness scores are generated using Somalier (v. 0.2.13)14. Contamination analyses are performed against viral and bacterial genomes from NCBI using Kraken2 (v. 2.1.2)15, as well as against mouse, human, and UniVec databases using FastQ Screen (v. 0.14.1)16. Sequence, mapping and variant statistics are computed using FastQC (v. 0.11.9), Qualimap (v. 2.2.1)17 and SNPeff (v. 4.3t)18. All of these metrics are combined into an interactive HTML report using MultiQC (v. 1.11)19.
Bolger, A.M., M. Lohse, and B. Usadel, Trimmomatic: a flexible trimmer for Illumina sequence data. Bioinformatics, 2014. 30(15): p. 2114-20. ↩
Li, H. and R. Durbin, Fast and accurate short read alignment with Burrows-Wheeler transform. Bioinformatics, 2009. 25(14): p. 1754-60. ↩
Faust, G.G. and I.M. Hall, SAMBLASTER: fast duplicate marking and structural variant read extraction. Bioinformatics, 2014. 30(17): p. 2503-5. ↩
Van der Auwera, G.A. and B.D. O'Connor, Genomics in the cloud : using Docker, GATK, and WDL in Terra. First edition. ed. 2020, Sebastopol, CA: O'Reilly Media. ↩↩
Poplin, R., et al., Scaling accurate genetic variant discovery to tens of thousands of samples. bioRxiv, 2018: p. 201178. ↩
Cibulskis, K., et al., Sensitive detection of somatic point mutations in impure and heterogeneous cancer samples. Nat Biotechnol, 2013. 31(3): p. 213-9. ↩
Benjamin, D., et al., Calling Somatic SNVs and Indels with Mutect2. bioRxiv, 2019: p. 861054. ↩
Kim, S., et al., Strelka2: fast and accurate calling of germline and somatic variants. Nat Methods, 2018. 15(8): p. 591-594. ↩
Lai, Z., et al., VarDict: a novel and versatile variant caller for next-generation sequencing in cancer research. Nucleic Acids Res, 2016. 44(11): p. e108. ↩
McLaren, W., et al., The Ensembl Variant Effect Predictor. Genome Biol, 2016. 17(1): p. 122. ↩
Boeva, V., et al., Control-FREEC: a tool for assessing copy number and allelic content using next-generation sequencing data. Bioinformatics, 2012. 28(3): p. 423-5. ↩
Favero, F., et al., Sequenza: allele-specific copy number and mutation profiles from tumor sequencing data. Ann Oncol, 2015. 26(1): p. 64-70. ↩
Pedersen, B. somalier: extract informative sites, evaluate relatedness, and perform quality-control on BAM/CRAM/BCF/VCF/GVCF. 2018; Available from: https://github.com/brentp/somalier. ↩
Wood, D.E., J. Lu, and B. Langmead, Improved metagenomic analysis with Kraken 2. Genome Biol, 2019. 20(1): p. 257. ↩
Wingett, S.W. and S. Andrews, FastQ Screen: A tool for multi-genome mapping and quality control. F1000Res, 2018. 7: p. 1338. ↩
Okonechnikov, K., A. Conesa, and F. Garcia-Alcalde, Qualimap 2: advanced multi-sample quality control for high-throughput sequencing data. Bioinformatics, 2016. 32(2): p. 292-4. ↩
Cingolani, P., et al., A program for annotating and predicting the effects of single nucleotide polymorphisms, SnpEff: SNPs in the genome of Drosophila melanogaster strain w1118; iso-2; iso-3. Fly (Austin), 2012. 6(2): p. 80-92. ↩
Ewels, P., et al., MultiQC: summarize analysis results for multiple tools and samples in a single report. Bioinformatics, 2016. 32(19): p. 3047-8. ↩
Koster, J. and S. Rahmann, Snakemake-a scalable bioinformatics workflow engine. Bioinformatics, 2018. 34(20): p. 3600. ↩
\ No newline at end of file
diff --git a/dev/pipeline-details/output/index.html b/dev/pipeline-details/output/index.html
index 1726043..72738ac 100644
--- a/dev/pipeline-details/output/index.html
+++ b/dev/pipeline-details/output/index.html
@@ -1,4 +1,4 @@
- Output Files - XAVIER Documentation
The output files and their locations are broken down here for the XAVIER pipeline. Pre-processing and germline variant calling steps are common but somatic variant calling is dependent on whether the pipeline was run in either (A) tumor-normal pair or (B) tumor-only analysis mode. All file locations are relative to the output directory specified during the job submission.
The output directory after a complete XAVIER run should look like:
The output files and their locations are broken down here for the XAVIER pipeline. Pre-processing and germline variant calling steps are common but somatic variant calling is dependent on whether the pipeline was run in either (A) tumor-normal pair or (B) tumor-only analysis mode. All file locations are relative to the output directory specified during the job submission.
The output directory after a complete XAVIER run should look like:
xavier_output/
├──bams
├──cluster.json# cluster info for the run
├──config
@@ -139,4 +139,4 @@
│└──vcf
├──vardict_out
└──varscan_out
-
Last update: 2024-01-30
\ No newline at end of file
+
\ No newline at end of file
diff --git a/dev/pipeline-details/overview/index.html b/dev/pipeline-details/overview/index.html
index 037c563..56e16c3 100644
--- a/dev/pipeline-details/overview/index.html
+++ b/dev/pipeline-details/overview/index.html
@@ -1 +1 @@
- Overview - XAVIER Documentation
Workflow diagram of the XAVIER: the pipeline is composed of a series of data processing steps to trim, align, and recalibrate reads prior to calling variants. These data processing steps closely follow GATK's best pratices for cleaning up raw alignments. The pipeline also consists of a series of comprehensive quality-control steps.
Last update: 2024-01-30
\ No newline at end of file
+ Overview - XAVIER Documentation
Workflow diagram of the XAVIER: the pipeline is composed of a series of data processing steps to trim, align, and recalibrate reads prior to calling variants. These data processing steps closely follow GATK's best pratices for cleaning up raw alignments. The pipeline also consists of a series of comprehensive quality-control steps.
\ No newline at end of file
diff --git a/dev/pipeline-details/settings/index.html b/dev/pipeline-details/settings/index.html
index d7da9fb..94fc3d3 100644
--- a/dev/pipeline-details/settings/index.html
+++ b/dev/pipeline-details/settings/index.html
@@ -1 +1 @@
- Settings - XAVIER Documentation
Removes variants with certain flags from vardict; (1) Germline: detected in germline sample (pass all quality parameters); (2) LikelyLOH: detected in germline but either lost in tumor OR 20-80% in germline, but increased to 1-opt_V (95%); (3) AFDiff: detected in tumor (pass quality parameters) and present in germline but didn’t pass quality parameters.
varscan
pileup
-d 100000 -q 15 -Q 15
samtools mpileup arguments; max depth of 100,000; min mapping quality of 15; min base quality of 15
calling
--strand-filter 0
Do not ignore variants with >90% support on one strand
--min-var-freq 0.01
Minimum variant allele frequency threshold 0.01
--output-vcf 1
Output in VCF format
--variants 1
Report only variant (SNP/indel) positions
all
GATK SelectVariants
--exclude-filtered
Removes non-PASS variants
--discordance
Remove variants found in supplied file (same as panel-of-normals file)
Removes variants with certain flags from vardict; (1) Germline: detected in germline sample (pass all quality parameters); (2) LikelyLOH: detected in germline but either lost in tumor OR 20-80% in germline, but increased to 1-opt_V (95%); (3) AFDiff: detected in tumor (pass quality parameters) and present in germline but didn’t pass quality parameters.
varscan
pileup
-d 100000 -q 15 -Q 15
samtools mpileup arguments; max depth of 100,000; min mapping quality of 15; min base quality of 15
calling
--strand-filter 0
Do not ignore variants with >90% support on one strand
--min-var-freq 0.01
Minimum variant allele frequency threshold 0.01
--output-vcf 1
Output in VCF format
--variants 1
Report only variant (SNP/indel) positions
all
GATK SelectVariants
--exclude-filtered
Removes non-PASS variants
--discordance
Remove variants found in supplied file (same as panel-of-normals file)
Make sure you're keeping the changelog up-to-date during development. Ideally, every PR that includes a user-facing change (e.g. a new feature, bug fix, or any API change) should add a concise summary to the changelog with a link to the PR. Only approve or merge PRs that either update the changelog or have no user-facing changes.
Edit the heading for the development version to match the new version.
If needed, clean up the changelog -- fix any typos, optionally create subheadings for 'New features' and 'Bug fixes' if there are lots of changes, etc.
Make sure you're keeping the changelog up-to-date during development. Ideally, every PR that includes a user-facing change (e.g. a new feature, bug fix, or any API change) should add a concise summary to the changelog with a link to the PR. Only approve or merge PRs that either update the changelog or have no user-facing changes.
Edit the heading for the development version to match the new version.
If needed, clean up the changelog -- fix any typos, optionally create subheadings for 'New features' and 'Bug fixes' if there are lots of changes, etc.
# go to the shared pipeline directory on biowulfcd/data/CCBR_Pipeliner/Pipelines/XAVIER
# clone the new version tag (e.g. v3.0.2) to a hidden directory
@@ -17,4 +17,4 @@
# you can verify that the symlink points to the new version with readlink
readlink-fv3.0
Versions of the ccbrpipeliner module only specify the major and minor version of each pipeline. If the new pipeline release only increments the patch number, ccbrpipeliner will use it automatically after you update the symlink as above. If you need to release a new major or minor version of a pipeline on biowulf, contact Kelly or Vishal.
Verify that ccbrpipeliner uses the latest version with:
moduleloadccbrpipeliner&&xavier--version
-
Last update: 2024-01-30
\ No newline at end of file
+
\ No newline at end of file
diff --git a/dev/requirements.txt b/dev/requirements.txt
index 64f9a23..62dc6a6 100644
--- a/dev/requirements.txt
+++ b/dev/requirements.txt
@@ -16,9 +16,11 @@ MarkupSafe==2.1.4
mergedeep==1.3.4
mkdocs==1.5.3
mkdocs-git-revision-date-plugin==0.3.2
+mkdocs-git-revision-date-localized-plugin
mkdocs-material==9.5.6
mkdocs-material-extensions==1.3.1
mkdocs-minify-plugin==0.8.0
+mike
packaging==23.2
paginate==0.5.6
pathspec==0.12.1
diff --git a/dev/search/search_index.json b/dev/search/search_index.json
index dfb2a09..c5bf22d 100644
--- a/dev/search/search_index.json
+++ b/dev/search/search_index.json
@@ -1 +1 @@
-{"config":{"lang":["en"],"separator":"[\\s\\-]+","pipeline":["stopWordFilter"]},"docs":[{"location":"","title":"XAVIER - eXome Analysis and Variant explorER \ud83d\udd2c","text":"
XAVIER - eXome Analysis and Variant explorER. XAVIER is an open-source, reproducible, and scalable solution for analyzing Whole Exome sequencing data. Its long-term goals: to accurately call germline and somatic variants, to infer CNVs, and to boldly annotate variants like no pipeline before!
Welcome to XAVIER's documentation! This guide is the main source of documentation for users that are getting started with the XAVIER pipeline.
The xavier pipeline is composed several inter-related sub commands to setup and run the pipeline across different systems. Each of the available sub commands perform different functions:
xavier run: Run the XAVIER pipeline with your input files.
xavier unlock: Unlocks a previous runs output directory.
XAVIER is a comprehensive whole exome-sequencing pipeline following the Broad's set of best practices. It relies on technologies like Singularity1 to maintain the highest-level of reproducibility. The pipeline consists of a series of data processing and quality-control steps orchestrated by Snakemake2, a flexible and scalable workflow management system, to submit jobs to a cluster or cloud provider.
The pipeline is compatible with data generated from Illumina short-read sequencing technologies. As input, it accepts a set of FastQ or BAM files and can be run locally on a compute instance, on-premise using a cluster, or on the cloud (feature coming soon!). A user can define the method or mode of execution. The pipeline can submit jobs to a cluster using a job scheduler like SLURM, or run on AWS using Tibanna (feature coming soon!). A hybrid approach ensures the pipeline is accessible to all users.
Before getting started, we highly recommend reading through the usage section of each available sub command.
For more information about issues or trouble-shooting a problem, please checkout our FAQ prior to opening an issue on Github.
This site is a living document, created for and by members like you. XAVIER is maintained by the members of CCBR and is improved by continuous feedback! We encourage you to contribute new content and make improvements to existing content via pull request to our GitHub repository .
Permission is hereby granted, free of charge, to any person obtaining a copy of this software and associated documentation files (the \"Software\"), to deal in the Software without restriction, including without limitation the rights to use, copy, modify, merge, publish, distribute, sublicense, and/or sell copies of the Software, and to permit persons to whom the Software is furnished to do so, subject to the following conditions:
The above copyright notice and this permission notice shall be included in all copies or substantial portions of the Software.
THE SOFTWARE IS PROVIDED \"AS IS\", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.
Make sure you're keeping the changelog up-to-date during development. Ideally, every PR that includes a user-facing change (e.g. a new feature, bug fix, or any API change) should add a concise summary to the changelog with a link to the PR. Only approve or merge PRs that either update the changelog or have no user-facing changes.
"},{"location":"release-guide/#how-to-release-a-new-version-on-github","title":"How to release a new version on GitHub","text":"
Determine the new version number according to semantic versioning guidelines.
Update CHANGELOG.md:
Edit the heading for the development version to match the new version.
If needed, clean up the changelog -- fix any typos, optionally create subheadings for 'New features' and 'Bug fixes' if there are lots of changes, etc.
Update the version in src/__init__.py.
On GitHub, go to \"Releases\" and click \"Draft a new release\". https://github.com/CCBR/XAVIER/releases/new
Choose a tag: same as the version number.
Choose the target: most likely this should be the main branch, or a specific commit hash.
Set the title as the new version number, e.g. v3.0.2
Copy and paste the release notes from the CHANGELOG into the description box.
Check the box \"Set as the latest release\".
Click \"Publish release\".
Post release chores:
Add a new \"development version\" heading to the top of CHANGELOG.md.
Bump the version number in src/__init__.py to include -dev, e.g. v3.0.2-dev if you just released v3.0.2.
"},{"location":"release-guide/#how-to-install-a-release-on-biowulf","title":"How to install a release on biowulf","text":"
After releasing a new version on GitHub:
# go to the shared pipeline directory on biowulf\ncd /data/CCBR_Pipeliner/Pipelines/XAVIER\n\n# clone the new version tag (e.g. v3.0.2) to a hidden directory\ngit clone --depth 1 --branch v3.0.2 https://github.com/CCBR/XAVIER .v3.0.2\n\n# change permissions for the new directory so anyone will be able to use the pipeline\nchown -R :CCBR_Pipeliner .v3.0.2\nchmod -R a+rX /data/CCBR_Pipeliner/Pipelines/XAVIER/.v3.0.2\n\n# if needed, remove the old symlink for the minor version number\nrm -i v3.0\n\n# recreate the symlink to point to the new latest version\nln -s .v3.0.2 v3.0\n\n# you can verify that the symlink points to the new version with readlink\nreadlink -f v3.0\n
Versions of the ccbrpipeliner module only specify the major and minor version of each pipeline. If the new pipeline release only increments the patch number, ccbrpipeliner will use it automatically after you update the symlink as above. If you need to release a new major or minor version of a pipeline on biowulf, contact Kelly or Vishal.
Verify that ccbrpipeliner uses the latest version with:
Low-quality and adapters sequences are trimmed from the raw sequencing reads using Trimmomatic (v. 0.39)1. Trimmed reads are then aligned to the human hg38 reference genome using BWA mapping software (v. 0.7.17)2. Duplicate reads are marked using Samblaster (v. 0.1.25)3 and sorted using samtools (v. 1.8). Finally, base quality score recalibration is performed as indicated in the GATK4 (v. 4.2.2.0) best practices 4.
HaplotypeCaller from GATK4 (v. 4.2.2.0) is used to call germline variants, parallelized across chromosomes, and all samples in the cohort are joint genotyped together 4,5.
Somatic variant calling (SNPs and Indels) is performed using Mutect (v. 1.1.7)6, Mutect2 (GATK v. 4.2.0)7, Strelka2 (v. 2.9.0)8, and VarDict (v. 1.4)9 in tumor-normal mode. Variants from all callers are merged using the CombineVariants tool from GATK version 3.8-1. Genomic, functional and consequence annotations are added using Variant Effect Predictor (VEP v. 99)10 and converted to Mutation Annotation Format (MAF) using the vcf2maf tool (v. 1.6.16)11.
For Copy Number Variants (CNVs), Control-Freec (v. 11.6)12 is used to generate pileups, which are used as input for the R package 'sequenza' (v. 3.0.0)13. The complete Control-Freec workflow is then re-run using ploidy and cellularity estimates from 'sequenza'.
SOBDetector is a tool that scores variants based on strand-orientation bias, which can be a sign of DNA damage caused by fixation of tissue. This pipeline runs SOBDetector in a two-pass method. The first pass uses parameters provided with the software (calculated from publicly available data from TCGA), then cohort-specific bias metrics are computed from those results, and SOBDetector is re-run using these cohort-specific values.
"},{"location":"pipeline-details/methods/#quality-and-identity-metrics","title":"Quality and identity metrics","text":"
Ancestry and relatedness scores are generated using Somalier (v. 0.2.13)14. Contamination analyses are performed against viral and bacterial genomes from NCBI using Kraken2 (v. 2.1.2)15, as well as against mouse, human, and UniVec databases using FastQ Screen (v. 0.14.1)16. Sequence, mapping and variant statistics are computed using FastQC (v. 0.11.9), Qualimap (v. 2.2.1)17 and SNPeff (v. 4.3t)18. All of these metrics are combined into an interactive HTML report using MultiQC (v. 1.11)19.
Bolger, A.M., M. Lohse, and B. Usadel, Trimmomatic: a flexible trimmer for Illumina sequence data. Bioinformatics, 2014. 30(15): p. 2114-20.\u00a0\u21a9
Li, H. and R. Durbin, Fast and accurate short read alignment with Burrows-Wheeler transform. Bioinformatics, 2009. 25(14): p. 1754-60.\u00a0\u21a9
Faust, G.G. and I.M. Hall, SAMBLASTER: fast duplicate marking and structural variant read extraction. Bioinformatics, 2014. 30(17): p. 2503-5.\u00a0\u21a9
Van der Auwera, G.A. and B.D. O'Connor, Genomics in the cloud : using Docker, GATK, and WDL in Terra. First edition. ed. 2020, Sebastopol, CA: O'Reilly Media.\u00a0\u21a9\u21a9
Poplin, R., et al., Scaling accurate genetic variant discovery to tens of thousands of samples. bioRxiv, 2018: p. 201178.\u00a0\u21a9
Cibulskis, K., et al., Sensitive detection of somatic point mutations in impure and heterogeneous cancer samples. Nat Biotechnol, 2013. 31(3): p. 213-9.\u00a0\u21a9
Benjamin, D., et al., Calling Somatic SNVs and Indels with Mutect2. bioRxiv, 2019: p. 861054.\u00a0\u21a9
Kim, S., et al., Strelka2: fast and accurate calling of germline and somatic variants. Nat Methods, 2018. 15(8): p. 591-594.\u00a0\u21a9
Lai, Z., et al., VarDict: a novel and versatile variant caller for next-generation sequencing in cancer research. Nucleic Acids Res, 2016. 44(11): p. e108.\u00a0\u21a9
McLaren, W., et al., The Ensembl Variant Effect Predictor. Genome Biol, 2016. 17(1): p. 122.\u00a0\u21a9
Memorial Sloan Kettering Cancer Center. vcf2maf. 2013; Available from: https://github.com/mskcc/vcf2maf.\u00a0\u21a9
Boeva, V., et al., Control-FREEC: a tool for assessing copy number and allelic content using next-generation sequencing data. Bioinformatics, 2012. 28(3): p. 423-5.\u00a0\u21a9
Favero, F., et al., Sequenza: allele-specific copy number and mutation profiles from tumor sequencing data. Ann Oncol, 2015. 26(1): p. 64-70.\u00a0\u21a9
Pedersen, B. somalier: extract informative sites, evaluate relatedness, and perform quality-control on BAM/CRAM/BCF/VCF/GVCF. 2018; Available from: https://github.com/brentp/somalier.\u00a0\u21a9
Wood, D.E., J. Lu, and B. Langmead, Improved metagenomic analysis with Kraken 2. Genome Biol, 2019. 20(1): p. 257.\u00a0\u21a9
Wingett, S.W. and S. Andrews, FastQ Screen: A tool for multi-genome mapping and quality control. F1000Res, 2018. 7: p. 1338.\u00a0\u21a9
Okonechnikov, K., A. Conesa, and F. Garcia-Alcalde, Qualimap 2: advanced multi-sample quality control for high-throughput sequencing data. Bioinformatics, 2016. 32(2): p. 292-4.\u00a0\u21a9
Cingolani, P., et al., A program for annotating and predicting the effects of single nucleotide polymorphisms, SnpEff: SNPs in the genome of Drosophila melanogaster strain w1118; iso-2; iso-3. Fly (Austin), 2012. 6(2): p. 80-92.\u00a0\u21a9
Ewels, P., et al., MultiQC: summarize analysis results for multiple tools and samples in a single report. Bioinformatics, 2016. 32(19): p. 3047-8.\u00a0\u21a9
Koster, J. and S. Rahmann, Snakemake-a scalable bioinformatics workflow engine. Bioinformatics, 2018. 34(20): p. 3600.\u00a0\u21a9
The output files and their locations are broken down here for the XAVIER pipeline. Pre-processing and germline variant calling steps are common but somatic variant calling is dependent on whether the pipeline was run in either (A) tumor-normal pair or (B) tumor-only analysis mode. All file locations are relative to the output directory specified during the job submission.
The output directory after a complete XAVIER run should look like:
xavier_output/\n\u251c\u2500\u2500 bams\n\u251c\u2500\u2500 cluster.json # cluster info for the run\n\u251c\u2500\u2500 config\n\u251c\u2500\u2500 config.json # config file for the run\n\u251c\u2500\u2500 fastqs\n\u251c\u2500\u2500 germline\n\u251c\u2500\u2500 indels.vcf.gz[.tbi] # raw germline INDELs\n\u251c\u2500\u2500 input_files\n\u251c\u2500\u2500 intervals.list\n\u251c\u2500\u2500 {sample1}-normal.R1.fastq.gz -> /path/to/{sample1}-normal.R1.fastq.gz\n\u251c\u2500\u2500 {sample1}-normal.R2.fastq.gz -> /path/to/{sample1}-normal.R2.fastq.gz\n\u251c\u2500\u2500 {sample1}-tumor.R1.fastq.gz -> /path/to/{sample1}-tumor.R1.fastq.gz\n\u251c\u2500\u2500 {sample1}-tumor.R2.fastq.gz -> /path/to/{sample1}-tumor.R2.fastq.gz\n.\n.\n.\n\u251c\u2500\u2500 kickoff.sh\n\u251c\u2500\u2500 logfiles\n\u251c\u2500\u2500 QC\n\u251c\u2500\u2500 resources\n\u251c\u2500\u2500 snps.vcf.gz[.tbi] # raw germline SNPs\n\u251c\u2500\u2500 somatic_paired # in case of tumor-normal paired run\n\u251c\u2500\u2500 somatic_tumor_only # in case of tumor-only run\n\u2514\u2500\u2500 workflow\n
Below we describe the different folders that contain specific outputs obtained for all samples from the XAVIER pipeline
The QC folder contains all the Quality-Control analyses performed at different steps of the pipeline for each sample to assess sequencing quality before and after adapter trimming, microbial taxonomic composition, contamination, variant calling, etc. The final summary report and data is available finalQC folder. \\ The MultiQC report also contains results from other analysis like mapping statistics, ancestry and relatedness, etc. It is recommended to study the MultiQC report first to get a birds eye view of the sequence data quality.
The bams folder contain two subfolders chrom_split and final_bams. final_bams contains the final processed BAM files for each sample in the run and the chrom_split folder contains all the sample BAM files split by each chromosome.
This folder contains the output from the GATK Best Practices pipeline to obtain germline variants with a few alterations detailed below. Briefly, joint SNP and INDEL variant detection is conducted across all samples included in a pipeline run using the GATK Haplotypcaller under default settings. Raw variants are then subsequently filtered based on several GATK annotations: \\ A strict set of criteria (QD < 2.0, FS > 60.0, MQ < 40.0, MQRankSum < -12.5, ReadPosRankSum < -8.0 for SNPs; QD < 2.0, FS > 200.0, ReadPosRankSum < -20.0 for INDELs) generates the 'combined.strictFilter.vcf'.
This call set is highly stringent, maximizing the true positive rate at the expense of an elevated false negative rate. This call set is really only intended for more general population genetic scale analyses (e.g., burden tests, admixture, linkage/pedigree based analysis, etc.) where false positives can be significantly confounding.
In case of human sequence data, a basic analyses of sample relatedness and ancestry (e.g., % European, African, etc.) is also performed using somalier.
The output folder looks like:
germline/\n\u251c\u2500\u2500 gVCFs\n.\n.\n.\n\u251c\u2500\u2500 somalier # only for hg38 genome\n\u2514\u2500\u2500 VCF\n
The VCF folder contains the final filtered germline variants (SNPs and INDELs) for all samples combined. The folder also contains raw variants for each sample, all samples combined, and also combined raw variants split by chromosome.
This folder contains the snakemake log files and computational statistics for the XAVIER run. All the log files (i.e., standard output and error) for each individual step are in the slurmfiles folder. These logfiles are important to diagnose errors in case the pipeline fails.
This workflow calls somatic SNPs and INDELs using multiple variant detection algorithms. For each of these tools, variants are called in a paired tumor-normal fashion, with default settings. See Pipeline Details for more information about the tools used and their parameter settings.
For each sample, the resulting VCF is fully annotated using VEP and converted to a MAF file using the vcf2maf tool. Resulting MAF files are found in maf folder within each caller's results directory (i.e., mutect2_out, strelka_out, etc.). Individual sample MAF files are then merged and saved in merged_somatic_variants directory.
For Mutect2, we use a panel of normals (PON) developed from the ExAC (excluding TCGA) dataset, filtered for variants <0.001 in the general population, and also including and in-house set of blacklisted recurrent germline variants that are not found in any population databases.
For Copy Number Variants (CNVs), two tools are employed in tandem. First, Control-FREEC is run with default parameters. This generates pileup files that can be used by Sequenza, primarily for jointly estimating contamination and ploidy. These value are used to run Freec a second time for improved performance.
In general, the tumor-only pipeline is a stripped down version of the tumor-normal pipeline. We only run MuTect2, Mutect, and VarDict for somatic variant detection, with the same PON and filtering as described above for the tumor-normal pipeline.
Workflow diagram of the XAVIER: the pipeline is composed of a series of data processing steps to trim, align, and recalibrate reads prior to calling variants. These data processing steps closely follow GATK's best pratices for cleaning up raw alignments. The pipeline also consists of a series of comprehensive quality-control steps.
Removes variants with certain flags from vardict; (1) Germline: detected in germline sample (pass all quality parameters); (2) LikelyLOH: detected in germline but either lost in tumor OR 20-80% in germline, but increased to 1-opt_V (95%); (3) AFDiff: detected in tumor (pass quality parameters) and present in germline but didn\u2019t pass quality parameters.
varscan
pileup
-d 100000 -q 15 -Q 15
samtools mpileup arguments; max depth of 100,000; min mapping quality of 15; min base quality of 15
calling
--strand-filter 0
Do not ignore variants with >90% support on one strand
--min-var-freq 0.01
Minimum variant allele frequency threshold 0.01
--output-vcf 1
Output in VCF format
--variants 1
Report only variant (SNP/indel) positions
all
GATK SelectVariants
--exclude-filtered
Removes non-PASS variants
--discordance
Remove variants found in supplied file (same as panel-of-normals file)
The xavier executable is composed of several inter-related sub commands. Please see xavier -h for all available options.
This part of the documentation describes options and concepts for xavier cache sub command in more detail. With minimal configuration, the cache sub command enables you to cache remote resources for the xavier pipeline. Caching remote resources allows the pipeline to run in an offline mode. The cache sub command can also be used to pull our pre-built reference bundles onto a new cluster or target system.
The cache sub command creates local cache on the filesysytem for resources hosted on DockerHub or AWS S3. These resources are normally pulled onto the filesystem when the pipeline runs; however, due to network issues or DockerHub pull rate limits, it may make sense to pull the resources once so a shared cache can be created and re-used. It is worth noting that a singularity cache cannot normally be shared across users. Singularity strictly enforces that its cache is owned by the user. To get around this issue, the cache subcommand can be used to create local SIFs on the filesystem from images on DockerHub.
XAVIER pipeline can be executed from either using the graphical user interface (GUI) or the command line interface (CLI). GUI offers a more interactive way for the user to provide input and adjust parameter settings. This part of the documentation describes how to run xavier using the GUI (with screenshots). See Command Line tab to read more about the xavier executable and running XAVIER pipeline using the CLI.
"},{"location":"usage/gui/#2-setting-up-xavier","title":"2. Setting up XAVIER","text":""},{"location":"usage/gui/#21-login-to-cluster","title":"2.1 Login to cluster","text":"
# Setup Step 1.) ssh into cluster's head node\n# example below for Biowulf cluster\nssh -Y $USER@biowulf.nih.gov\n
"},{"location":"usage/gui/#22-grab-an-interactive-node","title":"2.2 Grab an interactive node","text":"
# Setup Step 2.) Please do not run XAVIER on the head node!\n# Grab an interactive node first\nsinteractive --time=12:00:00 --mem=8gb --cpus-per-task=4\n
NOTE: ccbrpipeliner is a custom module created on biowulf which contains various NGS data analysis pipelines developed, tested, and benchmarked by experts at CCBR.
# Setup Step 3.) Add ccbrpipeliner module\nmodule purge # to reset the module environment\nmodule load ccbrpipeliner\n
If the module was loaded correctly, the greetings message should be displayed.
[+] Loading ccbrpipeliner 5 ...\n###########################################################################\nCCBR Pipeliner\n###########################################################################\n\"ccbrpipeliner\" is a suite of end-to-end pipelines and tools\n Visit https://github.com/ccbr for more details.\n Pipelines are available on BIOWULF and FRCE.\n Tools are available on BIOWULF, HELIX and FRCE.\n\n The following pipelines/tools will be loaded in this module:\n\n RENEE v2.5 https://ccbr.github.io/RENEE/\n XAVIER v3.0 https://ccbr.github.io/XAVIER/\n CARLISLE v2.4 https://ccbr.github.io/CARLISLE/\n CHAMPAGNE v0.2 https://ccbr.github.io/CHAMPAGNE/\n CRUISE v0.1 https://ccbr.github.io/CRUISE/\n\n spacesavers2 v0.10 https://ccbr.github.io/spacesavers2/\n permfix v0.6 https://github.com/ccbr/permfix\n###########################################################################\nThank you for using CCBR Pipeliner\n###########################################################################\n
To run the XAVIER pipeline from the GUI, simply enter:
xavier_gui\n
and it will launch the XAVIER window.
Note: Please wait until window created! message appears on the terminal.
"},{"location":"usage/gui/#32-folder-paths-and-reference-genomes","title":"3.2 Folder paths and reference genomes","text":"
To enter the location of the input folder containing FASTQ files and the location where the output folders should be created, either simply type the absolute paths
or use the Browse tab to choose the input and output directories
Next, from the drop down menu select the reference genome (hg38/mm10)
In case the paired normal samples are unavailable, XAVIER pipeline can be run in tumor-only mode which does not require paired samples information. However, in the absence of matching normal samples, CNV analysis is also unavailable.
After all the information is filled out, press Submit.
If the pipeline detects no errors and the run was submitted, a new window appears that has the output of a \"dry-run\" which summarizes each step of the pipeline.
Click OK
A dialogue box will popup to confirm submitting the job to slurm.
Click Yes
The dry-run output will be displayed again and the master job will be submitted. An email notification will be sent out when the pipeline starts and ends.
The XAVIER gui will ask to submit another job.
Click Yes to start again or No to close the XAVIER gui.
Users can input certain additional settings for the pipeline run including running an additional step to correct strand orientation bias in Formalin-Fixed Paraffin-Embedded (FFPE) samples and to provide a custom exome targets BED file. This file can be obtained from the manufacturer of the target capture kit that was used.
"},{"location":"usage/gui/#4-special-instructions-for-biowulf","title":"4. Special instructions for Biowulf","text":"
XAVIER GUI natively uses the X11 Window System to run XAVIER pipeline and display the graphics on a personal desktop or laptop. However, if running XAVIER specifically on NIH's Biowulf cluster, the HPC staff recommends NoMachine (NX) to run graphics applications.
Please see details here on how to install and connect to Biowulf on your local computer using NoMachine.
Once connected to Biowulf using NX, right click to open a terminal connection
and start an interactive session.
Similar to the instructions above, load ccbrpipeliner module and enter xavier_gui to launch the XAVIER gui.
The xavier executable is composed of several inter-related sub commands. Please see xavier -h for all available options.
This part of the documentation describes options and concepts for xavier run sub command in more detail. With minimal configuration, the run sub command enables you to start running xavier pipeline.
Setting up the xavier pipeline is fast and easy! In its most basic form, xavier run only has four required inputs.
The synopsis for each command shows its parameters and their usage. Optional parameters are shown in square brackets.
A user must provide a list of FastQ or BAM files (globbing is supported) to analyze via --input argument, an output directory to store results via --output argument, an exome targets BED file for the samples' capture kit, and select reference genome for alignment and annotation via the --genome argument.
Use you can always use the -h option for information on a specific command.
Each of the following arguments are required. Failure to provide a required argument will result in a non-zero exit-code.
--input INPUT [INPUT ...]
Input FastQ or BAM file(s) to process. type: file(s)
One or more FastQ files can be provided. The pipeline does NOT support single-end WES data. Please provide either a set of FastQ files or a set of BAM files. The pipeline does NOT support processing a mixture of FastQ files and BAM files. From the command-line, each input file should separated by a space. Globbing is supported! This makes selecting FastQ files easy. Input FastQ files should be gzipp-ed.
Example: --input .tests/*.R?.fastq.gz
--output OUTPUT
Path to an output directory. type: path
This location is where the pipeline will create all of its output files, also known as the pipeline's working directory. If the provided output directory does not exist, it will be initialized automatically.
Example: --output /data/$USER/WES_hg38
--runmode {init,dryrun,run} `
Execution Process. type: string
User should initialize the pipeline folder by first running --runmode init User should then perform a dry-run to list all steps the pipeline will take--runmode dryrun User should then perform the full run --runmode run
Example: --runmode init THEN --runmode dryrun THEN --runmode run
--genome {hg38, custom.json}
Reference genome. type: string/file
This option defines the reference genome for your set of samples. On Biowulf, xavier does comes bundled with pre built reference files for human samples; however, it is worth noting that the pipeline does accept a pre-built resource bundle pulled with the cache sub command (coming soon). Currently, the pipeline only supports the human reference hg38; however, support for mouse reference mm10 will be added soon.
Pre built Option Here is a list of available pre built genomes on Biowulf: hg38.
Custom Option For users running the pipeline outside of Biowulf, a pre-built resource bundle can be pulled with the cache sub command (coming soon). Please supply the custom reference JSON file that was generated by the cache sub command.
Example: --genome hg38 OR --genome /data/${USER}/hg38/hg38.json
--targets TARGETS
Exome targets BED file. type: file
This file can be obtained from the manufacturer of the target capture kit that was used.
Each of the following arguments are optional and do not need to be provided.
-h, --help
Display Help. type: boolean flag
Shows command's synopsis, help message, and an example command
Example: --help
--silent
Silence standard output. type: boolean flag
Reduces the amount of information directed to standard output when submitting master job to the job scheduler. Only the job id of the master job is returned.
Example: --silent
--mode {local,slurm}
Execution Method. type: string default: slurm
Execution Method. Defines the mode or method of execution. Valid mode options include: local or slurm.
local Local executions will run serially on compute instance. This is useful for testing, debugging, or when a users does not have access to a high performance computing environment. If this option is not provided, it will default to a local execution mode.
slurm The slurm execution method will submit jobs to a cluster using a singularity backend. It is recommended running xavier in this mode as execution will be significantly faster in a distributed environment.
Example: --mode slurm
--job-name JOB_NAME
Set the name of the pipeline's master job. type: string > default: pl:xavier
When submitting the pipeline to a job scheduler, like SLURM, this option always you to set the name of the pipeline's master job. By default, the name of the pipeline's master job is set to \"pl:xavier\".
List of variant callers to detect mutations. Please select from one or more of the following options: [mutect2, mutect, strelka, vardict, varscan]. Defaults to using all variant callers.
Example: --callers mutect2 strelka varscan
--pairs PAIRS
Tumor normal pairs file. type: file
This tab delimited file contains two columns with the names of tumor and normal pairs, one per line. The header of the file needs to be Tumor for the tumor column and Normal for the normal column. The base name of each sample should be listed in the pairs file. The base name of a given sample can be determined by removing the following extension from the sample's R1 FastQ file: .R1.fastq.gz. Contents of example pairs file:
Normal Tumor\nSample4_CRL1622_S31 Sample10_ARK1_S37\nSample4_CRL1622_S31 Sample11_ACI_158_S38\n
Example: --pairs /data/$USER/pairs.tsv
--ffpe
Apply FFPE correction. type: boolean flag
Runs an additional steps to correct strand orientation bias in Formalin-Fixed Paraffin-Embedded (FFPE) samples. Do NOT use this option with non-FFPE samples.
Example: --ffpe
--cnv
Call copy number variations (CNVs). type: boolean flag
CNVs will only be called from tumor-normal pairs. If this option is provided without providing a --pairs file, CNVs will NOT be called.
Example: --cnv
--singularity-cache SINGULARITY_CACHE
Overrides the $SINGULARITY_CACHEDIR environment variable. type: path default: --output OUTPUT/.singularity
Singularity will cache image layers pulled from remote registries. This ultimately speeds up the process of pull an image from DockerHub if an image layer already exists in the singularity cache directory. By default, the cache is set to the value provided to the --output argument. Please note that this cache cannot be shared across users. Singularity strictly enforces you own the cache directory and will return a non-zero exit code if you do not own the cache directory! See the --sif-cache option to create a shareable resource.
Path where a local cache of SIFs are stored. type: path
Uses a local cache of SIFs on the filesystem. This SIF cache can be shared across users if permissions are set correctly. If a SIF does not exist in the SIF cache, the image will be pulled from Dockerhub and a warning message will be displayed. The xavier cache subcommand can be used to create a local SIF cache. Please see xavier cache for more information. This command is extremely useful for avoiding DockerHub pull rate limits. It also remove any potential errors that could occur due to network issues or DockerHub being temporarily unavailable. We recommend running xavier with this option when ever possible.
Example: --singularity-cache /data/$USER/SIFs
--threads THREADS
Max number of threads for each process. type: int default: 2
Max number of threads for each process. This option is more applicable when running the pipeline with --mode local. It is recommended setting this value to the maximum number of CPUs available on the host machine.
# Step 1.) Grab an interactive node\n# Do not run on head node!\nsinteractive --mem=8g --cpus-per-task=4\nmodule purge\nmodule load ccbrpipeliner\n\n# Step 2A.) Initialize the all resources to the output folder\nxavier run --input .tests/*.R?.fastq.gz \\\n--output /data/$USER/xavier_hg38 \\\n--genome hg38 \\\n--targets Agilent_SSv7_allExons_hg38.bed \\\n--mode slurm \\\n--runmode init\n\n# Step 2B.) Dry-run the pipeline\nxavier run --input .tests/*.R?.fastq.gz \\\n--output /data/$USER/xavier_hg38 \\\n--genome hg38 \\\n--targets Agilent_SSv7_allExons_hg38.bed \\\n--mode slurm \\\n--runmode dryrun\n\n# Step 2C.) Run the XAVIER pipeline\n# The slurm mode will submit jobs to the cluster.\n# It is recommended running xavier in this mode.\nxavier run --input .tests/*.R?.fastq.gz \\\n--output /data/$USER/xavier_hg38 \\\n--genome hg38 \\\n--targets Agilent_SSv7_allExons_hg38.bed \\\n--mode slurm \\\n--runmode run\n
The xavier executable is composed of several inter-related sub commands. Please see xavier -h for all available options.
This part of the documentation describes options and concepts for xavier unlock sub command in more detail. With minimal configuration, the unlock sub command enables you to unlock a pipeline output directory.
If the pipeline fails ungracefully, it maybe required to unlock the working directory before proceeding again. Snakemake will inform a user when it maybe necessary to unlock a working directory with an error message stating: Error: Directory cannot be locked.
Please verify that the pipeline is not running before running this command. If the pipeline is currently running, the workflow manager will report the working directory is locked. The is the default behavior of snakemake, and it is normal. Do NOT run this command if the pipeline is still running! Please kill the master job and it's child jobs prior to running this command.
Unlocking xavier pipeline output directory is fast and easy! In its most basic form, xavier unlock only has one required input.
The synopsis for this command shows its parameters and their usage. Optional parameters are shown in square brackets.
A user must provide an output directory to unlock via --output argument. After running the unlock sub command, you can resume the build or run pipeline from where it left off by re-running it.
Use you can always use the -h option for information on a specific command.
Path to a previous run's output directory. This will remove a lock on the working directory. Please verify that the pipeline is not running before running this command. Example: --output /data/$USER/WES_hg38
# Step 0.) Grab an interactive node (do not run on head node)\nsinteractive --mem=8g -N 1 -n 4\nmodule purge\nmodule load ccbrpipeliner\n\n# Step 1.) Unlock a pipeline output directory\nxavier unlock --output /data/$USER/xavier_hg38\n
"}]}
\ No newline at end of file
+{"config":{"lang":["en"],"separator":"[\\s\\-]+","pipeline":["stopWordFilter"]},"docs":[{"location":"","title":"XAVIER - eXome Analysis and Variant explorER \ud83d\udd2c","text":"
XAVIER - eXome Analysis and Variant explorER. XAVIER is an open-source, reproducible, and scalable solution for analyzing Whole Exome sequencing data. Its long-term goals: to accurately call germline and somatic variants, to infer CNVs, and to boldly annotate variants like no pipeline before!
Welcome to XAVIER's documentation! This guide is the main source of documentation for users that are getting started with the XAVIER pipeline.
The xavier pipeline is composed several inter-related sub commands to setup and run the pipeline across different systems. Each of the available sub commands perform different functions:
xavier run: Run the XAVIER pipeline with your input files.
xavier unlock: Unlocks a previous runs output directory.
XAVIER is a comprehensive whole exome-sequencing pipeline following the Broad's set of best practices. It relies on technologies like Singularity1 to maintain the highest-level of reproducibility. The pipeline consists of a series of data processing and quality-control steps orchestrated by Snakemake2, a flexible and scalable workflow management system, to submit jobs to a cluster or cloud provider.
The pipeline is compatible with data generated from Illumina short-read sequencing technologies. As input, it accepts a set of FastQ or BAM files and can be run locally on a compute instance, on-premise using a cluster, or on the cloud (feature coming soon!). A user can define the method or mode of execution. The pipeline can submit jobs to a cluster using a job scheduler like SLURM, or run on AWS using Tibanna (feature coming soon!). A hybrid approach ensures the pipeline is accessible to all users.
Before getting started, we highly recommend reading through the usage section of each available sub command.
For more information about issues or trouble-shooting a problem, please checkout our FAQ prior to opening an issue on Github.
This site is a living document, created for and by members like you. XAVIER is maintained by the members of CCBR and is improved by continuous feedback! We encourage you to contribute new content and make improvements to existing content via pull request to our GitHub repository .
Permission is hereby granted, free of charge, to any person obtaining a copy of this software and associated documentation files (the \"Software\"), to deal in the Software without restriction, including without limitation the rights to use, copy, modify, merge, publish, distribute, sublicense, and/or sell copies of the Software, and to permit persons to whom the Software is furnished to do so, subject to the following conditions:
The above copyright notice and this permission notice shall be included in all copies or substantial portions of the Software.
THE SOFTWARE IS PROVIDED \"AS IS\", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.
Make sure you're keeping the changelog up-to-date during development. Ideally, every PR that includes a user-facing change (e.g. a new feature, bug fix, or any API change) should add a concise summary to the changelog with a link to the PR. Only approve or merge PRs that either update the changelog or have no user-facing changes.
"},{"location":"release-guide/#how-to-release-a-new-version-on-github","title":"How to release a new version on GitHub","text":"
Determine the new version number according to semantic versioning guidelines.
Update CHANGELOG.md:
Edit the heading for the development version to match the new version.
If needed, clean up the changelog -- fix any typos, optionally create subheadings for 'New features' and 'Bug fixes' if there are lots of changes, etc.
Update the version in src/__init__.py.
On GitHub, go to \"Releases\" and click \"Draft a new release\". https://github.com/CCBR/XAVIER/releases/new
Choose a tag: same as the version number.
Choose the target: most likely this should be the main branch, or a specific commit hash.
Set the title as the new version number, e.g. v3.0.2
Copy and paste the release notes from the CHANGELOG into the description box.
Check the box \"Set as the latest release\".
Click \"Publish release\".
Post release chores:
Add a new \"development version\" heading to the top of CHANGELOG.md.
Bump the version number in src/__init__.py to include -dev, e.g. v3.0.2-dev if you just released v3.0.2.
"},{"location":"release-guide/#how-to-install-a-release-on-biowulf","title":"How to install a release on biowulf","text":"
After releasing a new version on GitHub:
# go to the shared pipeline directory on biowulf\ncd /data/CCBR_Pipeliner/Pipelines/XAVIER\n\n# clone the new version tag (e.g. v3.0.2) to a hidden directory\ngit clone --depth 1 --branch v3.0.2 https://github.com/CCBR/XAVIER .v3.0.2\n\n# change permissions for the new directory so anyone will be able to use the pipeline\nchown -R :CCBR_Pipeliner .v3.0.2\nchmod -R a+rX /data/CCBR_Pipeliner/Pipelines/XAVIER/.v3.0.2\n\n# if needed, remove the old symlink for the minor version number\nrm -i v3.0\n\n# recreate the symlink to point to the new latest version\nln -s .v3.0.2 v3.0\n\n# you can verify that the symlink points to the new version with readlink\nreadlink -f v3.0\n
Versions of the ccbrpipeliner module only specify the major and minor version of each pipeline. If the new pipeline release only increments the patch number, ccbrpipeliner will use it automatically after you update the symlink as above. If you need to release a new major or minor version of a pipeline on biowulf, contact Kelly or Vishal.
Verify that ccbrpipeliner uses the latest version with:
Low-quality and adapters sequences are trimmed from the raw sequencing reads using Trimmomatic (v. 0.39)1. Trimmed reads are then aligned to the human hg38 reference genome using BWA mapping software (v. 0.7.17)2. Duplicate reads are marked using Samblaster (v. 0.1.25)3 and sorted using samtools (v. 1.8). Finally, base quality score recalibration is performed as indicated in the GATK4 (v. 4.2.2.0) best practices 4.
HaplotypeCaller from GATK4 (v. 4.2.2.0) is used to call germline variants, parallelized across chromosomes, and all samples in the cohort are joint genotyped together 4,5.
Somatic variant calling (SNPs and Indels) is performed using Mutect (v. 1.1.7)6, Mutect2 (GATK v. 4.2.0)7, Strelka2 (v. 2.9.0)8, and VarDict (v. 1.4)9 in tumor-normal mode. Variants from all callers are merged using the CombineVariants tool from GATK version 3.8-1. Genomic, functional and consequence annotations are added using Variant Effect Predictor (VEP v. 99)10 and converted to Mutation Annotation Format (MAF) using the vcf2maf tool (v. 1.6.16)11.
For Copy Number Variants (CNVs), Control-Freec (v. 11.6)12 is used to generate pileups, which are used as input for the R package 'sequenza' (v. 3.0.0)13. The complete Control-Freec workflow is then re-run using ploidy and cellularity estimates from 'sequenza'.
SOBDetector is a tool that scores variants based on strand-orientation bias, which can be a sign of DNA damage caused by fixation of tissue. This pipeline runs SOBDetector in a two-pass method. The first pass uses parameters provided with the software (calculated from publicly available data from TCGA), then cohort-specific bias metrics are computed from those results, and SOBDetector is re-run using these cohort-specific values.
"},{"location":"pipeline-details/methods/#quality-and-identity-metrics","title":"Quality and identity metrics","text":"
Ancestry and relatedness scores are generated using Somalier (v. 0.2.13)14. Contamination analyses are performed against viral and bacterial genomes from NCBI using Kraken2 (v. 2.1.2)15, as well as against mouse, human, and UniVec databases using FastQ Screen (v. 0.14.1)16. Sequence, mapping and variant statistics are computed using FastQC (v. 0.11.9), Qualimap (v. 2.2.1)17 and SNPeff (v. 4.3t)18. All of these metrics are combined into an interactive HTML report using MultiQC (v. 1.11)19.
Bolger, A.M., M. Lohse, and B. Usadel, Trimmomatic: a flexible trimmer for Illumina sequence data. Bioinformatics, 2014. 30(15): p. 2114-20.\u00a0\u21a9
Li, H. and R. Durbin, Fast and accurate short read alignment with Burrows-Wheeler transform. Bioinformatics, 2009. 25(14): p. 1754-60.\u00a0\u21a9
Faust, G.G. and I.M. Hall, SAMBLASTER: fast duplicate marking and structural variant read extraction. Bioinformatics, 2014. 30(17): p. 2503-5.\u00a0\u21a9
Van der Auwera, G.A. and B.D. O'Connor, Genomics in the cloud : using Docker, GATK, and WDL in Terra. First edition. ed. 2020, Sebastopol, CA: O'Reilly Media.\u00a0\u21a9\u21a9
Poplin, R., et al., Scaling accurate genetic variant discovery to tens of thousands of samples. bioRxiv, 2018: p. 201178.\u00a0\u21a9
Cibulskis, K., et al., Sensitive detection of somatic point mutations in impure and heterogeneous cancer samples. Nat Biotechnol, 2013. 31(3): p. 213-9.\u00a0\u21a9
Benjamin, D., et al., Calling Somatic SNVs and Indels with Mutect2. bioRxiv, 2019: p. 861054.\u00a0\u21a9
Kim, S., et al., Strelka2: fast and accurate calling of germline and somatic variants. Nat Methods, 2018. 15(8): p. 591-594.\u00a0\u21a9
Lai, Z., et al., VarDict: a novel and versatile variant caller for next-generation sequencing in cancer research. Nucleic Acids Res, 2016. 44(11): p. e108.\u00a0\u21a9
McLaren, W., et al., The Ensembl Variant Effect Predictor. Genome Biol, 2016. 17(1): p. 122.\u00a0\u21a9
Memorial Sloan Kettering Cancer Center. vcf2maf. 2013; Available from: https://github.com/mskcc/vcf2maf.\u00a0\u21a9
Boeva, V., et al., Control-FREEC: a tool for assessing copy number and allelic content using next-generation sequencing data. Bioinformatics, 2012. 28(3): p. 423-5.\u00a0\u21a9
Favero, F., et al., Sequenza: allele-specific copy number and mutation profiles from tumor sequencing data. Ann Oncol, 2015. 26(1): p. 64-70.\u00a0\u21a9
Pedersen, B. somalier: extract informative sites, evaluate relatedness, and perform quality-control on BAM/CRAM/BCF/VCF/GVCF. 2018; Available from: https://github.com/brentp/somalier.\u00a0\u21a9
Wood, D.E., J. Lu, and B. Langmead, Improved metagenomic analysis with Kraken 2. Genome Biol, 2019. 20(1): p. 257.\u00a0\u21a9
Wingett, S.W. and S. Andrews, FastQ Screen: A tool for multi-genome mapping and quality control. F1000Res, 2018. 7: p. 1338.\u00a0\u21a9
Okonechnikov, K., A. Conesa, and F. Garcia-Alcalde, Qualimap 2: advanced multi-sample quality control for high-throughput sequencing data. Bioinformatics, 2016. 32(2): p. 292-4.\u00a0\u21a9
Cingolani, P., et al., A program for annotating and predicting the effects of single nucleotide polymorphisms, SnpEff: SNPs in the genome of Drosophila melanogaster strain w1118; iso-2; iso-3. Fly (Austin), 2012. 6(2): p. 80-92.\u00a0\u21a9
Ewels, P., et al., MultiQC: summarize analysis results for multiple tools and samples in a single report. Bioinformatics, 2016. 32(19): p. 3047-8.\u00a0\u21a9
Koster, J. and S. Rahmann, Snakemake-a scalable bioinformatics workflow engine. Bioinformatics, 2018. 34(20): p. 3600.\u00a0\u21a9
The output files and their locations are broken down here for the XAVIER pipeline. Pre-processing and germline variant calling steps are common but somatic variant calling is dependent on whether the pipeline was run in either (A) tumor-normal pair or (B) tumor-only analysis mode. All file locations are relative to the output directory specified during the job submission.
The output directory after a complete XAVIER run should look like:
xavier_output/\n\u251c\u2500\u2500 bams\n\u251c\u2500\u2500 cluster.json # cluster info for the run\n\u251c\u2500\u2500 config\n\u251c\u2500\u2500 config.json # config file for the run\n\u251c\u2500\u2500 fastqs\n\u251c\u2500\u2500 germline\n\u251c\u2500\u2500 indels.vcf.gz[.tbi] # raw germline INDELs\n\u251c\u2500\u2500 input_files\n\u251c\u2500\u2500 intervals.list\n\u251c\u2500\u2500 {sample1}-normal.R1.fastq.gz -> /path/to/{sample1}-normal.R1.fastq.gz\n\u251c\u2500\u2500 {sample1}-normal.R2.fastq.gz -> /path/to/{sample1}-normal.R2.fastq.gz\n\u251c\u2500\u2500 {sample1}-tumor.R1.fastq.gz -> /path/to/{sample1}-tumor.R1.fastq.gz\n\u251c\u2500\u2500 {sample1}-tumor.R2.fastq.gz -> /path/to/{sample1}-tumor.R2.fastq.gz\n.\n.\n.\n\u251c\u2500\u2500 kickoff.sh\n\u251c\u2500\u2500 logfiles\n\u251c\u2500\u2500 QC\n\u251c\u2500\u2500 resources\n\u251c\u2500\u2500 snps.vcf.gz[.tbi] # raw germline SNPs\n\u251c\u2500\u2500 somatic_paired # in case of tumor-normal paired run\n\u251c\u2500\u2500 somatic_tumor_only # in case of tumor-only run\n\u2514\u2500\u2500 workflow\n
Below we describe the different folders that contain specific outputs obtained for all samples from the XAVIER pipeline
The QC folder contains all the Quality-Control analyses performed at different steps of the pipeline for each sample to assess sequencing quality before and after adapter trimming, microbial taxonomic composition, contamination, variant calling, etc. The final summary report and data is available finalQC folder. \\ The MultiQC report also contains results from other analysis like mapping statistics, ancestry and relatedness, etc. It is recommended to study the MultiQC report first to get a birds eye view of the sequence data quality.
The bams folder contain two subfolders chrom_split and final_bams. final_bams contains the final processed BAM files for each sample in the run and the chrom_split folder contains all the sample BAM files split by each chromosome.
This folder contains the output from the GATK Best Practices pipeline to obtain germline variants with a few alterations detailed below. Briefly, joint SNP and INDEL variant detection is conducted across all samples included in a pipeline run using the GATK Haplotypcaller under default settings. Raw variants are then subsequently filtered based on several GATK annotations: \\ A strict set of criteria (QD < 2.0, FS > 60.0, MQ < 40.0, MQRankSum < -12.5, ReadPosRankSum < -8.0 for SNPs; QD < 2.0, FS > 200.0, ReadPosRankSum < -20.0 for INDELs) generates the 'combined.strictFilter.vcf'.
This call set is highly stringent, maximizing the true positive rate at the expense of an elevated false negative rate. This call set is really only intended for more general population genetic scale analyses (e.g., burden tests, admixture, linkage/pedigree based analysis, etc.) where false positives can be significantly confounding.
In case of human sequence data, a basic analyses of sample relatedness and ancestry (e.g., % European, African, etc.) is also performed using somalier.
The output folder looks like:
germline/\n\u251c\u2500\u2500 gVCFs\n.\n.\n.\n\u251c\u2500\u2500 somalier # only for hg38 genome\n\u2514\u2500\u2500 VCF\n
The VCF folder contains the final filtered germline variants (SNPs and INDELs) for all samples combined. The folder also contains raw variants for each sample, all samples combined, and also combined raw variants split by chromosome.
This folder contains the snakemake log files and computational statistics for the XAVIER run. All the log files (i.e., standard output and error) for each individual step are in the slurmfiles folder. These logfiles are important to diagnose errors in case the pipeline fails.
This workflow calls somatic SNPs and INDELs using multiple variant detection algorithms. For each of these tools, variants are called in a paired tumor-normal fashion, with default settings. See Pipeline Details for more information about the tools used and their parameter settings.
For each sample, the resulting VCF is fully annotated using VEP and converted to a MAF file using the vcf2maf tool. Resulting MAF files are found in maf folder within each caller's results directory (i.e., mutect2_out, strelka_out, etc.). Individual sample MAF files are then merged and saved in merged_somatic_variants directory.
For Mutect2, we use a panel of normals (PON) developed from the ExAC (excluding TCGA) dataset, filtered for variants <0.001 in the general population, and also including and in-house set of blacklisted recurrent germline variants that are not found in any population databases.
For Copy Number Variants (CNVs), two tools are employed in tandem. First, Control-FREEC is run with default parameters. This generates pileup files that can be used by Sequenza, primarily for jointly estimating contamination and ploidy. These value are used to run Freec a second time for improved performance.
In general, the tumor-only pipeline is a stripped down version of the tumor-normal pipeline. We only run MuTect2, Mutect, and VarDict for somatic variant detection, with the same PON and filtering as described above for the tumor-normal pipeline.
Workflow diagram of the XAVIER: the pipeline is composed of a series of data processing steps to trim, align, and recalibrate reads prior to calling variants. These data processing steps closely follow GATK's best pratices for cleaning up raw alignments. The pipeline also consists of a series of comprehensive quality-control steps.
Removes variants with certain flags from vardict; (1) Germline: detected in germline sample (pass all quality parameters); (2) LikelyLOH: detected in germline but either lost in tumor OR 20-80% in germline, but increased to 1-opt_V (95%); (3) AFDiff: detected in tumor (pass quality parameters) and present in germline but didn\u2019t pass quality parameters.
varscan
pileup
-d 100000 -q 15 -Q 15
samtools mpileup arguments; max depth of 100,000; min mapping quality of 15; min base quality of 15
calling
--strand-filter 0
Do not ignore variants with >90% support on one strand
--min-var-freq 0.01
Minimum variant allele frequency threshold 0.01
--output-vcf 1
Output in VCF format
--variants 1
Report only variant (SNP/indel) positions
all
GATK SelectVariants
--exclude-filtered
Removes non-PASS variants
--discordance
Remove variants found in supplied file (same as panel-of-normals file)
The xavier executable is composed of several inter-related sub commands. Please see xavier -h for all available options.
This part of the documentation describes options and concepts for xavier cache sub command in more detail. With minimal configuration, the cache sub command enables you to cache remote resources for the xavier pipeline. Caching remote resources allows the pipeline to run in an offline mode. The cache sub command can also be used to pull our pre-built reference bundles onto a new cluster or target system.
The cache sub command creates local cache on the filesysytem for resources hosted on DockerHub or AWS S3. These resources are normally pulled onto the filesystem when the pipeline runs; however, due to network issues or DockerHub pull rate limits, it may make sense to pull the resources once so a shared cache can be created and re-used. It is worth noting that a singularity cache cannot normally be shared across users. Singularity strictly enforces that its cache is owned by the user. To get around this issue, the cache subcommand can be used to create local SIFs on the filesystem from images on DockerHub.
XAVIER pipeline can be executed from either using the graphical user interface (GUI) or the command line interface (CLI). GUI offers a more interactive way for the user to provide input and adjust parameter settings. This part of the documentation describes how to run xavier using the GUI (with screenshots). See Command Line tab to read more about the xavier executable and running XAVIER pipeline using the CLI.
"},{"location":"usage/gui/#2-setting-up-xavier","title":"2. Setting up XAVIER","text":""},{"location":"usage/gui/#21-login-to-cluster","title":"2.1 Login to cluster","text":"
# Setup Step 1.) ssh into cluster's head node\n# example below for Biowulf cluster\nssh -Y $USER@biowulf.nih.gov\n
"},{"location":"usage/gui/#22-grab-an-interactive-node","title":"2.2 Grab an interactive node","text":"
# Setup Step 2.) Please do not run XAVIER on the head node!\n# Grab an interactive node first\nsinteractive --time=12:00:00 --mem=8gb --cpus-per-task=4\n
NOTE: ccbrpipeliner is a custom module created on biowulf which contains various NGS data analysis pipelines developed, tested, and benchmarked by experts at CCBR.
# Setup Step 3.) Add ccbrpipeliner module\nmodule purge # to reset the module environment\nmodule load ccbrpipeliner\n
If the module was loaded correctly, the greetings message should be displayed.
[+] Loading ccbrpipeliner 5 ...\n###########################################################################\n CCBR Pipeliner\n###########################################################################\n \"ccbrpipeliner\" is a suite of end-to-end pipelines and tools\n Visit https://github.com/ccbr for more details.\n Pipelines are available on BIOWULF and FRCE.\n Tools are available on BIOWULF, HELIX and FRCE.\n\n The following pipelines/tools will be loaded in this module:\n\n RENEE v2.5 https://ccbr.github.io/RENEE/\n XAVIER v3.0 https://ccbr.github.io/XAVIER/\n CARLISLE v2.4 https://ccbr.github.io/CARLISLE/\n CHAMPAGNE v0.2 https://ccbr.github.io/CHAMPAGNE/\n CRUISE v0.1 https://ccbr.github.io/CRUISE/\n\n spacesavers2 v0.10 https://ccbr.github.io/spacesavers2/\n permfix v0.6 https://github.com/ccbr/permfix\n###########################################################################\nThank you for using CCBR Pipeliner\n###########################################################################\n
To run the XAVIER pipeline from the GUI, simply enter:
xavier_gui\n
and it will launch the XAVIER window.
Note: Please wait until window created! message appears on the terminal.
"},{"location":"usage/gui/#32-folder-paths-and-reference-genomes","title":"3.2 Folder paths and reference genomes","text":"
To enter the location of the input folder containing FASTQ files and the location where the output folders should be created, either simply type the absolute paths
or use the Browse tab to choose the input and output directories
Next, from the drop down menu select the reference genome (hg38/mm10)
In case the paired normal samples are unavailable, XAVIER pipeline can be run in tumor-only mode which does not require paired samples information. However, in the absence of matching normal samples, CNV analysis is also unavailable.
After all the information is filled out, press Submit.
If the pipeline detects no errors and the run was submitted, a new window appears that has the output of a \"dry-run\" which summarizes each step of the pipeline.
Click OK
A dialogue box will popup to confirm submitting the job to slurm.
Click Yes
The dry-run output will be displayed again and the master job will be submitted. An email notification will be sent out when the pipeline starts and ends.
The XAVIER gui will ask to submit another job.
Click Yes to start again or No to close the XAVIER gui.
Users can input certain additional settings for the pipeline run including running an additional step to correct strand orientation bias in Formalin-Fixed Paraffin-Embedded (FFPE) samples and to provide a custom exome targets BED file. This file can be obtained from the manufacturer of the target capture kit that was used.
"},{"location":"usage/gui/#4-special-instructions-for-biowulf","title":"4. Special instructions for Biowulf","text":"
XAVIER GUI natively uses the X11 Window System to run XAVIER pipeline and display the graphics on a personal desktop or laptop. However, if running XAVIER specifically on NIH's Biowulf cluster, the HPC staff recommends NoMachine (NX) to run graphics applications.
Please see details here on how to install and connect to Biowulf on your local computer using NoMachine.
Once connected to Biowulf using NX, right click to open a terminal connection
and start an interactive session.
Similar to the instructions above, load ccbrpipeliner module and enter xavier_gui to launch the XAVIER gui.
The xavier executable is composed of several inter-related sub commands. Please see xavier -h for all available options.
This part of the documentation describes options and concepts for xavier run sub command in more detail. With minimal configuration, the run sub command enables you to start running xavier pipeline.
Setting up the xavier pipeline is fast and easy! In its most basic form, xavier run only has four required inputs.
The synopsis for each command shows its parameters and their usage. Optional parameters are shown in square brackets.
A user must provide a list of FastQ or BAM files (globbing is supported) to analyze via --input argument, an output directory to store results via --output argument, an exome targets BED file for the samples' capture kit, and select reference genome for alignment and annotation via the --genome argument.
Use you can always use the -h option for information on a specific command.
Each of the following arguments are required. Failure to provide a required argument will result in a non-zero exit-code.
--input INPUT [INPUT ...]
Input FastQ or BAM file(s) to process. type: file(s)
One or more FastQ files can be provided. The pipeline does NOT support single-end WES data. Please provide either a set of FastQ files or a set of BAM files. The pipeline does NOT support processing a mixture of FastQ files and BAM files. From the command-line, each input file should separated by a space. Globbing is supported! This makes selecting FastQ files easy. Input FastQ files should be gzipp-ed.
Example: --input .tests/*.R?.fastq.gz
--output OUTPUT
Path to an output directory. type: path
This location is where the pipeline will create all of its output files, also known as the pipeline's working directory. If the provided output directory does not exist, it will be initialized automatically.
Example: --output /data/$USER/WES_hg38
--runmode {init,dryrun,run} `
Execution Process. type: string
User should initialize the pipeline folder by first running --runmode init User should then perform a dry-run to list all steps the pipeline will take--runmode dryrun User should then perform the full run --runmode run
Example: --runmode init THEN --runmode dryrun THEN --runmode run
--genome {hg38, custom.json}
Reference genome. type: string/file
This option defines the reference genome for your set of samples. On Biowulf, xavier does comes bundled with pre built reference files for human samples; however, it is worth noting that the pipeline does accept a pre-built resource bundle pulled with the cache sub command (coming soon). Currently, the pipeline only supports the human reference hg38; however, support for mouse reference mm10 will be added soon.
Pre built Option Here is a list of available pre built genomes on Biowulf: hg38.
Custom Option For users running the pipeline outside of Biowulf, a pre-built resource bundle can be pulled with the cache sub command (coming soon). Please supply the custom reference JSON file that was generated by the cache sub command.
Example: --genome hg38 OR --genome /data/${USER}/hg38/hg38.json
--targets TARGETS
Exome targets BED file. type: file
This file can be obtained from the manufacturer of the target capture kit that was used.
Each of the following arguments are optional and do not need to be provided.
-h, --help
Display Help. type: boolean flag
Shows command's synopsis, help message, and an example command
Example: --help
--silent
Silence standard output. type: boolean flag
Reduces the amount of information directed to standard output when submitting master job to the job scheduler. Only the job id of the master job is returned.
Example: --silent
--mode {local,slurm}
Execution Method. type: string default: slurm
Execution Method. Defines the mode or method of execution. Valid mode options include: local or slurm.
local Local executions will run serially on compute instance. This is useful for testing, debugging, or when a users does not have access to a high performance computing environment. If this option is not provided, it will default to a local execution mode.
slurm The slurm execution method will submit jobs to a cluster using a singularity backend. It is recommended running xavier in this mode as execution will be significantly faster in a distributed environment.
Example: --mode slurm
--job-name JOB_NAME
Set the name of the pipeline's master job. type: string > default: pl:xavier
When submitting the pipeline to a job scheduler, like SLURM, this option always you to set the name of the pipeline's master job. By default, the name of the pipeline's master job is set to \"pl:xavier\".
List of variant callers to detect mutations. Please select from one or more of the following options: [mutect2, mutect, strelka, vardict, varscan]. Defaults to using all variant callers.
Example: --callers mutect2 strelka varscan
--pairs PAIRS
Tumor normal pairs file. type: file
This tab delimited file contains two columns with the names of tumor and normal pairs, one per line. The header of the file needs to be Tumor for the tumor column and Normal for the normal column. The base name of each sample should be listed in the pairs file. The base name of a given sample can be determined by removing the following extension from the sample's R1 FastQ file: .R1.fastq.gz. Contents of example pairs file:
Normal Tumor\nSample4_CRL1622_S31 Sample10_ARK1_S37\nSample4_CRL1622_S31 Sample11_ACI_158_S38\n
Example: --pairs /data/$USER/pairs.tsv
--ffpe
Apply FFPE correction. type: boolean flag
Runs an additional steps to correct strand orientation bias in Formalin-Fixed Paraffin-Embedded (FFPE) samples. Do NOT use this option with non-FFPE samples.
Example: --ffpe
--cnv
Call copy number variations (CNVs). type: boolean flag
CNVs will only be called from tumor-normal pairs. If this option is provided without providing a --pairs file, CNVs will NOT be called.
Example: --cnv
--singularity-cache SINGULARITY_CACHE
Overrides the $SINGULARITY_CACHEDIR environment variable. type: path default: --output OUTPUT/.singularity
Singularity will cache image layers pulled from remote registries. This ultimately speeds up the process of pull an image from DockerHub if an image layer already exists in the singularity cache directory. By default, the cache is set to the value provided to the --output argument. Please note that this cache cannot be shared across users. Singularity strictly enforces you own the cache directory and will return a non-zero exit code if you do not own the cache directory! See the --sif-cache option to create a shareable resource.
Path where a local cache of SIFs are stored. type: path
Uses a local cache of SIFs on the filesystem. This SIF cache can be shared across users if permissions are set correctly. If a SIF does not exist in the SIF cache, the image will be pulled from Dockerhub and a warning message will be displayed. The xavier cache subcommand can be used to create a local SIF cache. Please see xavier cache for more information. This command is extremely useful for avoiding DockerHub pull rate limits. It also remove any potential errors that could occur due to network issues or DockerHub being temporarily unavailable. We recommend running xavier with this option when ever possible.
Example: --singularity-cache /data/$USER/SIFs
--threads THREADS
Max number of threads for each process. type: int default: 2
Max number of threads for each process. This option is more applicable when running the pipeline with --mode local. It is recommended setting this value to the maximum number of CPUs available on the host machine.
# Step 1.) Grab an interactive node\n# Do not run on head node!\nsinteractive --mem=8g --cpus-per-task=4\nmodule purge\nmodule load ccbrpipeliner\n\n# Step 2A.) Initialize the all resources to the output folder\nxavier run --input .tests/*.R?.fastq.gz \\\n --output /data/$USER/xavier_hg38 \\\n --genome hg38 \\\n --targets Agilent_SSv7_allExons_hg38.bed \\\n --mode slurm \\\n --runmode init\n\n# Step 2B.) Dry-run the pipeline\nxavier run --input .tests/*.R?.fastq.gz \\\n --output /data/$USER/xavier_hg38 \\\n --genome hg38 \\\n --targets Agilent_SSv7_allExons_hg38.bed \\\n --mode slurm \\\n --runmode dryrun\n\n# Step 2C.) Run the XAVIER pipeline\n# The slurm mode will submit jobs to the cluster.\n# It is recommended running xavier in this mode.\nxavier run --input .tests/*.R?.fastq.gz \\\n --output /data/$USER/xavier_hg38 \\\n --genome hg38 \\\n --targets Agilent_SSv7_allExons_hg38.bed \\\n --mode slurm \\\n --runmode run\n
The xavier executable is composed of several inter-related sub commands. Please see xavier -h for all available options.
This part of the documentation describes options and concepts for xavier unlock sub command in more detail. With minimal configuration, the unlock sub command enables you to unlock a pipeline output directory.
If the pipeline fails ungracefully, it maybe required to unlock the working directory before proceeding again. Snakemake will inform a user when it maybe necessary to unlock a working directory with an error message stating: Error: Directory cannot be locked.
Please verify that the pipeline is not running before running this command. If the pipeline is currently running, the workflow manager will report the working directory is locked. The is the default behavior of snakemake, and it is normal. Do NOT run this command if the pipeline is still running! Please kill the master job and it's child jobs prior to running this command.
Unlocking xavier pipeline output directory is fast and easy! In its most basic form, xavier unlock only has one required input.
The synopsis for this command shows its parameters and their usage. Optional parameters are shown in square brackets.
A user must provide an output directory to unlock via --output argument. After running the unlock sub command, you can resume the build or run pipeline from where it left off by re-running it.
Use you can always use the -h option for information on a specific command.
Path to a previous run's output directory. This will remove a lock on the working directory. Please verify that the pipeline is not running before running this command. Example: --output /data/$USER/WES_hg38
# Step 0.) Grab an interactive node (do not run on head node)\nsinteractive --mem=8g -N 1 -n 4\nmodule purge\nmodule load ccbrpipeliner\n\n# Step 1.) Unlock a pipeline output directory\nxavier unlock --output /data/$USER/xavier_hg38\n
"}]}
\ No newline at end of file
diff --git a/dev/sitemap.xml.gz b/dev/sitemap.xml.gz
index 5e1784801ceb0e03dc7602820e975642cda18479..f8782822fec34543fd682bf2cc229662d46e9449 100644
GIT binary patch
delta 13
Ucmb=gXP58h;P|#+^F;Ov03kaC82|tP
delta 13
Ucmb=gXP58h;OJJ~IFY>q031LBF8}}l
diff --git a/dev/usage/cache/index.html b/dev/usage/cache/index.html
index 0197e51..5b0bfa4 100644
--- a/dev/usage/cache/index.html
+++ b/dev/usage/cache/index.html
@@ -1,4 +1,4 @@
- xavier cache - XAVIER Documentation
The xavier executable is composed of several inter-related sub commands. Please see xavier -h for all available options.
This part of the documentation describes options and concepts for xavier cache sub command in more detail. With minimal configuration, the cache sub command enables you to cache remote resources for the xavier pipeline. Caching remote resources allows the pipeline to run in an offline mode. The cache sub command can also be used to pull our pre-built reference bundles onto a new cluster or target system.
The cache sub command creates local cache on the filesysytem for resources hosted on DockerHub or AWS S3. These resources are normally pulled onto the filesystem when the pipeline runs; however, due to network issues or DockerHub pull rate limits, it may make sense to pull the resources once so a shared cache can be created and re-used. It is worth noting that a singularity cache cannot normally be shared across users. Singularity strictly enforces that its cache is owned by the user. To get around this issue, the cache subcommand can be used to create local SIFs on the filesystem from images on DockerHub.
The xavier executable is composed of several inter-related sub commands. Please see xavier -h for all available options.
This part of the documentation describes options and concepts for xavier cache sub command in more detail. With minimal configuration, the cache sub command enables you to cache remote resources for the xavier pipeline. Caching remote resources allows the pipeline to run in an offline mode. The cache sub command can also be used to pull our pre-built reference bundles onto a new cluster or target system.
The cache sub command creates local cache on the filesysytem for resources hosted on DockerHub or AWS S3. These resources are normally pulled onto the filesystem when the pipeline runs; however, due to network issues or DockerHub pull rate limits, it may make sense to pull the resources once so a shared cache can be created and re-used. It is worth noting that a singularity cache cannot normally be shared across users. Singularity strictly enforces that its cache is owned by the user. To get around this issue, the cache subcommand can be used to create local SIFs on the filesystem from images on DockerHub.