Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add a description of the toolkit to the README #34

Merged
merged 34 commits into from
Aug 13, 2024
Merged
Show file tree
Hide file tree
Changes from 13 commits
Commits
Show all changes
34 commits
Select commit Hold shift + click to select a range
798a5ac
Add a description of the toolkit to the README
clintval Nov 22, 2023
e3e3dc1
Generate docs files
Mar 13, 2024
0ade11f
Merge remote-tracking branch 'origin/main' into cv_README
clintval May 10, 2024
531a122
Generate docs files
May 10, 2024
17e55e4
Fix up README after a review
clintval May 10, 2024
983dbb7
Fix up README after a review
clintval May 10, 2024
5a903a0
Remove outdated intro in Overview
clintval May 10, 2024
a62b275
Fixup a sentence
clintval May 10, 2024
a8cc755
Whitespace
clintval May 10, 2024
9304a02
Generate docs files
May 10, 2024
475c3e9
Generate docs files
May 10, 2024
c2ca29a
Remove duplicate .gitignore line
clintval May 10, 2024
0090746
Generate docs files
May 10, 2024
7067311
Small review fixups
clintval May 10, 2024
9a45075
Generate docs files
May 10, 2024
af562e6
docs: revise docs based on @msto review
clintval Jul 23, 2024
818d158
Generate docs files
Jul 23, 2024
a29d0ca
docs: small docs fixups for clarity and formatting
clintval Jul 23, 2024
c6a7e11
Generate docs files
Jul 23, 2024
8d80f2a
docs: one more pass at docs clarity!
clintval Jul 23, 2024
88fff29
chore: query group and template definition
clintval Jul 23, 2024
697b07d
docs: move reference down
clintval Jul 23, 2024
34e9faf
docs: do not repeat thyself
clintval Jul 23, 2024
ecf1df4
Generate docs files
Jul 23, 2024
09ead37
docs: little fixup
clintval Jul 23, 2024
45a17c1
Generate docs files
Jul 23, 2024
ef7f8f2
docs: formatting to be the same
clintval Jul 23, 2024
ac5b334
Generate docs files
Jul 23, 2024
f56f0e7
chore: header fixup
clintval Jul 23, 2024
2297324
chore: header fixup
clintval Jul 23, 2024
51557f2
Generate docs files
Jul 23, 2024
df8ead1
Generate docs files
Jul 23, 2024
3f52590
docs: suit review from @nh13
clintval Aug 13, 2024
ddd45aa
Generate docs files
Aug 13, 2024
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
73 changes: 70 additions & 3 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -8,7 +8,7 @@

[bioconda-badge-link]: https://img.shields.io/conda/dn/bioconda/fgsv.svg?label=Bioconda
[bioconda-link]: http://bioconda.github.io/recipes/fgsv/README.html
[github-badge]: https://github.com/fulcrumgenomics/fgsv/actions/workflows/unittests.yaml/badge.svg
[github-badge]: https://github.com/fulcrumgenomics/fgsv/actions/workflows/unittests.yaml/badge.svg?branch=main
[github-link]: https://github.com/fulcrumgenomics/fgsv/actions/workflows/unittests.yaml
[scala-badge]: https://img.shields.io/badge/language-scala-c22d40.svg
[scala-link]: https://www.scala-lang.org/
Expand All @@ -17,8 +17,75 @@
[doi-badge]: https://zenodo.org/badge/454071954.svg
[doi-link]: https://zenodo.org/doi/10.5281/zenodo.10452647

Tools to find evidence for structural variation.
Tools to gather evidence for structural variation via breakpoint detection.

## Documentation

Documentation can be found in the [docs folder](docs/01_Introduction.md)
Documentation can be found in the [docs folder](docs/01_Introduction.md).

## Introduction to the `fgsv` Toolkit

The `fgsv` toolkit contains tools for effective structural variant debugging but are not meant to be used as a structural variant calling toolchain in-and-of-itself.
Instead, it is better to think of `fgsv` as an effective breakpoint detection and structural variant exploration toolkit.

When describing structural variation, we use the term breakpoint to mean a junction between two loci and the term breakend to refer to one of the loci in a breakpoint.
clintval marked this conversation as resolved.
Show resolved Hide resolved
Importantly, all point intervals (1-length) reported by this toolkit are 1-based inclusive from the perspective of the reference sequence.

### `fgsv SvPileup`

Collates a pileup of putative structural variant supporting reads.
clintval marked this conversation as resolved.
Show resolved Hide resolved

```console
fgsv SvPileup \
--input sample.bam \
--output sample.svpileup
```

The tool [`fgsv SvPileup`](https://github.com/fulcrumgenomics/fgsv/blob/main/docs/tools/SvPileup.md) takes a query-grouped BAM file as input and scans through each template one at a time, where a template is the full collection of reads and alignments from a single source molecule.
clintval marked this conversation as resolved.
Show resolved Hide resolved
clintval marked this conversation as resolved.
Show resolved Hide resolved
clintval marked this conversation as resolved.
Show resolved Hide resolved
For example, a paired-end read may have an alignment per read: one alignment for read 1 and another alignment for read 2.

Primary and supplementary alignments for a template (see the [SAM Format Specification v1](https://samtools.github.io/hts-specs/SAMv1.pdf) for more information) are used to construct a “chain” of aligned sub-segments in a way that honors the logical ordering of sub-segments and their strandeness in relation to the reference sequence.
clintval marked this conversation as resolved.
Show resolved Hide resolved
clintval marked this conversation as resolved.
Show resolved Hide resolved
These aligned sub-segments in a chain relate to each other through typical alignment mechanisms like insertions and deletions but also contain information about the relative orientation of the sub-segment to the reference sequence and importantly, jumps between reference sequences such as translocations between chromosomes or contigs.

For each chain of aligned sub-segments per template, outlier jumps are collected where the minimum inter-segment distance within a read must be 100bp (by default) or greater, and the minimum inter-read distance across reads (e.g. between reads in a paired-end read) must be 1000bp (by default) or greater.
clintval marked this conversation as resolved.
Show resolved Hide resolved
In the case where there is both evidence for a split-read alignment and inter-read jump, the split-read alignment evidence is favored since it gives a precise breakpoint.
clintval marked this conversation as resolved.
Show resolved Hide resolved
At locations where these jumps occur, breakpoints are marked and the breakpoints are given a unique ID based on the positions of the breakends and the directionality of the left and right strands leading into each breakend.

This process creates a collection of candidate breakpoint locations.
The output of this tool is a metrics file tabulating the breakpoints and a BAM file with each breakpoint-supporting alignment having custom tags that indicate which breakpoint the alignment supports.
clintval marked this conversation as resolved.
Show resolved Hide resolved

### `fgsv AggregateSvPileup`

Merges nearby pileups of reads supporting putative breakpoints.
clintval marked this conversation as resolved.
Show resolved Hide resolved

```console
fgsv AggregateSvPileup \
--bam sample.bam \
--input sample.svpileup.txt \
--output sample.svpileup.aggregate.txt
```

Because of variability in typical short-read alignments, evidence for a single breakpoint may span a few loci near the true breakend loci. For example, if the breakpoint only has intra-read evidence, then the breakpoint could coincidentally occur within the unobserved bases between read 1 and read 2 in a pair. In other cases and due to sequence similarity or homology between each breakend locus, it is not always possible to locate the exact nucleotide point where the breakends occur, and instead a plausible region may exist that supports either breakend loci.
clintval marked this conversation as resolved.
Show resolved Hide resolved

The tool [`fgsv AggregateSvPileup`](https://github.com/fulcrumgenomics/fgsv/blob/main/docs/tools/AggregateSvPileup.md) is used to coalesce nearby breakpoints into one event if they appear to belong to one true breakpoint.
clintval marked this conversation as resolved.
Show resolved Hide resolved
This polishing step preserves true positive breakpoint events and intends to reduce the number of false positive breakpoint events.
clintval marked this conversation as resolved.
Show resolved Hide resolved

Adjacent breakpoints are only merged if their left breakends map to the same reference sequence, their right breakends map to the same reference sequence, the strandedness of the left and right aligned sub-segments is the same, and their left and right genomic breakend positions are both within a given length threshold.
clintval marked this conversation as resolved.
Show resolved Hide resolved

One shortcoming of the existing behavior, which should be corrected at some point, is that intra-read breakpoint evidence is considered similarly to inter-pair breakpoint evidence even though intra-read breakpoint evidence often has nucleotide-level alignment resolution and inter-pair breakpoint evidence does not.

The output of this tool is a metrics file tabulating the coalesced breakpoints with all previous breakpoint IDs listed for the new breakpoint event and an estimation of the allele frequency of the event based on the alignments that support the breakpoint.
clintval marked this conversation as resolved.
Show resolved Hide resolved

## `AggregateSvPileupToBedPE`

Convert the output of `AggregateSvPileup` to BEDPE.

```console
fgsv AggregateSvPileupToBedPE \
--input sample.svpileup.aggregate.txt \
--output sample.svpileup.aggregate.bedpe
```

The tool [`fgsv AggregateSvPileupToBedPE`](https://github.com/fulcrumgenomics/fgsv/blob/main/docs/tools/AggregateSvPileupToBedPE.md) is used to convert the output of `AggregateSvPileup` to the [BEDPE format](https://bedtools.readthedocs.io/en/latest/content/general-usage.html#bedpe-format) so that it can be viewed in [IGV](https://igv.org/) and other BEDPE-supporting genome browsers. For example:

![BEDPE in IGV](docs/img/fgsv-bedpe.png)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

pretty!

12 changes: 0 additions & 12 deletions docs/01_Introduction.md
Original file line number Diff line number Diff line change
Expand Up @@ -6,15 +6,3 @@ The following sections will help you to get started.
* [Contributing](03_Contributing.md)
* [Metric Descriptions](04_Metrics.md)
* [Tools Descriptions](05_Tools.md)

## Overview

`fgsv` contains tools for gathering evidence for structural variants
from aligned reads. The `SvPileup` tool searches for split read mappings
and read pairs that map across breakpoints, emitting verbose information
similar to other "piluep" tools for small variant detection, but in this
case for structural variation detection. The `AggregateSvPileup` attempts
to aggregate information across "nearby" pileups, which is useful as often
the genomic start and end of a breakpoint is not always precise. The tools
aim to be as sensitive as possible to find these evidence, but do neither
perform structural variation calling nor genotyping.
12 changes: 6 additions & 6 deletions docs/04_Metrics.md
Original file line number Diff line number Diff line change
Expand Up @@ -29,12 +29,12 @@ Aggregated cluster of breakpoint pileups
|id|String|Combined ID retaining the IDs of all constituent breakpoints|
|category|BreakpointCategory|Breakpoint category|
|left_contig|String|Contig name for left side of breakpoint|
|left_min_pos|Int|Minimum coordinate of left breakends (1-based)|
|left_max_pos|Int|Maximum coordinate of left breakends (1-based)|
|left_min_pos|Int|Minimum coordinate of left breakends (1-based inclusive)|
|left_max_pos|Int|Maximum coordinate of left breakends (1-based inclusive)|
|left_strand|Char|Strand at left breakends|
|right_contig|String|Contig name for right side of breakpoint|
|right_min_pos|Int|Minimum coordinate of right breakends (1-based)|
|right_max_pos|Int|Maximum coordinate of right breakends (1-based)|
|right_min_pos|Int|Minimum coordinate of right breakends (1-based inclusive)|
|right_max_pos|Int|Maximum coordinate of right breakends (1-based inclusive)|
|right_strand|Char|Strand at right breakends|
|split_reads|Int|Total number of split reads supporting the breakpoints in the cluster|
|read_pairs|Int|Total number of read pairs supporting the breakpoints in the cluster|
Expand Down Expand Up @@ -82,10 +82,10 @@ the only information comes from read-pairs and the breakpoint information should
|------|----|-----------|
|id|String|An ID assigned to the breakpoint that can be used to lookup supporting reads in the BAM.|
|left_contig|String|The contig of chromosome on which the left hand side of the breakpoint exists.|
|left_pos|Int|The position (possibly imprecise) of the left-hand breakend (1-based).|
|left_pos|Int|The position (possibly imprecise) of the left-hand breakend (1-based, inclusive).|
|left_strand|Char|The strand of the left-hand breakend; sequence reads would traverse this strand in order to arrive at the breakend and transit into the right-hand side of the breakpoint.|
|right_contig|String|The contig of chromosome on which the left hand side of the breakpoint exists.|
|right_pos|Int|The position (possibly imprecise) of the right-hand breakend (1-based).|
|right_pos|Int|The position (possibly imprecise) of the right-hand breakend (1-based, inclusive).|
|right_strand|Char|The strand of the right-hand breakend;. sequence reads would continue reading onto this strand after transiting the breakpoint from the left breakend|
|split_reads|Int|The number of templates/inserts with split-read alignments that identified this breakpoint.|
|read_pairs|Int|The number of templates/inserts with read-pair alignments (and without split-read alignments) that identified this breakpoint.|
Expand Down
Binary file added docs/img/fgsv-bedpe.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
2 changes: 1 addition & 1 deletion docs/tools/AggregateSvPileup.md
Original file line number Diff line number Diff line change
Expand Up @@ -36,7 +36,7 @@ of the overlapping target regions are copied from the `SvPiluep` input (if prese
The output file is a tab-delimited table with one record per aggregated cluster of pileups. Aggregated
pileups are reported with the minimum and maximum (inclusive) coordinates of all pileups in the cluster, a
possible putative structural variant event type supported by the pileups, and the sum of read support from all
pileups in the cluster. Positions in this file are 1-based positions.
pileups in the cluster. Positions in this file are 1-based inclusive positions.

## Arguments

Expand Down
2 changes: 1 addition & 1 deletion docs/tools/SvPileup.md
Original file line number Diff line number Diff line change
Expand Up @@ -15,7 +15,7 @@ Two output files will be created:

1. `<output-prefix>.txt`: a tab-delimited file describing SV pileups, one line per breakpoint event. The returned
breakpoint will be canonicalized such that the "left" side of the breakpoint will have the lower (or equal to)
position on the genome vs. the "right"s side. Positions in this file are 1-based positions.
position on the genome vs. the "right"s side. Positions in this file are 1-based inclusive positions.
2. `<output-prefix>.bam`: a SAM/BAM file containing reads that contain SV breakpoint evidence annotated with SAM
tag.

Expand Down
2 changes: 1 addition & 1 deletion docs/tools/index.md
Original file line number Diff line number Diff line change
Expand Up @@ -4,7 +4,7 @@ title: fgsv tools

# fgsv tools

The following tools are available in fgsv version 0.2.0-d603e95.
The following tools are available in fgsv version 0.2.0-c2ca29a.
## Breakpoint and SV Tools

Primary tools for calling and transforming breakpoints and SVs.
Expand Down