Skip to content

Latest commit

 

History

History
136 lines (97 loc) · 7.94 KB

pipeline_steps.md

File metadata and controls

136 lines (97 loc) · 7.94 KB

Pipeline Steps

Description of workflow for the PaintSHOP pipeline.

Overview

The pipeline consists of three stages:

  1. First, several reference files are created including cleaned input files and various indices.

  2. Secondly, the probe design pipeline is run on each chromosome making use of the reference files.

  3. Finally, a set of final output files are generated which constitute the pipeline endpoints.

1. Generating reference files

Parsing the genome

The first step in the workflow is to parse the raw assembly fasta file and discover chromosome names. Based on these names, some records are excluded from bowtie2 and jellyfish indices to prevent wrongly eliminating probes as non-specific based on alignment and k-mer count results. All but presumptive canonical chromosomes are excluded from probe design, according to the following table:

Type Identifier Example Included in indices Probes designed
Canonical * chr1 ✔️ ✔️
Unplaced Un_ chrUn_KI270386v1 ✔️
Unlocalized _random chr1_KI270708v1_random ✔️
Novel sequence _alt chr1_KZ115747v1_alt
Alt. haplotype _hap chr6_ssto_hap7
Fix patch _fix chr1_KN538361v1_fix

NOTE: Chromosomes not identified as one of these exceptions are presumed to be canonical chromosomes and treated as such. A record of observed chromosome names and their classifications is generated in the pipeline output directory at 01_reference_files/01_chrom_names/.

This step creates a filtered multi-fasta file for creating bowtie2 and jellyfish indices, as well as individual fasta files for each canonical chromosome for probes to be designed using parallel processing.

Parsing genome annotations

With canonical chromosomes discovered in the previous step, the provided annotations file is loaded and filtered to include only records where:

  • the seqid field is identical to one of the presumptive canonical chromosomes

  • the feature field is equal to exon

The remaining records constitute the set of annotations that will be intersected with DNA probes to design the isoform-resolved RNA FISH probes. These annotations are split by chromosome for parallel processing during downstream steps.

Isoform flattening

For RNA FISH probe design, when it is known which isoform(s) should be targeted, the annotation file is useful as is. However, it is often desireable to obtain RNA FISH probes for a particular target without specifying isoform information.

For instance, probes designed against an exon that only appears on a very rare isoform, are not likely to be useful against most of the transcripts for this target. To remedy this, this step implements an algorithm to collapse exon annotations to those segments shared by the maximal number of isoforms, when possible.

This step generates an additional annotation file with isoforms flattened to shared segments. Each of these annotation files is intersected with the DNA probe set to produce the corresponding (isoform-resolved or isoform-flattened) RNA probe set.

Building bowtie2 and jellyfish indices

After candidate probe sequences are mined from the genome, they are analyzed and scored for efficiency and specificity. As part of this process, candidate probe sequences are aligned to the reference genome using the Bowtie2 NGS aligner with very sensitive parameters. A k-mer frequency analysis is also performed using the jellyfish k-mer counter. Both of these tools require building an index from the genome before querying with candidate probe sequences.

Both indices are built from the filtered multi-fasta file generated upstream, and when the pipeline is executed on a computing cluster, or multiple cores are provided, these index building steps are executed in parallel to the mining of candidate probe sequences.

2. Probe design pipeline

Mining candidate probes

Candidate probe sequences are mined from the genome using OligoMiner with "newBalance" parameter values. For more information on the mining of candidate probe sequences, see the OligoMiner publication.

Scoring candidate probes

After mining candidate probe sequences, a series of steps are performed to score candidate sequences for specifity. These steps are described in depth in the PaintSHOP pre-print.

Briefly, probes are aligned to the genome using bowtie2, and pairwise alignments are reconstructed using sam2pairwise, and k-mer frequency is determined using jellyfish. A gradient boosting regression model implemented with XGBoost generate quantitative predictions about the likelihood of the candidates hybridizing with sequences other than their intended target in the genome using a thermodynamic partition function. These scores are aggregrated into an on-target and off-target score for each probe in the set. Here is a schematic overview of the machine learning pipeline:

3. Generating output files

DNA probes

After the pipeline is run on each chromosome, DNA FISH probes exist as per-chromosome .tsv files. These files are merged into a single file which constitutes the DNA FISH probe .tsv output file. This file is also used in subsequent steps.

RNA probes

The merged DNA probes are intersected with both the isoform-resolved and isoform-flattened annotation files, which generates the two RNA probe sets. For more information on these probe sets, see the output file specification.

Zip archives

Each of the three completed probe sets are also compressed into zip archives for convenience. These are the files that end up as downloads in the PaintSHOP Resources repo. These three compressed files contain the complete set of generated DNA and RNA FISH probes.