Pipeline Steps

Description of workflow for the PaintSHOP pipeline.

Overview

The pipeline consists of three stages:

First, several reference files are created including cleaned input files and various indices.
Secondly, the probe design pipeline is run on each chromosome making use of the reference files.
Finally, a set of final output files are generated which constitute the pipeline endpoints.

1. Generating reference files

Parsing the genome

The first step in the workflow is to parse the raw assembly fasta file and discover chromosome names. Based on these names, some records are excluded from bowtie2 and jellyfish indices to prevent wrongly eliminating probes as non-specific based on alignment and k-mer count results. All but presumptive canonical chromosomes are excluded from probe design, according to the following table:

Type	Identifier	Example	Included in indices	Probes designed
Canonical	*	chr1	✔️	✔️
Unplaced	Un_	chrUn_KI270386v1	✔️	❌
Unlocalized	_random	chr1_KI270708v1_random	✔️	❌
Novel sequence	_alt	chr1_KZ115747v1_alt	❌	❌
Alt. haplotype	_hap	chr6_ssto_hap7	❌	❌
Fix patch	_fix	chr1_KN538361v1_fix	❌	❌

NOTE: Chromosomes not identified as one of these exceptions are presumed to be canonical chromosomes and treated as such. A record of observed chromosome names and their classifications is generated in the pipeline output directory at 01_reference_files/01_chrom_names/.

This step creates a filtered multi-fasta file for creating bowtie2 and jellyfish indices, as well as individual fasta files for each canonical chromosome for probes to be designed using parallel processing.

Parsing genome annotations

With canonical chromosomes discovered in the previous step, the provided annotations file is loaded and filtered to include only records where:

the seqid field is identical to one of the presumptive canonical chromosomes
the feature field is equal to exon

The remaining records constitute the set of annotations that will be intersected with DNA probes to design the isoform-resolved RNA FISH probes. These annotations are split by chromosome for parallel processing during downstream steps.

Isoform flattening

For RNA FISH probe design, when it is known which isoform(s) should be targeted, the annotation file is useful as is. However, it is often desireable to obtain RNA FISH probes for a particular target without specifying isoform information.

For instance, probes designed against an exon that only appears on a very rare isoform, are not likely to be useful against most of the transcripts for this target. To remedy this, this step implements an algorithm to collapse exon annotations to those segments shared by the maximal number of isoforms, when possible.

This step generates an additional annotation file with isoforms flattened to shared segments. Each of these annotation files is intersected with the DNA probe set to produce the corresponding (isoform-resolved or isoform-flattened) RNA probe set.

Building bowtie2 and jellyfish indices

After candidate probe sequences are mined from the genome, they are analyzed and scored for efficiency and specificity. As part of this process, candidate probe sequences are aligned to the reference genome using the Bowtie2 NGS aligner with very sensitive parameters. A k-mer frequency analysis is also performed using the jellyfish k-mer counter. Both of these tools require building an index from the genome before querying with candidate probe sequences.

Both indices are built from the filtered multi-fasta file generated upstream, and when the pipeline is executed on a computing cluster, or multiple cores are provided, these index building steps are executed in parallel to the mining of candidate probe sequences.

2. Probe design pipeline

Mining candidate probes

Candidate probe sequences are mined from the genome using OligoMiner with "newBalance" parameter values. For more information on the mining of candidate probe sequences, see the OligoMiner publication.

Scoring candidate probes

After mining candidate probe sequences, a series of steps are performed to score candidate sequences for specifity. These steps are described in depth in the PaintSHOP pre-print.

Briefly, probes are aligned to the genome using bowtie2, and pairwise alignments are reconstructed using sam2pairwise, and k-mer frequency is determined using jellyfish. A gradient boosting regression model implemented with XGBoost generate quantitative predictions about the likelihood of the candidates hybridizing with sequences other than their intended target in the genome using a thermodynamic partition function. These scores are aggregrated into an on-target and off-target score for each probe in the set. Here is a schematic overview of the machine learning pipeline:

3. Generating output files

DNA probes

After the pipeline is run on each chromosome, DNA FISH probes exist as per-chromosome .tsv files. These files are merged into a single file which constitutes the DNA FISH probe .tsv output file. This file is also used in subsequent steps.

RNA probes

The merged DNA probes are intersected with both the isoform-resolved and isoform-flattened annotation files, which generates the two RNA probe sets. For more information on these probe sets, see the output file specification.

Zip archives

Each of the three completed probe sets are also compressed into zip archives for convenience. These are the files that end up as downloads in the PaintSHOP Resources repo. These three compressed files contain the complete set of generated DNA and RNA FISH probes.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

pipeline_steps.md

pipeline_steps.md

Pipeline Steps

Overview

1. Generating reference files

Parsing the genome

Parsing genome annotations

Isoform flattening

Building bowtie2 and jellyfish indices

2. Probe design pipeline

Mining candidate probes

Scoring candidate probes

3. Generating output files

DNA probes

RNA probes

Zip archives

Files

pipeline_steps.md

Latest commit

History

pipeline_steps.md

File metadata and controls

Pipeline Steps

Overview

1. Generating reference files

Parsing the genome

Parsing genome annotations

Isoform flattening

Building bowtie2 and jellyfish indices

2. Probe design pipeline

Mining candidate probes

Scoring candidate probes

3. Generating output files

DNA probes

RNA probes

Zip archives