CRSSANT is an analysis pipeline for sequencing data produced using the PARIS assay described in Lu et al., Cell 2016. CRSSANT automates the process of grouping sequencing reads into duplex groups (DGs), extracts stem groups (SGs) from the duplex groups, and reports the potential stem structures that result from folding SGs using ViennaRNA's RNAfold software.
Briefly, the CRSSANT pipeline is as follows: First, CRSSANT uses network analysis methods to cluster PARIS sequencing reads into DGs. After DGs have been found, a percentile rule is used to filter out reads that have one or both arms that are longer than the some percentile arm length, aggregated over all reads in a gene, and over both arms in a read. The remaining reads are referred to as stem groups, or SGs. SGs are piped into ViennaRNA's RNAfold software, and each SG is checked to see if it forms a valid stem structure. Predicted stem structures are reported as lists of base pairs.
CRSSANT is written in Python and available as source code that you can download and run on yuor own machine
For more about the CRSSANT pipeline, please see the bioRxiv preprint by Fischer-Hwang et al..
Navigate to the latest release, right click on the source code, and save it to a known path/location, e.g. CRSSANT_path
. You will need Python version 3.6+ and the following Python packages. We recommend downloading the latest versions of these packages using the Ananconda/Bioconda package manager (follow instructions in links in parentheses):
- ViennaRNA v2.4.7+ (Anaconda Cloud link)
- ushuffle v1.2.2+ (Anaconda Cloud link)
- NetworkX v2.1+ (Anaconda Cloud link)
- NumPy (Anaconda Cloud link)
- scikit-learn (Anaconda Cloud link)
- SciPy (Anaconda Cloud link)
To run the CRSSANT pipeline using the Python source code, open a command line interface on your machine and make sure to prepend all pipeline commands with a call to Python on your platform, e.g.:
python CRSSANT_path/CRSSANT reads.sam reference.fa reference.bed
where files reads.sam
, reference.fa
, and reference.bed
include paths to the reads, reference sequence and reference gene files, respectively. See subsection Specifying pipeline parameters for how to specify non-default pipeline parameters using optional flags.
The input to CRSSANT is assumed to be a SAM file of aligned sequencing reads produced by the PARIS assay. The reads are further assumed to be mapped to the same genomic region (e.g. chromosome or mini genome). The SAM file may contain reads from different genes, but all genes must reside in only a single genomic region.
By default, the CRSSANT pipeline analyzes all reads in a SAM file. The pipeline uses the spectral clustering method to cluster reads into DGs with overlap threshold parameter of t_o=0.5
and eigenratio threshold of t_eig=5
, and uses 8 threads for parallel processing. The following parameters allow the user to run CRSSANT using options different from the default ones.
The CRSSANT pipeline automatically writes all results to the same path where the reads file is found, but an output path may be specified with the -out
flag:
CRSSANT_path/CRSSANT reads.sam reference.fa reference.bed -out output
The CRSSANT pipeline assumes that all reads in the input SAM file map to a single reference genome location. Reads that map to multiple locations within the same genomic region, or chimeric reads, can be specified using the chimeric reads file flag, -chimeric
. When using the -chimeric
flag, the reads.sam
file is assumed to contain only normally-aligned reads that were mapped to a single reference genome location, as indicated with an XG:i:0
tag. Using the -chimeric
flag will create and save a new SAM file with filename ending in _chimeric.sam
in the reads.sam
file path. All parsed chimeric reads added to this new file will contain a new chiastic group (XG) field designation, an XG:i:1
tag. The CRSSANT analysis pipeline is then run on the new file.
To specify a chimeric reads file, run with flag -chimeric
:
CRSSANT_path/CRSSANT reads.sam reference.fa reference.bed -chimeric chimeric.sam -out output
By default, CRSSANT analyzes all possible pairs of genes present in the SAM file. The user may also specify a particular pair of genes for analysis using the gene flag -genes
, e.g. -genes gene1,gene2
indicates that the CRSSANT pipeline should analyze only reads whose left arms map to gene1 and whose right arms map to gene2.
To run CRSSANT on a particular gene pair of interest, run with flag -genes
:
CRSSANT_path/CRSSANT reads.sam reference.fa reference.bed -genes g1,g2 -out output
The default spectral clustering method may be operated with different overlap threshold and eigenratio threshold parameters by specifying one or both with the flags t_o
and t_eig
, respectively. t_o
may be any float between 0 and 1, and t_eig
may be any positive number. Increasing t_o
tends to result in more DGs that contain fewer reads, and increasing t_eig
tends to result in fewer DGs containing more reads. For example, the following command runs CRSSANT with spectral clustering using overlap threshold 0.6 and eigenratio threshold 8:
CRSSANT_path/CRSSANT reads.sam reference.fa reference.bed -t_o 0.7 -t_eig 8 -out output
The user may also specify the cliques-finding method for clustering DGs by specifying the clustering flag cluster
with the cliques
option, e.g. -cluster cliques
. If the cliques-finding method is specified, t_o
may also be specified, and again may be any float between and 1. The eigenvalue threshold t_eig
is not used in the cliques-finding method. By default, for the cliques-finding method the overlap threshold is set to 0.1. For example, the following command runs CRSSANT on reads whose arms both map to gene1, and performs DG clustering with the cliques-finding method using overlap threshold 0.3:
CRSSANT_path/CRSSANT reads.sam reference.fa reference.bed -genes gene1,gene1 -cluster cliques -t_o 0.3 -out output
For more on these parameters, see the bioRxiv preprint referenced at the top of this README.
CRSSANT runs using a default of 8 threads in parallel. The user may specify a different number of threads with the -n
flag.
CRSSANT produces up to five output files with the following extensions added to the reads (or combined normal reads and chimeric reads) file(s) name. After the DG clustering step, CRSSANT verifies that the DGs do not contain any non-overlapping reads, i.e. any reads where the start position of its left arm is greater than or equal to the stop position of the right arm of any other read in the DG. If the DGs do not contain any non-overlapping reads, then the following output files ending in the following are written:
_CRSSANT.sam
: SAM file containing reads that were successfully assigned to DGs, plus DG and non-overlapping group (NG) annotations_CRSSANT_dg.bed
: BED file listing all duplex groups. The file header contains the following columns:
region DG start DG stop Group_ID_coverage # reads in DG - DG start DG start color 2 DG left arm length,DG right arm length DG left arm start,DG right arm start
where coverage
is defined as c / sqrt(a*b) and
- c = number of reads in a given DG
- a = number of reads overlapping the left arm of the DG
- b = number of reads overlapping the right arm of the DG
If DG were successfully clustered, the CRSSANT pipeline performs SG assembly for nine percentile threshold cutoffs: 10th percentile, 20th percentile, ..., up to and including the 90th percentile. After SG assembly for each percentile threshold, CRSSANT checks if there is a non-zero number of SGs that were assembled from the DGs. If there is at least one SG, then output files ending in the following are written:
_CRSSANT_sg_bp.bed
: BED file containing basepairs only for SGs that form valid stem structures. The file header contains the following columns:
region SG bp start SG bp stop SG
_CRSSANT_sg_arc.bed
: BED file containing arcs for all SGs. Arc start and stop positions are the means of the start and stop indices of left and right SG arms, respectively. The file header contains the following columns:
region SG arc start SG arc stop SG
_CRSSANT_sg.aux
: auxiliary file containing SG crosslinking and stem length information, and arm statistics for all DGs. If no SG was formed from the DG, or if an SG did not result in a valid structure, some of the following information is replaced with null information. The file header contains the following columns:
Group_ID num_reads UU_cl,UC_cl,num_basepairs L_start_min,L_start_max,L_start_std L_stop_min,L_stop_max,L_stop_std R_start_min,R_start_max,R_start_std R_stop_min,R_stop_max,R_stop_std
where
- SG Group IDs correspond to DG IDs
num_reads
is the number of reads in the SG, i.e. the number of reads in the DG that passed percentile filteringUU_cl
andUC_cl
are the number of staggered uridine and uridine-cytosine base pairs, respectively, in the SG stem structurestem_length
is the length of SG stem structure_min
,_max
,_std
are the minimum arm index, maximum arm index, and standard deviation of all start and stop indices of the SG arms
CRSSANT assumes that the reference.bed
file contains minimal gene information in the following 6-column format:
genomic region gene start gene stop gene name
where columns are separated by tabs. Some specifics on format:
genomic region
must match the gene naming format in thereference.fa
filegene start
andgene stop
must be integers
To see specifics on arguments for running CRSSANT, run
CRSSANT_path/CRSSANT -h
The help information may take ~10 seconds to load.
You can test CRSSANT using a collection of Homo sapiens ribosomal RNA (rRNA) test data that we have compiled:
- Download the compressed folder of test data and decompress using the command
tar -zxvf tests.tar.gz
or by double-clicking on the tar.gz file - Specify the path/location where results should be written, e.g.
output
Run CRSSANT on all rRNA genes in region hs54S:
CRSSANT_path/CRSSANT tests/hsrRNA_reads.sam tests/hsrRNA.fa tests/hsrRNA_gene.bed -out output
or analyze specific genes, e.g. only reads whose left arms map to gene 5.8S and whose right arms map to gene 28S:
CRSSANT_path/CRSSANT tests/hsrRNA_reads.sam tests/hsrRNA.fa tests/hsrRNA_gene.bed -genes 5.8S,28S -out output