-
Notifications
You must be signed in to change notification settings - Fork 20
5. Visualization with PAINTOR CANVIS
After running PAINTOR, you can use CANVIS (Correlation and Annotation Visualization) to provide a visual summary of the outputs. We provide documentation for CANVIS where we will describe the necessary dependencies, file formats, and input parameters, as well as a sample script. For questions or bug reports, please contact [email protected].
The final visualization produced is composed of the following:
- Scatterplot of location versus posterior probabilities with a specified credible set
- Annotation bars
- Scatterplot of location versus -log10(pvalue)
- Correlation heat-map of LD matrix
- The following subplots are output into one svg file and an html file, which can be loaded by the browser or converted into other file formats
Download the latest software into target directory. The necessary libraries and dependencies are as follows:
- Python (2.7)
- numpy (1.12.0)
- scipy (0.15.1)
- matplotlib (2.0.0)*
- seaborn (0.7.1)
- pandas (0.15.2)
- svgutils
- note that most matplotlib libraries are still 1.X.X, but the new major release renders some functionalities incompatible with previous versions, thus using version 2.0.0 or higher is needed
The first 3 libraries are often included in Python distributions, whereas the latter three libraries can be easily installed using:
pip install seaborn
pip install pandas
pip install svgutils --user
Sample data is also provided in a folder, CANVIS_Sample.
- Locus file that contains Z-scores from population of interest [N+1 x F]
- LD Matrix file(s) [N x N]
- Annotation matrix file with annotation indicators of either 1 or 0 [N+1 x A]
The only mandatory file is the Locus file, but can plot using up to all three.
The locus file should at the minimum contain the Z-scores, Posterior Probabilities, and positions. Note that the heading for position pos
and posterior probabilities Posterior_Prob
must be exact, but the label for the z-scores is up to the user since the names of the z-scores are specified when running.
NOTE: Currently, the base pair column of the Locus file must have the heading, "pos", and the posterior probability heading must read "Posterior_Prob"
Example File:
chr pos rsid hdl.A0 hdl.A1 hdl.Zscore ldl.Zscore tc.Zscore tg.Zscore Posterior_Prob
chr4 3351705 rs2749778 T C -2.586514 3.742234 3.651452 2.98515 0.00371223
chr4 3351929 rs2749777 T C -1.071928 1.348749 1.702689 2.42911 3.42745e-07
chr4 3355035 rs2749776 A G -1.424898 1.16765 1.854778 2.604495 4.33626e-07
chr4 3355538 rs2749775 C T -1.5 0.965517 1.631579 2.603774 1.79114e-05
The LD file(s) contains a symmetric matrix of Pearson correlation coefficients where entry i,j will correspond to the correlation between SNPs i and j (r{i,j}). White space must separate individual columns of the matrix. This file has no header. If using multiple populations, then each population must have a different matrix file.
Example:
Population 1
1.0 0.5 0.5 0.2
0.5 1.0 0.3 0.1
0.5 0.3 1.0 0.9
0.2 0.1 0.9 1.0
Population 2
1.0 0.0 0.0 0.0
0.0 1.0 0.2 0.1
0.0 0.2 1.0 0.3
0.0 0.1 0.3 1.0
The annotation file contains a matrix of annotations that are typically binary, represented by either an entry of either 0 or 1. The rows of the matrix correspond to SNPs at that locus and columns represent unique annotations. For example, if the first column of the matrix represented "coding" region, and entry [1,1] of the matrix was equal to 1, this would signify that SNP 1 falls within a coding region. The first line in the file must be header identifying the annotations. Each annotation must have a unique identifier.
One can plot only certain annotations even if the file contains multiple columns of annotations. The user must specify which specific annotations will be plotted during input.
Example:
E066.H3K27ac.narrowPeak.Adult_Liver E066.H3K4me1.narrowPeak.Adult_Liver
0 0
0 0
1 1
1 1
Here are the following input parameters specified at the command line:
--locus [-l]
path to file with fine-mapping locus
Mandatory file; if not included, program will exit
--zscores [-z]
specific zscores to be plotted (between 1-3)
Mandatory file; if not included, program will exit
--annotations [-a]
path the file with annotations
Optional; if not included, no annotation bars will be included
--specific_annotations[-s]
specific annotations to be plotted (between 1-5)
list specific annotations to plot (recommended 3 max)
if not specified, program will use all annotations listed in the heading of the file
--ld_name [-r]
path to file with ld matrix
Optional; if not included, no ld matrix will be plotted
recommended < 350 entries; if there are more entries, program will not produce LD matrix, but will still be used in calculations. Use --large_ld y
to override
--threshold [-t]
threshold for credible set, a number (0,100); default: 0
--greyscale [-g]
y
or n
, flag for figure in greyscale; default: n
--output [-o]
desired name of output file [default: fig_final]
--interval [-i]
designated interval [default: all locations]
--large_ld [-L]
y
or n
, plots LD matrix regardless if range > 350
--horizontal [-H]
y
or n
, plots LD matrix adjacent to graphs instead of underneath
note: if the right and/or left interval are out of bounds or if there are no SNPs in the specified interval, they will be replaced with the minimum or maximum location respectively
You must include the locus path and z-scores name(s); all other flags are optional.
Graph of location versus Posterior Probabilities. SNPs that form the credible set will be marked in red (or black in greyscale).
Annotation bars currently are plotted using discrete (0 or 1) classifications. The region of color denotes a 1, whereas the white regions denote a 0. Multiple annotation bars can be plotted, but recommended to plot <= 3.
Plot of location versus -log10(pvalue). A dotted horizontal line denotes the threshold of 5*10^-8. If an LD matrix is provided, the top SNP is denoted by a black diamond and the rest of the plots are shaded according to the correlation to the top SNP.
If a LD file is provided, a heatmap matrix will be plotted. If the interval is > 350, a plot will not be produced unless the --large_ld
flag is used. Up to 2 LD matrices can be plotted if using multiple populations. If 2 LD matrices are plotted, they will be displayed horizontally.
The output will be formatted as an svg, and an html file with an embedded plot. Note that it may take up to 60 seconds to run, where the time is mostly dependent on how large the LD file is.
#!/bin/bash
python CANVIS.py\
-l chr4.3473139.rs6831256.post.filt.300\
-z tg.Zscore\
-r chr4.3473139.rs6831256.ld.filt.300\
-a chr4.3473139.rs6831256.annot.filt.300\
-s E066.H3K27ac.narrowPeak.Adult_Liver E066.H3K4me1.narrowPeak.Adult_Liver\
-t 99 \
-i 3381705 3507346