Skip to content

5. Visualization with PAINTOR CANVIS

Ruth Johnson edited this page Jul 11, 2017 · 9 revisions

Overview

After running PAINTOR, you can use CANVIS (Correlation and Annotation Visualization) to provide a visual summary of the outputs. We provide documentation for CANVIS where we will describe the necessary dependencies, file formats, and input parameters, as well as a sample script. For questions or bug reports, please contact [email protected].

The final visualization produced is composed of the following:

  • Scatterplot of location versus posterior probabilities with a specified credible set
  • Annotation bars
  • Scatterplot of location versus -log10(pvalue)
  • Correlation heat-map of LD matrix
  • The following subplots are output into one svg file and an html file, which can be loaded by the browser or converted into other file formats

Installation

Download the latest software into target directory. The necessary libraries and dependencies are as follows:

  • Python (2.7)
  • numpy (1.12.0)
  • scipy (0.15.1)
  • matplotlib (2.0.0)*
  • seaborn (0.7.1)
  • pandas (0.15.2)
  • svgutils
  • note that most matplotlib libraries are still 1.X.X, but the new major release renders some functionalities incompatible with previous versions, thus using version 2.0.0 or higher is needed

The first 3 libraries are often included in Python distributions, whereas the latter three libraries can be easily installed using:

pip install seaborn
pip install pandas
pip install svgutils --user

Sample data is also provided in a folder, CANVIS_Sample.

Inputs and File Formats

  1. Locus file that contains Z-scores from population of interest [N+1 x F]
  2. LD Matrix file(s) [N x N]
  3. Annotation matrix file with annotation indicators of either 1 or 0 [N+1 x A]

The only mandatory file is the Locus file, but can plot using up to all three.

File Formats

Locus File

The locus file should at the minimum contain the Z-scores, Posterior Probabilities, and positions. Note that the heading for position pos and posterior probabilities Posterior_Prob must be exact, but the label for the z-scores is up to the user since the names of the z-scores are specified when running.

NOTE: Currently, the base pair column of the Locus file must have the heading, "pos", and the posterior probability heading must read "Posterior_Prob"

Example File:

chr pos rsid hdl.A0 hdl.A1 hdl.Zscore ldl.Zscore tc.Zscore tg.Zscore Posterior_Prob
chr4 3351705 rs2749778 T C -2.586514 3.742234 3.651452 2.98515 0.00371223
chr4 3351929 rs2749777 T C -1.071928 1.348749 1.702689 2.42911 3.42745e-07
chr4 3355035 rs2749776 A G -1.424898 1.16765 1.854778 2.604495 4.33626e-07
chr4 3355538 rs2749775 C T -1.5 0.965517 1.631579 2.603774 1.79114e-05

LD Matrix File

The LD file(s) contains a symmetric matrix of Pearson correlation coefficients where entry i,j will correspond to the correlation between SNPs i and j (r{i,j}). White space must separate individual columns of the matrix. This file has no header. If using multiple populations, then each population must have a different matrix file.

Example: 

Population 1

1.0 0.5 0.5 0.2
0.5 1.0 0.3 0.1
0.5 0.3 1.0 0.9
0.2 0.1 0.9 1.0

Population 2

1.0 0.0 0.0 0.0
0.0 1.0 0.2 0.1
0.0 0.2 1.0 0.3
0.0 0.1 0.3 1.0

Annotation Matrix File

The annotation file contains a matrix of annotations that are typically binary, represented by either an entry of either 0 or 1. The rows of the matrix correspond to SNPs at that locus and columns represent unique annotations. For example, if the first column of the matrix represented "coding" region, and entry [1,1] of the matrix was equal to 1, this would signify that SNP 1 falls within a coding region. The first line in the file must be header identifying the annotations. Each annotation must have a unique identifier.

One can plot only certain annotations even if the file contains multiple columns of annotations. The user must specify which specific annotations will be plotted during input.

Example:

E066.H3K27ac.narrowPeak.Adult_Liver E066.H3K4me1.narrowPeak.Adult_Liver
0 0
0 0
1 1
1 1

Command Line Flags

Here are the following input parameters specified at the command line:

--locus [-l] path to file with fine-mapping locus Mandatory file; if not included, program will exit

--zscores [-z] specific zscores to be plotted (between 1-3) Mandatory file; if not included, program will exit

--annotations [-a] path the file with annotations Optional; if not included, no annotation bars will be included

--specific_annotations[-s] specific annotations to be plotted (between 1-5) list specific annotations to plot (recommended 3 max) if not specified, program will use all annotations listed in the heading of the file

--ld_name [-r] path to file with ld matrix Optional; if not included, no ld matrix will be plotted recommended < 350 entries; if there are more entries, program will not produce LD matrix, but will still be used in calculations. Use --large_ld y to override

--threshold [-t] threshold for credible set, a number (0,100); default: 0

--greyscale [-g] y or n, flag for figure in greyscale; default: n

--output [-o] desired name of output file [default: fig_final]

--interval [-i] designated interval [default: all locations]

--large_ld [-L] y or n, plots LD matrix regardless if range > 350

--horizontal [-H] y or n, plots LD matrix adjacent to graphs instead of underneath

note: if the right and/or left interval are out of bounds or if there are no SNPs in the specified interval, they will be replaced with the minimum or maximum location respectively

Output

You must include the locus path and z-scores name(s); all other flags are optional.

Posterior Probability Plot

Graph of location versus Posterior Probabilities. SNPs that form the credible set will be marked in red (or black in greyscale).

Annotation Bars

Annotation bars currently are plotted using discrete (0 or 1) classifications. The region of color denotes a 1, whereas the white regions denote a 0. Multiple annotation bars can be plotted, but recommended to plot <= 3.

Scatterplot pvalues

Plot of location versus -log10(pvalue). A dotted horizontal line denotes the threshold of 5*10^-8. If an LD matrix is provided, the top SNP is denoted by a black diamond and the rest of the plots are shaded according to the correlation to the top SNP.

LD Heatmap

If a LD file is provided, a heatmap matrix will be plotted. If the interval is > 350, a plot will not be produced unless the --large_ld flag is used. Up to 2 LD matrices can be plotted if using multiple populations. If 2 LD matrices are plotted, they will be displayed horizontally.

Output Files

The output will be formatted as an svg, and an html file with an embedded plot. Note that it may take up to 60 seconds to run, where the time is mostly dependent on how large the LD file is.

Example Script

#!/bin/bash

python CANVIS.py\
-l chr4.3473139.rs6831256.post.filt.300\
-z tg.Zscore\
-r chr4.3473139.rs6831256.ld.filt.300\
-a chr4.3473139.rs6831256.annot.filt.300\
-s E066.H3K27ac.narrowPeak.Adult_Liver E066.H3K4me1.narrowPeak.Adult_Liver\
-t 99 \
-i 3381705 3507346

Example Figure