Skip to content

3. Running Software and Suggested Pipeline

Gleb Kichaev edited this page Jan 11, 2017 · 6 revisions

Running software

Usage:

PAINTOR -input.files [input filename] -in [input directory] -out [output directory] -Zhead [Zscore header(s)] -LDname [LD suffix(es)] -annotations [annotation1,annotation2...] <other options>

OPTIONS: -flag Description [default setting]

-input (required) Filename of the input file containing the list of the fine-mapping loci [default: N/A]

-Zhead (required) The name(s) of the Zscore column in the header of the locus file (comma separated) [default: N/A]

-LDname (required) Suffix(es) for LD files. Must match the order of Z-scores in which the -Zhead flag is specified (comma separated) [Default:N/A]

-annotations The names of the annotations to include in model (comma separated) [default: N/A]

-enumerate specify this flag if you want to enumerate all possible configurations followed by the max number of causal SNPs (eg. -enumerate 3 considers up to 3 causals at each locus) [Default: not specified]

-in Input directory with all run files [default: ./ ]

-out Output directory where output will be written [default: ./ ]

-Gname Output Filename for enrichment estimates [default: Enrichment.Estimate]

-Lname Output Filename for the final sum of log bayes factors [default: Log.BayesFactor]

-RESname Suffix for output files of results [Default: results]

-ANname Suffix for annotation files [Default: annotations]

-MI Maximum iterations for algorithm to run [Default: 10]

-GAMinital Initialize the enrichment parameters to a pre-specified value (comma separated) [Default: 0,...,0]

-variance specify prior variance on the causal effect sizes scaled by sample size [Default: 30]

-num_samples specify number of samples to draw for each locus [Default: 1000000]

-set_seed specify an integer as a seed for random number generator [default: clock time at execution]

-max_causal specify the number of causals to pre-compute enrichments with [default: 2]

Example: Running PAINTOR defaults (single population)

PAINTOR defaults to doing approximate inference by using Importance Sampling with 1 million draws/locus (specified with -num_samples flag). The way the algorithm works is that it will infer the enrichment parameters first by doing enumeration under the assumption of 2 causal varaints per locus (can be changed by specifying the -max_causal flag). Then it will do one round of Importance Sampling to compute posterior probabilities for SNPs to be causal.

$> ./PAINTOR -input input.files -Zhead  ZSCORE.P1 -LDname LD1 -in RunDirectory/ -out OutDirectory/  -annotations Coding,DHS

Example: Running PAINTOR with full enumeration (single population)

For moderately sized loci and reasonable number of causal variants, one can elect to run full enumeration where every possible model is considered. To do so you can use the -enumerate [number of causals] flag in the specication:

$> ./PAINTOR -input input.files -Zhead  ZSCORE.P1 -LDname LD1 -in RunDirectory/ -out OutDirectory/ -enumerate 3 -annotations Coding,DHS

If loci are less than 500 SNPs, then it is reasonable and generally recommended to run with full enumeration as this will likely give the most accurate results.

Example: Running PAINTOR with full enumeration (two populations or traits)

$ ./PAINTOR -input input.files -Zhead  ZSCORE.P1,ZSCORE.P2 -LDname LD1,LD2 -in RunDirectory/ -out OutDirectory/ -enumerate 3 -annotations Coding,DHS

Approximate inference is also applicable for multi-population/multi-trait fine-mapping.

Suggested Pipeline

In order to determine which annotations are relevant to the phenotype being considered, we recommend running PAINTOR on each annotation independently.

Example: Pipeline for a pool of 100 annotations for a single population.

>$ ./PAINTOR -input input.files -Zhead  ZSCORE.P1 -LDname LD1 -in RunDirectory/ -out OutDirectory/ -enumerate 2 -Gname Enrich.Base -Lname BF.Base 
>$ ./PAINTOR -input input.files -Zhead  ZSCORE.P1 -LDname LD1 -in RunDirectory/ -out OutDirectory/ -enumerate 2 -annotations A1  -Gname Enrich.A1 -Lname BF.A1  
>$ ./PAINTOR -input input.files -Zhead  ZSCORE.P1 -LDname LD1 -in RunDirectory/ -out OutDirectory/ -enumerate 2 -annotations A2  -Gname Enrich.A2 -Lname BF.A2  
>$ ./PAINTOR -input input.files -Zhead  ZSCORE.P1 -LDname LD1 -in RunDirectory/ -out OutDirectory/ -enumerate 2 -annotations A3  -Gname Enrich.A3 -Lname BF.A3  
.
.
.
>$ ./PAINTOR -input input.files -Zhead  ZSCORE.P1 -LDname LD1 -in RunDirectory/ -out OutDirectory/ -enumerate 2 -annotations A100  -Gname Enrich.A100 -Lname BF.A100  

After obtaining the output for all of the annotations marginally, prioritize annotations based on the improvement in the model fit. Take the top annotations (usually no more than 4 or 5) to enter the final model that are roughly uncorrelated with one another. We recommend correlation matrices for this process. Then use those annotations in a final model to compute trait-specific posterior probabilities for causality:

Note: it is also possible to do model selection using alternative approaches such as stratifed LD-score regression. This has the advantage that it will learn the relevant functional data by leveraging the entire genome as opposed to restricting to just the significant GWAS risk loci.

Final run to obtain posteriors

>$ ./PAINTOR -input input.files -Zhead  ZSCORE.P1 -LDname LD1 -in RunDirectory/ -out OutDirectory/ -annotations A5,A20,A93  -Gname Enrich.Final -Lname BF.Final