Merge pull request #213 from CRI-iAtlas/germline

Germline Analysis module added
CRI-iAtlas · Feb 9, 2021 · 6385e25 · 6385e25
2 parents 7436d39 + 98d7e18
commit 6385e25
Show file tree

Hide file tree

Showing 26 changed files with 1,438 additions and 35 deletions.
diff --git a/configuration.yaml b/configuration.yaml
@@ -16,6 +16,10 @@ module_files:
   - "modules/cytokinenetwork.R"
   - "modules/cellimagemodule.R"
   - "modules/cnvmodule.R"
+  - "modules/germlinemodule.R"
+  - "modules/germlineheritability.R"
+  - "modules/germlinegwas.R"
+  - "modules/germlinerarevariants.R"
   - "modules/ioresponseoverviewmodule.R"
   - "modules/ioresponsesurvivalmodule.R"
   - "modules/ioresponse_distribution_module.R"
@@ -38,13 +42,15 @@ function_files:
   - "functions/mosaicplot.R"
   - "functions/barplot.R"
   - "functions/forestplot.R"
+  - "functions/manhattanplot.R"
   - "functions/subtype_classifier.R"
   - "functions/tablef_fun.R"
   - "functions/imageplot.R"
   - "functions/event_data_utils.R"
   - "functions/extracellnet_utils.R"
   - "functions/cellimage_utils.R"
   - "functions/iomodules_utils.R"
+  - "functions/germline_utils.R"
 
 page_files:
   - "pages/aboutpage.R"

diff --git a/data/MethodsText/germline-colocalization.md b/data/MethodsText/germline-colocalization.md
@@ -0,0 +1,22 @@
+**Title:** Germline Colocalization
+
+**Descriptions:** *Gene expression and splice quantitative trait locus analysis, and Colocalization:* We performed eQTL and sQTL analyses in TCGA and used summary statistics from the GTEx datasets to search for potential candidate genes. We excluded the HLA and IL17RA loci since SNPs at these loci are known eQTLs for genes that are part of the immune trait. For the significant and suggestive SNPs, we tested all genes within +/- 1MB for eQTL and all transcipts within +/- 500KB for sQTL. We used a shorter range for sQTLs with the assumption that SNPs affecting splicing are likely to act at a shorter distance. 
+
+
+*TCGA dataset:* RNA-seq gene expression and splicing data were downloaded from the NIH Genomics Data Commons (https://gdc.cancer.gov/about-data/publications/pancanatlas and https://gdc.cancer.gov/about-data/publications/PanCanAtlas-Splicing-2018 (Kahles et al., 2018). For the sQTL analysis, we considered the following splicing categories: 3’, 5’, exome skipping, intron retention, and mutually exclusive exon events quantified by the Percent Spliced In (PSI) (Kahles et al., 2018). Only splicing events with more than 800 non-missing observations (~10% of the total data) were considered. Association analyses between either gene expression or PSI and the imputed SNPs were performed using linear regression using age, sex, PC1-7, and cancer type as covariates. We calculated FDR for each SNP separately, under the assumption that the SNP was already either significant or suggestive, and thus we had to correct for each of the genes at the locus but not all of the other SNPs (Table S5). We then selected the SNP-gene expression (eQTL) or SNP-gene splicing (sQTL) pairs with FDR p < 0.1 for further colocalization analysis. 
+
+
+*GTEx dataset:* We downloaded all summary statistics for expression quantitative loci (eQTL - GTEx_Analysis_v8_eQTL_all_associations), and splicing quantitative loci (sQTL - GTEx_Analysis_v8_sQTL_all_associations) from GTEx project (https://console.cloud.google.com/storage/browser/gtex-resources) using the results from the latest version of the GTEx database (Version 8). For each SNP that had a genome-wide significant or suggestive association with one of the 33 immune traits by GWAS, we extracted all of the association statistics from the summary statistics for eQTL within +/- 1MB and for sQTLs within +/- 500 KB from all tissues in the GTEx summary statistics dataset. We then calculated FDR for each SNP, correcting for all of the genes at the locus across all tissues as we did for TCGA. For eQTL and/or sQTLs that had FDR p < 0.1, we pursued colocalization as below. TCGA GWAS summary statistics are annotated in Build 37, GTEx QTL summary stats are annotated in Build 38, when appropriate, liftover from Build 38 to 37 are provided using R/Bionconductor packages AnnotationHub (v2.12.1) (AH14150 chain file) and rtracklayer (v1.40.6). In the GTEx summary file (Tale S5) we annotated both Build 37 and Build 38 positions.
+
+
+*Colocalization analysis:* We performed colocalization posterior probability (CLPP) analysis using eCAVIAR (Hormozdiari et al., 2016) on both TCGA and GTEx results. eCAVIAR computes a posterior probability of causality based on association data and LD structure for the eQTL/sQTL and the trait GWAS and then calculates the joint probability of both of these being causal. It requires both summary statistics from GWAS and from the eQTL/sQTL analysis and the LD matrix of SNPs used in both analyses. For TCGA, we began with all SNPs that had FDR p < 0.1 with at least one gene and/or transcript and computed the eQTL and sQTL associations for the surrounding SNPs from the index SNP for that same gene/transcript using the same approach as outlined above. For GTEx, we began with SNP-gene expression or SNP-gene splicing pairs that met our FDR p < 0.1 criteria and extracted the eQTL and sQTL results for the surrounding SNPs from the summary results.  For the GWAS and TCGA analyses, we calculated the genotype correlation (r) at each locus from the genotype data. For the GTEx analysis, we downloaded the individual genotype data from dbGAP for GTEx participants (https://www.ncbi.nlm.nih.gov/projects/gap/cgi-bin/study.cgi?study_id=phs000424.v8.p2) and calculated genotypic correlation between SNPs in R. We then ran eCAVIAR separately for each FDR p < 0.1 eQTL and sQTL association from TCGA and GTEx, considering models at each locus that assume one or two causal variants. For each SNP-gene expression or SNP-gene splicing category pair, 200 SNPs (+/- 100 SNPs) around the index SNP were included in colocalization analysis. The CLPP of each SNP being causal was calculated, and also a regional CLPP by summing all 201 SNP CLPPs. We used a posterior probability of > 0.01 to consider plausible colocalization, including both the 1 and 2 locus model and considering the sum of the posterior probability SNPs in the colocalization results. 
+
+
+*Expanded region criteria for colocalization:* Since eCAVIAR identified multiple genes at the same locus for many loci that have plausible colocalization within a +/- 100 SNP boundary, we sought stronger evidence for colocalization at the loci where eCAVIAR found colocalization by examining an expanded region. We reasoned that a gene or transcript that is causal for the immune trait should not be more strongly associated with another SNP in the region that has little or no evidence of association with the immune trait. Therefore, for each gene or splice variant that had plausible colocalization by eCAVIAR, we performed an expanded region search (+/- 1MB for eQTLs and +/- 500KB for sQTLs) to see if we can identify one or more SNPs that had a stronger effect in the eQTL/sQTL analysis in the same tissue/dataset, which we called “counter-evidence” SNPs. If eCAVIAR produced plausible evidence of colocalization (posterior prob>0.01) and we could find no SNPs that met our counter-evidence criteria in the expanded region, we considered the expanded region evidence for colocalization as strong. If we did find SNPs that met our counter-evidence criteria in the expanded region, then we compared the significance level for the eQTL/sQTL association of the counter-evidence SNP vs. the eQTL/sQTL association with index SNP (associated with the immune trait). If the counter-evidence SNP association with eQTL or sQTL had a neg log10 p value that was less or equal than 1.5 higher than the index SNP (GWAS significant SNP for the immune trait), then we considered the expanded region evidence as intermediate. If the difference in -log10 p values was >1.5, we considered the expanded region analysis to be negative.
+ To visualize the colocalization in the expanded region, we generated plots that show the -log10 p QTL vs. -log10 p GWAS for all of the GWAS significant SNPs with CLPP > 0.01.  The plots included the association p values for all of the SNPs  at +/- 1MB for eQTL and at +/- 500KB for sQTL from the gene which had a CLPP > 0.01. These plots are available at Figshare (GTEX expanded region analysis plots: https://doi.org/10.6084/m9.figshare.13089341; ; TCGA expanded region analysis plots: https://doi.org/10.6084/m9.figshare.13090031. . We color-coded these plots with the LD, based on the LD matrix from the TCGA. Counter-SNPs are found in the top left corner of these plots (i.e. strong association with the eQTL or sQTL but no association with the immune trait). Conversely if there were no counter-SNPs, then the strongest SNPs for association with the immune trait were also the strongest SNPs for the association with the eQTL/sQTL. 
+
+
+[Reference Listing](https://doi.org/10.1016/j.immuni.2021.01.011)
+
+**Contributors:** Rosalyn Sayaman, Donglei Hu, Mohamad Saad, Elad Ziv, Davide Bedognetti
+
diff --git a/data/MethodsText/germline-gwas.md b/data/MethodsText/germline-gwas.md
@@ -0,0 +1,8 @@
+**Title:** Germline GWAS
+
+**Descriptions:** We selected each of the immune phenotypes that demonstrated nominally significant genome-wide heritability (N=33) for GWAS. GWAS was conducted on all of the genotyped SNPs that passed QC and all of the imputed SNPs that had imputation R2 > 0.5 and minor allele frequency > 0.005 in the 9,603 unrelated individuals (PLINK 1.9 identity by descent, IBD, pihat < 0.25). Minor allele frequencies were recalculated post-imputation for only the subset of 9,603 individuals (PLINK 1.9). Of the 39,127,678 SNPs available after imputation, 10,955,441 passed both imputation quality and frequency thresholds and thus were included in the association analysis. 
+GWAS was performed using PLINK 1.9. Immune phenotypes that were approximately normally distributed or normally distributed after stratification by covariates were tested for association with SNPs using linear regression with age, tumor type, sex and PC1-7 as covariates. Immune traits that were dichotomized for heritability analyses were analyzed using logistic regression models, with the same covariates. For each GWAS we also calculated the genome-wide inflation coefficient (lambda). We used the traditional cutoff of p < 5x10-8 as a cutoff for genome-wide significance and p < 1x10-6 to denote suggestive loci. Since we only selected the subset of phenotypes that was heritable and since many of the phenotypes were highly correlated, we did not correct the GWAS for the number of phenotypes analyzed. SNPs were annotated based on spanned genomic ranges (R v3.5.0, Bioconductor package GenomicRanges_1.32.6) with rsIDs (R v3.5.0, R package snplist_0.18.1, Bioconductor package SNPlocs.Hsapiens.dbSNP144.GRCh37_0.99.20) and with genes within +/- 50KB, +/- 500KB, and +/- 1MB of the SNP (R v3.5.0, Bioconductor package biomaRt_2.36.1 using grch37.ensembl.org as host). Variant annotations for all genome-wide and suggestive SNPs were determined using the web interface of the Ensembl Variant Effect Predictor (VEP, https://grch37.ensembl.org/info/docs/tools/vep/index.html. All annotations were based on Homo sapiens (human) genome assembly GRCh37 (hg19) from Genome Reference Consortium. All association statistics for the GWAS are available at Figshare  https://figshare.com/articles/dataset/Sayaman_et_al_TCGA_Germline-Immune_GWAS_Summary_Statistics/13077920. 
+
+[Reference Listing](https://doi.org/10.1016/j.immuni.2021.01.011)
+
+**Contributors:** Rosalyn Sayaman, Elad Ziv
diff --git a/data/MethodsText/germline-heritability.md b/data/MethodsText/germline-heritability.md
@@ -0,0 +1,17 @@
+**Title:** Heritability
+
+**Descriptions:** We performed heritability analysis on 139 traits using a mixed-model approach implemented in genome-wide complex trait analysis (GCTA) genomic-relatedness-based
+restricted maximum-likelihood (GREML) method (Yang et al., 2010, 2011) to calculate the proportion of immune trait variation that is attributable to common genetic variants (Zaitlen and
+Kraft, 2012). Heritability analyses were conducted separately within each ancestry subgroup (NEuropean=7,813, NAfrican=863, NAsian=570, and NAmerican=209 individuals), which were derived from ancestry analysis.
+
+To reduce bias in the heritability estimates, we removed one of each pair of related individuals with Ajk > 0.05 (calculated from SNPs with MAF > 0.01) prior to running GREML. We calculated heritability using an unconstrained approach (allowing heritability to be < 0 - those values were truncated in iAtlas). We used the likelihood ratio test (LRT) implemented in GREML to test if heritability is different than zero for each of the immune traits analyzed and used Benjamini-Hochberg adjustment (Benjamini and Hochberg, 1995) to calculate the FDR. All heritability analyses were run with age, tumor type, sex and PC 1-7 as covariates.
+
+We also used GREML to determine whether there are any contextual factors that interact with genome-wide common variant effects, including the major immune subtypes as determined by Thorsson et al and somatic mutations (divided into tertiles and dichotomized at 10 MB). We implemented the gene x environment (GxE) feature calculation in the European in GREML. For
+those immune traits for which we found nominally significant (p < 0.05) interactions, we calculated heritability in each stratified subset, as well as with immune subtype as an additional
+covariate. For GxE calculations, the LRT tests the significance of the variance of GxE interaction effects.
+
+[Reference Listing](https://doi.org/10.1016/j.immuni.2021.01.011)
+
+**Contributors:** Rosalyn Sayaman, Elad Ziv
+
+
diff --git a/data/MethodsText/germline-rarevariants.md b/data/MethodsText/germline-rarevariants.md
@@ -0,0 +1,12 @@
+**Title:** Rare Variants
+
+**Descriptions:** For rare variant analysis, we focused on well-annotated, germline pathogenic or likely pathogenic cancer predisposition variants as previously defined (allele frequency in 1000 Genomes and ExAC (release r0.3.1) <  0.05%) (Huang et al., 2018).
+Exome files related to samples for which all the covariates (age, imputed sex, PC 1-7, and cancer type) and at least one immune trait was available were retained (N = 9,138). There were 832 pathogenic/likely pathogenic SNPs/Indels events with at least one copy of rare allele in the whole exome sequencing data, corresponding to 586 distinct pathogenic SNPs/Indels mapping to 99 genes.
+
+We performed a pathway burden analysis using selected pre-defined biological pathways such as DNA damage repair and onco-genic processes, pan-cancer and per cancer (Bailey et al., 2018; Huang et al., 2018; Knijnenburg et al., 2018). These pathways were manually curated to generate a list of mutually exclusive pathways. The only genes that were not collapsed into pathways were BRCA1 and BRCA2 for which a sufficient number of events across cancers exist. Overall, 21 genotypic variables were used for analyses (Figure S7A). In the pan-cancer analysis, we only included genes (BRCA1/BRCA2) or pathways with a number of events (mutations) greater than 4 across cancers, including a total of 90 genes. For each pathway, variants that fall within its selected set of genes were collapsed based on the presence or absence of any rare variant (i.e., 0 if no rare variant was present and 1 if there is at least one variant). We conducted regression analyses (linear or logistic, as done for GWAS) to assess the association between the pathways’ burden of rare variants and immune traits. Traits assessed in these analyses were the same as the ones used for heritability analyses, with the addition of the immune subtypes (C1, C2, etc.), DNA-alteration related metrics such as the mutational load, the neoantigen load, the degree of copy number alterations (Thorsson et al., 2018) and the MANTIS score (threshold = 0.4, Middha et al., 2017). All pan-cancer regression models included the following covariates: age, sex, cancer type, and PC1-7. 
+
+In the pan-cancer analysis, we used a Benjamini-Hochberg FDR (Benjamini and Hochberg, 1995) to correct for multiple hypothesis testing, accounting for all 21 genes and pathways tested and 154 phenotypes (139 immune traits, 9 DNA related metrics, and 6 immune subtypes). We used a cutoff of FDR p < 0.1 to identify significant gene/pathway-immune trait associations and a threshold of nominal p < 0.005 (FDR <= 0.25) to identify suggestive associations. We used a more permissive cut-off in these analyses than the ones used in the heritability and GWAS to reduce type II error due to the low number of events (germline mutations). 
+
+[Reference Listing](https://doi.org/10.1016/j.immuni.2021.01.011)
+
+**Contributors:** Mohamad Saad, Jessica Roelands, Davide Bedognetti
diff --git a/data/germline/colocalization_GTEX_df.feather b/data/germline/colocalization_GTEX_df.feather
diff --git a/data/germline/colocalization_TCGA_df.feather b/data/germline/colocalization_TCGA_df.feather
diff --git a/data/germline/germline_gwas.feather b/data/germline/germline_gwas.feather
diff --git a/data/germline/germline_heritability.feather b/data/germline/germline_heritability.feather
diff --git a/data/germline/germline_rare_variants.feather b/data/germline/germline_rare_variants.feather
diff --git a/data/markdown/germline_heritability.markdown b/data/markdown/germline_heritability.markdown
@@ -0,0 +1,7 @@
+Explore the percentage of variance of immune phenotype traits explained by common genetic variance across different ancestry groups.
+
+Heritability analyses were performed using genomic-relatedness-based restricted maximum-likelihood (GREML) and provide estimates of the proportion of phenotypic variance explained by the genetic variance, V(Genotype)/Vp. The analyses were conducted separately within each ancestral subgroup, which were derived from ancestry analysis using the genotype data. Each immune trait is assigned to a particular Category and a particular Module, as described in the manuscript.
+
+Select a subset criteria (*Subset by*) and a group of interest (*Show associated results for*) to generate a bar plot summarizing the V(Genotype)/Vp for the immune traits in that group  that have p-values lower than the selected p-value threshold. For a visualization of a single phenotype across different ancestry clusters, select 'Immune Feature' in *Subset by* and the trait of interest in the *Show associated results for* menu. 
+
+For the European ancestry cluster, it is also possible to visualize the percentage of variance of immune traits accounted for by interaction between germline genotypes and immune subtypes (G x Immune Subtype).
diff --git a/data/markdown/tilmap.markdown b/data/markdown/tilmap.markdown
@@ -1,6 +1,6 @@
 This module relates to the manuscript:
 [Saltz et al. Spatial Organization And Molecular Correlation Of Tumor-Infiltrating Lymphocytes Using Deep Learning On Pathology Images; Cell Reports 23, 181-193, April 3 2018.]
-(https://www.cell.com/cell-reports/fulltext/S2211-1247(18)30447-9)
+(https://www.cell.com/cell-reports/fulltext/S2211-1247(18\)30447-9)
 
 TCGA H&E digital pathology images were analyzed for tumor infiltrating lymphocytes (TILs) using deep learning.
 The analysis identifies small spatial regions on the slide image - patches - that are rich in TILs. 

diff --git a/functions/barplot.R b/functions/barplot.R
@@ -53,3 +53,67 @@ create_barplot <- function(
 
 }
 
+create_barplot_horizontal <- function(df,
+                                      x_col = "x",
+                                      y_col = "y",
+                                      error_col = "error",
+                                      key_col = NA,
+                                      color_col = NA,
+                                      label_col = NA,
+                                      order_by = NULL,
+                                      xlab = "",
+                                      ylab = "",
+                                      title = "",
+                                      showLegend = TRUE,
+                                      legendTitle = " ",
+                                      source_name = NULL,
+                                      bar_colors = NULL){
+  if(is.na(key_col)) key_col <- x_col
+  if(is.na(color_col)) color_col <- x_col
+  if(is.na(label_col)) label_col <- x_col
+
+  if (is.null(bar_colors)) {
+    bar_colors <- viridis::viridis_pal(option = "D")(dplyr::n_distinct(df[[color_col]]))
+  }
+  wrapr::let(
+    alias = c(
+      X = x_col,
+      Y = y_col,
+      KEY = key_col,
+      COLOR = color_col,
+      ERROR = error_col,
+      LABEL = label_col),
+    plotly::plot_ly(
+      df,
+      x = ~X,
+      y = ~Y,
+      color = ~COLOR,
+      text = ~LABEL,
+      key = ~KEY,
+      type = 'bar',
+      source = source_name,
+      colors = bar_colors,
+      orientation = 'h',
+      error_x = list(
+        visible = T,
+        type = "data",
+        array = ~ERROR,
+        color = 'black',
+        thickness = 1),
+      hoverinfo = 'text'
+    )) %>%
+    plotly::layout(
+      title = title,
+      xaxis = list(title = xlab),
+      yaxis = list(title = ylab,
+                   categoryorder = "array",
+                   categoryarray = ~order_by),
+      showlegend = showLegend,
+      legend = list(orientation = 'h', x = 0, y = 1,
+                    title = list(text = legendTitle)
+      )
+    ) %>%
+    format_plotly()
+
+}
+