Skip to content

Latest commit

 

History

History
executable file
·
59 lines (46 loc) · 5.17 KB

Methods.md

File metadata and controls

executable file
·
59 lines (46 loc) · 5.17 KB
layout title group order
default
Methods
in_local_navigation
2

Pipelines

Quick links to the data analysis pipelines:

Methods

Gene expression and splicing quantification

The gene expresson and splicing quantification pipelines is available from the kerimoff/rnaseq GitHub repository. The pipeline is based on the nf-core/rnaseq pipeline that we have modified to include the following quantification methods:

  • gene counts: RNA-seq reads were aligned to the GRCh38 reference genome using HISAT2 and reads overlapping GENCODE v30 transcript annoations were counted using featureCounts.
  • exon counts: DEXSeq was used to convert GENCODE v30 transcript annotations to non-overlapping exon annotations. Reads overlapping exons were counted using featureCounts.
  • transcript usage: Salmon was used to estimate the expression levels of all annotated transcripts in GENCODE v30.
  • transcription event usage: txrevise was used to convert Ensembl 96 transcript annotations to independent transcriptional events.Salmon was used to estimate the expression levels of all transcriptional events.

Molecular trait normalisation and quality control

The molecular trait normalisation and quality control workflow is available from the kerimoff/qtl_norm_qc GitHub repository.

Normalisation

Briefly, we use the following normalisation strategies:

  • gene counts: Conditional quantile normalisation with cqn using gene length and GC content as covariates.
  • exon counts: Conditional quantile normalisation with cqn using exon length and GC content as covariates.
  • transcript usage: Transcript usage is calculated by dividing the transcript expression estimates (TPM units) the total expression of all transcripts of the same gene. Transcript usage values (0...1 scale) are further standardised using inverse normal transformation.
  • transcript event usage: Transcript event usage is calculated by dividing the transcript event expression estimates (TPM units) by the total expression of all events of the same class (promoters, splicing events, 3'end events) within the same gene. Transcript event usage values (0...1 scale) are further standardised using inverse normal transformation.

Genotype imputation and quality control

For most datasets, we performed basic QC on the raw genotypes calls from genotyping arrays after which we imputed the remaining genotypes using Michigan Imputation Server and 1000 Genomes Phase 3 reference panel.

Before imputation

  • Exclude all variants with minor allele frequence (MAF) < 0.01, Hardy-Weinberg equilibrium p-value < 1e-6 and that were missing in more than 5% of the samples.

After imputation

  • Convert all genetic variants to GRCh38 coordinates with CrossMap.
  • Exclude variants with minor allele frequence (MAF) < 0.01 and imputation quality score (R2) < 0.4.
  • Check that the genotypes in the VCF file are concordant with those detected from the RNA-seq data using qtltools mbv command. Correct sample swaps and discard any samples with missing genotypes or high levels of cross-contamination.

Association testing

The association testing pipeline is available from kerimoff/qtlmap GitHub repository. The main analysis steps are:

  • Perform principal component analysis (PCA) of the genotype data with PLINK 1.9.
  • Perfrom PCA analysis of the molecular trait data (prcomp function in R).
  • Perform assocation testing with QTLtools.

In association testing, we use the following parameters:

  • Use first six principal components from the genotype data and first six principal components from the molecular trait data as covariates.
  • Test all variants that are +/- 1Mb from the start of the gene (as defined in Ensembl).

Reference annotatons