layout | title | group | order |
---|---|---|---|
default |
Methods |
in_local_navigation |
2 |
- RNA-seq quantification: kerimoff/rnaseq
- Normalisation and QC: kerimoff/qtl_norm_qc
- Genotype QC: kauralasoo/genotype_qc
- Association testing: kerimoff/qtlmap
The gene expresson and splicing quantification pipelines is available from the kerimoff/rnaseq GitHub repository. The pipeline is based on the nf-core/rnaseq pipeline that we have modified to include the following quantification methods:
- gene counts: RNA-seq reads were aligned to the GRCh38 reference genome using HISAT2 and reads overlapping GENCODE v30 transcript annoations were counted using featureCounts.
- exon counts: DEXSeq was used to convert GENCODE v30 transcript annotations to non-overlapping exon annotations. Reads overlapping exons were counted using featureCounts.
- transcript usage: Salmon was used to estimate the expression levels of all annotated transcripts in GENCODE v30.
- transcription event usage: txrevise was used to convert Ensembl 96 transcript annotations to independent transcriptional events.Salmon was used to estimate the expression levels of all transcriptional events.
The molecular trait normalisation and quality control workflow is available from the kerimoff/qtl_norm_qc GitHub repository.
Briefly, we use the following normalisation strategies:
- gene counts: Conditional quantile normalisation with cqn using gene length and GC content as covariates.
- exon counts: Conditional quantile normalisation with cqn using exon length and GC content as covariates.
- transcript usage: Transcript usage is calculated by dividing the transcript expression estimates (TPM units) the total expression of all transcripts of the same gene. Transcript usage values (0...1 scale) are further standardised using inverse normal transformation.
- transcript event usage: Transcript event usage is calculated by dividing the transcript event expression estimates (TPM units) by the total expression of all events of the same class (promoters, splicing events, 3'end events) within the same gene. Transcript event usage values (0...1 scale) are further standardised using inverse normal transformation.
For most datasets, we performed basic QC on the raw genotypes calls from genotyping arrays after which we imputed the remaining genotypes using Michigan Imputation Server and 1000 Genomes Phase 3 reference panel.
- Exclude all variants with minor allele frequence (MAF) < 0.01, Hardy-Weinberg equilibrium p-value < 1e-6 and that were missing in more than 5% of the samples.
- Convert all genetic variants to GRCh38 coordinates with CrossMap.
- Exclude variants with minor allele frequence (MAF) < 0.01 and imputation quality score (R2) < 0.4.
- Check that the genotypes in the VCF file are concordant with those detected from the RNA-seq data using qtltools mbv command. Correct sample swaps and discard any samples with missing genotypes or high levels of cross-contamination.
The association testing pipeline is available from kerimoff/qtlmap GitHub repository. The main analysis steps are:
- Perform principal component analysis (PCA) of the genotype data with PLINK 1.9.
- Perfrom PCA analysis of the molecular trait data (prcomp function in R).
- Perform assocation testing with QTLtools.
In association testing, we use the following parameters:
- Use first six principal components from the genotype data and first six principal components from the molecular trait data as covariates.
- Test all variants that are +/- 1Mb from the start of the gene (as defined in Ensembl).