Skip to content

This repo contains necessary scripts to run Quality Control (QC) and Normalisation on phenotype count matrices generated by rnaseq pipeline.

License

Notifications You must be signed in to change notification settings

eQTL-Catalogue/qtl_norm_qc

Repository files navigation

qtl_norm_qc

This repo contains necessary scripts to run Quality Control (QC) and Normalisation on phenotype count matrices generated by rnaseq pipeline.

QC methods

To detect outliers we used PCA and MDS. And for sample mislabeling we used Sex-specific gene expression analysis

PCA

PCA is a linear dimension reduction method which aims to collect most of the variance in multidimensional dataset inside the principal components. As a result, it becomes possible to plot most of the variation and see if there are any samples in the dataset that look like obvious outliers. PCA is one of the most commonly used procedures to summarize the multivariate dataset and detect outliers in sample population.

MDS

MDS is an exploratory technique used to identify unrecognized dimensions of the dataset (Mugavin, 2008). MDS reduces multidimensional dataset to relatively simple, easy-to-visualize structures, which helps us to identify outliers after plotting and analysing it. On contrast to PCA, MDS is a non-linear dimension reduction using distances between each pair of samples, and forces all of the data into less number of dimensions. We explored MDS outliers of phenotype count matrices by performing hierarchical clustering. TPM (Wagner et al. 2012) values were used in log2-transformed (log2(0.1 + TPM)) scale. Pearson was used as the correlation measure and distance between samples were defined as distance = 1 - correlation.

We used isoMDS function from MASS R package (Cox and Cox 2000; Ripley 2007; Vernables and Ripley 2002) with two desired dimensions (k=2) to summarize the data into.

Sex-specific Gene Expression

We generate a scatter plot with XIST gene (ENSG00000229807 - found only in females) expression in horizontal axis and Y chromosome gene expression (found only in males) in vertical axis, and set the color of each sample according to its donor’s sex.

MBV

MBV (Match Bam to VCF) is a quality control method to find matches of aligned samples reads (BAMs) to the genotype samples in VCF file. The script generates the best-matches as a tab separated table and scatter plot for each sample.

Running the QC

To run the featurecounts_qc this github repository should be cloned (downloaded) into the local machine and navigated into the cloned folder:

git clone https://github.com/kerimoff/qtl_norm_qc.git
cd qtl_norm_qc

normaliseCountMatrix.R script accepts the following parameters

Mandatory QC parameters

--count_matrix or -c

Counts matrix file path. Tab separated file

--sample_meta or -s

Sample metadata file. Tab separated file

--phenotype_meta or -p

Phenotype metadata file. Tab separated file

Optional QC Parameters

--quant_method or -q

Quantification method. Possible values: gene_counts, leafcutter, txrevise, transcript_usage and exon_counts

Default Value: gene_counts

--outdir or -o

Path to the output directory

Default Value: ./normalised_results/

--name_of_study or -n

Custom name of the study. Optional . The study name by default will be extracted from sample metadata file. Will be overwritten with this parameter if provided

--build_html

Flag to build plotly html plots Default Value: FALSE

--mbvdir or -m

Path to the location where MBV quantification files are. Optional

Example QC running script can be found in here

Running the Normalisation

Mandatory QC parameters

--count_matrix or -c

Counts matrix file path. Tab separated file

--sample_meta or -s

Sample metadata file. Tab separated file

--phenotype_meta or -p

Phenotype metadata file. Tab separated file

Optional QC Parameters

--quant_method or -q

Quantification method. Possible values: gene_counts, leafcutter, txrevise, transcript_usage and exon_counts

Default Value: gene_counts

--outdir or -o

Path to the output directory

Default Value: ./normalised_results/

--name_of_study or -n

Custom name of the study. Optional . The study name by default will be extracted from sample metadata file. Will be overwritten with this parameter if provided

Example Normalisation running script can be found in here

Using Software Container to perform QC and Normalisation

All the software needed is containerised into the Docker container and pushed into DockerHub

Using Docker Container

All required dependencies are built into the Docker container.

Using the ready-to-use container (DockerHub)

To use the pre-built container located in DockerHub no additional steps required. When the container with kerimoff/eqtlutils tag is run docker checks the existing images in local computer and if it does not exist it automatically tries to pull it from DockerHub.

Executing the script with Docker container

To execute the script we should first run the container.

docker run -idt -v "$(pwd)":/work_dir -w /work_dir --name qtl_norm_qc_cont kerimoff/eqtlutils bash

This will start our container (with qtl_norm_qc_cont name) in detached mode and mount current directory qtl_norm_qc to /work_dir directory of running container.

To check if the your container's status you can run

docker ps -a

You will see that there is a running container with the qtl_norm_qc_cont name (usually in first row)

To execute the normalisation or QC just execute the bash script you created with bash command

docker exec -it fc_qc_container bash run_fc_qc.sh

Using Singularity Container

It is straight forward to run the scripts with singularity container.

singularity exec -B /path/in/host/:/path/in/container/ docker://kerimoff/eqtlutils bash run_fc_qc.sh

Running this command will automatically download the kerimoff/eqtlutils from DockerHub and run the run_fc_qc.sh script.

-B flag is Binding path in your host computer to path inside the container. So, be sure that your data to be processed is reachable by container.

About

This repo contains necessary scripts to run Quality Control (QC) and Normalisation on phenotype count matrices generated by rnaseq pipeline.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published