This repo contains necessary scripts to run Quality Control (QC) and Normalisation on phenotype count matrices generated by rnaseq pipeline.
- qtl_norm_qc
- Running the QC
- Running the Normalisation
- Using Software Container to perform QC and Normalisation
To detect outliers we used PCA and MDS. And for sample mislabeling we used Sex-specific gene expression analysis
PCA is a linear dimension reduction method which aims to collect most of the variance in multidimensional dataset inside the principal components. As a result, it becomes possible to plot most of the variation and see if there are any samples in the dataset that look like obvious outliers. PCA is one of the most commonly used procedures to summarize the multivariate dataset and detect outliers in sample population.
MDS is an exploratory technique used to identify unrecognized dimensions of the dataset (Mugavin, 2008). MDS reduces multidimensional dataset to relatively simple, easy-to-visualize structures, which helps us to identify outliers after plotting and analysing it. On contrast to PCA, MDS is a non-linear dimension reduction using distances between each pair of samples, and forces all of the data into less number of dimensions. We explored MDS outliers of phenotype count matrices by performing hierarchical clustering. TPM (Wagner et al. 2012) values were used in log2-transformed (log2(0.1 + TPM))
scale. Pearson was used as the correlation measure and distance between samples were defined as distance = 1 - correlation
.
We used isoMDS function from MASS R package (Cox and Cox 2000; Ripley 2007; Vernables and Ripley 2002) with two desired dimensions (k=2)
to summarize the data into.
We generate a scatter plot with XIST gene (ENSG00000229807 - found only in females) expression in horizontal axis and Y chromosome gene expression (found only in males) in vertical axis, and set the color of each sample according to its donor’s sex.
MBV (Match Bam to VCF) is a quality control method to find matches of aligned samples reads (BAMs) to the genotype samples in VCF file. The script generates the best-matches as a tab separated table and scatter plot for each sample.
To run the featurecounts_qc this github repository should be cloned (downloaded) into the local machine and navigated into the cloned folder:
git clone https://github.com/kerimoff/qtl_norm_qc.git
cd qtl_norm_qc
normaliseCountMatrix.R
script accepts the following parameters
Counts matrix file path. Tab separated file
Sample metadata file. Tab separated file
Phenotype metadata file. Tab separated file
Quantification method. Possible values: gene_counts, leafcutter, txrevise, transcript_usage and exon_counts
Default Value: gene_counts
Path to the output directory
Default Value: ./normalised_results/
Custom name of the study. Optional . The study name by default will be extracted from sample metadata file. Will be overwritten with this parameter if provided
Flag to build plotly html plots Default Value: FALSE
Path to the location where MBV quantification files are. Optional
Example QC running script can be found in here
Counts matrix file path. Tab separated file
Sample metadata file. Tab separated file
Phenotype metadata file. Tab separated file
Quantification method. Possible values: gene_counts, leafcutter, txrevise, transcript_usage and exon_counts
Default Value: gene_counts
Path to the output directory
Default Value: ./normalised_results/
Custom name of the study. Optional . The study name by default will be extracted from sample metadata file. Will be overwritten with this parameter if provided
Example Normalisation running script can be found in here
All the software needed is containerised into the Docker container and pushed into DockerHub
- Using Docker Container
- Using Singularity Container
All required dependencies are built into the Docker container.
To use the pre-built container located in DockerHub no additional steps required. When the container with kerimoff/eqtlutils
tag is run docker checks the existing images in local computer and if it does not exist it automatically tries to pull it from DockerHub.
To execute the script we should first run the container.
docker run -idt -v "$(pwd)":/work_dir -w /work_dir --name qtl_norm_qc_cont kerimoff/eqtlutils bash
This will start our container (with qtl_norm_qc_cont
name) in detached mode and mount current directory qtl_norm_qc
to /work_dir
directory of running container.
To check if the your container's status you can run
docker ps -a
You will see that there is a running container with the qtl_norm_qc_cont
name (usually in first row)
To execute the normalisation or QC just execute the bash script you created with bash
command
docker exec -it fc_qc_container bash run_fc_qc.sh
It is straight forward to run the scripts with singularity container.
singularity exec -B /path/in/host/:/path/in/container/ docker://kerimoff/eqtlutils bash run_fc_qc.sh
Running this command will automatically download the kerimoff/eqtlutils
from DockerHub and run the run_fc_qc.sh
script.
-B
flag is Binding path in your host computer to path inside the container. So, be sure that your data to be processed is reachable by container.