-
Notifications
You must be signed in to change notification settings - Fork 0
Home
The aim of the project is to duplicate certain parts of the paper by Zhang et. al.
-
Input: fastq.gz Output: fastq.gz
- FastQC for Illumina reads. (Remove parts of reads with low quality or adapters still present on the reads. ) The following steps were redundant because the data was already clean.
- (Preprocessing of Illumina HiSeq 2500 reads for removing low quality base-calls and adapters Trimmomatic (50 min/file) for Illumina reads)
- (Redo FastQC in order to ascertain that the eventual problems previously present have been solved.)
-
Input: fastq.gz Output: fastq.gz
(IGV for visualization and sanity checking of the genomes)
-
Long PacBio RSII SMRT reads are assembled (Canu 4,5 h).
-
Evaluation of PacBio and Illumina&NanoPore assemblies (MUMmerplot 5 min (reference genome comparison))
-
- Prokka (5 min) combines both structural and functional annotation of prokaryotic genomes by predicting and identifying genetic elements encoded in the genome.
- eggNOGmapper (functional annotation of already predicted sequences)
-
FastQC for reads quality control
Input: fastq.gz Output: fastq.gz
-
Preprocessing of 100 bp paired end Illumina HiSeq 2500 reads for removing low quality base-calls and adapters
Trimmomatic (50 min/file) for Illumina reads (For RNA_raw_data)
-
FastQC for reads quality control in order to ascertain that the preprocessing was successful
- Mapping the RNA-seq reads with BWA to the reference genome (SAMTools)
Output: Linked to SAMTools in order to get compressed BAM file
- Read counting with HTSeq (Python package) i.e. how many reads map to different genomic features (i.e. genes) or to specific contigs.
- Comparing the amount of reads between transcriptome in rich medium and in human heat-inactivated serum using DESeq2 (R package included in Bioconductor)
-
Identify essential genes for growth in human serum based on the Tn-Seq data-analysis.
-
Assembly of Illumina reads together with NanoPore (MinION) (SPAdes assembler 2h)
(Note: 1 kmer size, don't use --careful option and error correction)
-
Produce a ML phylogenetic tree
-
Evaluate antibiotic resistance potential by finding resistance genes with ResFinder
-
Find SNPs by performing variant calling
-
Assembly evaluation with QUAST
-
Plasmid identification.
-
Identifying possible base-calling and assembly errors by aligning short reads to assembled contigs
-
Implement a workflow with Snakemake. [1](#footnote1)
1: https://snakemake.readthedocs.io/en/stable/ (Not sure if this is feasible in the timeframe allocated to this course. )
Made symbolic links to subreads from PacBio. They were named:
PacBio_subread_1.fastq.gz PacBio_subread_2.fastq.gz PacBio_subread_3.fastq.gz PacBio_subread_4.fastq.gz PacBio_subread_5.fastq.gz PacBio_subread_6.fastq.gz
Made also symbolic links to RNA-seq raw data from Serum replicate 1 forward and reverse strands named:
RNA-Seq_Serum_replicate_1_forward.fastq.gz RNA-Seq_Serum_replicate_1_reverse.fastq.gz
The RNA-seq quality control was performed by logging in to a working node:
salloc -A g2018003 -p core -n 2 -t 02:30:00 --reservation=g2018003_6
and then performing the actual analysis with FastQC:
module load bioinfo-tools
module load FastQC
fastqc ./RNA-Seq_Serum_replicate_1_forward.fastq.gz
fastqc ./RNA-Seq_Serum_replicate_1_reverse.fastq.gz
The analysis resulted with forward and reverse reads summary quality reports.
The shell script used for this was pacbio_assembly.sh.
The procedure was similar to the previous quality control. It was run on a working node in commandline:
module load bioinfo-tools
module load FastQC
fastqc WGS_forward.fq.gz
fastqc WGS_reverse.fq.gz
The results on forward and reverse Illumina HiSeq 2500 data showed that no pre-processing was necessary.
The following sbatch script was run in spades to carry out the genome assembly: illumina_nanopore_assembly.sh.
As FastQC tends to yield skewed results with RNA-Seq data Trimmomatic 0.36 was run with default parameter values and ILLUMINACLIP with default values. The sbatch script for running it was rna-seq-trim_1.sh.
fastqc RNA-Seq_trim_paired_1.fg.gz
fastqc RNA-Seq_trim_paired_2.fg.gz
fastqc RNA-Seq_trim_unpaired_1.fg.gz
fastqc RNA-Seq_trim_unpaired_2.fg.gz
As a result some of the issues detected in the previous quality check were corrected as can be seen in the reads in which remained both forward and reverse reads survived the processing and the unpaired ones in which only the forward or the reverse read survived.
The sbatch script mummerplot_assembly_only_chromosomal.sh was utilised for evaluating the assembly of PacBio reads. In the script the query sequence is only the chromosome. This graph was produced by the script.
Figure 1. Here we can see that only short sequences in the reverse strand matched with the reference genome.
This script was used when applying the correction.
The resulting aln-pe.sam file was compressed to a .bam with SAMtools:
samtools view -bt ref.fa.fai aln-pe.sam > aln-pb.bam
samtools sort -T tmp -o aln-pb.sorted.bam aln-pb.bam
samtools index aln-pb.sorted.bam
The corrected Canu reads were assembled into a genome with pilon script.
For evaluating the corrected PacBio contigs quast_PacBio_corrected_Illumina.sh was run. The resulting quality report in figure 2 showed that the new assembly (denoted by pilon.fasta) was superior to the canu assembly in most parameters.
Figure 2. Quast analysis summary of with Illumina reads corrected PacBio assembly. We can see that the new corrected assembly is superior to the assembly produced by Canu.
In addition Mummer and mummerplot script was run on the assembled chromosome. Figure 3 resulted from the run.
Synteny was performed by first aligning one by one E. Faecium chromosomal genome to E. Faecalis and E. Casseliflavus after which the full Genbank genome files of E. Faecalis and E. Casseliflavus were uploaded with the alignments produced by CLI BLASTn in the previous stage.
formatdb -i efaecium_assemble_genome_chromosome.fasta -p F
blastall -i Enterococcus_casseliflavus_ec20.fasta -d pilon.fasta -o alignment_efaecium_e_casseliflavus_e20 -p blastn -m 8
blastall -i Enterococcus_faecalis_V583_ref_genome.fna -d pilon.fasta -o alignment_efaecium_e_faecalis_v583 -p blastn -m 8
The resulting synteny is depicted in Figure 3.
Figure 3. The synteny between E Casseliflavus (above), E. Faecium (middle) and E. Faecalis (below).
The functional and structural annotation were performed first with Prokka/1.12-12547ca and then with eggNOG-mapper. In eggNOG-mapper the bacteria were chosen to be prioritised both in HMM database and in Taxonomic Scope and the default values in prioritising coverage were used. The fasta input file to eggNOG-mapper was one of the output files from Prokka.
The RNA-seq reads were mapped to the assembled genome with BWA/0.7.17 and then sorted with samtools/1.6 using 6 scripts (one for each replicate).
The RNA-seq replicates from both BHI and Human Serum were listed by genes with htseq 0.9.1. This resulted in following reads counts.
Differential expression analysis between the two conditions for bacteria: BHI and Human Serum was performed with a script utilising deseq2 R-package. Figure 4 illustrates the 20 most differentially expressed genes (with their gene ID:s).
Figure 4. 20 most differentially expressed genes in BHI and human blood serum. Column Serum_Replicate_3 was dropped because it contained only zeros.
A simple blast search could be performed in Nucleotide BLAST. 6 plasmids could be identified. In the following table are presented plasmids which were identified from the Illumina corrected Canu assembly.
File name | Plasmid name | E-value | Q coverage |
---|---|---|---|
2.fasta | Enterococcus faecium strain E745 plasmid pl2 | 0.0 | 100% |
3.fasta | Enterococcus faecium strain E745 plasmid pl5 | 0.0 | 99% |
4.fasta | Enterococcus faecium strain E745 plasmid pl4 | 0.0 | 98% |
5.fasta | Enterococcus faecium strain E745 plasmid pl6 | 0.0 | 99% |
6.fasta | Enterococcus faecium strain E745 plasmid pl3 | 0.0 | 99% |
7_8.fasta | Enterococcus faecium strain E745 plasmid pl1 | 0.0 | 99% |
Table 1. The names of contigs from the corrected assembly, the names of identified plasmids and the alignment e-values and coverages of the alignments.