Home

Genome analysis - project wiki

Project plan

Aims

The aim of the project is to duplicate certain parts of the paper by Zhang et. al.

Execution

Basic analyses

Reads quality control

Input: fastq.gz Output: fastq.gz

FastQC for Illumina reads. (Remove parts of reads with low quality or adapters still present on the reads. ) The following steps were redundant because the data was already clean.
(Preprocessing of Illumina HiSeq 2500 reads for removing low quality base-calls and adapters Trimmomatic (50 min/file) for Illumina reads)
(Redo FastQC in order to ascertain that the eventual problems previously present have been solved.)

Genome sequencing - DNA assembly

Input: fastq.gz Output: fastq.gz

(IGV for visualization and sanity checking of the genomes)

Long PacBio RSII SMRT reads are assembled (Canu 4,5 h).
Evaluation of PacBio and Illumina&NanoPore assemblies (MUMmerplot 5 min (reference genome comparison))

Annotation of the genome
- Prokka (5 min) combines both structural and functional annotation of prokaryotic genomes by predicting and identifying genetic elements encoded in the genome.
- eggNOGmapper (functional annotation of already predicted sequences)
Synteny comparison with a closely related genome of E. faecalis
RNA-seq

FastQC for reads quality control

Input: fastq.gz Output: fastq.gz
Preprocessing of 100 bp paired end Illumina HiSeq 2500 reads for removing low quality base-calls and adapters

Trimmomatic (50 min/file) for Illumina reads (For RNA_raw_data)
FastQC for reads quality control in order to ascertain that the preprocessing was successful

Alignment of the reads from transcriptome to the genome

Mapping the RNA-seq reads with BWA to the reference genome (SAMTools)

Output: Linked to SAMTools in order to get compressed BAM file

Comparison of transcriptome in rich medium and in human heat-inactivated serum (Differential expression analysis)

Read counting with HTSeq (Python package) i.e. how many reads map to different genomic features (i.e. genes) or to specific contigs.
Comparing the amount of reads between transcriptome in rich medium and in human heat-inactivated serum using DESeq2 (R package included in Bioconductor)

Extra analyses

Identify essential genes for growth in human serum based on the Tn-Seq data-analysis.
Assembly of Illumina reads together with NanoPore (MinION) (SPAdes assembler 2h)

(Note: 1 kmer size, don't use --careful option and error correction)
Produce a ML phylogenetic tree
Evaluate antibiotic resistance potential by finding resistance genes with ResFinder
Find SNPs by performing variant calling
Assembly evaluation with QUAST
Plasmid identification.
Identifying possible base-calling and assembly errors by aligning short reads to assembled contigs
Implement a workflow with Snakemake. [1](#footnote1)

Data organization

dataorganisation3

1: https://snakemake.readthedocs.io/en/stable/ (Not sure if this is feasible in the timeframe allocated to this course. )

Reporting

19/4 2018

Symbolic links

Made symbolic links to subreads from PacBio. They were named:

PacBio_subread_1.fastq.gz PacBio_subread_2.fastq.gz PacBio_subread_3.fastq.gz PacBio_subread_4.fastq.gz PacBio_subread_5.fastq.gz PacBio_subread_6.fastq.gz

Made also symbolic links to RNA-seq raw data from Serum replicate 1 forward and reverse strands named:

RNA-Seq_Serum_replicate_1_forward.fastq.gz RNA-Seq_Serum_replicate_1_reverse.fastq.gz

Running quality control on RNA-seq raw data

The RNA-seq quality control was performed by logging in to a working node:

salloc -A g2018003 -p core -n 2 -t 02:30:00 --reservation=g2018003_6

and then performing the actual analysis with FastQC:

module load bioinfo-tools 
module load FastQC
fastqc ./RNA-Seq_Serum_replicate_1_forward.fastq.gz 
fastqc ./RNA-Seq_Serum_replicate_1_reverse.fastq.gz

The analysis resulted with forward and reverse reads summary quality reports.

21/4 2018

Running genome assembly for PacBio data

The shell script used for this was pacbio_assembly.sh.

25/4 2018

Running quality control on Illumina DNA data

The procedure was similar to the previous quality control. It was run on a working node in commandline:

module load bioinfo-tools 
module load FastQC
fastqc WGS_forward.fq.gz 
fastqc WGS_reverse.fq.gz

The results on forward and reverse Illumina HiSeq 2500 data showed that no pre-processing was necessary.

27/4 2018

Running genome assembly for Illumina and Oxford NanoPore (Minion) reads

The following sbatch script was run in spades to carry out the genome assembly: illumina_nanopore_assembly.sh.

Preprocessing the RNA-Seq raw data

As FastQC tends to yield skewed results with RNA-Seq data Trimmomatic 0.36 was run with default parameter values and ILLUMINACLIP with default values. The sbatch script for running it was rna-seq-trim_1.sh.

9/5 2018

Running a new quality check on Trimmomatic 0.36 preprocessed RNA-Seq raw data with FastQC

fastqc RNA-Seq_trim_paired_1.fg.gz 
fastqc RNA-Seq_trim_paired_2.fg.gz 
fastqc RNA-Seq_trim_unpaired_1.fg.gz 
fastqc RNA-Seq_trim_unpaired_2.fg.gz

As a result some of the issues detected in the previous quality check were corrected as can be seen in the reads in which remained both forward and reverse reads survived the processing and the unpaired ones in which only the forward or the reverse read survived.

Running assembly evaluation with MUMmer/3.9.4alpha and Mummerplot

The sbatch script mummerplot_assembly_only_chromosomal.sh was utilised for evaluating the assembly of PacBio reads. In the script the query sequence is only the chromosome. This graph was produced by the script.

Figure 1. Here we can see that only short sequences in the reverse strand matched with the reference genome.

14/5 2018

Correcting assembled PacBio contigs with Illumina reads using BWA (version 0.7.17)

This script was used when applying the correction.

The resulting aln-pe.sam file was compressed to a .bam with SAMtools:

samtools view -bt ref.fa.fai aln-pe.sam > aln-pb.bam
samtools sort -T tmp -o aln-pb.sorted.bam aln-pb.bam
samtools index aln-pb.sorted.bam

16/5 2018

Re-Assembly with pilon

The corrected Canu reads were assembled into a genome with pilon script.

Running a quality evaluation on the Illumina corrected PacBio assembly

For evaluating the corrected PacBio contigs quast_PacBio_corrected_Illumina.sh was run. The resulting quality report in figure 2 showed that the new assembly (denoted by pilon.fasta) was superior to the canu assembly in most parameters.

Figure 2. Quast analysis summary of with Illumina reads corrected PacBio assembly. We can see that the new corrected assembly is superior to the assembly produced by Canu.

In addition Mummer and mummerplot script was run on the assembled chromosome. Figure 3 resulted from the run.

Synteny of E. Faecium, E. Casseliflavus and E. Faecalis

Synteny was performed by first aligning one by one E. Faecium chromosomal genome to E. Faecalis and E. Casseliflavus after which the full Genbank genome files of E. Faecalis and E. Casseliflavus were uploaded with the alignments produced by CLI BLASTn in the previous stage.

formatdb -i efaecium_assemble_genome_chromosome.fasta -p F
blastall -i Enterococcus_casseliflavus_ec20.fasta -d pilon.fasta -o alignment_efaecium_e_casseliflavus_e20 -p blastn -m 8

blastall -i Enterococcus_faecalis_V583_ref_genome.fna -d pilon.fasta -o alignment_efaecium_e_faecalis_v583 -p blastn -m 8

The resulting synteny is depicted in Figure 3.

Figure 3. The synteny between E Casseliflavus (above), E. Faecium (middle) and E. Faecalis (below).

17/5

Functional and structural annotation

The functional and structural annotation were performed first with Prokka/1.12-12547ca and then with eggNOG-mapper. In eggNOG-mapper the bacteria were chosen to be prioritised both in HMM database and in Taxonomic Scope and the default values in prioritising coverage were used. The fasta input file to eggNOG-mapper was one of the output files from Prokka.

23/5

RNA-seq reads mapping of each replicate into the genome

The RNA-seq reads were mapped to the assembled genome with BWA/0.7.17 and then sorted with samtools/1.6 using 6 scripts (one for each replicate).

RNA-seq replicates reads listing by genes

The RNA-seq replicates from both BHI and Human Serum were listed by genes with htseq 0.9.1. This resulted in following reads counts.

RNA-seq reads differential expression analysis

Differential expression analysis between the two conditions for bacteria: BHI and Human Serum was performed with a script utilising deseq2 R-package. Figure 4 illustrates the 20 most differentially expressed genes (with their gene ID:s).

Figure 4. 20 most differentially expressed genes in BHI and human blood serum. Column Serum_Replicate_3 was dropped because it contained only zeros.

24/5

Plasmid identification

A simple blast search could be performed in Nucleotide BLAST. 6 plasmids could be identified. In the following table are presented plasmids which were identified from the Illumina corrected Canu assembly.

File name	Plasmid name	Q coverage
2.fasta	Enterococcus faecium strain E745 plasmid pl2	100%
3.fasta	Enterococcus faecium strain E745 plasmid pl5	99%
4.fasta	Enterococcus faecium strain E745 plasmid pl4	98%
5.fasta	Enterococcus faecium strain E745 plasmid pl6	99%
6.fasta	Enterococcus faecium strain E745 plasmid pl3	99%
7_8.fasta	Enterococcus faecium strain E745 plasmid pl1	99%

Table 1. The names of contigs from the corrected assembly, the names of identified plasmids and the alignment e-values and coverages of the alignments.