-
Notifications
You must be signed in to change notification settings - Fork 9
/
40-hts.Rmd
758 lines (550 loc) · 26.8 KB
/
40-hts.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
# High throughput sequencing {#sec-hts}
```{r, echo = F}
suppressPackageStartupMessages(library("tidyverse"))
library(tidyverse)
library(DT)
```
**Learning Objectives**
The aim of this chapter is to present high throughput sequencing by
going through one of its key applications: RNAseq. It is a highly
sensitive and accurate tool that has revolutionised the study of
transcriptomes. The objective is to give an overview of the
technology.
*RNA Sequencing Data: Hitchhiker's Guide to Expression
Analysis*[@Van_den_Berge:2019] is an interesting review describing
RNA-seq technology and data analysis.
We will pass through the different steps starting from the library
preparation and the sequencing process. We will explore the resulting
fastq files and the downstream processing pipeline including quality
control, alignment and counting steps.
## RNA-seq library preparation
The starting point of a RNAseq experiment is the cDNA library
preparation. RNA is first isolated from cells and purified to remove
ribosomal RNA which constitute the majority of total RNA. This can be
done by extracting poly(A) RNA or by depleting rRNA. Purified RNA is then
fragmented, reverse transcribed and adapters are ligated to both extremities
to allow for amplification and sequencing.
```{r, echo=FALSE, fig.align='center', out.width='60%'}
knitr::include_graphics("figs/Unstranded_lib_prep.png")
```
### Stranded versus unstranded libraries
The library preparation presented above corresponds to an **unstranded protocol**.
As both cDNA strands will be amplified for sequencing, about half of the
reads will be sequenced from the first cDNA strand, and half of the reads will
be sequenced from the second strand cDNA. But we won't know which strand was
sequenced! So we lose the information about the orientation of the transcript.
This can be avoided by using a **stranded protocol**.
In this case, dUTPs are added during the 2nd strand cDNA synthesis and the
newly synthetized cDNA strand is degradated after the ligation step.
This allows to infer the transcript orientation because only the first cDNA strand
will be sequenced. Indeed, if the sequence of the read corresponds to the
the sense strand of DNA, it means that the gene was originally transcribed in a
sense orientation. On the contrary, if the read is complementary to the sense
strand of DNA, it means that it was originally transcribed in antisense.
```{r, echo=FALSE, fig.align='center', out.width='100%'}
knitr::include_graphics("figs/stranded_vs_unstranded.png")
```
Strand-specificity leads to lower number of ambiguous reads.
## Sequencing
### SE vs PE sequencing
Once the library is ready, molecules from the library are sequenced in
a high throughput manner to obtain short sequences from one end
(single-end sequencing) or both ends (pair-end sequencing). Single-end
sequencing is cheaper, but Paired-end sequencing improves the ability
to localise the fragment in the genome and resolve mapping close to
repeat regions, resulting in less multimapping reads.
```{r, echo = FALSE, fig.align ='center', out.width = '100%'}
knitr::include_graphics("figs/SE_vs_PE.png")
```
### cluster amplification
The sequencing is done by a cluster amplification process.
The library preparation is hybridized to a flow cell coated with
immobilized oligonucleotides that serve as support to hold the DNA
strands in place during sequencing. The templates are copied from the
hybridized primers by 3’ extension using a high-fidelity DNA
polymerase. The original templates are denatured, leaving the copies
immobilized on the flow cell surface.
Immobilized DNA template copies are then amplified by bridge
amplification. The templates loop over to hybridize to adjacent
oligonucleotides. DNA polymerase copies the templates from the
hybridized oligonucleotides, forming dsDNA bridges, which are
denatured to form two ssDNA strands. These two strands loop over and
hybridize to adjacent oligonucleotides and are extended again to form
two new dsDNA loops. The process is repeated on each template by
cycles of denaturation and amplification to create millions of
individual, dense clonal clusters containing ~2,000 molecules.
Each cluster of dsDNA bridges is denatured, and the reverse strand is
removed by specific base cleavage, leaving the forward DNA strand. The
3’-ends of the DNA strands and flow cell-bound oligonucleotides are
blocked to prevent interference with the sequencing reaction. The
sequencing primer is hybridised to the Illumina adapter. Clusters are
ready for sequencing.
```{r, echo = FALSE, fig.align ='center', out.width = '100%'}
knitr::include_graphics("figs/cluster_amplification.png")
```
The sequencing process is done by synthesis. A polymerase adds a
fluorescently tagged dNTP to the DNA strand (only one base is able to
be added per round due to the fluorophore acting as a blocking group,
but the blocking group is reversible). Each of the four bases has a
unique fluorophore, and after each round, the machine records which
base was added. The fluorophore is then washed away and the process is
repeated.
For a dynamic view of the cluster amplification process, watch the
[Illumina video](https://www.youtube.com/watch?v=HMyCqWhwB8E).
## fastq files
The sequence data generated from the sequencer are received in text
files called `fastq` files. For a single-end run, one fastq file is
created for each sample. For a paired-end run, two separated fastq
files are generated, each containing sequences from one end.
What does a FASTQ file look like?
```{r message=FALSE, warning=FALSE, include=FALSE}
try(rWSBIM2122::prepare_shell())
```
```{r, echo = FALSE, fig.align ='center', out.width = '100%'}
knitr::include_graphics("figs/fastq_first_lines.png")
```
Each entry in a fastq file consists of 4 lines:
- A sequence identifier (by convention preceded by '@') with
information about the sequencing run
- The sequence
- A delimiter, always starting by the sign '+'
- The base call quality scores. These are Phred scores, encoded using
[ASCII characters](https://theasciicode.com.ar) to represent the
numerical quality scores. Each character is assigned a quality
score between 0 and 40. Quality scores represent the probability
that the corresponding nucleotide call is correct.
```{r, echo=FALSE, fig.align='center', out.width='100%'}
knitr::include_graphics("figs/phred_score.png")
```
`r msmbstyle::question_begin()`
What do you think of the quality of the last sequence presented in the
fastq file above?
`r msmbstyle::question_end()`
## Processing pipeline
The raw reads from fastq files will need to pass through a different
tools to yield ultimately a count table, representing a snapshot of
individual gene expression levels. The execution of this set of tools
in a specified order is commonly referred to as a pipeline.
```{r, echo=FALSE, fig.align='center', out.width='40%'}
knitr::include_graphics("figs/pipeline.png")
```
You can download the whole script that we will use to process the RNAseq data
by clicking
[here](https://raw.githubusercontent.com/UCLouvain-CBIO/WSBIM2122/master/data/script_RNAseq_processing.sh)
### Quality control
Of course, no one will analyse the sequences from the fastq file one
by one to check their quality... Rather, a software called
[FastQC](http://www.bioinformatics.babraham.ac.uk/projects/fastqc/)
can be used. Then, another program such as
[Trimmomatics](http://www.usadellab.org/cms/?page=trimmomatic) can
help to clean the reads.
- **FastQC**
[FastQC](http://www.bioinformatics.babraham.ac.uk/projects/fastqc/)
allows to analyse different features of the reads and generates a
report that includes several diagnostic plots to highlight any problem
the data may have.
FastQC is run from the command line:
```{r, eval = FALSE}
$ fastqc SampleName.fastq
```
`r msmbstyle::question_begin()`
Try to interpret the different plots of this
[fastQC report](./OD01_10_1_fastqc.html)
from a real RNAseq fastq file.
Another report corresponding to [bad
quality](https://www.bioinformatics.babraham.ac.uk/projects/fastqc/bad_sequence_fastqc.html)
data can be used as a point of comparison.
`r msmbstyle::question_end()`
`r msmbstyle::solution_begin()`
**1. Per base sequence quality**
```{r, echo=FALSE, fig.align='center', out.width='80%'}
knitr::include_graphics("figs/fastqc_fig1.png")
```
This graph shows an overview of the range of quality values (Phred
scores) across all bases at each position. In this example, the sample
contains reads that are 150 bp long. For each position, a boxplot is
drawn with the median value, represented by the red line. The mean
quality is represented by the blue line. The background of the graph
divides the y-axis into very good quality scores (green), scores of
reasonable quality (orange), and reads of poor quality (red). Here
the quality of reads is very good.
This graph would for example represent low quality reads:
```{r, echo=FALSE, fig.align='center', out.width='80%'}
knitr::include_graphics("figs/fastqc_bad_fig1.png")
```
**2. Per tile sequence quality**:
```{r, echo=FALSE, fig.align='center', out.width='80%'}
knitr::include_graphics("figs/fastqc_fig2.png")
```
Flowcells are divided into tiles. The graph allows to look at the
quality scores from each tile across the reads to see if there is a
loss in quality associated with one part of the flowcell. The plot
shows the deviation from the average quality for each tile. Cold
colours being positions where the quality was at or above the average
for that base in the run, and hotter colours indicate that a tile had
worse qualities than other tiles for that base. Low scores are often
found around the edges, but hot spots can also occur in the middle
(due to an air bubble for example), however these should represent
only a small percentage of the total sequences. In this example no
tile shows poor quality.
This graph would for example show some lower quality tiles:
```{r, echo=FALSE, fig.align='center', out.width='80%'}
knitr::include_graphics("figs/fastqc_bad_fig2.png")
```
**3. Per sequence quality scores**
```{r, echo=FALSE, fig.align='center', out.width='80%'}
knitr::include_graphics("figs/fastqc_fig3.png")
```
It represents a density plot of average quality per read. The
distribution of average read quality should show a tight peak in the
upper range of the plot. It could report a subset of sequences with
low quality values, however these should represent only a small
percentage of the total sequences.
**4. Per base sequence content**
```{r, echo=FALSE, fig.align='center', out.width='80%'}
knitr::include_graphics("figs/fastqc_fig4.png")
```
It plots the proportion of each base position over all the
reads. Typically, we expect to see each base roughly 25% of the time
at each position so the lines in this plot should run parallel with
each other. Experience has shown that RNAseq libraries often have a
selection bias in around the first 12bp of each run. This is due to a
biased selection of random primers, but doesn't represent any
individually biased sequences. This is a typical plot obtained for
RNAseq reads.
When Analysing DNA, the figure could look like this:
```{r, echo=FALSE, fig.align='center', out.width='80%'}
knitr::include_graphics("figs/fastqc_bad_fig4.png")
```
**5. Per sequence GC content**
```{r, echo=FALSE, fig.align='center', out.width='80%'}
knitr::include_graphics("figs/fastqc_fig5.png")
```
It gives the GC distribution across the reads.
This plot makes more sense for whole genome shotgun sequencing,
as the overall GC content should reflect the GC content of the underlying genome.
(Note that since the software doesn't know the the GC content of the genome,
the theorical distribution is calculated from the observed data).
An unexpected distribution could indicate a contaminated library.
For RNA sequencing, this plot makes less sense, as over-represented sequences
might generate an unexpected distribution.
**6. Per base N content**
```{r, echo=FALSE, fig.align='center', out.width='80%'}
knitr::include_graphics("figs/fastqc_fig6.png")
```
If a sequencer is unable to make a base call with sufficient
confidence then it will replace the conventional base call with an
'N'. This graph plots the percent of times that ‘N’ occurs at a
position in all reads. If there was an increase at a particular
position, it could indicate that something went wrong during
sequencing.
**7. Sequence Length Distribution**
```{r, echo=FALSE, fig.align='center', out.width='80%'}
knitr::include_graphics("figs/fastqc_fig7.png")
```
The distribution of sequence lengths of all reads in the file.
Here we see that most reads are 150 pb long but some reads are shorter.
**8. Sequence Duplication Levels**:
```{r, echo=FALSE, fig.align='center', out.width='80%'}
knitr::include_graphics("figs/fastqc_fig8.png")
```
It gives the distribution of duplicated reads. The blue line
corresponds to the percentage of duplicated reads among total number
of reads. The red line corresponds to the percentage of duplicated
reads among total number of distinct reads.
In a highly diverse library from whole genome shotgun sequencing, almost
all the reads are expected to be unique, appearing only 1
time in the sequence data. A high level of duplication could indicate
biased PCR enrichment, or presence of contaminant.
In contrast, in RNAseq, there will likely be some very highly abundant
transcripts for which it is expected to observe duplicated reads. Here
the data was flagged as 'failed' by FastQC even though duplication was
expected in this case.
**9. Overrepresented sequences**
```{r, echo=FALSE, fig.align='center', out.width='80%'}
knitr::include_graphics("figs/fastqc_fig9.png")
```
FastQC lists here all of the sequence which make up more than 0.1% of
the total. Finding that a single sequence is highly overrepresented in
the set means that it is either highly biologically significant, or
that the library is biased or contaminated. Here it is of course not
relevant as it just corresponds to a sequence full of 'N'.
**10. Adapter Content**
```{r, echo=FALSE, fig.align='center', out.width='80%'}
knitr::include_graphics("figs/fastqc_fig10.png")
```
This graph indicates if and where adapter sequences occur in the reads.
If residual adapters were present we could see something like this:
```{r, echo=FALSE, fig.align='center', out.width='80%'}
knitr::include_graphics("figs/fastqc_bad_fig10.png")
```
`r msmbstyle::solution_end()`
- **Trimmomatics**
FastQC gives a global vision the quality of the sample. If it is not
optimal, it is advisable to use
[Trimmomatics](http://www.usadellab.org/cms/?page=trimmomatic)
software to filter poor quality reads and trim residual adapters.
Trimmomatics has a variety of options to clean the reads. Read the
[manual](http://www.usadellab.org/cms/uploads/supplementary/Trimmomatic/TrimmomaticManual_V0.32.pdf)
for more information. However, a command for Trimmomatics could look
something like the command below. In this case it would perform
adapter removal, and perform a scan of the read with a 10-base wide
sliding window, cutting when the average quality per base drops below
20.
```
$ java -jar trimmomatic.jar \
PE \ # paired-end
-threads 4 \ # number of threads
SampleName_1.fastq \ # fastq file for fw reads
SampleName_2.fastq \ # fastq file for rev reads
SampleName_1_clean.fastq \ # clean fastq for fw reads
SampleName_1_unpaired.fastq \ # fastq for unpaired remaining fw reads
SampleName_2_clean.fastq \ # clean fastq for rev reads
SampleName_2_unpaired.fastq \ # fastq for unpaired remaining rev reads
ILLUMINACLIP:TruSeq3-PE.fa:2:30:10 \ # cut adapter from the read.
SLIDINGWINDOW:10:20 # sliding window trimming
```
`r msmbstyle::question_begin()`
How could you check that the trimming performed well?
`r msmbstyle::question_end()`
`r msmbstyle::solution_begin()`
Once samples passed through Trimmomatics, they should perform better
on the quality tests run by FastQC. You could re-run FastQC on your
clean fastq files to see whether adapters were correctly removed and
if the per base sequence quality is higher after trimming...
```
$ fastqc SampleName_1_clean.fastq
```
`r msmbstyle::solution_end()`
## Alignment
```{r, echo=FALSE, fig.align='center', out.width='20%'}
knitr::include_graphics("figs/pipeline_mapping.png")
```
Next step is to map the reads to determine where in the genome the
reads originated from. Many different aligners are available such as
[HISAT2](https://ccb.jhu.edu/software/hisat2/manual.shtml),
[STAR](https://github.com/alexdobin/STAR),
[BWA](http://bio-bwa.sourceforge.net),
[Salmon](https://combine-lab.github.io/salmon/),
[Subread](https://sourceforge.net/projects/subread/)... There is no
gold standard, but some tools are better suited for particular NGS
analyses [@Baruzzo:2017]... In the case of RNAseq, it is important to
use an aligner able to handle spliced reads.
```{r, echo = FALSE, fig.align = 'center', out.width = '100%', purl = TRUE}
knitr::include_graphics("figs/spliced_alignment.png")
```
Here we show an example of what the command could look like when using
[HISAT2](http://daehwankimlab.github.io/hisat2/) aligner,
launched with default parameters. In this basic configuration,
mandatory parameters are of course the fastq file(s) and the sequence
of the genome on which the reads have to be aligned. In practical, the
genome's sequence is not given as it is but in an indexed form
[^Indexed_genome]. Of course many of other parameters can be fine
tuned (see the
[manual](http://daehwankimlab.github.io/hisat2/manual/) for more
details). The alignment output file data is a `sam` file (see below).
[^Indexed_genome]: Genome indexing allows the aligner to quickly find
potential alignment sites for query sequences in a genome, which saves
considerable time during alignment. Frequently used genome (human,
mouse, rat...) already have indexed versions that can be downloaded
directly. When working with another reference genome, a specific
indexed genome has to be built.
```
$ hisat2 \
-p 6 \ # number of threads
-x genome_index \ # index for the reference genome
-1 SampleName_1_clean.fastq \ # clean fastq_file for fw reads
-2 SampleName_2_clean.fastq \ # clean fastq_file for rev reads
-S SampleName.sam # sam file name
```
It is recommended to pay attention to the HISAT2 report to check the
alignment rate. For a PE-end alignment it could look like this:
```{r, echo = FALSE, fig.align = 'center', out.width = '100%', purl = TRUE}
knitr::include_graphics("figs/hisat2_report.png")
```
**SAM format**
The [SAM](https://genome.sph.umich.edu/wiki/SAM) format is a generic
format for storing large nucleotide sequence alignments.
In a SAM file, each entry gives the alignment information for a single read.
Each entry has 11 mandatories fields for essential mapping information
and a variable number of other fields for aligner specific information.
An example of few entries from a SAM
file is displayed below.
```{bash}
head wsbim2122_data/RNAseq_pipeline/example.sam
```
Field 1: Sequence ID
Field 2: [SAM Flag](https://broadinstitute.github.io/picard/explain-flags.html). It gives information about
the read (is it paired, aligned, is the mate read properly aligned...)
Field 3: Chromosome where alignment occurs
Field 4: Position of alignment
Field 5: Mapping quality score (read uniquely mapped (60), multiply mapped (1) or unmapped (0))
Field 6: [CIGAR](https://timd.one/blog/genomics/cigar.php)
string. It is a representation of alignment indicating positions of match/mismatch/deletion/insertion
compared to the reference.
Field 7: Reference name of the mate (Set to '=' if the mate’s reference sequence is the same as this alignment’s)
Field 8: Position of mate
Field 9: Inferred fragment length
Field 10: Read sequence (reverse-complemented if aligned to the reverse strand)
Field 11: Base quality (Phred score)
Field 12: Optional. See [hisat2 manual](http://daehwankimlab.github.io/hisat2/manual/) for more information.
**BAM format**
The BAM file is a compressed binary version of SAM file. SAM files can be
easily converted into BAM files using [samtools](http://www.htslib.org
software), and it is sometimes mandatory to sort the SAM/BAM by
chromosomal coordinates for downstream applications.
```
samtools view -Sb SampleName.sam | samtools sort > SampleName.bam
```
**Alignment visualisation**
Alignements described in BAM files can easily be viewed using
[IGV](https://software.broadinstitute.org/software/igv/) (Integrative Genomics Viewer),
a very useful interactive tool for the visual exploration of genomic data.
Visualising the alignments is not part of the processing pipeline
but sometimes it can be useful..
```{r, echo = FALSE, fig.align = 'center', out.width = '60%', purl = TRUE}
knitr::include_graphics("figs/pipeline_IGV.png")
```
IGV requires that BAM files have an associated index file[^Indexed_bam].
Samtools can be used to index the BAM file. The following command would create
the indexed SampleName.bam.bai file.
[^Indexed_bam]: Indexing a BAM file allows IGV to quickly extract alignments overlapping
particular genomic regions
```
samtools index SampleName.bam
```
Bam file can be loaded in IGV[^note_on_index].
Below is an example of visualisation of a bam file using IGV, focusing on CDKN1A gene.
[^note_on_index]: Note that the index file must reside in the same directory as the bam
file that it indexes, and it must have the same basename
```{r, echo = FALSE, fig.align = 'center', out.width = '100%', purl = TRUE}
knitr::include_graphics("figs/IGV.png")
```
Loading a BAM file creates 3 associated tracks:
- Coverage Track to view depth of coverage
- Splice Junction Track which provides an alternative view of reads spanning splice junctions
- Alignment Track to view individual aligned reads
## Counting
```{r, echo=FALSE, fig.align='center', out.width='20%'}
knitr::include_graphics("figs/pipeline_counting.png")
```
Now comes the quantification step. It can be performed with several
tools such as [htseq-count](https://pypi.org/project/HTSeq/) or
[FeatureCounts](http://bioinf.wehi.edu.au/featureCounts/). We will
describe the latter as it runs much faster. It was developed for
counting reads to genomic features such as genes, exons, promoters and
genomic bins. FeatureCounts takes as input SAM/BAM files and an
annotation file (such as a gtf file) containing chromosomal
coordinates of features.
The Gene Transfer Format (GTF) is a widely used format for storing
gene annotations. Human and mouse GTF files can be downloaded easily
from [Ensembl](http://www.ensembl.org/info/data/ftp/index.html) or
from the [UCSC table
browser](https://www.gencodegenes.org/human/). The first few lines of
gtf annotation file for human genome assembly GRCh38 look like this:
```{r, echo = FALSE, fig.align = 'center', out.width = '100%', purl = TRUE}
knitr::include_graphics("figs/gtf_example.png")
```
FeatureCounts will count the number of uniquely mapped reads assigned
to features (e.g. an exon) or meta-features (e.g. a gene) of the gtf
file. When summarizing reads at gene level, read counts obtained for
exons belonging to the same gene will be added up to yield the read
count for the corresponding gene. Note that an exon-spanning read
will be counted only once for the corresponding gene even if it
overlaps with more than one exon.
```{r, echo = FALSE, fig.align = 'center', out.width = '100%', purl = TRUE}
knitr::include_graphics("figs/featurecount_metafeature.png")
```
FeatureCounts could be launched with the following command. As usual,
many of other parameters can be fine-tuned (see the
[manual](http://manpages.ubuntu.com/manpages/bionic/man1/featureCounts.1.html)
for more details).
```
featureCounts \
-p \ # paired-end
--countReadPairs \ # fragments counted instead of reads
-a annotations.gtf \ # name of the annotation file
-t exon \ # feature type in GTF annotation
-g gene_id \ # Meta-features used in GTF annotation
-s 0 \ # strand-specificity
-o SampleName_counts.tsv \ # Name of the output file
SampleName.bam # bam file
```
Here is an example of featurecounts output file. The last column gives
the raw counts of each gene in that sample.
```{r, message = FALSE}
counts_result <- read_tsv(file = "wsbim2122_data/count_data/sample1_counts.tsv.gz")
datatable(head(counts_result))
```
`r msmbstyle::question_begin()`
Inspect the count table.
1. How many genes are tested ?
2. How many reads were assigned to genes?
3. Inspect the counts. What is the maximum value? How many genes have zero counts ?
4. Plot the distribution of counts. How does it look like?
`r msmbstyle::question_end()`
`r msmbstyle::solution_begin()`
Inspect the count table.
1. How many genes are tested ?
```{r}
nrow(counts_result)
```
2. How many reads were assigned to genes?
```{r}
counts_result <- counts_result %>%
rename(counts = "../processed_data/bam/sample1.bam")
sum(counts_result$counts)
```
3. Inspect the counts. What is the maximum value? How many genes have zero counts ?
```{r}
summary(counts_result$counts)
counts_result %>%
filter(counts == 0) %>%
nrow()
```
4. Plot the distribution of counts. How does it look like?
```{r}
counts_result %>%
ggplot(aes(x = log2(counts + 1))) +
geom_histogram(bins = 20)
```
Note that a "pseudo-count" was added before the log2 transformation of the counts,
to avoid elimination of counts equal to zero.
`r msmbstyle::solution_end()`
## Preparing the count matrix
The aim of an RNAseq experiment is often to compare different types of
cells, or to compare treated cells to control cells, to see if some
genes are up- or down- regulated.
Once fastq files from each sample have been processed into raw counts,
results can be merged in a single count matrix, were each column represents a sample,
and each line represents a gene. This count matrix will be the starting point of a
differential expression analysis.
```{r, echo = F}
load("wsbim2122_data/deseq2/counts.rda")
head(counts)
```
`r msmbstyle::question_begin()`
Imagine you had initially 6 samples (3 control and 3 treated samples) that you have
processed independently giving the 6 counts files generated by FeatureCounts located
in the folder `wsbim_data/count_data/`. Use these files to create a count matrix.
`r msmbstyle::question_end()`
`r msmbstyle::solution_begin()`
```{r, message = FALSE}
samples <- list.files("wsbim2122_data/count_data",
pattern = "*tsv.gz",
full.names = TRUE)
samples
counts <- read_tsv(file = samples[1]) %>%
select(Geneid, ends_with('.bam'))
for (sample in samples[-1]){
tmp <- read_tsv(sample) %>%
select(ends_with('.bam'))
counts <- cbind(counts, tmp)
}
names(counts) <- sub(pattern = ".bam$", '', names(counts))
names(counts) <- sub(pattern = "../processed_data/bam/", '', names(counts))
head(counts)
```
`r msmbstyle::solution_end()`
As differential expression analysis addresses relatively complex statistical concepts,
it will be covered in the next chapter.