-
Notifications
You must be signed in to change notification settings - Fork 3
/
chip-seq.tex
130 lines (116 loc) · 7.09 KB
/
chip-seq.tex
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
\section{Pol~\abbrsc{III} \abbr{chip}-sequencing quantifies \abbr{trna} gene expression}
\label{sec:chip}
\subsection{\abbr{chipseq} is a \abbr{dna} binding assay}
\define[\defineword{\abbr{chipseq}} \chip followed by high-throughput
sequencing]{\chipseq} is a family of assays based on high-throughput sequencing,
similar to \rnaseq, but pre-dating the latter \citep{Johnson:2007}. Unlike
\rnaseq, which quantifies the abundance of \rna transcripts in the cell,
\chipseq pinpoints loci of protein--\dna interaction for specific proteins that
can be targeted with an antibody. There are several distinct applications of
\chipseq that all rely on the identification and quantification of binding sites
of specific proteins to \dna. The most common uses of \chipseq are:
\begin{enumerate*}
\item the identification of novel \tf binding sites by targeting specific,
known \tf[s], and
\item the profiling of histone modifications such as methylation or
acetylation, which reveal information about the transcriptional activity
of the proximal sequence (see \cref{sec:transcription})
\citep{Barski:2007}.
\end{enumerate*}
Briefly, a sample is prepared by cross-linking proteins to the \dna in solution
using formaldehyde to ensure that transient interactions are captured, instead
of being dissolved during the assay preparation. Next, sonication or
\define[\defineword{MNase} \dna-cleaving enzyme (nuclease) purified from
specific bacteria]{MNase} treatment is used to shear \dna into smaller
fragments. Some of these fragments will have the protein of interest bound.
Using an antibody that recognises the protein of interest as specifically and
sensitively as possible, these fragments are purified. The protein is then
unlinked and the remaining \dna fraction is again purified, size selected,
ligated to sequencing adapters, amplified and sequenced
(\cref{fig:chip-seq-workflow}) \citep{Park:2009}.
\textfig{chip-seq-workflow}{body}{0.75\textwidth}
{Typical \abbr{chipseq} workflow.}
{Starting from a sample with a target protein bound to \dna (\num{1}),
the protein is covalently cross-linked to the \dna (mainly with
formaldehyde, \num{2}). Next, the \dna is fragmented (mainly via sonication,
\num{3}) and the protein-bound fragments are immunoprecipitated with a
specific antibody targeting the protein of interest (\num{4}). The protein
is unlinked and the \dna purified (\num{5}). This is followed by sequencing
library preparation, sequencing and mapping of the read data back to the
reference (\num{6}). This workflow follows the general outline presented in
e.g.\ \citet{Landt:2012}.}
\subsection{Quantifying expression of \abbr{trna} genes}
A central aspect of this thesis is the investigation of genome-wide \trna gene
expression. \trna gene expression can be quantified via \pol3 \chipseq, using an
antibody that specifically recognises the \pol3 subunit \abbrsc{RPC1/155}, which
forms part of the active centre of \trna gene transcription
\citep{Ablasser:2009}. The reason for using this, on the first glance, indirect
measure is because \trna genes are not identifiable by their sequence alone:
performing a multiple sequence alignment of \trna genes in \mmu reveals that
several \trna genes share the exact same sequence (\cref{fig:trna-alignment}).
\textfloat{trna-alignment}{spill}
{\ifoddpage\else\raggedleft\fi%
\small\input{data/mm10-trna-cove-asn.tex}}
{Alignment of \xtrna{Asn} genes.}
{Parts of a multiple sequence alignment of \trna genes in \mmu generated
with COVE\@. Shown are the \trna genes coding for asparagine. Bases which
differ from the consensus sequence are highlighted in \primaryname{}.}
Consequently, to identify individual \trna genes and to quantify their
expression, we cannot use conventional \rnaseq, since the \rna reads covering
only the transcribed gene region are not uniquely mappable. Common strategies
for counting ambiguously mapping reads, as used in \name{ERANGE} by
\citet{Mortazavi:2008}, still require at least \emph{some} unambiguous
information, to distinguish different genes that share reads (and the problem of
multi-mapping reads continues to pose challenges, as recent publications
highlight \citep{Kahles:2015}).
We solve this problem by extending the \trna gene body into the flanking
regions, which are not under \define[\defineword{purifying selection} selective
force that prevents mutations]{purifying selection}, and therefore not
conserved. As \pol3 \chipseq fragments cover the flanking regions as well as the
actual gene body (\cref{fig:trna-pol3-binding-profile}), we can use reads
uniquely mapping to the flanking regions to assign ambiguously mapped reads to
the appropriate regions (\cref{fig:trna-pol3-map-ambiguous-reads}).
\textfig{trna-pol3-binding-profile}{body}{\textwidth}
{\abbr{trna} \abbr{pol3} \abbr{chip} binding profile.}
{The shaded, bell-shaped area shows an idealised binding profile of \chipseq
data spanning the \trna gene with the A and B box highlighted, as well as
its flanking regions upstream and downstream of the gene body. This overlap
plays a role in identifying the individual gene.}
More precisely, when mapping the reads, we do not discard all ambiguously
mapping reads. However, we still discard reads which are likely \pcr duplicates,
i.e.\ we remove all but one copy of non-unique reads in the raw input data.
Despite this, we still have reads that have not been uniquely assigned to a
given \trna gene (\cref{fig:trna-pol3-map-a}). To assign these reads, we use the
number of uniquely mapping reads in \trna genes’ flanking regions to determine
the most likely origin (\cref{fig:trna-pol3-map-b}) \citep{Kutter:2011}.
Formally, let \(i\) be the \(i\)th \trna gene locus, and \(c_i\) be the count of
uniquely mapped reads in its flanking region (we used \SI{\pm100}{bp}, which has
been shown to work well in practice \citep{Kutter:2011}). A multi-mapping read
\(r\), which maps to a set \(T\) of candidate \trna[s], can be allocated to a
target \trna gene \(i\) randomly with probability
\begin{equation}
p_i = \begin{cases}
c_i \big/ \sum_{x \in T}c_x &
\text{if \(\sum_{x \in T}c_x \neq 0\),} \\
1 \big/ \vert T \rvert & \text{otherwise.}
\end{cases}
\end{equation}
\textfloat{trna-pol3-map-ambiguous-reads}{body}{%
\centering
\begingroup
\includegraphics[width=\textwidth]{trna-pol3-map-ambiguous-reads-1}
\vspace{-\baselineskip}
\subcaption{\label{fig:trna-pol3-map-a}Two potential match candidate
\trna genes for a read.}
\endgroup
\begingroup
\includegraphics[width=\textwidth]{trna-pol3-map-ambiguous-reads-2}
\vspace{-\baselineskip}
\subcaption{\label{fig:trna-pol3-map-b}Using the count data from the
flanking regions to extrapolate most likely mapping positions for
ambiguous reads.}
\endgroup}
{Mapping ambiguous \abbr{chip} reads.}
{\chip reads originating from \trna genes can often not be mapped
unambiguously to any given \trna. Instead, information from the gene’s
flanking regions is used to determine the more likely provenance.}