-
Notifications
You must be signed in to change notification settings - Fork 0
/
pplacer_demo.html
395 lines (278 loc) · 14 KB
/
pplacer_demo.html
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
<!DOCTYPE html>
<html>
<head>
<meta http-eqiv='content-type' content='text/html;charset=utf-8'>
<title>pplacer_demo.sh</title>
<link rel=stylesheet href="src/docco.css">
</head>
<body>
<div id=container>
<div id=background></div>
<table cellspacing=0 cellpadding=0>
<thead>
<tr>
<th class=docs><h1>pplacer_demo.sh</h1></th>
<th class=code></th>
</tr>
</thead>
<tbody>
<tr><td class='docs'><p>This is a demonstration for the use of the pplacer suite of programs. It
covers the use of placement, visualization, classification, and comparison.
If you are looking at this file in a web browser after processing with
<a href="http://rtomayko.github.com/shocco/">shocco</a>, the left column will describe
what is going on in the right column.</p>
<p>It is assumed that java is available and that you have installed <code>pplacer</code>
and <code>guppy</code>. See the <a href="http://github.com/fhcrc/microbiome-demo">README</a>
for details.</p>
<p><strong>Download tutorial files:</strong></p>
<ul>
<li><a href="http://github.com/fhcrc/microbiome-demo/zipball/master">zip archive</a></li>
<li><a href="http://github.com/fhcrc/microbiome-demo/tarball/master">tar archive</a>
</td><td class=code><div class=highlight><pre>
<span class="c">#!/bin/bash -eu</span>
</pre></div></td></tr><tr><td class=docs></li>
</ul>
<h2>Getting set up (for this demo)</h2>
</td><td class=code><div class=highlight><pre>
</pre></div></td></tr><tr><td class=docs>
<p>We start with a couple of little functions to make this script run smoothly.
You can safely ignore them.</p>
</td><td class=code><div class=highlight><pre>
</pre></div></td></tr><tr><td class=docs>
<p>We have a little script function <code>aptx</code> to run archaeopteryx from within this
script (you can also open them directly from the archaeopteryx user interface
if you prefer).</p>
</td><td class=code><div class=highlight><pre>
aptx<span class="o">()</span> <span class="o">{</span>
java -jar bin/forester.jar -c bin/_aptx_configuration_file <span class="nv">$1</span>
<span class="o">}</span>
</pre></div></td></tr><tr><td class=docs>
<p>A little <code>pause</code> function to pause between steps.</p>
</td><td class=code><div class=highlight><pre>
pause<span class="o">()</span> <span class="o">{</span>
<span class="nb">echo</span> <span class="s2">"Please press return to continue..."</span>
<span class="nb">read</span>
<span class="o">}</span>
</pre></div></td></tr><tr><td class=docs>
<p>Make sure that <code>guppy</code> can be found.</p>
</td><td class=code><div class=highlight><pre>
which guppy > /dev/null 2>&1 <span class="o">||</span> <span class="o">{</span>
<span class="nb">echo</span> <span class="s2">"Couldn't find guppy. \</span>
<span class="s2">There is a download script in the bin directory for you to use."</span>
<span class="nb">exit </span>1
<span class="o">}</span>
</pre></div></td></tr><tr><td class=docs>
<p>Echo the commands to the terminal.</p>
</td><td class=code><div class=highlight><pre>
<span class="nb">set</span> -o verbose
</pre></div></td></tr><tr><td class=docs>
<h2>Phylogenetic placement</h2>
</td><td class=code><div class=highlight><pre>
</pre></div></td></tr><tr><td class=docs>
<p>This makes p4z1r2.json, which is a "place" file in JSON format. Place files
contain information about collections of phylogenetic placements on a tree.
You may notice that one of the arguments to this command is
<code>vaginal_16s.refpkg</code>, which is a "reference package". Reference packages are
simply an organized collection of files including a reference tree, reference
alignment, and taxonomic information. We have the beginnings of a
<a href="http://microbiome.fhcrc.org/apps/refpkg/">database</a> of reference packages
and some <a href="http://github.com/fhcrc/taxtastic">software</a> for putting them
together.</p>
</td><td class=code><div class=highlight><pre>
pplacer -c vaginal_16s.refpkg src/p4z1r36.fasta
pause
</pre></div></td></tr><tr><td class=docs>
<h2>Grand Unified Phylogenetic Placement Yanalyzer (guppy)</h2>
</td><td class=code><div class=highlight><pre>
</pre></div></td></tr><tr><td class=docs>
<p><code>guppy</code> is our Swiss army knife for phylogenetic placements. It has a lot of
different subcommands, which you can learn about with online help by invoking
the <code>--cmds</code> option.</p>
</td><td class=code><div class=highlight><pre>
guppy --cmds
pause
</pre></div></td></tr><tr><td class=docs>
<p>These subcommands are used by writing out the name of the subcommand like</p>
<pre><code>guppy SUBCOMMAND [options] [files]
</code></pre>
</td><td class=code><div class=highlight><pre>
</pre></div></td></tr><tr><td class=docs>
<p>For example, we can get help for the <code>fat</code> subcommand.</p>
</td><td class=code><div class=highlight><pre>
guppy fat --help
pause
</pre></div></td></tr><tr><td class=docs>
<h2>Visualization</h2>
</td><td class=code><div class=highlight><pre>
</pre></div></td></tr><tr><td class=docs>
<p>Now run <code>guppy fat</code> to make a phyloXML "fat tree" visualization, and run
archaeopteryx to look at it. Note that <code>fat</code> can be run without the reference
package specification, e.g.:</p>
<pre><code>guppy fat p4z1r36.json
</code></pre>
<p>but in that case there won't be any taxonomic information in the
visualizations.
<a href="http://matsen.fhcrc.org/pplacer/demo/p4z1r36.html">Here</a>
is an online version.</p>
</td><td class=code><div class=highlight><pre>
guppy fat -c vaginal_16s.refpkg p4z1r36.json
aptx p4z1r36.xml &
</pre></div></td></tr><tr><td class=docs>
<h2>Statistical comparison</h2>
</td><td class=code><div class=highlight><pre>
</pre></div></td></tr><tr><td class=docs>
<p><code>kr</code> is the command to calculate things using the
<a href="http://arxiv.org/abs/1005.1699">Kantorovich-Rubinstein (KR) metric</a>
which is a generalization of UniFrac. It simply takes in JSON placement files and
spits the matrix of distances between the corresponding samples.</p>
</td><td class=code><div class=highlight><pre>
guppy kr src/*.json
pause
</pre></div></td></tr><tr><td class=docs>
<p>The KR metric can be thought of as the amount of work it takes to move the
distribution of reads from one collection of samples to another along the
edges of the tree. This can be visualized by thickening the branches of the
tree in proportion to the number of reads transported along that branch. To
get such a visualization, we use guppy's <code>kr_heat</code> subcommand. The reference
package is included again to add in taxonomic annotation. Red indicates
movement towards the root and blue away from the root.
<a href="http://matsen.fhcrc.org/pplacer/demo/bv.heat.html">Here</a> is a version which
compares all of the vaginosis-positive samples with the negative ones.</p>
</td><td class=code><div class=highlight><pre>
guppy kr_heat -c vaginal_16s.refpkg/ src/p1z1r2.json src/p1z1r34.json
aptx p1z1r2.p1z1r34.heat.xml &
</pre></div></td></tr><tr><td class=docs>
<p>Phylogenetic placement data has a special structure, and we have developed
variants of classical ordination and clustering techniques, called "edge
principal components analysis" and "squash clustering" which leverage this
special structure. You can read more about these methods
<a href="http://matsen.fhcrc.org/papers/11MatsenEvansEdgeSquash.pdf">in our paper</a>.</p>
</td><td class=code><div class=highlight><pre>
</pre></div></td></tr><tr><td class=docs>
<h3>Edge principal components analysis</h3>
</td><td class=code><div class=highlight><pre>
</pre></div></td></tr><tr><td class=docs>
<p>With edge principal components analysis (edge PCA), it is possible to
visualize the principal component axes, and find differences between
samples which may only differ in terms of read distributions on closely
related taxa. <code>guppy pca</code> creates a tree file (here <code>pca_out.xml</code>) which
shows the principal component axes projected onto the tree.
<a href="http://matsen.fhcrc.org/pplacer/demo/pca.html">Here</a> are the first five
principal component axes for the full data set.</p>
</td><td class=code><div class=highlight><pre>
guppy pca --prefix pca_out -c vaginal_16s.refpkg src/*.json
aptx pca_out.xml &
</pre></div></td></tr><tr><td class=docs>
<p>The <code>pca_out.trans</code> file has the samples projected onto principal coordinate
axes. <a href="http://fhcrc.github.com/microbiome-demo/edge_pca.svg">Here</a> is the
corresponding figure for the full data set.</p>
</td><td class=code><div class=highlight><pre>
cat pca_out.trans
</pre></div></td></tr><tr><td class=docs>
<h3>Squash clustering</h3>
</td><td class=code><div class=highlight><pre>
</pre></div></td></tr><tr><td class=docs>
<p><code>guppy</code> can also do "squash clustering". Squash clustering is a type of
hierarchical clustering that is designed for use with phylogenetic
placements. In short, where UPGMA considers the average of distances between
samples, squash clustering considers the distances between averages of
samples. One nice thing about squash clustering is that you can see what the
internal nodes of the clustering tree signify. The clustering is done with
the <code>squash</code> subcommand, which makes a directory containing <code>cluster.tre</code>,
which is the clustering tree, and then a subdirectory <code>mass_trees</code> which
contain all of the mass averages for the internal nodes of the tree.</p>
</td><td class=code><div class=highlight><pre>
mkdir squash_out
guppy squash -c vaginal_16s.refpkg --out-dir squash_out src/*.json
aptx squash_out/cluster.tre &
</pre></div></td></tr><tr><td class=docs>
<p>We can look at <code>0006.phy.fat.xml</code>: the mass distribution for the internal
node number 6 in the clustering tree.</p>
</td><td class=code><div class=highlight><pre>
aptx squash_out/mass_trees/0006.phy.fat.xml &
</pre></div></td></tr><tr><td class=docs>
<h2>Classification</h2>
</td><td class=code><div class=highlight><pre>
</pre></div></td></tr><tr><td class=docs>
<p>Next we run guppy's <code>classify</code> subcommand to classify the reads. The columns
are as follows: read name, attempted rank for classification, actual rank for
classification, taxonomic identifier, and confidence. We use <code>head</code> here
just to get the first 30 lines so that you can look at them.</p>
</td><td class=code><div class=highlight><pre>
guppy classify -c vaginal_16s.refpkg p4z1r36.json
head -n 30 p4z1r36.class.tab
pause
</pre></div></td></tr><tr><td class=docs>
<p>We can quickly explore the classification results via SQL by importing them
into a SQLite3 database. We exit if SQLite3 is not available, and clean up in
case the script is getting run for the second time.</p>
</td><td class=code><div class=highlight><pre>
which sqlite3 > /dev/null 2>&1 <span class="o">||</span> <span class="o">{</span>
<span class="nb">echo</span> <span class="s2">"No sqlite3, so stopping here."</span>
<span class="nb">exit </span>0
<span class="o">}</span>
rm -f taxtable.db
</pre></div></td></tr><tr><td class=docs>
<p>Create a table containing the taxonomic names.</p>
</td><td class=code><div class=highlight><pre>
guppy taxtable -c vaginal_16s.refpkg --sqlite taxtable.db
</pre></div></td></tr><tr><td class=docs>
<p>Explore the taxonomic table itself, without reference to placements.</p>
</td><td class=code><div class=highlight><pre>
sqlite3 -header -column taxtable.db <span class="s2">"SELECT tax_name FROM taxa WHERE rank = 'phylum'"</span>
pause
</pre></div></td></tr><tr><td class=docs>
<p>For placement classifications, <code>guppy classify</code> can emit .sqlite
files, which contain SQL instructions for creating a table of
classification results in the database.</p>
</td><td class=code><div class=highlight><pre>
guppy classify --sqlite taxtable.db -c vaginal_16s.refpkg src/*.json
</pre></div></td></tr><tr><td class=docs>
<p>Now we can investigate placement classifications using SQL queries. Here we
ask for the lineage of a specific sequence.</p>
</td><td class=code><div class=highlight><pre>
sqlite3 -header taxtable.db <span class="s2">"</span>
<span class="s2">SELECT pc.rank,</span>
<span class="s2"> tax_name,</span>
<span class="s2"> likelihood</span>
<span class="s2">FROM placement_names AS pn</span>
<span class="s2"> JOIN placement_classifications AS pc USING (placement_id)</span>
<span class="s2"> JOIN taxa USING (tax_id)</span>
<span class="s2"> JOIN ranks USING (rank)</span>
<span class="s2">WHERE pc.rank = desired_rank</span>
<span class="s2"> AND pn.name = 'FUM0LCO01DX37Q'</span>
<span class="s2">ORDER BY rank_order</span>
<span class="s2">"</span>
pause
</pre></div></td></tr><tr><td class=docs>
<p>Here is another example, with somewhat less confidence in the
species-level classification result.</p>
</td><td class=code><div class=highlight><pre>
sqlite3 -header taxtable.db <span class="s2">"</span>
<span class="s2">SELECT pc.rank,</span>
<span class="s2"> tax_name,</span>
<span class="s2"> likelihood</span>
<span class="s2">FROM placement_names AS pn</span>
<span class="s2"> JOIN placement_classifications AS pc USING (placement_id)</span>
<span class="s2"> JOIN taxa USING (tax_id)</span>
<span class="s2"> JOIN ranks USING (rank)</span>
<span class="s2">WHERE pc.rank = desired_rank</span>
<span class="s2"> AND pn.name = 'FUM0LCO01A2HOA'</span>
<span class="s2">ORDER BY rank_order</span>
<span class="s2">"</span>
pause
</pre></div></td></tr><tr><td class=docs>
<p>That's it for the demo. For further information, please consult the
<a href="http://matsen.github.com/pplacer/">pplacer documentation</a>.</p>
</td><td class=code><div class=highlight><pre>
<span class="nb">echo</span> <span class="s2">"Thanks!"</span>
</td><td class=code><div class=highlight><pre>
</pre></div></td></tr><tr><td class=docs>
</pre></div></td></tr><tr><td class=docs>
</td><td class=code><div class=highlight><pre>
</pre></div></td></tr><tr><td class=docs></td><td class='code'></td></tr>
</tbody>
</table>
</div>
</body>
</html>