-
Notifications
You must be signed in to change notification settings - Fork 2
/
publication--03--greengenes.Rmd
243 lines (187 loc) · 9.85 KB
/
publication--03--greengenes.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
---
title: Graphing the Greengenes database
bibliography: bibliography.bibtex
---
```{r init, echo = FALSE, message = FALSE}
library(knitr)
if (! is.null(current_input())) { # if knitr is being used
knitr::read_chunk("settings.R")
} else {
source("settings.R")
}
```
```{r rendering_settings, include=FALSE}
```
## Requirements
**NOTE:** This analysis requires at least 10Gb of RAM to run.
It uses large files not included in the repository and many steps can take a few minutes to run.
## Parameters
### Analysis input/output
```{r io_settings}
```
### Analysis parameters
These settings are shared between the RDP, SILVA, and Greengenes analyses, since the results of those three analyeses are combined in one plot later.
This ensures that all of the graphs use the same color and size scales, instead of optimizing them for each data set, as would be done automatically otherwise.
```{r database_comparison_settings}
```
## Parse database
The greengenes database stores sequences in one file and taxonomy information in another and the order of the two files differ making parseing more difficult than the other databases.
Since taxonomy inforamtion is needed for creating the `taxmap` data structure, we will parse it first and add the sequence information on after.
The input files are not included in this repository because they are large, but you can download them from the Greengenes website here:
http://greengenes.lbl.gov/Download/
### Parse taxonomy file
```{r greengenes_taxonomy, cache = TRUE}
gg_taxonomy_path <- file.path(input_folder, "gg_13_5_taxonomy.txt")
gg_taxonomy <- readLines(gg_taxonomy_path)
print(gg_taxonomy[1:5])
```
Note that there are some ranks with no names.
These will be removed after parsing the file since they provide no information and an uniform-length taxonomy is not needed.
```{r}
library(metacoder)
```
```{r greengenes_taxonomy_parse, cache = TRUE}
# Parse taxonomy file
greengenes <- extract_tax_data(gg_taxonomy,
key = c(id = "info", "class"),
regex = "^([0-9]+)\t(.*)$",
class_sep = "; ",
class_regex = "^([a-z]{1})__(.*)$",
class_key = c(rank = "taxon_rank", "taxon_name"))
greengenes$data$class_data <- NULL
# Remove data for ranks with no information
greengenes <- filter_taxa(greengenes, taxon_names != "")
print(greengenes)
```
### Parse sequence file
Next we will parse the sequence file so we can add it to the `obs_data` table of the `greengenes` object.
```{r greengenes_sequence, cache = TRUE}
gg_sequence_path <- file.path(input_folder, "gg_13_5.fasta")
substr(readLines(gg_sequence_path, n = 10), 1, 100)
gg_seq <- ape::read.FASTA(gg_sequence_path)
```
### Integrating sequence and taxonomy
We will need to use the Greengenes ID to match up which sequence goes with which row since they are in different orders.
```{r greengenes_combine, cache = TRUE}
greengenes <- mutate_obs(greengenes, "sequence",
setNames(gg_seq[id], greengenes$data$tax_data$taxon_id))
```
## Subset
Next I will subset the taxa in the dataset (depending on parameter settings).
This can help make the graphs less cluttered and make it easier to compare databases.
```{r greengenes_subset, cache = TRUE}
if (! is.null(min_seq_count)) {
greengenes <- filter_taxa(greengenes, n_obs >= min_seq_count)
}
if (just_bacteria) {
greengenes <- filter_taxa(greengenes, taxon_names == "Bacteria", subtaxa = TRUE)
}
if (! is.null(max_taxonomy_depth)) {
greengenes <- filter_taxa(greengenes, n_supertaxa <= max_taxonomy_depth)
}
print(greengenes)
```
## Remove chloroplast sequences
These are not bacterial and will bias the *in silico* PCR results.
Note that the `invert` option makes it so taxa are included that *did not* pass the filter.
This is different than simply using `name != "Chloroplast", subtaxa = TRUE` since the effects of `invert` are applied after those of `subtaxa`.
```{r greengenes_rm_chloro, cache = TRUE}
greengenes <- filter_taxa(greengenes, taxon_names == "Chloroplast", subtaxa = TRUE, invert = TRUE)
print(greengenes)
```
## Plot whole database
Although graphing everything can be a bit overwhelming (`r length(taxon_names(greengenes))` taxa), it gives an intuitive feel for the complexity of the database:
```{r greengenes_plot_all, cache = TRUE}
greengenes_plot_all <- heat_tree(greengenes,
node_size = n_obs,
node_color = n_obs,
node_size_range = size_range * 2,
edge_size_range = size_range,
node_size_interval = all_size_interval,
edge_size_interval = all_size_interval,
node_color_interval = all_size_interval,
edge_color_interval = all_size_interval,
node_label = taxon_names,
node_label_max = label_max,
node_label_size_range = label_size_range,
node_color_axis_label = "Sequence count",
output_file = result_path("greengenes--all"))
print(greengenes_plot_all)
```
## PCR
Before doing the digital PCR, I will filter for only full length sequences, since shorter sequences might not have a primer binding site and the digital PCR would fail, not because of an inability of a real primer to bind to real DNA, but because of missing sequence information in the database.
Ideally, this kind of filtering for full length sequences would involve something like a multiple sequences alignment so we don't remove sequences that are actually full length, but just happen to be shorter than the cutoff of `r min_seq_length` naturally.
However, this method is easy and should work OK.
```{r greengenes_length_filter, cache = TRUE}
if (! is.null(min_seq_length)) {
greengenes <- filter_obs(greengenes, data = c("tax_data", "sequence"), drop_taxa = TRUE,
vapply(sequence, length, numeric(1)) >= min_seq_length)
}
print(greengenes)
```
Next I will conduct digital PCR with a new set of universal 16S primers [@walters2016improved], allowing for a maximum mismatch of `r max_mismatch`%.
```{r greengenes_pcr, cache = TRUE}
greengenes_pcr <- primersearch(greengenes, sequence,
forward = forward_primer,
reverse = reverse_primer,
mismatch = max_mismatch)
```
Now the object `greengenes_pcr` has all the information that `greengenes` has plus the results of the digital PCR.
Lets plot the whole database again, but coloring based on digital PCR success.
```{r greengenes_plot_pcr_all, cache = TRUE}
greengenes_plot_pcr_all <- heat_tree(greengenes_pcr,
node_size = seq_count,
node_size_range = size_range * 2,
edge_size_range = size_range,
node_size_interval = all_size_interval,
edge_size_interval = all_size_interval,
node_label = taxon_names,
node_color = prop_amplified,
node_color_range = pcr_success_color_scale,
node_color_trans = "linear",
node_label_max = label_max,
node_label_size_range = label_size_range,
edge_color_interval = c(0, 1),
node_color_interval = c(0, 1),
node_color_axis_label = "Proportion PCR success",
node_size_axis_label = "Sequence count",
output_file = result_path("greengenes--pcr_all"))
print(greengenes_plot_pcr_all)
```
Since these are universal bacterial primers, we would expect most of the database to amplify.
The plot shows this and very few sequences were not amplified.
If we wanted to look at only those sequences that did not amplify, we could filter taxa by PCR success and plot what remains.
I am including the supertaxa of those taxa that did not amplify since excluding them would split the tree into a cloud of fragments lacking taxonomic context.
```{r greengenes_plot_pcr_fail, cache = TRUE}
greengenes_plot_pcr_fail <- greengenes_pcr %>%
filter_taxa(prop_amplified < pcr_success_cutoff, supertaxa = TRUE) %>%
heat_tree(node_size = query_count - seq_count,
node_label = taxon_names,
node_color = prop_amplified,
node_size_range = size_range * 2,
edge_size_range = size_range,
node_size_interval = pcr_size_interval,
edge_size_interval = pcr_size_interval,
node_color_range = pcr_success_color_scale,
node_color_trans = "linear",
node_color_interval = c(0, 1),
edge_color_interval = c(0, 1),
node_label_size_range = label_size_range,
node_label_max = label_max,
node_color_axis_label = "Proportion PCR success",
node_size_axis_label = "Sequences not amplified",
output_file = result_path("greengenes--pcr_fail"))
print(greengenes_plot_pcr_fail)
```
## Save outputs for composite figure
Some results from this file will be combined with others from similar analyses to make a composite figure.
Below, the needed objects are saved so that they can be loaded by another Rmd file.
```{r greengenes_save, cache = TRUE}
save(file = file.path(output_folder, "greengenes_data.RData"),
greengenes_plot_all, greengenes_plot_pcr_fail)
```
## Software and packages used
```{r, cache = TRUE}
sessionInfo()
```
## References