publication--03--greengenes.Rmd

---
title: Graphing the Greengenes database
bibliography: bibliography.bibtex
---

```{r init, echo = FALSE, message = FALSE}
library(knitr)
if (! is.null(current_input())) { # if knitr is being used
  knitr::read_chunk("settings.R")
} else {
  source("settings.R")
}
```

```{r rendering_settings, include=FALSE}
```

## Requirements 

**NOTE:** This analysis requires at least 10Gb of RAM to run.
It uses large files not included in the repository and many steps can take a few minutes to run. 

## Parameters

### Analysis input/output

```{r io_settings}
```

### Analysis parameters

These settings are shared between the RDP, SILVA, and Greengenes analyses, since the results of those three analyeses are combined in one plot later. 
This ensures that all of the graphs use the same color and size scales, instead of optimizing them for each data set, as would be done automatically otherwise.

```{r database_comparison_settings}
```


## Parse database

The greengenes database stores sequences in one file and taxonomy information in another and the order of the two files differ making parseing more difficult than the other databases.
Since taxonomy inforamtion is needed for creating the `taxmap` data structure, we will parse it first and add the sequence information on after.
The input files are not included in this repository because they are large, but you can download them from the Greengenes website here: 

http://greengenes.lbl.gov/Download/

### Parse taxonomy file

```{r greengenes_taxonomy, cache = TRUE}
gg_taxonomy_path <- file.path(input_folder, "gg_13_5_taxonomy.txt")
gg_taxonomy <- readLines(gg_taxonomy_path)
print(gg_taxonomy[1:5])
```

Note that there are some ranks with no names. 
These will be removed after parsing the file since they provide no information and an uniform-length taxonomy is not needed.

```{r}
library(metacoder)
```

```{r greengenes_taxonomy_parse, cache = TRUE}
# Parse taxonomy file
greengenes <- extract_tax_data(gg_taxonomy,
                               key = c(id = "info", "class"),
                               regex = "^([0-9]+)\t(.*)$",
                               class_sep = "; ",
                               class_regex = "^([a-z]{1})__(.*)$",
                               class_key = c(rank = "taxon_rank", "taxon_name"))
greengenes$data$class_data <- NULL

# Remove data for ranks with no information
greengenes <- filter_taxa(greengenes, taxon_names != "")
print(greengenes)
```


### Parse sequence file

Next we will parse the sequence file so we can add it to the `obs_data` table of the `greengenes` object.

```{r greengenes_sequence, cache = TRUE}
gg_sequence_path <- file.path(input_folder, "gg_13_5.fasta")
substr(readLines(gg_sequence_path, n = 10), 1, 100)
gg_seq <- ape::read.FASTA(gg_sequence_path)
```

### Integrating sequence and taxonomy

We will need to use the Greengenes ID to match up which sequence goes with which row since they are in different orders.

```{r greengenes_combine, cache = TRUE}
greengenes <- mutate_obs(greengenes, "sequence",
                         setNames(gg_seq[id], greengenes$data$tax_data$taxon_id))
```


## Subset

Next I will subset the taxa in the dataset (depending on parameter settings). 
This can help make the graphs less cluttered and make it easier to compare databases. 

```{r greengenes_subset, cache = TRUE}
if (! is.null(min_seq_count)) {
  greengenes <- filter_taxa(greengenes, n_obs >= min_seq_count)
}
if (just_bacteria) {
  greengenes <- filter_taxa(greengenes, taxon_names == "Bacteria", subtaxa = TRUE)
}
if (! is.null(max_taxonomy_depth)) {
  greengenes <- filter_taxa(greengenes, n_supertaxa <= max_taxonomy_depth)
}
print(greengenes)
```


## Remove chloroplast sequences

These are not bacterial and will bias the *in silico* PCR results.
Note that the `invert` option makes it so taxa are included that *did not* pass the filter. 
This is different than simply using `name != "Chloroplast", subtaxa = TRUE` since the effects of `invert` are applied after those of `subtaxa`.

```{r greengenes_rm_chloro, cache = TRUE}
greengenes <- filter_taxa(greengenes, taxon_names == "Chloroplast", subtaxa = TRUE, invert = TRUE)
print(greengenes)
```


## Plot whole database

Although graphing everything can be a bit overwhelming (`r length(taxon_names(greengenes))` taxa), it gives an intuitive feel for the complexity of the database:

```{r greengenes_plot_all, cache = TRUE}
greengenes_plot_all <- heat_tree(greengenes, 
                                 node_size = n_obs,
                                 node_color = n_obs,
                                 node_size_range = size_range * 2,
                                 edge_size_range = size_range,
                                 node_size_interval = all_size_interval,
                                 edge_size_interval = all_size_interval,
                                 node_color_interval = all_size_interval,
                                 edge_color_interval = all_size_interval,
                                 node_label = taxon_names,
                                 node_label_max = label_max,
                                 node_label_size_range = label_size_range,
                                 node_color_axis_label = "Sequence count",
                                 output_file = result_path("greengenes--all"))
print(greengenes_plot_all)
```


## PCR

Before doing the digital PCR, I will filter for only full length sequences, since shorter sequences might not have a primer binding site and the digital PCR would fail, not because of an inability of a real primer to bind to real DNA, but because of missing sequence information in the database. 
Ideally, this kind of filtering for full length sequences would involve something like a multiple sequences alignment so we don't remove sequences that are actually full length, but just happen to be shorter than the cutoff of `r min_seq_length` naturally.
However, this method is easy and should work OK.

```{r greengenes_length_filter, cache = TRUE}
if (! is.null(min_seq_length)) {
  greengenes <- filter_obs(greengenes, data = c("tax_data", "sequence"), drop_taxa = TRUE,
                           vapply(sequence, length, numeric(1)) >= min_seq_length)
}
print(greengenes)
```

Next I will conduct digital PCR with a new set of universal 16S primers [@walters2016improved], allowing for a maximum mismatch of `r max_mismatch`%.

```{r greengenes_pcr, cache = TRUE}
greengenes_pcr <- primersearch(greengenes, sequence,
                               forward = forward_primer,
                               reverse = reverse_primer,
                               mismatch = max_mismatch)
```

Now the object `greengenes_pcr` has all the information that `greengenes` has plus the results of the digital PCR.
Lets plot the whole database again, but coloring based on digital PCR success.

```{r greengenes_plot_pcr_all, cache = TRUE}
greengenes_plot_pcr_all <- heat_tree(greengenes_pcr,
                                     node_size = seq_count,
                                     node_size_range = size_range * 2,
                                     edge_size_range = size_range,
                                     node_size_interval = all_size_interval,
                                     edge_size_interval = all_size_interval,
                                     node_label = taxon_names,
                                     node_color = prop_amplified,
                                     node_color_range =  pcr_success_color_scale,
                                     node_color_trans = "linear",
                                     node_label_max = label_max,
                                     node_label_size_range = label_size_range,
                                     edge_color_interval = c(0, 1),
                                     node_color_interval = c(0, 1),
                                     node_color_axis_label = "Proportion PCR success",
                                     node_size_axis_label = "Sequence count",
                                     output_file = result_path("greengenes--pcr_all"))
print(greengenes_plot_pcr_all)
```

Since these are universal bacterial primers, we would expect most of the database to amplify.
The plot shows this and very few sequences were not amplified.
If we wanted to look at only those sequences that did not amplify, we could filter taxa by PCR success and plot what remains. 
I am including the supertaxa of those taxa that did not amplify since excluding them would split the tree into a cloud of fragments lacking taxonomic context.

```{r greengenes_plot_pcr_fail, cache = TRUE}
greengenes_plot_pcr_fail <- greengenes_pcr %>%
  filter_taxa(prop_amplified < pcr_success_cutoff, supertaxa = TRUE) %>%
  heat_tree(node_size = query_count - seq_count,
            node_label = taxon_names,
            node_color = prop_amplified,
            node_size_range = size_range * 2,
            edge_size_range = size_range,
            node_size_interval = pcr_size_interval,
            edge_size_interval = pcr_size_interval,
            node_color_range =  pcr_success_color_scale,
            node_color_trans = "linear",
            node_color_interval = c(0, 1),
            edge_color_interval = c(0, 1),
            node_label_size_range = label_size_range,
            node_label_max = label_max,
            node_color_axis_label = "Proportion PCR success",
            node_size_axis_label = "Sequences not amplified",
            output_file = result_path("greengenes--pcr_fail"))
print(greengenes_plot_pcr_fail)
```

## Save outputs for composite figure

Some results from this file will be combined with others from similar analyses to make a composite figure.
Below, the needed objects are saved so that they can be loaded by another Rmd file.

```{r greengenes_save, cache = TRUE}
save(file = file.path(output_folder, "greengenes_data.RData"),
     greengenes_plot_all, greengenes_plot_pcr_fail)
```

## Software and packages used

```{r, cache = TRUE}
sessionInfo()
```


## References