Skip to content

the_vignette

Leonel Herrera-Alsina edited this page Aug 7, 2023 · 5 revisions

title: "Using LEMAD" author: "Leonel Herrera-Alsina" date: "r Sys.Date()" output: rmarkdown::html_vignette vignette: > %\VignetteIndexEntry{Using LEMAD} %\VignetteEngine{knitr::rmarkdown} %\VignetteEncoding{UTF-8}

Introduction

Our Lineage Extinction Model of Ancestral Distribution (LEMAD) computes the likelihood of the modern distribution of species (given the parameters of the model) where lineage extinction is a fundamental part of the calculation. LEMAD generalizes the likelihood described in GeoSSE for any number of areas and under several sets of geographic assumptions that facilitate its use in Ancestral Distribution Estimation (ADE). In LEMAD, the species’ geographic distribution changes over time in a similar manner as an evolving discrete trait does in SSE models. Lineages expand and contract their geographic distribution by colonising new regions and going locally extinct from others; these processes take place along the branches of the phylogenetic tree. LEMAD assumes that rates of speciation and extinction are constant across regions. DIVA and DEC are popular methods for ADE and they differ in the set of geographic assumptions (see below); in a LEMAD framework, we can use both DIVA and DEC sets of assumptions (we call them DIVAevents and DECevents). DIVAevents assumes that widespread species can split their ranges (vicariance) in any combination regardless the number of areas where daughter lineages inhabit (e.g., a species presents in region A, B, C and D can split in AB-CD or A-BCD; widespread vicariance sensu Matzke 2013) while DECevents assumes that one of the daughter lineages will be present at a single region (e.g., ABCD species splits in A-BCD or B-ACD; narrow vicariance). For in-situ speciation and in contrast with DIVAevents, DECevents allows that widespread species have a given population diverging from the rest, becoming a different species which coexists with the parental lineage e.g., ABCD species produces ABCD and the new species is restricted to A (in-situ subset hereafter; sympatry subset sensu Matzke 2013).

It could be convenient to remove all the objects in memory and then load LEMAD:

rm(list = ls())
library(DDD)
library(lemad)
library(phytools)

Similar to the DIVA, DEC, BioGeoBears,LEMAD uses two input files: 1) a rooted, ultrametric tree in nexus format (for conversion of other formats to nexus, we refer to the documentation in package 'ape') and 2) a data file with geographic information. This file will have two columns, the first containing taxa names (which should match the names in the phylogeny) and the second a code for the region where each species inhabits (usually A,B,C, etc., but notice that 'NA' is a valid code too, if you are not sure about the distribution of that species). A comma-separated value file (.csv) generated in MsExcel works particularly well. The *.csv file can be loaded into R using the read.csv() function. In this vignette, we are loading R objects (using the function data()) instead of nexus trees or csv files as it was better to handle when building this vignette document. But, we provide a line commented out on how to the file uploading should be done:

# amazilia_distribution <- read.csv(file="humm_distribution.csv")
data(distributioninfo)
head(amazilia_distribution)

In this data set (here we see only the top lines of the data frame), hummingbird species can live in regions A, B or C or any combination of them. Note that first species is a multi-region one, and the proper way to indicate it is with "AB".

The second object we need is an ultrametric phylogenetic tree, that is rooted and has labeled tips. One can load it in R by using read.nexus(). In our example we load a prepared phylogeny named "phylo_Vign":

# amazilia_tree <- read.nexus("amazilia_phylo.nex") # example of how to load a tree
data("phylo_Vign2")

To run LEMAD it is important that tree tip labels agree with taxon names in the distribution file, but also that these are in the same order. To that end, we run the following piece of code prior to any analysis:

rownames(amazilia_distribution) <- amazilia_distribution$species
geiger:::name.check(amazilia_tree, amazilia_distribution, data.names=NULL)
amazilia_distribution$region <- toupper(amazilia_distribution$region)
amazilia_distribution <- amazilia_distribution[order(match(amazilia_distribution$species,amazilia_tree$tip.label)),]

(amazilia_distribution$species) == (amazilia_tree$tip.label)

The last line should produce a vector of only TRUE which confirms our phylogenic tree and species distribution object match each other. Now that our phylogenetic tree and species distribution are working fine, we declare these variables for simplicity:

phylotree_recons <- amazilia_tree
species_presence <- amazilia_distribution$region

If the letters that are coding the distribution of species is unsorted (e.g., BAC instead of ABC), we can run the following lines to solve it.

#for(ik in 1:length(species_presence)){
#  string_to_sort <- species_presence[ik]
#  species_presence[ik] <- paste(sort(unlist(strsplit(string_to_sort, ""))), collapse = "")
#}

We need to specify that our analysis will have three region. We do not have to specify all the multi-region combinations.

all_areas <- c("A","B","C")

Many datasets are not perfect and we have to learn to live with that. Because LEMAD considers the lineages that went extinct (i.e., missing branches) into the analysis, if our phylogenetic tree is incomplete because of sampling issues, the estimation of extinction can be off. Although LEMAD would not distinguish between missing branches due to extinction and missing branches due to sampling (e.g., it was not possible to obtain molecular information for all the extant species), we can help the analysis by specifying we are aware of sampling issues in our tree. In this example, our tree includes 80% of all the species living in region A, the 57% of species living in B and 83% of the species living in both. All species living region C are present in the tree, also all the species living in AC or BC, etc. We set the following list to set the regions that are incomplete (regions whose inhabitants are all present in the tree, do not have to be specified)

missing_spp_areas <- list()
missing_spp_areas[[1]] <- c("A","B","AB")
missing_spp_areas[[2]] <- c(0.8,0.57,0.83)

Notice that the order of the second list object should match that of the first list object. We need to set what is the maximum number of areas where a species can take at a given point in time. This means, the most widespread ancestor, could it be spread in A, B and C? or we limit that lineages can only live in two regions? In the case of this group of hummingbirds, there are few species that are indeed, present at the three regions at the present. So, it might be safe to assume that ancestors could achieve the same. In datasets with more regions, one should make a decision whether a species can be truly "cosmopolitan" or not.

num_max_multiregion <- 3

By doing that line, we are saying that lineages can be present in: "A" "B" "C" "AB" "AC" "BC" "ABC" Now, let's imagine that we have four regions (A,B,C,D) and we still think that the maximum number of regions a species can simultaneously be present at is 3. This makes that all the possible distributions are: "A" "B" "C" "D" "AB" "AC" "AD" "BC" "BD" "CD" "ABC" "ABD" "ACD" "BCD" "ABCD"

Imagine we have regions A,B,C,D and E, setting the maximum number of regions to occur at the same time has a major consequence in the total number of potential distributions: when num_max_multiregion = 2, we get 15 different distributions when num_max_multiregion = 3, we get 25 different distributions when num_max_multiregion = 4, we get 30 different distributions

The higher the number of potential distributions, the more equations have to be solved and the slower the analysis is. We cannot set num_max_multiregion to 1 because we need some (at least) two-region species because the model assumes there are vicariance events. LEMAD also offers the option to account for an region that it was part of the past distribution of a clade but it is not at the present time. Imagine Antarctica for instance which might have been key for the diversification process of a clade/radiation but at some point stopped being inhabited by species. This means that our clade is distributed in regions A, B and C but region D (currently empty, no modern species lives in D) might have been important. Notice that we are taking the model at very limit because there is no tree tip that contributes to the calculation of probabilities of ancestors living in region D. To run this analysis, we use the same dataset, the only thing that changes is the argument num_max_multiregion. For the analysis that includes an area with no modern species, we do: num_max_multiregion <- c(3,"D"). Similar to what we set up before, we are saying that the maximum number of regions that an ancestor can take is 3. And that, besides the regions A,B and C that we define in all_areas, we have region "D" too.

In LEMAD we can fit models where the extinction rate is fixed to a certain figure. This means that the provided rate will be plugged into the analysis. At the end, we will get a likelihood value for such a model. When fixing the model to a different extinction rate, the likelihood could be different. By comparing both results, we will know which model is supported better. Also, we can let the extinction rate be estimated during the calculation. To do that we use the argument:

#lineage_extinction <- "free"
# or
lineage_extinction <- 0.001

Before starting the optimization (Maximum likelihood search, which will find the combinations of rates that fit best), we can provide of "initial values". The search algorithm needs to start calculating the very first likelihood using a set of rates. From there, it will try rates that are slightly lower or higher, calculate the likelihood and check whether those likelihoods are better than the resulting one from the previous parameter combination. This makes the algorithm decide which direction to take.

initial_lambda <- c(0.01,0.01)
initial_disperextirpation <- 0.002

Those lines are saying that we want to start the ML search where speciation is 0.01 and the rate of colonizing and disappearing from regions is 0.002. Note that the initial lambda is a vector of length 2 because we have two types of speciation: in-situ and vicariance. We need to provide a value for each type of speciation. If we do not have a better initial guess, we do both 0.01. Again, these are only starting points. If we have no clue what a sensible starting point would be, we set both arguments to NULL and LEMAD will provide a standard starting point (the lambda will come from fitting a birth-death model to the phylogenetic tree). In principle, it does not matter where we start the search, it should always converge to the same value. LEMAD allows fitting models under the assumptions of DIVAevents or DECevets (see Introduction). Because they are different parameterizations of the same model, the comparison of their likelihoods is valid and straightforward so data will inform which set of assumptions is more likely. We need to use the variable:

DEC_events <- FALSE # to choose DIVAevents
# or when choosing DECevents:
# DEC_events <- TRUE 

There are cases where the user wishes to define his/her own matrix of dispersal and extirpation events. If you do not need this, please skip this part and jumo to "Finally, we can incorporate information .....". Let us first see what the default matrix is like, we do that with the following lines:

object_a <- prepare_full_lambdas_vicariance(all_areas,num_max_multiregion,1,2,DEC_events)

my_dispersal_extirpationMatrix <- lemad_prepare_q_matrix(
  object_a$all_area_combination,object_a$matrices_names,4,4)
my_dispersal_extirpationMatrix

If you are setting up an analysis that includes "a region with no modern species in", you might need to use

prepare_full_lambdas_vicariance(c(all_areas,"D"),num_max_multiregion,1,2,DEC_events)

so the area "D" is included too.
Now, we can call my_dispersal_extirpationMatrix and edit as we wish. Notice that we use the number 4 to indicate what movements/transitions are possible and 0 for those that are not. To edit, you could do:

# my_dispersal_extirpationMatrix[1,4] <- 0

Remember that the main diagonal should keep NAs. Please make sure that after editing, the following expression class(my_dispersal_extirpationMatrix) retrieves "matrix" "array". The last step is to do:

initial_disperextirpation <- my_dispersal_extirpationMatrix

In summary, initial_disperextirpation can 1) be NULL (lemad will take 1/5 of initial lambda to start the optimization), 2) be a number provided by the user (lemad will start the optimization from this value) or 3) be a dispersal/extirpation matrix defined by the user (in that case the optimization will start in a value 1/5 lambda).

Finally, we can incorporate information on the distribution of the common ancestor of the whole clade. So, LEMAD will compute the likelihood conditioned on the information we provided. To indicate that the very first ancestor of the clade was present in area A, We do:

# condition_on_origin <- A

Generally, we lack this information so we normally want LEMAD not to condition on a particular area. We do:

condition_on_origin <- NULL

We can specify what optimizer (i.e., algorithm to maximize the likelihood) lemad will use:

optimizer <- "simplex"

To launch the analysis:

output_vig <- lemad_analysis(
 phylotree_recons,
 species_presence,
 areas = all_areas,
 num_max_multiregion,
 condition_on_origin,
 DEC_events,
 missing_spp_areas = missing_spp_areas,
 lineage_extinction = lineage_extinction,
 initial_lambda = initial_lambda,
 initial_disperextirpation = initial_disperextirpation,
 optimizer = optimizer)

This vignette runs the analysis itself. You will see all the loglik values that are tried out during the optimization.That is what you normally see in your console when running an analysis. I will now show what the output is like.

output_vig$model_ml
output_vig$number_free_pars

With the information from "output model_ml" and "output number_free_pars" we can compare across models using AIC. We can see the rates of the model, those are the rates that maximize the likelihood.

output_vig$estimated_rates

In-situ speciation is estimated to be higher than vicariance. And, extinction is fixed at 0.001 as we set it up. output$ancestral_states will provide the probability of each ancestor (internal node) being at each region (or combination of regions), here is the first lines of such an object:

head(output_vig$ancestral_states)

Those are the probabilities for the internal nodes 67, 68 up to 72. Now, one can use a function to plot those values into our phylogenetic tree. Lemad provides a function to plot your results. It is by no means intended to be the ultimate plotting function, but it is useful to see right away what our results look like. In LEMAD, we use output_vig$ancestral_states for this. This object is a table where the rownames are the id for the nodes in our tree. In columns, it is shown the probability of each one being at each region in our system. We cab see that, for node with ID 67, the probability of being in area A is 0.268, being at B is 0.284 and so forth. Because sometimes the nodes in a tree are labelled in an unconventional way, I recommend to first plot our tree and see the node ID:

plot.phylo(phylotree_recons,show.tip.label=F)
nodelabels()

We can now see that the id for the root node is 67, which matches the first row in our output_vig$ancestral_states, so we can carry on. The plotting function will take the same objects that the ones used for the main analysis. We just need to add the colors we choose. The order of the colors should be in the order we have the all_areas vector.

colors_for_pie <-  c("blue","red","green")
probabilities_at_node <- output_vig$ancestral_states
plot_biogeo_reconst(probabilities_at_node,
                     num_max_multiregion,           
                                all_areas,
                                phylotree_recons,
                                species_presence,
                                colors_for_pie)

Note that when ancestors are reconstructed to be present in one area, the node will be plotted in a solid color matching "colors_for_pie".However, when the ancestor is reconstructed to have had a multi-area distribution, the node will like like a pie chart. If two colors are shown, it means that the ancestor was present in BOTH areas, NOT that there was the same probability of being present in either area.