Manual: example of usage

Documentation is relevant to version 0.0.2 (now in stable)

General algorithm idea

To generate new data, we perform several steps:

algo

Samples are checked for distribution. To obtain data with a normal distribution, the forward and inverse functions are adjusted. The samples are then clustered. The possibility of clustering using the beta diversity metric has been implemented. Within each cluster, R2 of the best global linear model (glm) and the probability of their coexistence are calculated between all samples using the formula: P(AB) = P(A&B) / P(A|B), where A and B are logical arrays whose values indicate their presence in the sample. GLM R2 is calculated between samples, based only on samples where both species occurred. Currently, other predictive approaches based on clustering of species are being implemented, including the naive Bayesian classifier. By default, the Gauss method is used for the global linear model. The created data is saved in the samovar database.

Then the generation of new samples begins using the resulting database. Initially, the abundance value is set for one of the species. At each iteration, a random value is generated, which is compared with the maximum probability value not equal to 1 in the table of coexistence probabilities for a given cluster between predicted and unpredictable species. If the random variable is larger, the view is removed from the generated sample and the probability matrix. If the value is less, the view is added to the selection. The output is a logical array indicating the presence of species in the selection of the current cluster.

Then, based on the R2 matrix for all the species of this cluster present in the generated sample, an oriented generation graph is constructed, starting with the initialized species. At each iteration of the depth-first crawl, the maximum value between the species present and absent in the graph is searched for and a new dependency is added. Based on the resulting graph, values are predicted for all species present in the generated sample. The output is a numeric array with normalized predicted values of the species. At the next stage, the maximum value of R2 is searched for between the average values of predicted and unsaid clusters. The generated data is used for multivariate regression between the abundance of species in different clusters. The predicted abundance of the species with the maximum value of R2 is used for the next cluster as initialization. The scenario repeats until the presence, abundance, and probability of all species in all samples are determined.

At the last stage, the abundance of normalized species returns to its original values using the inverse normalization function.

R script usage

Get the data from GMrepo.

If you want to use your abundance file, skip this stage

data <- GMrepo_type2data(mesh_ids, ...)

#or to use with run list:
data <- GMrepo_run2data(runs, ...)

# Parameters:
# mesh_ids - character, all types ID to use from GMrepo
# runs - character, all runs ID to use from GMrepo
# number_to_process - numeric vector for number of runs per meshID. if not specified, all runs will be downloaded
# experiment_type - character, experiment type to be only downloaded from GMrepo. if not specified, all runs will be downloaded
# test - logical. if true, test data will be executed

# Output
# List:
  # [1] data.frame; columns: ncbi taxonIDs, species names, all runIDs; rows: species; values: percents of all amount in sample, up to 100%
  # [2] data.frame; columns: runID, desease
  # [3] list;
    # [3][runID] for each runID, list of length 16 whith metadeta from GMrepo

Output - abundance matrix

On this stage, you can split your data into groups and filter it using metadata

Filter

To check the distribution and its variations, you can run:

data %>% res_trim (...) %>% plot_data_with_treshhold()
data %>% res_trim (...) %>% plot_data_n2amount(normalisation_function = log10)

#you can use for test and similar data:
data %>% res_trim (...) %>% plot_data_n2amount(normalisation_function = function(x) log10(1/(1-log10(x)))*(1-2*log10(x)))

# Parameters:
# res_trim:
  # treshhold_amount - numeric, from 0 to 1. minimal mean amount in samples to use in pipeline
  # treshhold_presence - numeric, from 0 to number of reads. minimal number of samples to use in pipeline
# plots:
  # normalisation_function - function(x), to normalise data
  # split_n - amount to highlight on plot

Normalize data distribution and apply treshholds:

data <- data %>% res_trim (treshhold_amount = 10^(-4))
data_scaled <- data %>% res_normalize(normalisation_function = function(x) log10(1/(1-log10(x)))*(1-2*log10(x)) )

# Parameters:
# res_trim:
  # treshhold_amount - numeric, from 0 to 1. minimal mean amount in samples to use in pipeline
  # treshhold_presence - numeric, from 0 to number of reads. minimal number of samples to use in pipeline
# res_normalize
  # normalisation_function - function(x), to normalize data. log by default

# Reverse function of this operation is res_unscale(X, data, data_scaled) and takes: data_scaled(X) -> data(X)

Visualize and determine number of k-means to split the initial matrix. Should work faster for bigger k-means values, but extrime numbers results in loss of complexity

data_scaled %>% res_cluster_dendro
data_scaled %>% res_cluster_PCA2D

# Parameters:
# k_means - numeric, number of k-means to use
# speices - logical, by default true. if false, distribution of runs will be shown

Clusterisation and PCA results. On PCA, each colour is for different cluster

Create coherence tables

samovar <- data_scaled %>% build_samovar(...)

# Input:
# data_scaled: scaled data frame; columns, samples; rows, species;

# Parameters:
# k_means - numeric, number of k-means to use
# inner_model - character, type of global linear model to use in calculation for species cluster, processed by glm(). gaussian by default
# inter_model - character, type of global linear model to use in calculation between species clusters, processed by glm(). gaussian by default
# prepared - logical. set true, if you follow pipeline steps or have a data frame with all values above 0 and normal distribution. false by default. _to be implemented

# Output:
# List:
  # [1] list
    # [1][cluster] for each cluster: matrix of glm R squares of each cluster
      # Now is calculated as:
      # glm(sp1[value > 0] ~ sp2[value > 0], family = inner_model)
  # [2] list
    # [2][cluster] for each cluster: matrix of probabilities of each two samples to co-occure.
      # Now is calculated as:
      # N(cases with both species) / sum(cases with one or two species), e.a. W(a&b)/W(a|b). It is planned to implement Bayesian analise of co-ocurance.
  # [3] matrix of glm R squares between each cluster
      # Now is calculated as:
      # glm(mean(cl1) ~ mean(cl2), family = inter_model). It is planned to implement multidimensional GLM for this calculation.
  # [4] list, parametrs of samovar

Boil the samovar to get new data

new_reads <- boil(data, data_scaled, samovar, ...)

# Input
# if prepared = true
# data: data.frame; columns, samples; rows, species;
# data_scaled: scaled data.frame; columns, samples; rows, species;
# samovar: list from precious step

# if prepared = false
# data: data.frame; columns, samples; rows, species;

# Parameters:
# prepared - logical. is all data prepared
# reps - number of repeats
# ... - all parameters from precious stages

Validation

See benchmarking for some validation tricks, or explore diversity differences using

vegan::vegdist()

Evaluate results via beta-diversity PCA:

pca1

pca2

Further usage

See benchmarking or feel free to contact us with any questions

Provide feedback

Saved searches

Use saved searches to filter your results more quickly