diff --git a/doc/InvTraitR_documentation.Rmd b/doc/InvTraitR_documentation.Rmd deleted file mode 100644 index 268b609..0000000 --- a/doc/InvTraitR_documentation.Rmd +++ /dev/null @@ -1,234 +0,0 @@ ---- -title: "InvTraitR project documentation" -output: rmarkdown::html_vignette ---- - -### Forms of length-dry biomass models - -Allometric models generally take two different forms. The standard form is a power function with a multiplicative error structure (model1): - -Y = a x (X^b) x exp(error) - -This is the mode fit when the Y and X variables are log-transformed. For example, let's consider taking the natural logarithm of both sides of this equation: - -log(Y) = log(a) + log(X^b) + error - -This equations simplifies to: - -log(Y) = log(a) + (b x log(X)) + error - -This simplified version is the standard linear regression equation. However, given that we fit the model in log-log space with additive error, when we back-transform the error, it is no longer centered on zero. This means that we typically require a correction factor to obtain the appropriate back-transformed Y value. - -The second model form is a model that is fit directly using non-linear regression models (model2): - -Y = a x (X^b) + error - -When models are fit using this type of non-linear regression equation, then we do not need any corrections to get the predictions of Y on the natural scale. - -The vast majority of allometric equations in the literature take the form of model1. - - -### Preservation effects - -The recommended approach to generating length-dry biomass equations is to use fresh specimens or frozen specimens (Benke et al. 1999). This is because preservation in commonly used preservatives like formalin and ethanol can cause mass loss as lipids and other compounds may dissolve. - -Loss in dry mass due to preservation can be substantial. For example, Mahrlein et al. (2016) report dry weight loss of more than 20% due to preservation in ethanol. Similarly, Wetzel et al. (2005) report dry weight loss of more than 30% due to both formalin and ethanol preservation. This is in contrast to the common doctrine that preservation-induced mass loss is less prevalent when animals are preserved in formalin (Leuven et al. 1985). These numbers (20 or 30%) are significant and, therefore, we should do our best to account for them to the best of our ability. - -Generating equations or gathering test data that used different preservation methods can lead to a large additional source of error when estimating dry biomass. However, there is another issue with preservation methods. Usually, we want to estimate dry biomass for communities or species that were collected and preserved. However, individuals that are preserved in ethanol or formalin can also change their lengths. Thus, when these preserved individuals are measured for length, the measured lengths can be incorrect. - -We need to look carefully at which preservation methods were used to generate the equation and which preservation is used for the testing data or for the dataset where the length data come from. - -*What information do we need to extract?* - -Add preservation method information for each equation. I think I'm going to stick with three possibilities: - -+ unpreserved -+ formaldehyde -+ ethanol - -Add a preservation correction factor for equations generated with preserved specimens. For example, Dekanova et al. (2022) generated their length-dry biomass equations using formalin-preserved specimens. We need to add some correction factor to these equations that increases the dry biomass estimate to account for this. For example, Mahrlein et al. (2016) propose a 22% correction factor but we'll need to check how consistent this is. The point is that we'll need to add a correction factor to each equation that was generated using preserved specimens. This will be based on the preservation method used. - -In addition, we need to provide a length correction factor because preserved animals also change their length. Whether this length correction factor is required will depend on the preservation status of the equation: - -+ length preserved -> equation unpreserved (correct length upwards) -+ length unpreserved -> equation preserved (correct length downwards) - -The details of how to do this will depend and I'll need to figure it out. - -A nice comparison figure for the manuscript will be something like comparing accuracy of the method when accounting for preservation effects and when not accounting for preservation effects. - -I think this is actually very important because most people that use these equations generally don't think too much about the preservation effects. This makes sense given that it's a lot of extra work. But, if it only needs to be done once (by us) then it might improve these biomass estimations a lot. - -### Prediction error calculations - -A general note is that mean squared error (MSE) is calculated in the same way as residual mean square (RMS). As far as I can tell, RMS is the older term. In this mathematics tutorial, RMS and MSE are used interchangeably (https://web.maths.unsw.edu.au/~adelle/Garvan/Assays/GoodnessOfFit.html). Nonetheless, it is calculated as: - -MSE = (1/n)*sum( (y - y-hat)^2 ) - -So, it is the average of the squared residuals (in words). - -If we take the square root of this, we get the RMSE which is the root mean squared error. - -RMSE = sqrt(MSE) - -This is typically an estimate of the standard deviation of the regression. But, the denominator is not (1/n), it is (1/n-K) where k is the number of parameters the model is fit to. Thus, the standard deviation of the regression (s) is simply: - -MSE = sqrt( (1/n-K)*sum( (y - y-hat)^2 ) ) - -So, given that we are only dealing with models with two parameters, this small difference can probably be ignored and we can use RMSE as a measure of the standard deviation of the model. - - -### Back-transformation problems for log-linear models - -The problems with using log-linear models for prediction on the original scale of the response variable are well known (e.g. Marhrlein et al. 2016). Models of the form: - -+ Y = a * X^b - -Are typically transformed onto the log-scale as follows: - -+ log(Y) = log(a) + b*log(X) - -The problem with this transformation is that the errors become multiplicative and not additive on the log-scale. Thus, predictions of the expected value of Y become the geometric mean rather than the arithmetric which is always an underestimation. This means that predictions on the original-scale tend to be biased in such models. How does this bias work mathematically? Good explanation can be found here: https://timeseriesreasoning.com/contents/robust-linear-models/ - -When we fit the model on the log-scale, we get the expected value of log(y) given X which is just the fitted model coefficients multiplied by the predictor variables with some additive error. - -E(log(y)|X) = β_fitted*X + e - -We, however, want E(exp(log(y))|X). But, if we simply take the exponential of both sides of the equation, we get the following: - -exp(E(log(y)|X)) = exp(β_fitted*X + e) - -We solve this by taking the expectation of both sides E and after some simplification, we get to: - -E(y|X) = E(exp(β_fitted*X)) * E(exp(e)) - -This shows that to get the quantity we want (i.e. E(y|X)), we need to multiply the model prediction by expectation of the exponentiated errors. - -There are several correction factors that can be used to solve this issue. FreshInvTraitR uses three of these correction factors which depend on different sources of information (e.g. model fit: r2, maximimum and minimum y-values, residual mean square and the residuals). - -*Duan's smearing factor* - -The first correction that we use is the standard Duan's smearing factor. This correction factor requires data on the residuals to calculate and, therefore, it is typically not easy to extract from the literature. However, some papers report it and we directly digitised scatter plots and calculated it for some other equations. - -The smearing factor is calculated as E(exp(residuals)) or, if using a different logarithm base, E(base^(residuals)). - -*Bird and Prairie (1983)* - -The correction factor proposed by Bird and Prairie (1983) is to multiply the antilog of weight by exp(RMS/2) where RMS is the residual mean square. - -*Strimbu et al. (2018)* - -The vast majority of correction factors to deal with the log-transformation bias rely on knowing the standard deviation of the model residuals (Neyman and Scott 1960). However, when researchers compile equations from the literature, usually they do not have access to the standard deviation of the residuals. This means that the typical correction factors do not work in these cases. - -Strimbu et al. (2018) recently proposed a set of correction factors that work when access to the residuals of the model are not available. For FreshInvTraitR, we generally only have information on the equation, the range of X values that were used to fit the equation and the coefficient of determination (i.e. the r2 value). Using Strimbu et al.'s (2018) correction factors, this information tends to be sufficient. - -Specifically, Strimbu et al. (2018) propose two correction factors that we may be able to use on log-linear equations in our database: - -+ BC3 = exp( (0.5* (1 - r2)) * ( (1/log(a)) * (( log(ymax, b=a)-log(ymin, b=a) )/6))^2) - -a - base of the logarithm -ymax - maximum y value -ymin - minimum y value -r2 - coefficient of determination of the log-linear model - -The other correction factor that does not require r2 is: - -+ BC4 = exp( (1/(log(a)^2)) * (( log(ymax, b=a)-log(ymin, b=a) )^2/72) ) - -Using these correction factors, we can calculate the unbiased expected Y value as: - -+ unbiased Y(pred) = biased Y(pred) * ( BC3 | BC4 ) - -```{r} - -# let's test these correction factors - -e <- exp(1) - -# simulate some lengths -x <- runif(50, 2.3, 5.5) - -# simulate a y-variable from one of these equations -y <- 0.021*(x^2.315) + rnorm(n = 50, mean = 0, sd = 0.15) -plot(x, y) - -# remove anything less than zero -rem <- y>0 -x <- x[rem] -y <- y[rem] - -# check the relationship -plot(x, y) - -# lit a log-linear model -lm_log <- lm(log(y) ~ log(x)) -summary(lm_log) - -# get the r2 -r2 <- summary(lm_log)$r.squared -print(r2) - -# get the predictions -pred_log <- e^(predict(lm_log)) - -# plot the observed versus predicted values -plot(y, pred_log) -abline(a = 0, b = 1, col = "red") - -# try this with the correction factors -a <- e -ymax <- max(y) -ymin <- min(y) -BC <- exp( (0.5* (1 - r2)) * ( (1/(log(a))) * (( log(ymax, b=a)-log(ymin, b=a) )/6))^2) -print(BC) - -# try this with the true correction factor -BC <- exp(var(residuals(lm_log))/2) -print(BC) - -# do the bias correction and check the prediction -# pred_cor <- (10^(predict(lm_log))) * BC -pred_cor <- (e^predict(lm_log))*BC - -# plot the observed versus corrected predicted values -plot(y, pred_cor) -abline(a = 0, b = 1, col = "red") - -# calculate the mean squared error -round(mean((y - pred_cor)^2), 6) -round(mean((y - pred_log)^2), 6) - -# mean absolute error -round(mean(abs(y-pred_cor)), 6) -round(mean(abs(y-pred_log)), 6) - -# mean percentage error -mean((abs(y - pred_cor)/y)*100) -mean((abs(y - pred_log)/y)*100) -``` - - -### Standard error of the prediction - -Perhaps we can try to estimate it from some existing information but let's see how it goes: - -https://www.researchgate.net/post/Standard_error_of_estimate_not_reported_in_regression_studies_Can_you_calculate_this_without_knowing_the_residuals - -Estimating the standard deviation from the range: - -https://bmcmedresmethodol.biomedcentral.com/articles/10.1186/1471-2288-14-135#Tab1 - - -### Removing bad equations - -Based on tests of the accuracy of the equations, we removed all equations with an r2 value less than 0.6. - -In addition, we removed equations with an r2 value of less than 0.75 if they had a sample size of less than or equal 20. - -An r2 of 0.75 does not sound that bad but the errors can be very large on a log-scale which these equations are typically fit on. - -We also removed the Daphnia equation from Rahkola et al. (1998), (reference 2) because it was performing very poorly on the test data. - - - -