The goal of this package is to provide functions automating time-consuming tasks in R: statistical testing & non-linear modelling/plotting. Run the following lines of code to install it on your computer:
### Usual installation ###
# External installation of the "aomisc" package not hosted on CRAN with the devtools::install_github function:
install.packages("devtools")
library(devtools)
install_github("OnofriAndreaPG/aomisc")
# Installation of the "automation" package and all its dependencies:
install_github("Eliot-RUIZ/automation") # Type multiple time on the Enter button to skip updates of package if asked
library(automation) # Loading
### In case of error during the installation, first run the lines below, and then re-run those above ###
# Fixing potential errors during installation of "aomisc" mentionning a problem of version:
install.packages("installr")
library(installr)
updateR()
# Fixing errors mentionning the "curl" package:
install.packages("curl")
library(curl)
# Fixing errors mentionning the "mixOmics" package:
if (!requireNamespace("BiocManager", quietly = TRUE))
install.packages("BiocManager")
if (BiocManager::version() >= package_version('3.11')) {
BiocManager::install("mixOmicsTeam/mixOmics")
} else {
message('Please update to the latest Bioconductor (https://www.bioconductor.org/install/) ',
'to install the stable GitHub version') }
library(mixOmics)
The Base R packages as well, as the numerous external packages hosted on CRAN and Github (around 80,000) provide an extraordinary diversity of functions coding most statistical tests and associated operations. This powerful tool widely used across the scientific community allows to get a better understanding of the association between variables (inter alia) in almost every specific cases. However, this powerful tool requires a great knowledge of inferential statistics to use it fully and correctly. Indeed, finding the right statistical test with all its assumptions and associated computations while assessing how to run such analysis in R can be very tedious and complex. Even running a complete analysis requires a lot of coding lines, thus increasing the risk of errors and decreasing the time available for understanding the results. Finally, reporting all the results in a correct and compact format for a scientific report is also very cumbersome. For example, the official APA format asks to remove the zero (0.1 -> .1) only for numbers between -1 and 1, provide the results with the greek letters (e.g. chi, phi, eta), or write the p of p-value in italic (p)!
Doing all of that can takes days of hard work when analysing a big dataset from my own experience. Therefore, I decided to code a function running a complete analysis instantly and displaying the results in a format enabling to simply copy and paste it in a report.
Currently, the function is fully operational for any type/number of factors (X), when the dependent variable (Y) is qualitative. I am working very hard on the quantitative part of the function, and it should, therefore, soon be released.
To do that, the function first chooses the right analysis between around 50 different main tests, 20 different measures of associations (e.g. effect size, odds ratio), and 15 different types of post-hoc analysis. I recommend to check the regularly updated decision tree (link might change) to get a better understanding on how the function works, and notably how it decides which test is the most appropriate, using the variable type and many different validity tests (around 15).
Actualised link: http://www.xmind.net/m/G3nsU3
The decision tree can also be downloaded by clicking on the top right corner button and then "Download". It can then be opened in Xmind (prior download of the software necessary) to get access to the content of the boxes, and also see it in the text preview (linear tree showing the content of the boxes). Each box is annotated with the associated piece of code because I understand that checking the 3000 lines of code (not all in Github yet) can be very difficult, even if I made an effort on the appearance of the text (well-spaced out and structured) and on the number of comments!
Using the auto_stats function is very simple since the user only has to fill with the name of the variable(s) the y argument, and possibly the x1 (and) x2 arguments, while indicating the name of the dataframe in which they can be found in the data argument. If one ("first" or "second") of both ("both") factors (X) are repeated measures, the user must indicate it in the paired argument, the default being that the variables are independent ("none"). If the variables are paired, the user must provide the id argument with the name of one column of the dataframe containing the ID of the subjects. In case of an error of an entry error, I implemented multiple warning messages (displaying in red) covering most of the possible errors (I hope), to explain in details what went wrong and rapidly rectify it.
### Actual default values ###
auto_stats = function(data, y, x1 = NULL, x2 = NULL, paired = "none", id = NULL, digits = 3, apa = FALSE)
After a few micro-seconds/seconds (depending on the analysis and the size of the dataframe), the auto_stat function will display the result in a convenient form, separated in multiple subsections varying with the type of analysis (e.g. TABLE only for qualitative Y). The MESSAGE(S) section is of major importance since I implemented many different kinds of advice, general information, notes on the tests etc, for more transparency.
Finally, the digits (number of decimals) and apa arguments, as their names indicate, serve to control the output. The latter only controls the MAIN TEST(S) and MEASURE(S) OF ASSOCIATION sections, since they are usually the only ones reported. For the main tests, the function will generate an "APA code" below, which just have to be pasted in an R Notebook (instructions below) to generate an appropriate result. Since the greek letters and mathematical symbols are characters, they can be copied/pasted in any kind of word processor. It will also round the measures of association results to 3 decimals, and remove the zero in front of numbers between -1 & 1.
Example: χ²(1, N = 5) = 0.8, p = .371, φ = .577
While I was developing the code, I had to create datasets, to test every single branch of the tree/code (21 in total). I stored them in the test_auto_stats function, designed to quickly provide the tests of interest. Besides being very useful when debugging the function, it can also serve to show how to use the function correctly, and notably in which form must be the dataframe to perform an analysis (e.g. long format instead of wide).
This function is also intuitive to use because you just have to precise the type of dependent variable, the number of factors and if they are repeated measures or not.
test_auto_stats(y = "qualitative", nb_x = 1, paired = "first")
Tests:
auto_stats(`QUALITATIVE: One paired X - Y & X1 with 2 levels`,
y = "Y", x1 = "X1", id = "ID", paired = "first")
auto_stats(`QUALITATIVE: One paired X - Y with more than 2 levels & X1 with 2 levels`,
y = "Y", x1 = "X1", id = "ID", paired = "first")
auto_stats(`QUALITATIVE: One paired X - Y with 2 levels & X1 with more than 2 levels`,
y = "Y", x1 = "X1", id = "ID", paired = "first")
auto_stats(`QUALITATIVE: One paired X - Y & X1 with more than 2 levels`,
y = "Y", x1 = "X1", id = "ID", paired = "first")
As you can see, multiple commands running each part of the code in the section desired are displayed. Meanwhile, the different datasets have been loaded in the Global Environment and can be seen on the top right corner in RStudio. Their names aim at being the most explicit possible about which branch of the tree/code they were designed to test.
Then, just copy the name and run it to display the dataframe, or copy the entire command to get the final result:
auto_stats(`QUALITATIVE: One paired X - Y with 2 levels & X1 with more than 2 levels`,
y = "Y", x1 = "X1", id = "ID", paired = "first")
------------------------------ TABLE -----------------------------
During No Yes
After Before
No No 1 0
Yes 0 1
Yes No 0 1
Yes 0 0
---------------------- ASSUMPTION(S) TESTING ---------------------
Variable type: Qualitative dependent variable with two levels (Y)
One factor (X1) with more than two levels of repeated measures.
-------------------------- MAIN TEST(S) --------------------------
Asymptotic General Symmetry Test (N = 9): Z = 1, p-value = 0.577 (ns)
Cochran's Q test (N = 9): Q = 1, df = 2, p-value = 0.607 (ns)
------------------------ POST-HOC ANALYSIS -----------------------
Pairwise two-sample permutation symmetry tests (fdr adjustment method)
Comparison p-value
After - Before = 0 1.000
After - During = 0 0.476
Before - During = 0 0.476
------------------------------------------------------------------
Here is the results obtained when activating the apa argument for a different dataframe:
auto_stats(`QUALITATIVE: One paired X - Y with more than 2 levels & X1 with 2 levels`,
y = "Y", x1 = "X1", id = "ID", paired = "first", apa = TRUE)
------------------------------ TABLE -----------------------------
Before
After Maybe No Yes
Maybe 0 1 0
No 0 1 0
Yes 1 1 0
---------------------- ASSUMPTION(S) TESTING ---------------------
Variable type: Qualitative dependent variable with more than two levels (Y)
One factor (X1) with two levels of repeated measures.
-------------------------- MAIN TEST(S) --------------------------
Asymptotic General Symmetry Test (N = 8): Z = 1.414, p-value = 0.334 (ns)
APA code -> Asymptotic General Symmetry Test (N = 8): Z = 1.414, *p* = .334
Stuart-Maxwell Marginal Homogeneity Test (N = 8): X-squared = 2.667, df = 2, p-value = 0.264 (ns)
APA code -> Stuart-Maxwell Marginal Homogeneity Test: χ^2^(2, N = 8) = 2.667, *p* = .264
(Copy it + Paste it in an R Notebook + Click on Preview and save it + Click on Knit + Copy it anywhere else)
-------------------- MEASURE(S) OF ASSOCIATION -------------------
Cohen's g = 0.5 (Large effect)
APA code -> Cohen's *g* = .5
Odds ratio = 1 (Negligible effect)
APA code -> OR = 1
Comparison Odds-ratio Cohen's G
Maybe/Maybe : No/No Inf 0.5
Maybe/Maybe : Yes/Yes Inf 0.5
No/No : Yes/Yes Inf 0.5
------------------------ POST-HOC ANALYSIS -----------------------
Pairwise Symmetry Tests (fdr adjustment method):
Comparison p-value
Maybe/Maybe : No/No 1
Maybe/Maybe : Yes/Yes 1
No/No : Yes/Yes 1
------------------------------------------------------------------
We can also try the auto_stats function with two factors:
test_auto_stats(y = "qualitative", nb_x = 2, paired = "none") ### 7 different tests shown, I chose the 3rd one.
auto_stats(`QUALITATIVE: Two independent X - Y & X1 with 2 levels - Woolf test = NOT OK - Multiplicative model`,
y = "Y", x1 = "X1", x2 = "X2", paired = "none")
------------------------------ TABLE -----------------------------
X2 Group 1 Group 2 Group 3 Group 4 Group 5 Group 6
X1 Y
Female No 19 8 391 244 299 317
Yes 89 17 202 131 94 24
Male No 313 207 205 279 138 351
Yes 512 353 120 138 53 22
---------------------- ASSUMPTION(S) TESTING ---------------------
Variable type: Qualitative dependent variable (Y) with two levels
Two independent factors (X).
Woolf's Test (N = 4526): X-squared = 17.902, df = 5, p-value = 0.003 -> Not satisfied (p-value <= 0.05)
Note: Heterogeneity of odds ratios across levels of X2 (strata).
Likelihood Ratio Test on Logistic Regression model (N = 4526):
Deviance = 20.204, p-value = 0.001 -> Multiplicative model (significant difference)
Note: Deviance between additive (Y ~ X1 + X2) & multiplicative model (Y ~ X1 * X2)
-------------------------- MAIN TEST(S) --------------------------
Type III Likelihood Ratio Test on Logistic Regression model (N = 4526):
X1 -> X-squared = 19.054, df = 1, p-value = 0 (***)
X2 -> X-squared = 268.851, df = 5, p-value = 0 (***)
X1:X2 -> X-squared = 20.204, df = 5, p-value = 0.001 (**)
-------------------- MEASURE(S) OF ASSOCIATION -------------------
Odds ratio Magnitude
0.349 (Less odds)
0.803 (Less odds)
1.133 (Negligible effect)
0.921 (Less odds)
1.222 (Negligible effect)
0.828 (Less odds)
------------------------ POST-HOC ANALYSIS -----------------------
Groupewise Exact Fisher's Tests (fdr adjustment method):
Group p-value
Group 1 0.0001
Group 2 0.6770
Group 3 0.6770
Group 4 0.6770
Group 5 0.6770
Group 6 0.6770
------------------------------------------------------------------
Finally, we can increase the number of digits. Notice the message at the end of the analysis:
auto_stats(`QUALITATIVE: Two independent X - Y & X1 with 2 levels - Woolf test = OK`,
y = "Y", x1 = "X1", x2 = "X2", paired = "none", digits = 5)
------------------------------ TABLE -----------------------------
X2 Adult female Adult male Teenage boy Teenage girl
X1 Y
Group 1 No 1 1 1 1
Yes 0 0 0 0
Group 2 No 1 0 0 1
Yes 0 1 1 0
---------------------- ASSUMPTION(S) TESTING ---------------------
Variable type: Qualitative dependent variable (Y) with two levels
Two independent factors (X) and X1 has two levels.
Woolf's Test (N = 8): X-squared = 0.90521, df = 3, p-value = 0.82417 -> Satisfied (p-value > 0.05)
Note: Homogeneity of odds ratios across levels of X2 (strata).
-------------------------- MAIN TEST(S) --------------------------
Exact conditional test of independence (N = 8): S = 4, p-value = 0.5 (ns)
-------------------- MEASURE(S) OF ASSOCIATION -------------------
Odds ratio (Mantel-Haenszel estimate) = Inf 95% CI [0.18781, Inf]
Odds ratio Magnitude
1 (Negligible effect)
9 (Large effect)
9 (Large effect)
1 (Negligible effect)
------------------------ POST-HOC ANALYSIS -----------------------
Groupewise Exact Fisher's Tests (fdr adjustment method):
Group p-value
Adult female 1
Adult male 1
Teenage boy 1
Teenage girl 1
-------------------------- MESSAGE(S) ---------------------------
This test has been designed to know if there is an association between the 1st factor (X1)
and the dependant variable (Y). As the 2nd factor (X2) is used for adjustment, you should
change the position in the formula if this is the variable of main interest.
------------------------------------------------------------------
Finally, here is a table presenting the conventional rule of thumb used in *auto_stats* for interpreting the measures of associations (âš interpretation varying with the field of study):
Measure of association | Precision | No effect | Negligible effect | Small effect | Medium effect | Large effect |
---|---|---|---|---|---|---|
Cohen's h | ø | h = 0 | h < 0.2 | 0.2 ≥ h < 0.5 | 0.5 ≥ h < 0.8 | h ≥ 0.8 |
Phi coefficient | ø | φ = 0 | φ < 0.1 | 0.1 ≥ φ < 0.3 | 0.3 ≥ φ < 0.5 | φ ≥ 0.5 |
Cramer's V | df = 2 | V = 0 | V < 0.07 | 0.07 ≥ V < 0.2 | 0.2 ≥ V < 0.35 | V ≥ 0.35 |
Cramer's V | df = 3 | V = 0 | V < 0.06 | 0.06 ≥ V < 0.17 | 0.17 ≥ V < 0.29 | V ≥ 0.29 |
Cramer's V | df = 4 | V = 0 | V < 0.05 | 0.05 ≥ V < 0.15 | 0.15 ≥ V < 0.25 | V ≥ 0.25 |
Cramer's V | df ≥ 5 (Warning if df > 5) | V = 0 | V < 0.05 | 0.05 ≥ V < 0.13 | 0.13 ≥ V < 0.22 | V ≥ 0.22 |
Cohen's g | ø | g = 0 | g < 0.05 | 0.05 ≥ g < 0.15 | 0.15 ≥ g < 0.25 | g ≥ 0.25 |
Odds ratio | Qualitative Y - Paired - N x N contigency tables | OR = 1 (OR < 1 -> "Less Odds") | OR < 1.22 | 1.22 ≥ OR < 1.86 | 1.86 ≥ OR < 3 | OR ≥ 3 |
Odds ratio | Qualitative Y - Independent | OR = 1 (OR < 1 -> "Less Odds") | OR < 1.55 | 1.55 ≥ OR < 2.8 | 2.8 ≥ OR < 5 | OR ≥ 5 |
Automation of the comparison of non-linear models using different estimators & automatic plotting of the non-linear models of interest: compare_nlm
In my experience, finding the best models and the best parameters for nonlinear models on R can be very cumbersome. Indeed, few Self-Starters (i.e. functions allowing to automatically determine "good" starting parameters for each models), are implemented in Base R. It takes a long time to find a package offering the appropriate functions, even for simple models such as the power laws. When your not an expert in mathematics, even thinking to a model who could fit well enough the data can be complicated.
The groping search method of good starting parameters is even more tedious since even with values close to the nearest tenth of the optimal values, the nls function (BaseR) sometimes don't converge after 50,000 iterations! Besides, once the right parameters have been found, it is still very long to calculate the model quality estimators and the confidence intervals around the coefficients. Finally, making a correct graph representing the model can be very hard in R.
To ease such procedure, I coded the function compare_nlm (see the commented code in french in the "R" file of this page) to compare many different models, compute their estimators, and finally plot them, for a set of data consisting in 1 quantitative dependent variable (Y) and 1 quantitative factor.
### Default values ###
compare_nlm = function(formula, data, digits = 2, arrange = c("AIC", "RMSE", "BIC"), increase = T, plot_model = NA, package = F)
After 4 to 5s of computation (on a usual laptop), the function display a table containing the best coefficients (1 to 7) of 112 non-linear and 2 linear models. The self-starters for all those models are found in the packages "drc" and "aomisc", and compare_nlm uses internally the function drm (package "drc") to run the computation.
Sometimes, the convergence of the self-starters might fail (model not appropriate for the data), but the function will jump to the other model and print a message with the ID of the model (found at the end of the table).
compare_nlm(mpg ~ wt, mtcars)
Error in optim(startVec, opfct, hessian = TRUE, method = optMethod, control = list(maxit = maxIt, :
non-finite finite-difference value [2]
[1] "Convergence failed - Self-Starter ID : 59"
Error in optim(startVec, opfct, hessian = TRUE, method = optMethod, control = list(maxit = maxIt, :
non-finite finite-difference value [1]
[1] "Convergence failed - Self-Starter ID : 95"
The user can choose to order the table (arrange argument), according to the AIC, and/or the BIC, and/or the RMSE and/or the number of coefficients.
The order of the estimators reflects their importance for the user because, the dataframe is first ordered using the 1st estimator, and then, for equal values of the 1st estimator, it uses the 2nd estimator etc.
The models with the lowest estimators (i.e. best models) are then placed on the top of the dataframe (i.e. Rank 1, Rank 2 etc). However, the dataframe could also be ordered by descending order if the argument increase is shifted to FALSE.
Here is an example obtained with the mtcars package, and easily exported using the kable function in the "knitr" package:
result = compare_nlm(mpg ~ wt, mtcars, digits = 1, increase = F, arrange = c("Nb_coeffs", "AIC"))
kable(head(result, 15), row.names = F)
Rank | ID | Function | RMSE | AIC | BIC | Nb_coeffs | Coeff_1 | Coeff_2 | Coeff_3 | Coeff_4 | Coeff_5 | Coeff_6 | Coeff_7 |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|
1 | 65 | twophase | 2.5 | 165.6 | 177.3 | 7 coeffs | b1: 8 | c1: 9.3 | d1: 15.8 | e1: 0.3 | b2: 3.1 | d2: 26.3 | e2: 2.7 |
2 | 15 | CRS.6 | 2.4 | 161.9 | 172.2 | 6 coeffs | b: 1.5 | c: 5.2 | d: 87.9 | e: 1.1 | f: -394.3 | g: -2.7 | |
3 | 61 | multi2 | 5.9 | 216.8 | 225.5 | 5 coeffs | b1: 1.6 | b2: 55.6 | b3: 46.4 | c: 17.7 | d: 20.1 | ||
4 | 93 | weibull2x | 5.3 | 209.6 | 218.4 | 5 coeffs | b: -5.1 | c: 20.9 | d: 11.8 | e: 4.4 | t0: 4.5 | ||
5 | 3 | baro5 | 2.5 | 161.5 | 170.2 | 5 coeffs | b1: 4 | b2: 2.3 | c: 7.8 | d: 34.1 | e: 2.9 | ||
6 | 13 | CRS.5b | 2.5 | 160.7 | 169.5 | 5 coeffs | b: 2.8 | c: 8.9 | d: -343.6 | e: 1.1 | f: 955.3 | ||
7 | 14 | CRS.5c | 2.5 | 160.7 | 169.5 | 5 coeffs | b: 2.8 | c: 8.7 | d: -826.6 | e: 1.1 | f: 2259.1 | ||
8 | 57 | ml4b | 2.5 | 160.7 | 169.5 | 5 coeffs | b: 2.8 | c: 8.9 | d: -343.6 | e: 1.1 | f: 955.3 | ||
9 | 58 | ml4c | 2.5 | 160.7 | 169.5 | 5 coeffs | b: 2.8 | c: 8.7 | d: -826.6 | e: 1.1 | f: 2259.1 | ||
10 | 5 | BC.5 | 2.5 | 160.6 | 169.4 | 5 coeffs | b: 3.3 | c: 8.5 | d: -163.1 | e: 1.1 | f: 170.6 | ||
11 | 7 | bcl4 | 2.5 | 160.6 | 169.4 | 5 coeffs | b: 3.3 | c: 8.5 | d: -163.1 | e: 1.1 | f: 170.6 | ||
12 | 8 | braincousens | 2.5 | 160.6 | 169.4 | 5 coeffs | b: 3.3 | c: 8.5 | d: -163.1 | e: 1.1 | f: 170.6 | ||
13 | 12 | CRS.5a | 2.5 | 160.6 | 169.4 | 5 coeffs | b: 2.6 | c: 8.7 | d: -294.2 | e: 0.9 | f: 792.2 | ||
14 | 56 | ml4a | 2.5 | 160.6 | 169.4 | 5 coeffs | b: 2.6 | c: 8.7 | d: -294.2 | e: 0.9 | f: 792.2 | ||
+99 other lines |
In R, the results are printed in a convenient way to allows maximum visibility for detecting the best model rapidly. They can also be printed using the View function:
View(compare_nlm(mpg ~ wt, mtcars)) # Easier to read and to search a model
compare_nlm(mpg ~ wt, mtcars) # Compact format as you can see below
Rank ID Function AIC BIC RMSE Nb_coeffs Coeff_1 Coeff_2 Coeff_3 Coeff_4 Coeff_5 Coeff_6 Coeff_7
===== ==== ================= ======= ======= ====== ========== ============ =========== ============== =========== =========== ========== =========
1 71 UCRS.5c 156.46 165.26 2.31 5 coeffs b: -274.58 c: 93.75 d: 30.07 e: 2.26 f: 158.28
2 77 uml4c 156.46 165.26 2.31 5 coeffs b: -274.58 c: 93.75 d: 30.07 e: 2.26 f: 158.28
3 70 UCRS.5b 156.55 165.34 2.32 5 coeffs b: -249.34 c: 69.84 d: 30.07 e: 2.26 f: 89.31
4 76 uml4b 156.55 165.34 2.32 5 coeffs b: -249.34 c: 69.84 d: 30.07 e: 2.26 f: 89.31
5 69 UCRS.5a 157.11 165.91 2.34 5 coeffs b: -201.18 c: 66.46 d: 30.07 e: 2.26 f: 65.33
6 75 uml4a 157.11 165.91 2.34 5 coeffs b: -201.18 c: 66.46 d: 30.07 e: 2.26 f: 65.33
7 16 EXD.2 157.55 161.95 2.58 2 coeffs d: 49.66 e: 3.41
8 99 DRC.expoDecay 157.55 161.95 2.58 2 coeffs init: 49.66 k: 0.29
9 100 DRC.expoGrowth 157.55 161.95 2.58 2 coeffs init: 49.66 k: -0.29
10 103 DRC.logCurve 157.59 161.98 2.58 2 coeffs a: 39.26 b: -17.09
11 88 W2x.4 157.74 165.07 2.43 4 coeffs c: 30.87 d: 9.57 e: 1.65 t0: 1.83
12 107 DRC.poly2 158.05 163.91 2.52 3 coeffs a: 49.93 b: -13.38 c: 1.17
13 84 W2.3 158.17 164.03 2.53 3 coeffs b: -1.38 d: 37.03 e: 2.54
14 17 EXD.3 158.39 164.25 2.54 3 coeffs c: 6.78 d: 57.33 e: 2.27
15 94 DRC.asymReg 158.39 164.25 2.54 3 coeffs init: 57.34 m: 0.44 plateau: 6.78
16 109 DRC.SSasymp 158.39 164.25 2.54 3 coeffs Asym: 6.78 R0: 57.34 lrc: -0.82
17 30 l3 158.47 164.33 2.54 3 coeffs b: 1.51 d: 48.49 e: 2.42
18 36 LL.3 158.47 164.33 2.54 3 coeffs b: 1.51 d: 48.49 e: 2.42
19 41 LL2.3 158.47 164.33 2.54 3 coeffs b: 1.51 d: 48.5 e: 0.88
20 104 DRC.lorentz.3 158.61 164.48 2.55 3 coeffs b: 0.13 d: 102.25 e: -2.5
[ reached 'max' / getOption("max.print") -- omitted 93 rows ]
The confidence interval around the coefficients of the model of interest can easily be calculated using the confint.drc function in the "drc" package:
model = drm(mpg ~ wt, data = mtcars, fct = UCRS.5c())
confint(model, level = 0.95)
2.5 % 97.5 %
b:(Intercept) -2226.212509 1677.054628
c:(Intercept) 67.247223 120.259724
d:(Intercept) 27.957770 32.175538
e:(Intercept) 1.840954 2.688259
f:(Intercept) 103.096202 213.467964
The argument plot_model allows to automatically plot 1 to 9 model with their RMSE by entering the ID, the partition of the graphical window being automatically adapted to the number of graphics to plot:
compare_nlm(mpg ~ wt, mtcars, plot_model = c(75,23,34,88,44,46))
Information on the non-linear model (e.g. formula, coefficients) can be directly accessed for 93 models (in package "drc") by typing their names preceded by a question mark:
?UCRS.5c
The name of the package containing the function appear in front of the name of the model, if the argument package is switched to TRUE.
compare_nlm(mpg ~ wt, mtcars, package = T)
# Part of the dataset obtained #
7 16 drc::EXD.2 157.55 161.95 2.58 2 coeffs d: 49.66 e: 3.41
8 99 aomisc::DRC.expoDecay 157.55 161.95 2.58 2 coeffs init: 49.66 k: 0.29
The names of the 21 remaining model (in package "aomisc") are displayed below, and more informations on them can be obtained following this link1 and this link2:
In case of error, different messages will be printed to explain to the user what was the problem (e.g. error of syntax of the estimators, error in the number of digits or the ID of the models).
The package "drc" already provides methods to calculate the confidence interval around the non-linear models created with its Self-Starters (predict.drc). However, this feature is not available for the model created from "aomisc" self-starters, though some of them are often the best models (e.g. linear, exponential, logarithmic). In addition, the predict.drc function only works with 72 Self-Starters, according to my own tests (see the Appendix at the end for the list).
The function ci_nlm was coded to generalize CI calculations for all models used in compare_nlm, but also to allows the user for plotting the predictions.
### Default values ###
ci_nlm = function(formula, fct, data, method = "delta", level = 0.05, nb_boot = 200, expand_x = NA, keep_cols = NULL)
The predict.drc function computes the confidence interval using the Delta method, which is very fast. However, as its calculation is very complicated, I decided to use the equivalent Bootstrap method (Bertail, Boizot & Combris, 2003) for the moment, which is much more simple to code but is slower to compute.
A warning message appears if the "aomisc" self-starter is used with the Delta method, advising to shift *method" to "boot". Beforehand, it is really easy to check the source of the function, by switching package = FALSE to package = TRUE in the compare_nlm function (see above).
The latency vary with the Self-Starter used, with the number of coefficients and with the number of iterations. However, it is situated between 2 and 4s for "aomisc" Self-starters, with the default value of 200 iterations. Since the drm function calculate many supplementary things for "drc" Self-starters, the duration of computation is generally much longer.
### DRC self-starter with the Delta method -> immediate result ###
result = ci_nlm(mpg~ wt, fct = gaussian(), data = mtcars)
head(result)
Predictions Lower_CI Upper_CI
1 22.39356 20.79063 23.99648
2 20.68509 19.36893 22.00126
3 24.94405 22.85991 27.02818
4 18.81245 17.61229 20.01262
5 17.75349 16.52585 18.98113
6 17.66496 16.43295 18.89697
### DRC self-starter with the Bootstrap method (default = 200 iterations) -> 10s ###
result = ci_nlm(mpg~ wt, fct = gaussian(), data = mtcars, method = "boot")
head(result)
Predictions Lower_CI Upper_CI
1 22.39356 21.30726 23.78245
2 20.68509 19.65370 21.97412
3 24.94405 23.84398 26.62455
4 18.81245 17.87121 19.96312
5 17.75349 16.74902 18.81757
6 17.66496 16.64668 18.71966
### AOMISC self-starter with the Bootstrap method (default = 200 iterations) -> 2s ###
result = ci_nlm(mpg~ wt, fct = DRC.expoDecay(), data = mtcars, method = "boot")
Predictions Lower_CI Upper_CI
1 23.01207 22.19429 24.00978
2 21.35223 20.59005 22.25067
3 25.13077 24.03569 26.36981
4 19.32383 18.49125 20.24617
5 18.08863 17.25334 19.06617
6 17.98273 17.14739 18.96891
The argument keep_col is very useful for plotting since it allows to add any other columns to the new dataframe:
ci_nlm(mpg~ wt, fct = gaussian(), data = mtcars, keep_cols = c("all")) # To keep all the columns of the initial dataframe
result = ci_nlm(mpg~ wt, fct = gaussian(), data = mtcars, keep_cols = c("mpg", "wt")) # To keep only the columns of interest
head(result)
mpg wt Predictions Lower_CI Upper_CI
1 21.0 2.620 22.39356 20.79063 23.99648
2 21.0 2.875 20.68509 19.36893 22.00126
3 22.8 2.320 24.94405 22.85991 27.02818
4 21.4 3.215 18.81245 17.61229 20.01262
5 18.7 3.440 17.75349 16.52585 18.98113
6 18.1 3.460 17.66496 16.43295 18.89697
Plotting the model and its confidence interval is then made really easy:
### Plotting in Base R ###
data = result[order(result$wt), ] # Dataframe must be ordered by X for plotting in Base R
plot(data$wt, data$mpg)
lines(data$wt, data$Predictions, lwd = 2)
lines(data$wt, data$Lower_CI, lty = 2)
lines(data$wt, data$Upper_CI, lty = 2)
### Plotting with the "ggplot" package ###
ggplot(data = mtcars, aes(x = wt, y = mpg)) + geom_point(size = 3) +
geom_ribbon(data = result, aes(x = wt, ymin = Lower_CI, ymax = Upper_CI), alpha = 0.5, fill = "grey") +
geom_line(data = result, aes(x = wt, y = Predictions), size = 1.2) +
theme_minimal()
Another argument allows the user to do predictions using the model. The user only has to fill the argument expand_x with two new limits of X incorporation the previous.
To ensure the accuracy of the confidence intervals, the data related to the previous X is kept as it is. The new X is calculated to fill the new bounds, while being spaced by the same mean step as the previous values of X, so as not to artificially inflate the "density" of the measurements, which would reduce the confidence interval around the predictions.
ci_nlm(mpg~ wt, fct = gaussian(), data = mtcars, expand_x = c(-6,10))
ggplot(data = mtcars, aes(x = wt, y = mpg)) + geom_point(size = 3) +
geom_ribbon(data = result, aes(x = wt, ymin = Lower_CI, ymax = Upper_CI), alpha = 0.5, fill = "grey") +
geom_line(data = result, aes(x = wt, y = Predictions), size = 1.2) +
theme_minimal()
If keep_cols is activated, the selected columns are repeated as much as necessary to fit the number of rows of the new dataframe. This is very useful when facetting in ggplot for example.
result = ci_nlm(mpg~ wt, fct = gaussian(), data = mtcars, keep_cols = c("mpg","wt"), expand_x = c(-6,10))
tail(result)
mpg wt New_X Predictions Lower_CI Upper_CI
331 15.0 3.570 9.335000 5.679318 -9.279771 20.63841
341 21.4 2.780 9.457219 5.554060 -9.839035 20.94715
351 21.0 2.620 9.579438 5.431132 -10.397176 21.25944
361 21.0 2.875 9.701656 5.310468 -10.954134 21.57507
371 22.8 2.320 9.823875 5.192001 -11.509855 21.89386
381 21.4 3.215 9.946094 5.075669 -12.064286 22.21562
If this is problem in your case, you can just run this line of code and it will replace the repeated rows by NA in the selected columns.
replace_NA = function(prev_df, new_df, names) {
new_df[(nrow(prev_df) + 1) : nrow(new_df), names] = NA
print(new_df)
}
result_without_NA = replace_NA(mtcars, result, names = c("mpg", "wt"))
tail(result_without_NA)
mpg wt New_X Predictions Lower_CI Upper_CI
331 NA NA 9.335000 5.679318 -9.279771 20.63841
341 NA NA 9.457219 5.554060 -9.839035 20.94715
351 NA NA 9.579438 5.431132 -10.397176 21.25944
361 NA NA 9.701656 5.310468 -10.954134 21.57507
371 NA NA 9.823875 5.192001 -11.509855 21.89386
381 NA NA 9.946094 5.075669 -12.064286 22.21562
The performance of the Delta and the Bootstrap method are very similar when the number of iterations is sufficient: in general between 100 and 200. However, this is not the case when the X are expanded. The CI generated by the Bootstrap method follows the fitted line while they widen approaching the limits with Delta method. This could be due to the difference of method or to complementary calculations implemented in the "drc" package, but I still did not found the answer to my question and I would be glad if someone can share it with me. I hope this will be answered when I will implement the Delta method for "aomisc" functions.
I believe the Delta method is more accurate in this case since it seems logic that the CI increase while going away for the measured points. A message pop if the bootstrap method is used with the expand argument to warn the user about that.
Names of the 21 Self-Starters in the drc package not working with the predict.drc, and therefore requiring a calculation of their CI with the Bootstrap method:
baro5, CRS.6, NEC.2, NEC.3, NEC.4, twophase, UCRS.4a, UCRS.4b, UCRS.4c, UCRS.5a, UCRS.5b, UCRS.5c, uml3a, uml3b, uml3c, uml4a, uml4b, uml4c, W2x.3, W2x.4 & weibull2x.