The SEXIT (Sequential Effect eXistence and sIgnificance Test) framework #237
Replies: 41 comments
-
Note to future me: #339 made me think that it would be useful to have convenience functions (could be smart wrappers) for simple and straightforward tests (correlations, t-tests and this kind of jazz). These |
Beta Was this translation helpful? Give feedback.
-
Once the thresholds for non-significance (i.e., the ROPE) and the one for a "large" or "moderate" effect are explicitly defined, the SEXIT framework does not make any interpretation. I.e., it does not label the effects, but just gives the 3 sequential probabilities as-is and the description of the posterior. It provides a lot of information about the posterior distribution (the probabilities of different sections) in a clear and meaningful way. Example of potential formulations:
which one is the best? |
Beta Was this translation helpful? Give feedback.
-
I like this very much! The first one makes most sense to me, as saying "the effect is large (0.02%)" seems confusing. |
Beta Was this translation helpful? Give feedback.
-
True but is also the longest :( I'm thinking about how it would fit into a manuscript when you have to report a handful of parameters for a dozen of models ^^, so being compact is a plus. For instance, I find reminding the thresholds used (like in |
Beta Was this translation helpful? Give feedback.
-
Might be good compromises? the information is nicely grouped, and presented in a relatively concise way |
Beta Was this translation helpful? Give feedback.
-
Then maybe: "There is 99% of probability that the effect of X is positive (median, 95% CI, Prsig=0.97, Prlarge=0.35)" |
Beta Was this translation helpful? Give feedback.
-
Or this one? |
Beta Was this translation helpful? Give feedback.
-
But then I have the feeling that the practical significance and large probabilities get kind of lost in the parentheses with the other indices |
Beta Was this translation helpful? Give feedback.
-
Do you think it would be relevant to add somewhere in there some BF or would it be redundant? |
Beta Was this translation helpful? Give feedback.
-
Though this one wouldn't be bad either, it puts the base information first and then the sexit stuff
And in the case of models with few parameters (like a t-test or a correlation etc.), one could just insert the thresholds:
And / or we could think about explicating the "the effect" part:
*(in the case of non-standardized data, to give an idea of what it represents in terms of variance) This could make the reports of interaction clearer, for instance for X1:X2:
|
Beta Was this translation helpful? Give feedback.
-
I like these 👆
Not redundant, a BF can definitely be added alongside the p-ROPE - as part of the "existence" testing, if priors have been specified. Just to complicate things more, you can also have multiple BFs to correspond to the multiple ps: 😅 library(rstanarm)
#> Loading required package: Rcpp
#> This is rstanarm version 2.21.1
#> - See https://mc-stan.org/rstanarm/articles/priors for changes to default priors!
#> - Default priors may change, so it's safest to specify priors, even if equivalent to the defaults.
#> - For execution on a local, multicore CPU with excess RAM we recommend calling
#> options(mc.cores = parallel::detectCores())
library(bayestestR)
mtcars_Z <- effectsize::standardize(mtcars)
m <- stan_glm(mpg ~ cyl + am,
family = gaussian(),
data = mtcars_Z,
prior = normal(0, 1, 1),
refresh = 0)
m_prior <- unupdate(m)
#> Sampling priors, please wait...
# point
(b <- bayesfactor_parameters(m, m_prior, null = 0))
#> Loading required namespace: logspline
#> # Bayes Factor (Savage-Dickey density ratio)
#>
#> Parameter | BF
#> ----------------------
#> (Intercept) | 0.037
#> cyl | 7446.539
#> am | 0.791
#>
#> * Evidence Against The Null: [0]
plot(b) + ggplot2::coord_cartesian(xlim = c(-2,2)) # rope range
(b <- bayesfactor_parameters(m, m_prior, null = c(-0.1, 0.1)))
#> # Bayes Factor (Null-Interval)
#>
#> Parameter | BF
#> ----------------------
#> (Intercept) | 0.013
#> cyl | 5081.259
#> am | 0.513
#>
#> * Evidence Against The Null: [-0.1, 0.1]
plot(b) + ggplot2::coord_cartesian(xlim = c(-2,2)) # medium-large slope
(b <- bayesfactor_parameters(m, m_prior, null = c(-0.5, 0.5)))
#> # Bayes Factor (Null-Interval)
#>
#> Parameter | BF
#> -----------------------
#> (Intercept) | 2.810e-06
#> cyl | 32.139
#> am | 0.005
#>
#> * Evidence Against The Null: [-0.5, 0.5]
plot(b) + ggplot2::coord_cartesian(xlim = c(-2,2)) Created on 2020-10-14 by the reprex package (v0.3.0) |
Beta Was this translation helpful? Give feedback.
-
|
Beta Was this translation helpful? Give feedback.
-
1In theory yes, as the essence of SEXIT is the sequential focus. I.e., the first thing to demonstrate is the direction, and once this is clear, then one can focus on significance/non-negligibility, and once that is clear, one can focus on whether it is big. So technically yes, if pd > 99.9%, then one could omit it and move on to non-negligibility. But that said, I think it's just clearer if the standards for reporting are consistent, i.e., if the reported indices are the same, in the same order. And it doesn't not make it particularly heavy on the eyes IMO to have something like:
(i.e., to report the indices even if they are super high). After all, we do report p-values no matter what's their value. However, it would potentially work not to report significance/strength if direction is not certain enough. But then again, I think it's easier if people stick with reporting the same thing in the same way 2So you prefer something like:
Why not, but the repetition of "of probability that it is ..." bothers me a bit ^^ And TBH I reckon one could get very very quickly used to a SEXIT reading, i.e., naturally and automatically checking the first probability, and then moving on to the second if > 99.9, then eventually moving on to the third. In the end, it's even quite light on attention and memory, as there is only one information (one value) considered at a time. 3We had this discussion somewhere, but yes IMO it is important to give to what the parameter corresponds relative to its own variance as this impacts the coefficient's scale, and therefore the p-sig and p-large. sexit <- function(x, negligibility=0.05, strong=0.3){
if(median(x) < 0){
x <- -1 * x
direction <- "negative"
} else{
direction <- "positive"
}
n <- length(x)
pd <- insight::format_value(length(x[x > 0]) / n, as_percent=TRUE)
psig <- insight::format_value(length(x[x > negligibility]) / n, as_percent=TRUE)
pstrong <- insight::format_value(length(x[x > strong]) / n, as_percent=TRUE)
paste0(pd, ", ", psig, ", ", pstrong, " of being ", direction, ", significant and strong")
}
df <- iris
x1 <- insight::get_parameters(rstanarm::stan_glm(Sepal.Length ~ Sepal.Width, data=df, refresh=0, iter=10000))[[2]]
sexit(x1)
#> [1] "92.15%, 86.52%, 31.13% of being negative, significant and strong"
df$Sepal.Width2 <- df$Sepal.Width / 100
x2 <- insight::get_parameters(rstanarm::stan_glm(Sepal.Length ~ Sepal.Width2, data=df, refresh=0, iter=10000))[[2]]
sexit(x2)
#> [1] "92.36%, 92.31%, 92.07% of being negative, significant and strong" Created on 2020-10-15 by the reprex package (v0.3.0) Another option is to standardize the significance and importance threshold for each parameter by the scale of the variable it refers to, but then we fall back to the problem of standardized standardization which requires a full access and knowledge about the parameter's type. So by the time we have that, I think it could be easier to add the information. Though this point is a more a report-issue than a SEXIT problem per se (as it applies to all regression models). |
Beta Was this translation helpful? Give feedback.
-
Maybe this can trimmed down to (with the values upfront): |
Beta Was this translation helpful? Give feedback.
-
I understand your intention, but it still is somehow misleading. Assume that the standardized X2 is 1 (or -1), then following sentences are all correct: An increase of 1 of X2 (0.4 SD) has 99%, 97% and 35% probability of having a positive, significant and large change (median, 95% CI) on the effect of X1. An increase of 3.5 of X2 (0.4 SD) has 99%, 97% and 35% probability of having a positive, significant and large change (median, 95% CI) on the effect of X1. A decrease of 2 of X2 (0.4 SD) has 99%, 97% and 35% probability of having a negative, significant and large change (median, 95% CI) on the effect of X1. the "1" in this case is arbitrary, and therefore somehow misleading. It implies that "1" is significant, large etc., but what about 2, 3, 4 etc? |
Beta Was this translation helpful? Give feedback.
-
tada 🎉 (this will facilitate my life in report easystats/report#103) library(rstanarm)
library(bayestestR)
#> Note: The default CI width might change in future versions (see https://github.com/easystats/bayestestR/issues/250).
#> To prevent any issues, please set it explicitly when using bayestestR functions, via the 'ci' argument.
model <- rstanarm::stan_glm(mpg ~ wt * cyl,
data = mtcars,
iter = 800, refresh = 0
)
s <- sexit(model)
s
#> # The thresholds beyond which the effect is considered as significant (i.e., non-negligible) and large are 0.30 and 1.81 (corresponding respectively to 0.05 and 0.30 of the outcome's SD).
#>
#> (Intercept) (Median = 52.98, 95% CI [41.89, 65.71]) has a 100.00% probability of being positive (> 0), 100.00% of being significant (> 0.30), and 100.00% of being large (> 1.81)
#> - wt (Median = -8.14, 95% CI [-13.23, -3.93]) has a 99.88% probability of being negative (< 0), 99.81% of being significant (< -0.30), and 99.31% of being large (< -1.81)
#> - cyl (Median = -3.60, 95% CI [-5.49, -1.70]) has a 100.00% probability of being negative (< 0), 99.88% of being significant (< -0.30), and 96.94% of being large (< -1.81)
#> - wt:cyl (Median = 0.74, 95% CI [0.14, 1.44]) has a 98.38% probability of being positive (> 0), 92.12% of being significant (> 0.30), and 0.00% of being large (> 1.81)
#>
#> Parameter | Median | 95% CI | Existence (0) | Significance (0.30) | Large (1.81) | NA | NA
#> ---------------------------------------------------------------------------------------------------------
#> (Intercept) | 52.98 | [41.89, 65.71] | 41.89 | 65.71 | 1.00 | 1.00 | 1.00
#> wt | -8.14 | [-13.23, -3.93] | -13.23 | -3.93 | 1.00 | 1.00 | 0.99
#> cyl | -3.60 | [-5.49, -1.70] | -5.49 | -1.70 | 1.00 | 1.00 | 0.97
#> wt:cyl | 0.74 | [0.14, 1.44] | 0.14 | 1.44 | 0.98 | 0.92 | 0.00
print(s, summary=TRUE)
#> # The thresholds beyond which the effect is considered as significant (i.e., non-negligible) and large are 0.30 and 1.81 (corresponding respectively to 0.05 and 0.30 of the outcome's SD).
#>
#> (Intercept) (Median = 52.98, 95% CI [41.89, 65.71]) has 100.00%, 100.00% and 100.00% probability of being positive (> 0), significant (> 0.30) and large (> 1.81)
#> - wt (Median = -8.14, 95% CI [-13.23, -3.93]) has 99.88%, 99.81% and 99.31% probability of being negative (< 0), significant (< -0.30) and large (< -1.81)
#> - cyl (Median = -3.60, 95% CI [-5.49, -1.70]) has 100.00%, 99.88% and 96.94% probability of being negative (< 0), significant (< -0.30) and large (< -1.81)
#> - wt:cyl (Median = 0.74, 95% CI [0.14, 1.44]) has 98.38%, 92.12% and 0.00% probability of being positive (> 0), significant (> 0.30) and large (> 1.81) Created on 2020-10-26 by the reprex package (v0.3.0) What do you think? |
Beta Was this translation helpful? Give feedback.
-
Looks good! |
Beta Was this translation helpful? Give feedback.
-
Shouldn't the values for significant and large be |
Beta Was this translation helpful? Give feedback.
-
I can add that.
Yeah I thought about that. For the column names? like |
Beta Was this translation helpful? Give feedback.
-
library(rstanarm)
library(bayestestR)
#> Note: The default CI width might change in future versions (see https://github.com/easystats/bayestestR/issues/250).
#> To prevent any issues, please set it explicitly when using bayestestR functions, via the 'ci' argument.
model <- rstanarm::stan_glm(mpg ~ wt * cyl,
data = mtcars,
iter = 800, refresh = 0
)
s <- sexit(model)
s
#> # Following the SEXIT framework, we report the median of the posterior distribution and its 95% CI (Highest Density Interval), along the probability of direction (pd), the probability of significance and the probability of being large.The thresholds beyond which the effect is considered as significant (i.e., non-negligible) and large are 0.30 and 1.81 (corresponding respectively to 0.05 and 0.30 of the outcome's SD).
#>
#> (Intercept) (Median = 52.72, 95% CI [42.22, 64.87]) has a 100.00% probability of being positive (> 0), 100.00% of being significant (> 0.30), and 100.00% of being large (> 1.81)
#> - wt (Median = -8.16, 95% CI [-12.39, -3.76]) has a 99.94% probability of being negative (< 0), 99.94% of being significant (< -0.30), and 99.81% of being large (< -1.81)
#> - cyl (Median = -3.58, 95% CI [-5.40, -1.72]) has a 100.00% probability of being negative (< 0), 100.00% of being significant (< -0.30), and 96.75% of being large (< -1.81)
#> - wt:cyl (Median = 0.73, 95% CI [0.21, 1.42]) has a 99.25% probability of being positive (> 0), 92.50% of being significant (> 0.30), and 0.06% of being large (> 1.81)
#>
#> Parameter | Median | 95% CI | Existence (0) | Significance (0.30) | Large (1.81) | NA | NA
#> -------------------------------------------------------------------------------------------------------------
#> (Intercept) | 52.72 | [42.22, 64.87] | 42.22 | 64.87 | 1.00 | 1.00 | 1.00
#> wt | -8.16 | [-12.39, -3.76] | -12.39 | -3.76 | 1.00 | 1.00 | 1.00
#> cyl | -3.58 | [-5.40, -1.72] | -5.40 | -1.72 | 1.00 | 1.00 | 0.97
#> wt:cyl | 0.73 | [0.21, 1.42] | 0.21 | 1.42 | 0.99 | 0.92 | 6.25e-04
print(s, summary=TRUE)
#> # The thresholds beyond which the effect is considered as significant (i.e., non-negligible) and large are 0.30 and 1.81 (corresponding respectively to 0.05 and 0.30 of the outcome's SD).
#>
#> (Intercept) (Median = 52.72, 95% CI [42.22, 64.87]) has 100.00%, 100.00% and 100.00% probability of being positive (> 0), significant (> 0.30) and large (> 1.81)
#> - wt (Median = -8.16, 95% CI [-12.39, -3.76]) has 99.94%, 99.94% and 99.81% probability of being negative (< 0), significant (< -0.30) and large (< -1.81)
#> - cyl (Median = -3.58, 95% CI [-5.40, -1.72]) has 100.00%, 100.00% and 96.75% probability of being negative (< 0), significant (< -0.30) and large (< -1.81)
#> - wt:cyl (Median = 0.73, 95% CI [0.21, 1.42]) has 99.25%, 92.50% and 0.06% probability of being positive (> 0), significant (> 0.30) and large (> 1.81) Created on 2020-10-26 by the reprex package (v0.3.0) |
Beta Was this translation helpful? Give feedback.
-
Yes, exactly. |
Beta Was this translation helpful? Give feedback.
-
library(rstanarm)
library(bayestestR)
#> Note: The default CI width might change in future versions (see https://github.com/easystats/bayestestR/issues/250).
#> To prevent any issues, please set it explicitly when using bayestestR functions, via the 'ci' argument.
model <- rstanarm::stan_glm(mpg ~ wt * cyl,
data = mtcars,
iter = 800, refresh = 0
)
s <- sexit(model)
s
#> # Following the SEXIT framework, we report the median of the posterior distribution and its 95% CI (Highest Density Interval), along the probability of direction (pd), the probability of significance and the probability of being large.The thresholds beyond which the effect is considered as significant (i.e., non-negligible) and large are 0.30 and 1.81 (corresponding respectively to 0.05 and 0.30 of the outcome's SD).
#>
#> (Intercept) (Median = 52.58, 95% CI [39.64, 64.09]) has a 100.00% probability of being positive (> 0), 100.00% of being significant (> 0.30), and 100.00% of being large (> 1.81)
#> - wt (Median = -8.03, 95% CI [-12.57, -3.31]) has a 99.81% probability of being negative (< 0), 99.81% of being significant (< -0.30), and 99.44% of being large (< -1.81)
#> - cyl (Median = -3.57, 95% CI [-5.55, -1.52]) has a 100.00% probability of being negative (< 0), 99.88% of being significant (< -0.30), and 95.38% of being large (< -1.81)
#> - wt:cyl (Median = 0.72, 95% CI [0.09, 1.37]) has a 98.44% probability of being positive (> 0), 89.88% of being significant (> 0.30), and 0.56% of being large (> 1.81)
#>
#> Parameter | Median | 95% CI | Existence (> |0|) | Significance (> |0.30|) | Large (> |1.81|)
#> -------------------------------------------------------------------------------------------------------
#> (Intercept) | 52.58 | [39.64, 64.09] | 1.00 | 1.00 | 1.00
#> wt | -8.03 | [-12.57, -3.31] | 1.00 | 1.00 | 0.99
#> cyl | -3.57 | [-5.55, -1.52] | 1.00 | 1.00 | 0.95
#> wt:cyl | 0.72 | [0.09, 1.37] | 0.98 | 0.90 | 5.62e-03
print(s, summary=TRUE)
#> # The thresholds beyond which the effect is considered as significant (i.e., non-negligible) and large are 0.30 and 1.81 (corresponding respectively to 0.05 and 0.30 of the outcome's SD).
#>
#> (Intercept) (Median = 52.58, 95% CI [39.64, 64.09]) has 100.00%, 100.00% and 100.00% probability of being positive (> 0), significant (> 0.30) and large (> 1.81)
#> - wt (Median = -8.03, 95% CI [-12.57, -3.31]) has 99.81%, 99.81% and 99.44% probability of being negative (< 0), significant (< -0.30) and large (< -1.81)
#> - cyl (Median = -3.57, 95% CI [-5.55, -1.52]) has 100.00%, 99.88% and 95.38% probability of being negative (< 0), significant (< -0.30) and large (< -1.81)
#> - wt:cyl (Median = 0.72, 95% CI [0.09, 1.37]) has 98.44%, 89.88% and 0.56% probability of being positive (> 0), significant (> 0.30) and large (> 1.81) Created on 2020-10-26 by the reprex package (v0.3.0) Next step is to mention it in the vignettesi, guidelines, README and then popularize it |
Beta Was this translation helpful? Give feedback.
-
Some thoughts:
|
Beta Was this translation helpful? Give feedback.
-
I believe, to make this feature more prominent, it would be better to find a different acronym... Else, it will be underutilized. |
Beta Was this translation helpful? Give feedback.
-
We can have an alternative, one for sad people and one for fun people 🤷 Others acronym are: Sequential Effect Existence and Significance Test: SEEST... So if you inSEEST we can add that as an alias 😅 |
Beta Was this translation helpful? Give feedback.
-
😂😂😂😂 |
Beta Was this translation helpful? Give feedback.
-
I personally find the name clear and easy to remember, but think of what we needed to do when submitting the effectsize-paper... :-| |
Beta Was this translation helpful? Give feedback.
-
How about Sequential Existence and Magnitude Inference Testing of Effects? |
Beta Was this translation helpful? Give feedback.
-
omg haha 💣 |
Beta Was this translation helpful? Give feedback.
-
But this is nothing we need to decide for the next update, I'd say |
Beta Was this translation helpful? Give feedback.
-
Here are some ideas that have been in the back of my mind about this thing I've been mentioning here and there, put here to open the discussion about it and guide the development of a possible future direction:
Motivation
There's a debate between which index is the best, and which one to report. This leads to the mindless report of all possible indices, with that the reader will be satisfied, but often without the writer understanding and interpreting them. Indeed, it's complicated to juggle between so many indices at the same time, that have complicated definitions and subtle differences.
The focus should be put on intuitiveness and explicitness of the indices' interpretation (rely as little as possible on magical numbers or arbitrary rules), and practical meaningfulness and usefulness of the indices.
To that end, we suggest a system of description of parameters that would be intuitive, easy to learn and apply, mathematically accurate and useful for taking decision.
Theoretical idea
Ideas
Which would translate into something like give pd, if pd > 99% (for example), then assess significance (ROPE). Also, report characteristics of the posterior (median & CI). Some points of debate:
Once we assessed the direction of the effect, and said that it's for instance likely positive, it seems sound to investigate a ROPE defined on the positive side, to investigate the probability that the effect is bigger than a given threshold. But then, it creates a practical issue, because let's say 99% of the posterior is positive, and 3% of this posterior is in the ROPE -0.1; 0.1. If we only take interest in the 0; 0.1 ROPE, what to do with the 1% of posterior that is negative? Thus, a more straightforward and solving approach is not to report the percentage of the posterior that is in the ROPE, but the percentage of posterior that is NOT in the ROPE (i.e., is bigger). This inversed take on ROPE, (which could be re-named Region of Practical Significance (ROPS)) might also "speak" more to people. In the case above (pd = 99%, 3% in ROPE [-0.1, 0.1]) we would say that that there is 99% of probability that the effect is positive and 97% of probability that its size is significant (i.e., bigger than the negligibility threshold).
Another issue with the ROPE is the discrete and hard bounds. In the example above (threshold 0.1), 0.999 is considered as negligible and 0.10001 as not negligible. Although practical, this seems to contradict the underlying probabilistic perspective. I've tried in the past (e.g.), then abandoned, the idea of a probabilistic definition of ROPE. I am still not sure if that makes any sense, but the idea is to define the ROPE as a distribution, e.g., normal (mean 0, SD 0.1/3), in which the "weight" or density increases as we go closer to 0. What I tried to do with this distribution is to take the overlap between the posterior and the ROPE distribution, to take into account the fact that 0.0001 is different from 0.099999. The overlap might not be the best approach to it, but it keeps coming to my mind as a solution to soften the ROPE boundaries.
Instead of interpreting (i.e. labelling) the effect size of the point-estimate, it makes more sense to give the proportion of the posterior within each "category" (small, medium, large etc.).
Should we try establishing guidelines for actually interpreting the uncertainty, rather than just reporting the indices? I.e., for instance the width of the standardized CI, or the SD, so that we can conclude something like "the estimation yielded precise parameters" or something
Thoughts?
Beta Was this translation helpful? Give feedback.
All reactions