Skip to content

Latest commit

 

History

History
50 lines (28 loc) · 21.5 KB

body.md

File metadata and controls

50 lines (28 loc) · 21.5 KB

Modern scientific research often aims to quantify the effects of multiple simultaneously operating processes in natural or human systems. Some examples from my own work in ecology and evolution consider the effects of herbivory and fertilization on standing biomass [@Gruner+2008]; the effects of bark, wood density, and fire on tree mortality [@brando_fire-induced_2012]; or the effects of taxonomic and genomic position on evolutionary rates [@ghenu_multicopy_2016]. This multifactorial approach [@mcgill_why_2016] complements, rather than replacing, the traditional hypothesis-testing or strong-inferential framework [@platt_strong_1964;@fox_why_2016;@betini_why_2017].^[While there is much interesting debate over the best methods for gathering evidence to distinguish among two or more particular, intrinsically discrete hypotheses [@taper_evidential_2015], that is not the focus of this paper.] Such attempts to quantify the magnitude or importance of different processes also differ from predictive modeling, which dominates the fields of machine learning and artificial intelligence [@hastieElements2009]. Prediction and quantification of process strength are closely related --- if we can accurately predict outcomes over a range of conditions then we can also predict the effects of changes in those conditions, and hence infer strengths of processes, if the changes we are trying to predict are adequately reflected in our training data. However, predictive modelers are usually primarily concerned with predictions within the natural range of conditions, which may not give us enough information to reliably make inferences about processes. The paper focuses on statistical modeling for estimation and inference, rather than prediction.

A standard approach to analyzing multifactorial systems, particularly common in ecology, goes as follows: (1) Construct a full model that encompasses as many of the processes (and their interactions) as is feasible. (2) Fit the full model and make sure that it describes the data reasonably well (e.g. by examining model diagnostics and by making sure that the level of unexplained variation is not unacceptably large). (3) Construct possible submodels of the full model by setting subsets of parameters to zero. (4) Compute information-theoretic measures of quality, such as the Akaike or Bayesian/Schwarz information criteria, for every submodel. (5) Use multi-model averaging (MMA) to estimate model-averaged parameters and confidence intervals (CIs); possibly draw conclusions about the importance of different processes by summing the information-theoretic weights [@burnham_model_2002]. I argue that this approach, even if used sensibly as advised by proponents of the approach (e.g. with reasonable numbers of candidate submodels), is a poor way to approach estimation and inference for multifactorial problems.

For example, suppose we want to understand the effects of ecosystem-level net primary productivity and fire intensity on species diversity (a simplified version of the analysis done in @moritzRole2023a). The model-comparison or model-averaging approach would construct five models: a null model with no effects of either productivity or fire, two single-factor models, an additive model, and a full model allowing for interactions between productivity and fire. We would then fit all of these models and model-average their parameters, and derive model-averaged confidence intervals.

The goal of a multifactorial analysis is to tease apart the contributions of many processes, all of which we believe are affecting our study system to some degree. If our scientific questions are (something like) "How important is this factor, in an absolute sense or relative to other factors?" (or equivalently, "how much does a change in this factor change the system in absolute or relative terms?"), rather than "Which of these factors are having any effect at all on my system?", why are we working so hard to fit many models of which only one (the full model) incorporates all of the factors? If we do not have particular, a priori discrete hypotheses about our system (such as "process $A$ influences the outcome but process $B$ has no effect at all"), why does so much of our data-analytic effort go into various ways to test between, or combine and reconcile, multiple discrete models? In software development, this is called an "XY problem"^[http://www.perlmonks.org/?node=XY+Problem]: rather than thinking about the best way to solve our real problem $X$ (understanding multifactorial systems), we have gotten bogged down in the details of how to make a particular tool, $Y$ (multimodel approaches) provide the answers we need. Most critiques of MMA address technical concerns such as the influence of unobserved heterogeneity [@brewer_relative_2016], or criticize the misuse of information-theoretic methods by researchers [@mundryIssues2011;@cade_model_2015], but do not ask why we are comparing discrete models in the first place.

In contrast to averaging across discrete hypotheses, or treating a choice of discreting hypotheses as an end goal, fitting and comparing multiple models as a step in a null-hypothesis significance testing (NHST) procedure is defensible. In the biodiversity analysis described above, we might fit the full model and then assess the significance of individual terms by comparing the fit of the full model to models with those terms dropped (taking particular care with the interpretation of dropping a lower-level effect in models with interactions, e.g. see @bernhardtInterpretation1979). While much maligned, NHSTs are a useful part of data analysis --- not to decide whether we really think a null hypothesis is false (they almost always are), but to see if we can distinguish signal from noise. Another interpretation is that NHSTs can test whether we can reliably determine the direction of effects --- that is, not whether the effect of a predictor on some process is zero, but whether we can tell unequivocally that it has a particular sign, positive or negative [@jones_sensible_2000; @dushoff_i_2019].

However, researchers using multimodel approaches are not fitting one-step-reduced models to test hypotheses; rather, they are fitting a wide range of submodels, typically in the hope that model choice or multimodel averaging will help them deal with insufficient data in a multifactorial world. If we had enough information (even Big Data doesn't always provide the information we need: @mengStatistical2018), we could fit only the full model, drawing our conclusions from the estimates and CIs with all of the factors considered simultaneously. But we nearly always have too many predictors, and not enough data; we don't want to overfit (which will inflate our CIs and $p$-values to the point where we can't tell anything for sure), but at the same time we are afraid of neglecting potentially important effects.

Stepwise regression, the original strategy for separating signals from noise, is now widely deprecated because it interferes with correct statistical inference [@harrell_regression_2001;@whittingham_why_2006;@romanoStepwise2005;@mundryStepwise2009]. Information-theoretic tools mitigate the instability of stepwise approaches, allow simultaneous comparison of many, non-nested models, and avoid the stigma of NHST. A further step forward, multi-model averaging [@burnham_model_2002], accounts for model uncertainty and avoids focusing on a single best model. Some forms of model averaging provide shrinkage estimators; averaging the strength of effects between models where they are included and models where they are absent adjusts the estimated effects toward zero [@cade_model_2015]. More recently, model averaging is experiencing a backlash, as studies point out that multimodel averaging may run into trouble when variables are collinear and/or have differential levels of measurement error (@freckleton_dealing_2011); when we are careless about the meaning of main effects in the presence of interactions; when we average model parameters rather than model predictions [@cade_model_2015]; or when we use summed model weights to assess the relative importance of predictors (@galipaud_ecologists_2014, @cade_model_2015 [but cf. @zhang_model_2015]).

@freckleton_dealing_2011 makes the point that model averaging will tend to shrink the estimates of multicollinear predictors toward each other, so that estimates of weak effects will be biased upward and estimates of strong effects biased downward. This is an unsurprising (in hindsight) consequence of shrinkage estimation. With other analytical methods such as lasso regression, or selection of a single best model by AIC, the weaker of two correlated predictors, or more precisely the one that appears weaker based on the available data, could be eliminated entirely, leading all of its effects to be attributed to the stronger predictor. Researchers often make a case for dropping correlated terms in this way because collinearity of predictors inflates parameter uncertainty and complicates interpretation. However, others have repeatedly pointed out that collinearity is a problem of epistemic uncertainty --- we are simply missing the data that would tell us which combination of collinear factors really drives the system. The confidence intervals of parameters from a full model estimated by regression or maximum likelihood will correctly identify this uncertainty; modeling procedures that automatically drop collinear predictors (by model selection or sparsity-inducing penalization) not only fail to resolve the issue, but can lead to inaccurate predictions based on new data [@grahamConfronting2003; @morrisseyMultiple2018; @fengCollinearity2019a; @vanhoveCollinearity2021]. A full model might (correctly) tell us we can't confidently assess whether either productivity or fire decrease or increase species diversity, because their estimated effects are strongly correlated. However, by comparing the fit of the full model to one that dropped both productivity and fire, we could conclude that their joint effect is highly significant.

In ecology, information criteria were introduced by applied ecologists who were primarily interested in making the best possible predictions to inform conservation and management; they were less concerned with inference or quantifying the strength of underlying processes [@BurnAnde98;@burnham_model_2002;@johnsonModel2004a]. Rather than using information criteria as tools to identify the best predictive model, or to obtain the best overall (model-averaged) predictions, most current users of information-theoretic methods use them either to quantify variable importance, or, by multimodel averaging, to have their cake and eat it too --- to avoid either over- or underfitting while quantifying effects in multifactorial systems. There are two problems with this approach, one conceptual and one practical.

The conceptual problem with model averaging reflects the original sin of unnecessarily discretizing a continuous model space. When we fit many different models as part of our analytical process (based on selection or averaging), the models are only means to an end; despite the claims of some information-theoretic modelers, we are not really using the submodels in support of the method of multiple working hypotheses as described by @chamberlinMethod1890. For example, Chamberlin argued that in teaching about the origin of the Great Lakes we should urge students "to conceive of three or more great agencies [pre-glacial erosion, glacial erosion, crust deformation] working successively or simultaneously, and to estimate how much was accomplished by each of these agencies." Chamberlin was not suggesting that we test which individual mechanism or combination of mechanisms fits the data best (in whatever sense), but instead that we acknowledge that the world is multifactorial. In a similar vein, @gelmanPhilosophy2013 advocate "continuous model expansion", creating models that include all components of interest (with appropriate Bayesian priors to constrain the overall complexity of the model) rather than selecting or averaging across discrete sets of models that incorporate subsets of the processes.

Here I am not concerned whether 'truth' is included in our model set (it isn't), and how this matters to our inference [@bernardoBayesian1994; @barker_truth_2015]. I am claiming the opposite, that our full model --- while certainly not the true model --- is usually the closest thing we have to a true model. This claim seems to contradict the information-theoretic result that the best approximating model (i.e., the minimum-AIC model) is expected to be closest to the true (generating) model in a predictive sense (i.e., it has the smallest expected Kullback-Leibler distance) [@poncianoMultimodel2018]. However, the fact that excluding some processes allows the fitted model to better match observation that does not mean that we should believe these processes are not affecting on our system --- just that, with available data, dropping terms will give us better predictions than keeping the full model. If we are primarily interested in prediction, or in comparing qualitatively different, possibly non-nested hypotheses [@luttbeg_comparing_2004], information-theoretic methods do match our goals well.

The technical problem with model averaging is its computational inefficiency. Individual models can take minutes or hours to fit, and we may have to fit dozens or scores of sub-models in the multi-model averaging process. There are efficient tools available for fitting "right-sized" models that avoid many of the technical problems of model averaging. Penalized methods such as ridge and lasso regression [@dahlgren_alternative_2010] are well known in some scientific fields; in a Bayesian setting, informative priors centered at zero have the same effect of regularizing --- pushing weak effects toward zero and controlling model complexity (more or less synonymous with the shrinkage of estimates described above) [@lemoineMoving2019a]. Developed for optimal (predictive) fitting in models with many parameters, penalized models have well-understood statistical properties; they avoid the pitfalls of model-averaging correlated or nonlinear parameters; and, by avoiding the need to fit many sub-models in the model-averaging processes, they are much faster.^[Although they may require a computationally expensive cross-validation step in order to choose the degree of penalization.] Furthermore, penalized approaches underlie modern nonparametric methods such as additive models and Gaussian processes that allow models to expand indefinitely to match the available data [@rasmussenGaussian2005; @woodGeneralized2017].

Penalized models have their own challenges. A big advantage of information-theoretic methods is that, like wrapper methods for feature selection in machine learning [@chandrashekarSurvey2014], we can use model averaging as long as we can fit component models and extract the log-likelihood and number of parameters --- we never need to build new software. Although powerful computational tools exist for fitting penalized versions of linear and generalized linear models (e.g. the glmnet package for R) and mixed models (glmmLasso), quantile regression [@koenkerQuantile2017], software for some more exotic models (e.g. zero-inflated models, models for censored data) may not be readily available. Fitting these models requires the user to choose the strength of penalization. This process is conveniently automated in tools like glmnet, but correctly assessing out-of-sample accuracy (and hence the correct level of penalization) is tricky for data that are correlated in space or time [@wenger_assessing_2012; @robertsCrossvalidation2016]. Penalization (or regularization) can also be done by imposing Bayesian priors on subsets of parameters [@chungNondegenerate2013], but this converts the choice of strength of penalization to a similarly challenging choice of appropriate priors.

Finally, frequentist inference (computing $p$-values and CIs) for parameters in penalized models --- one of the basic outputs we want from a statistical analysis of a multifactorial system --- is a current research problem; statisticians have proposed a variety of methods [@potscherConfidence2010a;@javanmard_confidence_2014;@lockhart_significance_2014;@taylorPostselection2018], but they typically make extremely strong asymptotic assumptions and are far from being standard options in software. Scientists should encourage their friends in statistics and computer science to build tools that make penalized approaches easier to use.

Statisticians derived confidence intervals for ridge regression long ago [@obenchain_classical_1977] --- surprisingly, they are identical to the confidence intervals one would have gotten from the full model without penalization! @wangInterval2013b similarly proved that model-averaging CIs derived as suggested by @hjortFrequentist2003 are asymptotically (i.e. for arbitrarily large data sets) equivalent to the CIs from the full model. Analytical and simulation studies [@turek2012model;@fletcher2012model;@turek2013frequentist;@turek2015comparison;@kabaila_model-averaged_2016;@dormann_model_2018] have shown that a variety of alternative methods for constructing CIs are overoptimistic, i.e. that they generate too-narrow confidence intervals with coverage lower than the nominal level. Simulations from several of the studies above show that MMA confidence intervals constructed according to the best known procedures typically include the true parameter values only about 80% or 90% of the time. In particular, @kabaila_model-averaged_2016 say that constructing CIs that take advantage of shrinkage but still achieve correct coverage will be very difficult to achieve using model averaged confidence intervals. (The only examples I have been able to find of MMA confidence intervals with close to nominal coverage are from Chapter 5 of @burnham_model_2002.) In short, it seems difficult to find model-averaged confidence intervals that compete successfully with the standard confidence interval based on the full model.

Free lunches do not exist in statistics, any more than anywhere else. We can use penalized approaches to improve prediction accuracy without having to sacrifice any input variables (by trading bias for variance), but the only known way to gain statistical power for testing hypotheses, or narrowing our uncertainty about our predictions, is to limit the scope of our models a priori [@harrell_regression_2001], to add information from pre-specified Bayesian priors (or equivalent regularization procedures), or to collect more data. @burnhamMultimodel2004b defined a "savvy" prior that reproduces the results of AIC-based multimodel averaging in a Bayesian framework, but it is a weak conceptual foundation for understanding multifactorial systems. Because it is a prior on discrete models, rather than on the magnitude of continuous parameters that describe the strength of different processes, it induces a spike-and-slab type prior on parameters that assigns a positive probability to the unrealistic case of a parameter being exactly zero; furthermore, the prior will depend on the particular set of models being considered.

Multimodel averaging is probably most popular in ecology (in May 2024, Google Scholar returned $\approx$ 65,000 hits for "multimodel averaging" alone and 31,000 for "multimodel averaging ecology"). However, multifactorial systems --- and the problems of approaching inference through comparing and combining discrete models that consider artificially limited subsets of the processes we know are operating --- occur throughout the sciences of complexity, those involving biological and human processes. In psychology, economics, sociology, epidemiology, ecology, and evolution, every process that we can imagine has some influence on the outcomes that we observe. Pretending that some of these processes are completely absent can be a useful means to an inferential or computational end, but it is rarely what we actually believe about the system (although see @mundryIssues2011 for a counterargument). We should not let this useful pretense become our primary statistical focus.

If we have sensible scientific questions and good experimental designs, muddling through with existing techniques will often give reasonable results [@murtaugh_performance_2009]. But researchers should at least be aware that the roundabout statistical methods they currently use to understand multifactorial systems were designed for prediction, or the comparison of discrete hypotheses, rather than for quantifying the relative strength of simultaneously operating processes. When prediction is the primary goal, penalized methods can work better (faster and with better-understood statistical properties) than multimodel averaging. When estimating the magnitude of effects or judging variable importance, penalized or Bayesian methods may be appropriate --- or we may have to go back to the difficult choice of focusing on a restricted number of variables for which we have enough to data to fit and interpreting the full model.

Acknowledgements

Thanks to Jonathan Dushoff for conversations on these topics over many years. Dana Karelus, Daniel Turek, and Jeff Walker provided useful input: Noam Ross encouraged me to finally submit the paper; Tara Bolker gave advice on straw men; three anonymous reviewers gave useful feedback. This work was supported by multiple NSERC Discovery grants. I used Microsoft Copilot to remind me of the term "epistemic uncertainty".

References