output

pdf_document	html_document
default	default

Super Learning with `sl3` {#sl3}

Rachael Phillips

Based on the sl3 R package by Jeremy Coyle, Nima Hejazi, Ivana Malenica, Rachael Phillips, and Oleg Sofrygin.

Updated: 2022-05-23

Learning Objectives {-}

By the end of this chapter you will be able to:

Select a performance metric that is optimized by the true prediction function and aligns with the intended use of the analysis in the real world.
Assemble a diverse set ("library") of learners to be considered in the super learner. In particular, you should be able to:

a. Customize a learner by modifying its tuning parameters. b. Create variations of the same base learner with different tuning parameter specifications. c. Couple screener(s) with learner(s) to create learners that consider as covariates a reduced, screener-selected subset of them.
Specify a meta-learner that optimizes the objective function of interest.
Justify the library and the meta-learner in terms of the prediction problem at hand, intended use of the analysis in the real world, statistical model, sample size, number of covariates, and outcome prevalence for discrete outcomes.
Interpret the fit for a super learner from the table of cross-validated risk estimates and the super learner coefficients.

Super Learner (SL) Introduction

A common task in data analysis is prediction, or using the observed data to learn a function that takes as input data on covariates/predictors and outputs a predicted value. Occasionally, the scientific question of interest lends itself to causal effect estimation. Even in these scenarios, where prediction is not in the forefront, prediction tasks are embedded in the procedure. For instance, in targeted minimum loss-based estimation (TMLE), predictive modeling is necessary for estimating outcome regressions and propensity scores.

There are various strategies that can be employed to model relationships from data, which we refer to interchangeably as "estimators", "algorithms", and "learners". For some data algorithms that can pick up on complex relationships in the data are necessary to adequately model it, and for other data parametric regression learners might fit the data reasonably well. It is generally impossible to know in advance which approach will be the best for a given data set and prediction problem.

The Super Learner (SL) solves the issue of selecting an algorithm, as it can consider many of them, from the simplest parametric regressions to the most complex machine learning algorithms (e.g., neural nets, support vector machines, etc). Additionally, it is proven to perform as well as possible in large samples, given the learners specified [@vdl2007super]. The SL represents an entirely pre-specified, data-adaptive, and theoretically grounded approach for predictive modeling. It has been shown to be adaptive and robust in a variety of applications, and in even in very small samples. Detailed descriptions outlining the SL procedure are widely available [@polley2010super; @naimi2018stacked]. Practical considerations for specifying the SL, including how to specify a rich and diverse library of learners, choose a performance metric for the SL, and specify a cross-validation (CV) scheme, are described in a pre-print article [@rvp2022super]. Here, we focus on introducing sl3, the standard tlverse software package for SL.

How to fit a SL with `sl3`

In this section, the core functionality for fitting any SL with sl3 is illustrated. In the sections that follow, additional sl3 functionality is presented.

Fitting any SL with sl3 consists of the following three steps:

Define the prediction task with make_sl3_Task.
Instantiate the SL with Lrnr_sl.
Fit the SL to the task with train.

WASH Benefits Study Example {-}

Using the WASH Benefits Bangladesh data, we are interested in predicting weight-for-height z-score whz from a set of predictors/covariates (i.e., variables measured before weight-for-height z-score). More information on this dataset is described in the "Meet the Data" chapter.

Preliminaries {-}

First, we need to load the data and relevant packages into the R session.

Load the data

We will use the fread function in the data.table R package to load the WASH Benefits example dataset:

library(data.table)
washb_data <- fread(
  paste0(
    "https://raw.githubusercontent.com/tlverse/tlverse-data/master/",
    "wash-benefits/washb_data.csv"
  ),
  stringsAsFactors = TRUE
)

Next, we will take a peek at the first few rows of our dataset:

head(washb_data)

whz	tr	fracode	month	aged	sex	momage	momedu	momheight	hfiacat	Nlt18	Ncomp	watmin	elec	walls	roof	asset_wardrobe	asset_table	asset_chair	asset_khat	asset_chouki	asset_tv	asset_bike	asset_mobile
0.00	Control	N05265	9	268	male	30	Primary (1-5y)	146.40	Food Secure	3	11	0	1	1	1	0	1	1	1	0	1	0	1
-1.16	Control	N05265	9	286	male	25	Primary (1-5y)	148.75	Moderately Food Insecure	2	4	0	1	1	1	0	1	0	1	1	0	0	1
-1.05	Control	N08002	9	264	male	25	Primary (1-5y)	152.15	Food Secure	1	10	0	0	1	1	0	0	1	0	1	0	0	1
-1.26	Control	N08002	9	252	female	28	Primary (1-5y)	140.25	Food Secure	3	5	0	1	1	1	1	1	1	1	0	0	1	1
-0.59	Control	N06531	9	336	female	19	Secondary (>5y)	150.95	Food Secure	2	7	0	1	1	1	1	1	1	1	1	0	0	1
-0.51	Control	N06531	9	304	male	20	Secondary (>5y)	154.20	Severely Food Insecure	0	3	1	1	1	1	0	0	0	0	1	0	0	1

Install the `sl3` software (as needed)

To install any package, we recommend first clearing the R workspace and then restarting the R session. In RStudio, this can be achieved by clicking the tab "Session" then "Clear Workspace", and then clicking "Session" again then "Restart R".

We can install sl3 using the function install_github provided in the devtools R package:

library(devtools)
install_github("tlverse/sl3")

Once the R package is installed, we recommend restarting the R session again.

Load the `sl3` software

Once sl3 is installed, we can load it like any other R package:

library(sl3)

1. Define the prediction task with `make_sl3_Task` {-}

The sl3_Task object defines the prediction task of interest. Recall that our task in this illustrative example is to use the WASH Benefits Bangladesh example dataset to learn a function of the covariates for predicting weight-for-height Z-score whz.

# create the task (i.e., use washb_data to predict outcome using covariates)
task <- make_sl3_Task(
  data = washb_data,
  outcome = "whz",
  covariates = c("tr", "fracode", "month", "aged", "sex", "momage", "momedu", 
                 "momheight", "hfiacat", "Nlt18", "Ncomp", "watmin", "elec", 
                 "floor", "walls", "roof", "asset_wardrobe", "asset_table", 
                 "asset_chair", "asset_khat", "asset_chouki", "asset_tv", 
                 "asset_refrig", "asset_bike", "asset_moto", "asset_sewmach", 
                 "asset_mobile")
)

# let's examine the task
task
An sl3 Task with 4695 obs and these nodes:
$covariates
 [1] "tr"              "fracode"         "month"           "aged"           
 [5] "sex"             "momage"          "momedu"          "momheight"      
 [9] "hfiacat"         "Nlt18"           "Ncomp"           "watmin"         
[13] "elec"            "floor"           "walls"           "roof"           
[17] "asset_wardrobe"  "asset_table"     "asset_chair"     "asset_khat"     
[21] "asset_chouki"    "asset_tv"        "asset_refrig"    "asset_bike"     
[25] "asset_moto"      "asset_sewmach"   "asset_mobile"    "delta_momage"   
[29] "delta_momheight"

$outcome
[1] "whz"

$id
NULL

$weights
NULL

$offset
NULL

$time
NULL

The sl3_Task keeps track of the roles the variables play in the prediction problem. Additional information relevant to the prediction task (such as observational-level weights, offset, id, CV folds) can also be specified in make_sl3_Task. The default CV fold structure in sl3 is V-fold CV (VFCV) with V=10 folds; if id is specified in the task then a clustered V=10 VFCV scheme is considered, and if the outcome type is binary or categorical then a stratified V=10 VFCV scheme is considered. Different CV schemes can be specified by inputting an origami folds object, as generated by the make_folds function in the origami R package. Refer to the documentation on origami's make_folds function for more information (e.g., in RStudio, by loading the origami R package and then inputting "?make_folds" in the Console). For more details on sl3_Task, refer to its documentation (e.g., by inputting "?sl3_Task" in R).

Tip: If you type task$ and then press the tab key (press tab twice if not in RStudio), you can view all of the active and public fields, and methods that can be accessed from the task$ object. This $ is like the key to access many internals of an object. In the next section, will see how we can use $ to dig into SL fit objects as well, to obtain predictions from an SL fit or candidate learners, examine an SL fit or its candidates, and summarize an SL fit.

2. Instantiate the SL with `Lrnr_sl`

In order to create Lrnr_sl we need to specify, at the minimum, a set of learners for the SL to consider as candidates. This set of algorithms is also commonly referred to as the "library". We might also specify the meta-learner, which is the algorithm that ensembles the learners, but this is optional since there are already defaults set up in sl3. See "Practical considerations for specifying a super learner" for step-by-step guidelines for tailoring the SL specification, including the library and meta-learner(s), to perform well for the prediction task at hand [@rvp2022super].

Learners have properties that indicate what features they support. We may use the sl3_list_properties() function to get a list of all properties supported by at least one learner:

sl3_list_properties()
 [1] "binomial"      "categorical"   "continuous"    "cv"           
 [5] "density"       "h2o"           "ids"           "importance"   
 [9] "offset"        "preprocessing" "sampling"      "screener"     
[13] "timeseries"    "weights"       "wrapper"

Since whz is a continuous outcome, we can identify the learners that support this outcome type with sl3_list_learners():

sl3_list_learners(properties = "continuous")
 [1] "Lrnr_arima"                     "Lrnr_bartMachine"              
 [3] "Lrnr_bayesglm"                  "Lrnr_bilstm"                   
 [5] "Lrnr_bound"                     "Lrnr_caret"                    
 [7] "Lrnr_cv_selector"               "Lrnr_dbarts"                   
 [9] "Lrnr_earth"                     "Lrnr_expSmooth"                
[11] "Lrnr_ga"                        "Lrnr_gam"                      
[13] "Lrnr_gbm"                       "Lrnr_glm"                      
[15] "Lrnr_glm_fast"                  "Lrnr_glm_semiparametric"       
[17] "Lrnr_glmnet"                    "Lrnr_glmtree"                  
[19] "Lrnr_grf"                       "Lrnr_gru_keras"                
[21] "Lrnr_gts"                       "Lrnr_h2o_glm"                  
[23] "Lrnr_h2o_grid"                  "Lrnr_hal9001"                  
[25] "Lrnr_HarmonicReg"               "Lrnr_hts"                      
[27] "Lrnr_lightgbm"                  "Lrnr_lstm_keras"               
[29] "Lrnr_mean"                      "Lrnr_multiple_ts"              
[31] "Lrnr_nnet"                      "Lrnr_nnls"                     
[33] "Lrnr_optim"                     "Lrnr_pkg_SuperLearner"         
[35] "Lrnr_pkg_SuperLearner_method"   "Lrnr_pkg_SuperLearner_screener"
[37] "Lrnr_polspline"                 "Lrnr_randomForest"             
[39] "Lrnr_ranger"                    "Lrnr_rpart"                    
[41] "Lrnr_rugarch"                   "Lrnr_screener_correlation"     
[43] "Lrnr_solnp"                     "Lrnr_stratified"               
[45] "Lrnr_svm"                       "Lrnr_tsDyn"                    
[47] "Lrnr_xgboost"

Now that we have an idea of some learners, let's instantiate a few of them. Below we instantiate Lrnr_glm and Lrnr_mean, a main terms generalized linear model (GLM) and a mean model, respectively.

lrn_glm <- Lrnr_glm$new()
lrn_mean <- Lrnr_mean$new()

For both of the learners created above, we just used the default tuning parameters. We can also customize a learner's tuning parameters to incorporate a diversity of different settings, and consider the same learner with different tuning parameter specifications.

Below, we consider the same base learner, Lrnr_glmnet (i.e., GLMs with elastic net regression), and create two different candidates from it: an L2-penalized/ridge regression and an L1-penalized/lasso regression.

# penalized regressions:
lrn_ridge <- Lrnr_glmnet$new(alpha = 0)
lrn_lasso <- Lrnr_glmnet$new(alpha = 1)

By setting alpha in Lrnr_glmnet above, we customized this learner's tuning parameter. When we instantiate Lrnr_hal9001 below we show how multiple tuning parameters (specifically, max_degreeand num_knots) can be modified at the same time.

Let's also instantiate some more learners that do not enforce relationships to be linear or monotonic, and to further diversify the set of candidates to include nonparametric learners, since up to this point all of the learners we've instantiated have been parametric.

# spline regressions:
lrn_polspline <- Lrnr_polspline$new()
lrn_earth <- Lrnr_earth$new()

# fast highly adaptive lasso (HAL) implementation
lrn_hal <- Lrnr_hal9001$new(max_degree = 2, num_knots = c(3,2), nfolds = 5)

# tree-based methods
lrn_ranger <- Lrnr_ranger$new()
lrn_xgb <- Lrnr_xgboost$new()

Let's also include a generalized additive model (GAM) and Bayesian GLM to further diversify the pool that we will consider as candidates in the SL.

lrn_gam <- Lrnr_gam$new()
lrn_bayesglm <- Lrnr_bayesglm$new()

Now that we've instantiated a set of learners, we need to put them together so the SL can consider them as candidates. In sl3, we do this by creating a so-called Stack of learners. A Stack is created in the same way we created the learners. This is because Stack is a learner itself; it has the same interface as all of the other learners. What makes a stack special is that it considers multiple learners at once: it can train them simultaneously, so that their predictions can be combined and/or compared.

stack <- Stack$new(
  lrn_glm, lrn_mean, lrn_ridge, lrn_lasso, lrn_polspline, lrn_earth, lrn_hal, 
  lrn_ranger, lrn_xgb, lrn_gam, lrn_bayesglm
)
stack
 [1] "Lrnr_glm_TRUE"                          
 [2] "Lrnr_mean"                              
 [3] "Lrnr_glmnet_NULL_deviance_10_0_100_TRUE"
 [4] "Lrnr_glmnet_NULL_deviance_10_1_100_TRUE"
 [5] "Lrnr_polspline_5"                       
 [6] "Lrnr_earth_2_3_backward_0_1_0_0"        
 [7] "Lrnr_hal9001_2_1_c(3, 2)_5"             
 [8] "Lrnr_ranger_500_TRUE_none_1"            
 [9] "Lrnr_xgboost_20_1"                      
[10] "Lrnr_gam_NULL_NULL_GCV.Cp"              
[11] "Lrnr_bayesglm_TRUE"

We can see that the names of the learners in the stack are long. This is because the default naming of a learner in sl3 is clunky: for each learner, every tuning parameter in sl3 is contained in the name. In the next section, "Naming learners", we show a few different ways for the user to name learners as they wish.

Now that we have instantiated a set of learners and stacked them together, we are ready to instantiate the SL. We will use the default meta-learner, which is non-negative least squares (NNLS) regression (Lrnr_nnls) for continuous outcomes, and we will still go ahead and specify it for illustrative purposes.

sl <- Lrnr_sl$new(learners = stack, metalearner = Lrnr_nnls$new())

3. Fit the SL to the task with `train`

The last step for fitting the SL to the prediction task is to call train and supply the task. We will also set a random number generator before calling train so the results are reproducible.

set.seed(4197)
sl_fit <- sl$train(task)

Summary {-}

In this section, the core functionality for fitting any SL with sl3 was illustrated. This consists of the following three steps:

Define the prediction task with make_sl3_Task.
Instantiate the SL with Lrnr_sl.
Fit the SL to the task with train.

This example was for demonstrative purposes only. See @rvp2022super for step-by-step guidelines for constructing a SL that is well-specified for the prediction task at hand.

Additional `sl3` Topics

Obtaining Predictions

We will draw on the fitted SL object from above, sl_fit, to obtain the SL's predicted whz value for each subject.

sl_preds <- sl_fit$predict(task)
head(sl_preds)
[1] -0.55036 -0.87149 -0.71521 -0.75967 -0.63134 -0.67732

We can also obtain predicted values from a candidate learner in the SL. Below we obtain predictions for the GLM learner.

glm_preds <- sl_fit$learner_fits$Lrnr_glm_TRUE$predict(task)
head(glm_preds)
[1] -0.72617 -0.93615 -0.70850 -0.64918 -0.70132 -0.84618

Below we visualize the observed values for whz and predicted whz values for SL, GLM and the mean.

# table of observed and predicted outcome values and arrange by observed values
df_plot <- data.table(
  Obs = washb_data[["whz"]], SL_Pred = sl_preds, GLM_Pred = glm_preds,
  Mean_Pred = sl_fit$learner_fits$Lrnr_mean$predict(task)
)
df_plot <- df_plot[order(df_plot$Obs), ]

head(df_plot)

Obs	SL_Pred	GLM_Pred	Mean_Pred
-4.67	-1.5371	-0.90956	-0.58608
-4.18	-1.2411	-0.63906	-0.58608
-4.17	-1.2141	-0.80981	-0.58608
-4.03	-1.5101	-0.89602	-0.58608
-3.95	-1.6594	-1.19523	-0.58608
-3.90	-1.3279	-0.98492	-0.58608

# melt the table so we can plot observed and predicted values
df_plot$id <- seq(1:nrow(df_plot))
df_plot_melted <- melt(
  df_plot, id.vars = "id",
  measure.vars = c("Obs", "SL_Pred", "GLM_Pred", "Mean_Pred")
)

library(ggplot2)
ggplot(df_plot_melted, aes(id, value, color = variable)) + 
  geom_point(size = 0.1) + 
  labs(x = "Subject id (ordered by increasing whz)", 
       y = "Weight-for-height z-score (whz)") +
  theme(legend.position = "bottom", legend.title = element_blank(),
        axis.text.x = element_blank(), axis.ticks.x = element_blank())

We can also obtain the cross-validated predictions for the candidate learners:

cv_preds <- sl_fit$fit_object$cv_fit$predict_fold(task, "validation")

head(cv_preds)

Lrnr_glm_TRUE	Lrnr_mean	Lrnr_glmnet_NULL_deviance_10_0_100_TRUE	Lrnr_glmnet_NULL_deviance_10_1_100_TRUE	Lrnr_polspline_5	Lrnr_earth_2_3_backward_0_1_0_0	Lrnr_hal9001_2_1_c(3, 2)_5	Lrnr_ranger_500_TRUE_none_1	Lrnr_xgboost_20_1	Lrnr_gam_NULL_NULL_GCV.Cp	Lrnr_bayesglm_TRUE
-0.74535	-0.59308	-0.69823	-0.69772	-0.72502	-0.71555	-0.69465	-0.72504	-0.78828	-0.72444	-0.74525
-0.94468	-0.58647	-0.82777	-0.79474	-0.84489	-0.83517	-0.85505	-0.73837	-0.59828	-0.93238	-0.94452
-0.64941	-0.59308	-0.69920	-0.72559	-0.71400	-0.60895	-0.70451	-0.51159	-0.64530	-0.61107	-0.64946
-0.62111	-0.58460	-0.62242	-0.65938	-0.65248	-0.69157	-0.68570	-0.56976	-0.46965	-0.59100	-0.62137
-0.76466	-0.58460	-0.66531	-0.70686	-0.70006	-0.69693	-0.67454	-0.68531	-0.65883	-0.79748	-0.76495
-0.88726	-0.57635	-0.80557	-0.72485	-0.71248	-0.47698	-0.67666	-0.79139	-0.69626	-0.91322	-0.88716

If we wanted to obtain predicted values for new data then we would need to create a new task from the new data. Also, the covariates in this new task must be identical to the covariates in the task for training. As an example, let's assume we have data washb_data_new for which we want to SL predictions.

washb_data_new$whz <- rep(NA, nrow(washb_data_new)) # create a fake outcome
pred_task <- make_sl3_Task(
  data = washb_data_new,
  outcome = "whz", 
  outcome_type = "continuous",
  covariates = c("tr", "fracode", "month", "aged", "sex", "momage", "momedu", 
                 "momheight", "hfiacat", "Nlt18", "Ncomp", "watmin", "elec", 
                 "floor", "walls", "roof", "asset_wardrobe", "asset_table", 
                 "asset_chair", "asset_khat", "asset_chouki", "asset_tv", 
                 "asset_refrig", "asset_bike", "asset_moto", "asset_sewmach", 
                 "asset_mobile")
)
sl_preds_new_task <- sl_fit$predict(pred_task)

Summarizing SL Fits

Coefficients

We can see how the meta-learner created a function of the learners in a few ways. In our illustrative example, we considered the default, NNLS meta-learner for continuous outcomes. For meta-learners that simply learn a weighted combination, we can examine their coefficients.

round(sl_fit$coefficients, 3)
                          Lrnr_glm_TRUE                               Lrnr_mean 
                                  0.000                                   0.000 
Lrnr_glmnet_NULL_deviance_10_0_100_TRUE Lrnr_glmnet_NULL_deviance_10_1_100_TRUE 
                                  0.076                                   0.000 
                       Lrnr_polspline_5         Lrnr_earth_2_3_backward_0_1_0_0 
                                  0.164                                   0.389 
             Lrnr_hal9001_2_1_c(3, 2)_5             Lrnr_ranger_500_TRUE_none_1 
                                  0.000                                   0.372 
                      Lrnr_xgboost_20_1               Lrnr_gam_NULL_NULL_GCV.Cp 
                                  0.000                                   0.000 
                     Lrnr_bayesglm_TRUE 
                                  0.000

We can also examine the coefficients by directly accessing the meta-learner's fit object.

metalrnr_fit <- sl_fit$fit_object$cv_meta_fit$fit_object
round(metalrnr_fit$coefficients, 3)
                          Lrnr_glm_TRUE                               Lrnr_mean 
                                  0.000                                   0.000 
Lrnr_glmnet_NULL_deviance_10_0_100_TRUE Lrnr_glmnet_NULL_deviance_10_1_100_TRUE 
                                  0.076                                   0.000 
                       Lrnr_polspline_5         Lrnr_earth_2_3_backward_0_1_0_0 
                                  0.164                                   0.389 
             Lrnr_hal9001_2_1_c(3, 2)_5             Lrnr_ranger_500_TRUE_none_1 
                                  0.000                                   0.372 
                      Lrnr_xgboost_20_1               Lrnr_gam_NULL_NULL_GCV.Cp 
                                  0.000                                   0.000 
                     Lrnr_bayesglm_TRUE 
                                  0.000

Direct access to the meta-learner fit object is also handy for more complex meta-learners (e.g., non-parametric meta-learners) that are not defined by a simple set of main terms regression coefficients.

Cross-validated predictive performance

We can obtain a table of the cross-validated (CV) predictive performance, i.e., the CV risk, for each learner included in the SL. Below, we use the squared error loss for the evaluation function, which equates to the mean squared error (MSE) as the metric to summarize predictive performance. The reason why we use the MSE is because it is a valid metric for estimating the conditional mean, which is what we're learning the prediction function for in the WASH Benefits example. For more information on selecting an appropriate performance metric, see @rvp2022super.

cv_risk_table <- sl_fit$cv_risk(eval_fun = loss_squared_error)

cv_risk_table[,c(1:3)]

learner	coefficients	MSE
Lrnr_glm_TRUE	0.00000	1.0224
Lrnr_mean	0.00000	1.0654
Lrnr_glmnet_NULL_deviance_10_0_100_TRUE	0.07582	1.0166
Lrnr_glmnet_NULL_deviance_10_1_100_TRUE	0.00000	1.0144
Lrnr_polspline_5	0.16359	1.0163
Lrnr_earth_2_3_backward_0_1_0_0	0.38900	1.0131
Lrnr_hal9001_2_1_c(3, 2)_5	0.00000	1.0176
Lrnr_ranger_500_TRUE_none_1	0.37159	1.0128
Lrnr_xgboost_20_1	0.00000	1.0793
Lrnr_gam_NULL_NULL_GCV.Cp	0.00000	1.0235
Lrnr_bayesglm_TRUE	0.00000	1.0224

Cross-validated SL

We can see from the CV risk table above that the SL is not contained. This is because we do not have a CV risk for the SL unless we cross-validate it or include it as a candidate in another SL; the latter is shown in the next subsection. Below we show how to obtain a cross-validated risk estimate for the SL using cv_sl. (Note: we set eval = FALSE for this part of the code as it was killing the build).

set.seed(569)
cv_sl_fit <- cv_sl(lrnr_sl = sl_fit, task = task, eval_fun = loss_squared_error)

cv_sl_fit$cv_risk[,c(1:3)]

Discrete SL

The discrete SL (dSL), is a SL that uses a winner-take-all meta-learner called the cross-validated selector. The dSL is therefore identical to the candidate with the best cross-validated performance; its predictions will be the same as this candidate’s predictions. The dSL meta-learner specification can be invoked in sl3 with Lrnr_cv_selector.

cv_selector <- Lrnr_cv_selector$new(eval_function = loss_squared_error)
stack_with_sl <- Stack$new(stack, sl)
dSL <- Lrnr_sl$new(learners = stack_with_sl, metalearner = cv_selector)

We can examine the CV risk for an ensemble SL (eSL), or multiple eSLs, by including it as a candidate in a dSL. An eSL is a SL that uses any parametric or non-parametric algorithm as its meta-learner. Therefore, the eSL is defined by a combination of multiple candidates; its predictions are defined by a combination of multiple candidates’ predictions. The sl object defined in the previous section, sl, is an eSL since it used NNLS as the meta-learner.

When we include an eSL as a candidate in the dSL, this allows the eSL’s CV performance to be compared to that from the other learners from which it was constructed. This is similar to calling CV SL, cv_sl, above. The difference between including the eSL in the dSL and calling cv_sl is that the former automates a procedure for the final SL to be the learner that achieved the best CV predictive performance, i.e., lowest CV risk. If the eSL performs better than any other candidate, the dSL will end up selecting it. As mentioned in @rvp2022super, "another advantage of this approach is that multiple eSLs that use more flexible meta-learner methods (e.g., non-parametric machine learning algorithms like HAL) can be evaluated simultaneously.

Default Data Pre-processing

In sl3 it is required that the analytic dataset (i.e., the dataset consisting of observations on an outcome and covariates) does not contain any missing values, and it does not contain character and factor covariates. In this subsection, we review the default functionality in sl3 that takes care of this internally; specifically, this data pre-processing occurs when make_sl3_Task is called.

Users can also perform any pre-processing before creating the sl3_Task (as needed) to bypass the default functionality discussed in the following. See @rvp2022super, section "Preliminaries: Analytic dataset pre-processing" for more information and general guidelines to follow for pre-processing of the analytic dataset, including considerations for pre-processing in high dimensional settings.

Recall that the sl3_Task object defines the prediction task of interest. Our task in the illustrative example from above was to use the WASH Benefits Bangladesh data to learn a function of the covariates for predicting weight-for-height Z-score whz. For more details on sl3_Task, refer to the documentation (e.g., by inputting "?sl3_Task" in R). We will instantiate the task in order to examine the pre-processing of washb_data.

# create the task (i.e., use washb_data to predict outcome using covariates)
task <- make_sl3_Task(
  data = washb_data,
  outcome = "whz",
  covariates = c("tr", "fracode", "month", "aged", "sex", "momage", "momedu", 
                 "momheight", "hfiacat", "Nlt18", "Ncomp", "watmin", "elec", 
                 "floor", "walls", "roof", "asset_wardrobe", "asset_table", 
                 "asset_chair", "asset_khat", "asset_chouki", "asset_tv", 
                 "asset_refrig", "asset_bike", "asset_moto", "asset_sewmach", 
                 "asset_mobile")
)
Warning in process_data(data, nodes, column_names = column_names, flag = flag, :
Missing covariate data detected: imputing covariates.

Imputation and missingness indicators

Notice the warning that appeared when we created the task above. (We muted this warning when we created the task in the previous section). This warning states that missing covariate data was detected and imputed. For each covariate column with missing values, sl3 uses the median to impute missing continuous covariates, and the mode to impute discrete (binary and categorical) covariates.

Also, for each covariate with missing values, an additional column indicating whether the value was imputed is incorporated. The so-called "missingness indicator" covariates can be helpful, as the pattern of covariate missingness might be informative for predicting the outcome.

Users are free to handle missingness in their covariate data before creating the sl3 task. In any case, we do recommend the inclusion of the missingness indicator as a covariate. Let's examine this in greater detail for completeness. It's also easier to see what's going on here by examining it with an example.

First, let's examine the missingness in the data:

# which columns have missing values, and how many observations are missing?
colSums(is.na(washb_data))
           whz             tr        fracode          month           aged 
             0              0              0              0              0 
           sex         momage         momedu      momheight        hfiacat 
             0             18              0             31              0 
         Nlt18          Ncomp         watmin           elec          floor 
             0              0              0              0              0 
         walls           roof asset_wardrobe    asset_table    asset_chair 
             0              0              0              0              0 
    asset_khat   asset_chouki       asset_tv   asset_refrig     asset_bike 
             0              0              0              0              0 
    asset_moto  asset_sewmach   asset_mobile 
             0              0              0

We can see that covariates momage and momheight have missing observations. Let's check out a few rows in the data with missing values:

some_rows_with_missingness <- which(!complete.cases(washb_data))[31:33]
# note: we chose 31:33 because missingness in momage & momheight is there
washb_data[some_rows_with_missingness, c("momage", "momheight")]
   momage momheight
1:     NA     153.2
2:     17        NA
3:     23        NA

When we called make_sl3_Task using washb_data with missing covariate values, momage and momheight were imputed with their respective medians (since they are continuous), and a missingness indicator (denoted by prefix "delta_") was added for each of them. See below:

task$data[some_rows_with_missingness,
          c("momage", "momheight", "delta_momage", "delta_momheight")]
   momage momheight delta_momage delta_momheight
1:     23     153.2            0               1
2:     17     150.6            1               0
3:     23     150.6            1               0
colSums(is.na(task$data))
             tr         fracode           month            aged             sex 
              0               0               0               0               0 
         momage          momedu       momheight         hfiacat           Nlt18 
              0               0               0               0               0 
          Ncomp          watmin            elec           floor           walls 
              0               0               0               0               0 
           roof  asset_wardrobe     asset_table     asset_chair      asset_khat 
              0               0               0               0               0 
   asset_chouki        asset_tv    asset_refrig      asset_bike      asset_moto 
              0               0               0               0               0 
  asset_sewmach    asset_mobile    delta_momage delta_momheight             whz 
              0               0               0               0               0

Indeed, we can see that washb_task$data has no missing values. The missingness indicators take a value of 0 when the observation was not in the original data and a value of 1 when the observation was in the original data.

If the data supplied to make_sl3_Task contains missing outcome values, then an error will be thrown. Missing outcomes in the data can easily be dropped when the task is created, by setting drop_missing_outcome = TRUE. In general, we do not recommend dropping missing outcomes during data pre-processing, unless the problem of interest is purely prediction. This is because complete case analyses are generally biased; it is typically unrealistic to assume the missingness is completely random and therefore unsafe to just drop the observations with missing outcomes. For instance, in the estimation of estimands that admit Targeted Minimum Loss-based Estimators (i.e., pathwise differentiable estimands, including most parameters arising in causal inference that do not violate positivity, and those reviewed in the following chapters), the missingness that should be reflected in the expression of the question of interest (e.g., what would have been the average effect of treatment with Drug A compared to standard of care under no loss to follow-up) is also incorporated in the estimation procedure. That is, the probability of loss to follow-up is a prediction function that is approximated (e.g., with SL) and incorporated that in the estimation of the target parameter and the inference / uncertainty quantification.

Character and categorical covariates

First any character covariates are converted to factors. Then all factor covariates are one-hot encoded, i.e., the levels of a factor become a set of binary indicators. For example, the factor cats and it's one-hot encoding are shown below:

cats <- c("calico", "tabby", "cow", "ragdoll", "mancoon", "dwarf", "calico")
cats <- factor(cats)
cats_onehot <- factor_to_indicators(cats)
cats_onehot
     cow dwarf mancoon ragdoll tabby
[1,]   0     0       0       0     0
[2,]   0     0       0       0     1
[3,]   1     0       0       0     0
[4,]   0     0       0       1     0
[5,]   0     0       1       0     0
[6,]   0     1       0       0     0
[7,]   0     0       0       0     0

The second value for cats was "tabby" so the second row of cats_onehot has value 1 under tabby. Every level of cats except for one is represented in the cats_onehot table. The first and last cats are "calico" so the first and last rows of cats_onehot are zero across all columns, to denote this level that does not appear explicitly in the table.

The learners in sl3 are trained to the object X in the task, or a sample of X for learners that use cross-validation. Let's check out the first six rows of our task's X object:

head(task$X)

fracode.N05265	fracode.N06531	fracode.N08002	month	aged	sex.male	momage	momedu.Primary..1.5y.	momedu.Secondary...5y.	momheight	hfiacat.Moderately.Food.Insecure	hfiacat.Severely.Food.Insecure	Nlt18	Ncomp	watmin	elec	walls	roof	asset_wardrobe	asset_table	asset_chair	asset_khat	asset_chouki	asset_tv	asset_bike	asset_mobile	delta_momage	delta_momheight
1	0	0	9	268	1	30	1	0	146.40	0	0	3	11	0	1	1	1	0	1	1	1	0	1	0	1	1	1
1	0	0	9	286	1	25	1	0	148.75	1	0	2	4	0	1	1	1	0	1	0	1	1	0	0	1	1	1
0	0	1	9	264	1	25	1	0	152.15	0	0	1	10	0	0	1	1	0	0	1	0	1	0	0	1	1	1
0	0	1	9	252	0	28	1	0	140.25	0	0	3	5	0	1	1	1	1	1	1	1	0	0	1	1	1	1
0	1	0	9	336	0	19	0	1	150.95	0	0	2	7	0	1	1	1	1	1	1	1	1	0	0	1	1	1
0	1	0	9	304	1	20	0	1	154.20	0	1	0	3	1	1	1	1	0	0	0	0	1	0	0	1	1	1

We can see that any character columns in the WASH Benefits dataset were converted to factors and all factors (tr, momedu, hfiacat and fracode) were one-hot encoded. We can also see that the missingness indicators reviewed above are the last two columns in task$X: delta_momage and delta_momage. The imputed momage and momheight are also in the task's X object.

Learner Documentation

Documentation for the learners and some of their tuning parameters can be found in the R session (e.g., to see Lrnr_glmnet's parameters, one could type "?Lrnr_glmnet" in RStudio's R console) or online at the sl3 Learners Reference. All of the learners in sl3 are simply wrappers around existing functions from other software packages in R. For example, sl3's Lrnr_xgboost is a learner in sl3 for fitting the XGBoost (eXtreme Gradient Boosting) algorithm. As described in the Lrnr_xgboost documentation, "this learner provides fitting procedures for xgboost models, using the xgboost package, via xgb.train". In general, the documentation in sl3 for a learner refers the reader to the original function and package that sl3 has wrapped a learner around. With that in mind, the sl3 learner documentation is a good first place to look up any learner, as it will show us exactly which package and function the learner is based on. However, any thorough investigation of a learner (such as a detailed explanation of all tuning parameters or how it models the data) typically involves referencing the original package. Continuing the example from above, this means that, while some information will be provided in Lrnr_xgboost documentation, such as learning that Lrnr_xgboost uses the xgboost package's xgb.train function, the deepest understanding of the XGBoost algorithm available in sl3 will come from referencing the xgboost R package and its xgb.train function.

Naming learners

Recall that our Stack from the example above had long names.

stack
 [1] "Lrnr_glm_TRUE"                          
 [2] "Lrnr_mean"                              
 [3] "Lrnr_glmnet_NULL_deviance_10_0_100_TRUE"
 [4] "Lrnr_glmnet_NULL_deviance_10_1_100_TRUE"
 [5] "Lrnr_polspline_5"                       
 [6] "Lrnr_earth_2_3_backward_0_1_0_0"        
 [7] "Lrnr_hal9001_2_1_c(3, 2)_5"             
 [8] "Lrnr_ranger_500_TRUE_none_1"            
 [9] "Lrnr_xgboost_20_1"                      
[10] "Lrnr_gam_NULL_NULL_GCV.Cp"              
[11] "Lrnr_bayesglm_TRUE"

Here, we show a few different ways for the user to name learners. The first way to name a learner is upon instantiation, as shown below:

lrn_glm <- Lrnr_glm$new(name = "GLM")

We can specify the name for any learner upon instantiating it. Above, we named the GLM learner "GLM".

Also, we can specify the names of the learners upon creation of the Stack:

learners_pretty_names <- c(
  "GLM" = lrn_glm, "Mean" = lrn_mean, "Ridge" = lrn_ridge, 
  "Lasso" = lrn_lasso, "Polspline" = lrn_polspline, "Earth" = lrn_earth, 
  "HAL" = lrn_hal, "RF" = lrn_ranger, "XGBoost" = lrn_xgb, "GAM" = lrn_gam, 
  "BayesGLM" = lrn_bayesglm
)
stack_pretty_names <- Stack$new(learners_pretty_names)
stack_pretty_names
 [1] "GLM"       "Mean"      "Ridge"     "Lasso"     "Polspline" "Earth"    
 [7] "HAL"       "RF"        "XGBoost"   "GAM"       "BayesGLM"

Defining Learners over Grid of Tuning Parameters

Customized learners can be created over a grid of tuning parameters. For highly data-adaptive learners that require careful tuning, it is oftentimes very helpful to consider different tuning parameter specifications. However, this is time consuming, so computational feasibility should be considered. Also, when the effective sample size is small, highly data-adaptive learners will likely not perform well since they typically require a lot of data to fit their models. See @rvp2022super for information on the effective sample size, and step-by-step guidelines for tailoring the SL specification to perform well for the prediction task at hand.

We show two ways to customize learners over a grid of tuning parameters. The first, "do-it-yourself" approach requires that the user or a collaborator has knowledge of the algorithm and their tuning parameters, so they can adequately specify a set of tuning parameters themselves. The second approach does not require the user to have specialized knowledge of an algorithm (although some understanding is still helpful); it uses the caret software to automatically select an "optimal" set of tuning parameters over a grid of them.

Do-it-yourself grid

Below, we show how we can create several variations of an XGBoost learner, Lrnr_xgboost, by hand. This example is just for demonstrative purposes; users should consult the documentation, and consider computational feasibility and their prediction task to specify an appropriate grid of tuning parameters for their task.

grid_params <- list(
  max_depth = c(3, 5, 8),
  eta = c(0.001, 0.1, 0.3),
  nrounds = 100
)
grid <- expand.grid(grid_params, KEEP.OUT.ATTRS = FALSE)

xgb_learners <- apply(grid, MARGIN = 1, function(tuning_params) {
  do.call(Lrnr_xgboost$new, as.list(tuning_params))
})
xgb_learners
[[1]]
[1] "Lrnr_xgboost_100_1_3_0.001"

[[2]]
[1] "Lrnr_xgboost_100_1_5_0.001"

[[3]]
[1] "Lrnr_xgboost_100_1_8_0.001"

[[4]]
[1] "Lrnr_xgboost_100_1_3_0.1"

[[5]]
[1] "Lrnr_xgboost_100_1_5_0.1"

[[6]]
[1] "Lrnr_xgboost_100_1_8_0.1"

[[7]]
[1] "Lrnr_xgboost_100_1_3_0.3"

[[8]]
[1] "Lrnr_xgboost_100_1_5_0.3"

[[9]]
[1] "Lrnr_xgboost_100_1_8_0.3"

In the example above, we considered every possible combination in the grid to create nine XGBoost learners. If we wanted to create custom names for each of these learners we could do that as well:

names(xgb_learners) <- c(
  "XGBoost_depth3_eta.001", "XGBoost_depth5_eta.001", "XGBoost_depth8_eta.001", 
  "XGBoost_depth3_eta.1", "XGBoost_depth5_eta.1", "XGBoost_depth8_eta.1", 
  "XGBoost_depth3_eta.3", "XGBoost_depth5_eta.3", "XGBoost_depth8_eta.3"
)

Automatic grid and selection with `caret`

We can use the Lrnr_caret to use the caret software. As described in the Lrnr_caret documentation, Lrnr_caret "uses the caret package's train function to automatically tune a predictive model". Below, we instantiate a neural network that will be automatically tuned with caret and we name the learner "NNET_autotune".

lrnr_nnet_autotune <- Lrnr_caret$new(method = "nnet", name = "NNET_autotune")

Learners with Interactions and `formula` Interface

As described in in @rvp2022super, if it’s known/possible that there are interactions among covariates then we can include learners that pick up on that explicitly (e.g., by including in the library a parametric regression learner with interactions specified in a formula) or implicitly (e.g., by including in the library tree-based algorithms that learn interactions empirically).

One way to define interaction terms among covariates in sl3 is with a formula. The argument exists in Lrnr_base, which is inherited by every learner in sl3; even though formula does not explicitly appear as a learner argument, it is via this inheritance. This implementation allows formula to be supplied to all learners, even those without native formula support. Below, we show how to specify a GLM learner that considers two-way interactions among all covariates.

lrnr_glm_interaction <- Lrnr_glm$new(formula = "~.^2")

As we can see from above, the general behavior of formulain R applies in sl3. See Details of formula in the stats R package for more details on this syntax (e.g,. in RStudio, type "?formula" in the Console and information will appear in the Help tab).

Screeners

One characteristic of a rich library of learners is that it is effective at handling covariates of high dimension. When there are many covariates in the data relative to the effective sample size (see Figure 1 Flowchart in @rvp2022super), candidate learners should be coupled with a range of so-called "screeners". A screener is simply a function that returns a subset of covariates. A screener is intended to be coupled with a candidate learner, to define a new candidate learner that considers the reduced set of screener-returned covariates as its covariates.

As stated in @rvp2022super, "covariate screening is essential when the dimensionality of the data is very large, and it can be practically useful in any SL or machine learning application. Screening of covariates that considers associations with the outcome must be cross validated to avoid biasing the estimate of an algorithm’s predictive performance". By including screener-learner couplings as additional candidates in the SL library, we are cross validating the screening of covariates. Covariates retained in each CV fold may vary.

A "range of screeners" is a set of screeners that exhibits varying degrees of dimension reduction and incorporates different fitting procedures (e.g., lasso-based screeners that retain covariates with non-zero coefficients, and importance-based screeners that retain the top $j$ most important covariates according to some importance metric. The current set of screeners available in sl3 is described in each part below.

We will see that, to define a screener and learner coupling in sl3, we need to create a Pipeline. A Pipeline is a set of learners to be fit sequentially, where the fit from one learner is used to define the task for the next learner.

Variable importance-based screeners

Variable importance-based screeners retain the top $j$ most important covariates according to some importance metric. This screener is provided by Lrnr_screener_importance in sl3 and the parameter $j$ (default is five) is provided by the user via the num_screen argument. The user also gets to choose the importance metric considered via the learner argument. Any learner with an importance method can be used in Lrnr_screener_importance; this currently includes the following:

sl3_list_learners(properties = "importance")
[1] "Lrnr_lightgbm"     "Lrnr_randomForest" "Lrnr_ranger"      
[4] "Lrnr_xgboost"

Let's consider screening covariates based on Lrnr_ranger variable importance ranking that selects the top ten most important covariates, according to ranger's "impurity_corrected" importance. We will couple this screener with Lrnr_glm to define a new learner that (1) selects the top ten most important covariates, according to ranger's "impurity_corrected" importance, and then (2) passes the screener-selected covariates to Lrnr_glm, so Lrnr_glm fits a model according to this reduced set of covariates. As mentioned above, this coupling establishes a new learner and requires defining a Pipeline. The Pipeline is sl3's way of going from (1) to (2).

ranger_with_importance <- Lrnr_ranger$new(importance = "impurity_corrected")
RFscreen_top10 <- Lrnr_screener_importance$new(
  learner = ranger_with_importance, num_screen = 10
)
RFscreen_top10_glm <- Pipeline$new(RFscreen_top10, lrn_glm)

We could even define the Pipeline for the entire Stack, so that every learner in it is fit to the screener-selected, reduced set of ten covariates.

RFscreen_top10_stack <- Pipeline$new(RFscreen_top10, stack)

Coefficient threshold-based screeners

Lrnr_screener_coefs provides screening of covariates based on the magnitude of their estimated coefficients in a (possibly regularized) GLM. The threshold (default = 1e-3) defines the minimum absolute size of the coefficients, and thus covariates, to be kept. Also, a max_retain argument can be optionally provided to restrict the number of selected covariates to be no more than max_retain.

Let's consider screening covariates with Lrnr_screener_coefs to select the variables with non-zero lasso regression coefficients. We will couple this screener with Lrnr_glm to define a new learner that (1) selects the covariates with non-zero lasso regression coefficients, and then (2) passes the screener-selected covariates to Lrnr_glm, so Lrnr_glm fits a model according to this reduced set of covariates. The structure is very similar to above.

lasso_screen <- Lrnr_screener_coefs$new(learner = lrn_lasso, threshold = 0)
lasso_screen_glm <- Pipeline$new(lasso_screen, lrn_glm)

We could even define the Pipeline for the entire Stack, so that every learner in it is fit to the lasso screener-selected, reduced set of covariates.

lasso_screen_stack <- Pipeline$new(lasso_screen, stack)

Correlation-based screeners

Lrnr_screener_correlation provides covariate screening procedures by running a test of correlation (Pearson default), and then selecting the (1) top ranked variables (default), or (2) the variables with a p-value lower than some user-specified threshold.

Let's consider screening covariates with Lrnr_screener_coefs. We will illustrate how to set up a pipeline with a Stack, which looks the same as previous examples. The Pipeline with a single learner also looks the same as previous examples.

# select top 10 most correlated covariates
corRank_screen <- Lrnr_screener_correlation$new(
  type = "rank", num_screen = 10
)
corRank_screen_stack <- Pipeline$new(corRank_screen, stack)

# select covariates with correlation p-value below 0.05, and a minimum of 3
corP_screen <- Lrnr_screener_correlation$new(
  type = "threshold", pvalue_threshold = 0.05, min_screen = 3
)
corP_screen_stack <- Pipeline$new(corP_screen, stack)

Augmented screeners

Augmented screeners are special in that they enforce certain covariates to always be included. That is, if a screener removes a "mandatory" covariate then Lrnr_screener_augment will reincorporate it before the learner(s) in the Pipeline are fit. An example of how to use this screener is included below. We assume aged and momage are covariates that must be kept in learner fitting.

keepme <- c("aged", "momage")
# using corRank_screen as an example, but any instantiated screener can be 
# supplied as screener.
corRank_screen_augmented <- Lrnr_screener_augment$new(
  screener = corRank_screen, default_covariates = keepme
)
corRank_screen_augmented_glm <- Pipeline$new(corRank_screen_augmented, lrn_glm)

Lrnr_screener_augment is useful when subject-matter experts feel strongly that certain covariate sets must be included, even under screening procedures.

Stack with range of screeners

Above, we mentioned that we'd like to consider a range of screeners to diversify the library. Here we show how we can create a new Stack from other learners stacks which includes learners with no screening, and learners coupled with various screeners.

screeners_stack <- Stack$new(stack, corP_screen_stack, corRank_screen_stack, 
                             lasso_screen_stack, RFscreen_top10_stack)

This screeners_stack could be inputted as learners in Lrnr_sl to define the SL that considers as candidates learners with no screening, and learners coupled with various screeners.

Advanced `sl3` Functionality {#sl3-advanced}

Variable Importance Measures

Variable importance can be interesting and informative. It can also be contradictory and confusing. Nevertheless, our collaborators tend to like it, so we created a variable importance function in sl3. The sl3 importance function returns a table with variables listed in decreasing order of importance (i.e., most important on the first row).

The measure of importance in sl3 is based on a ratio or difference of predictive performance between the SL fit with a removed or permuted covariate (or covariate grouping), and the SL fit with the observed covariate (or covariate grouping), across all of them. In this manner, the larger the ratio/difference in predictive performance, the more important the covariate (or covariate group) is in the SL prediction.

The intuition of this measure is that it calculates the predictive risk (e.g., MSE) of losing one covariate (or one group of covariates), while keeping everything else fixed, comparing this predictive risk to the one from the analytic dataset. If the ratio in predictive risks is one, or the difference is zero, then losing that covariate (group) had no impact, and it is thus not important according to this measure. This procedure is repeated across all of the covariates/groups. As stated above, we can remove each covariate (or covariate group) and refit the SL without it, or we just permute it (faster) and hope for this shuffling to distort any meaningful information that was present. This idea of permuting instead of removing saves a lot of time, and is also incorporated in randomForest variable importance measures. However, the permutation approach is more risky. The sl3 importance default is to remove each covariate and then refit. Below, we use the permute approach because it is so much faster.

Let's explore the sl3 variable importance measurements for sl_fit, the SL we fit above to the WASH Benefits example dataset. We will define a grouping of covariates based on household assets, so these covariates will be considered together in the importance.

assets <- c("asset_wardrobe", "asset_table", "asset_chair", "asset_khat",
            "asset_chouki", "asset_tv", "asset_refrig", "asset_bike", 
            "asset_moto", "asset_sewmach", "asset_mobile", "Nlt18", "Ncomp", 
            "watmin", "elec", "floor", "walls", "roof")
set.seed(983)
washb_varimp <- importance(
  fit = sl_fit, eval_fun = loss_squared_error, type = "permute", 
  covariate_groups = list("assets" = assets)
)

washb_varimp

covariate_group	MSE_difference
aged	0.0337
assets	0.0315
momedu	0.0109
month	0.0092
tr	0.0059
momheight	0.0019
hfiacat	0.0007
sex	0.0005
momage	0.0002
fracode	0.0001
delta_momheight	0.0000
delta_momage	0.0000

# plot variable importance
importance_plot(x = washb_varimp)

Conditional density estimation

In certain scenarios it may be useful to estimate the conditional density of a dependent variable, given predictors/covariates that precede it. In the context of causal inference, this arises most readily when working with continuous-valued treatments. Specifically, conditional density estimation (CDE) is necessary when estimating the treatment mechanism for a continuous-valued treatment, often called the generalized propensity score. Compared the classical propensity score (PS) for binary treatments (the conditional probability of receiving the treatment given covariates), $\mathbb{P}(A = 1 \mid W)$, the generalized PS is the conditional density of treatment $A$, given covariates $W$, $\mathbb{P}(A \mid W)$.

CDE often requires specialized approaches tied to very specific algorithmic implementations. To our knowledge, general and flexible algorithms for CDE have been proposed only sparsely in the literature. We have implemented two such approaches in sl3: a semiparametric CDE approach that makes certain assumptions about the constancy of (higher) moments of the underlying distribution, and second approach that exploits the relationship between the conditional hazard and density functions to allow CDE via pooled hazard regression. Both approaches are flexible in that they allow the use of arbitrary regression functions or machine learning algorithms for the estimation of nuisance quantities (the conditional mean or the conditional hazard, respectively). We elaborate on these two frameworks below. Importantly, per @dudoit2005asymptotics and related works, a loss function appropriate for density estimation is the negative log-density loss $L(\cdot) = -\log(p_n(\cdot))$.

Moment-restricted location-scale

This family of semiparametric CDE approaches exploits the general form $\rho(Y - \mu(X) / \sigma(X))$, where $Y$ is the dependent variable of interest (e.g., treatment $A$ in the PS), $X$ are the predictors (e.g., covariates $W$ in the PS), \rho$ is a specified marginal density function, and $\mu(X) = \E(Y \mid X)$ and $\sigma(X) = \E[(Y - \mu(X))^2 \mid X]$ are nuisance functions of the dependent variable that may be estimated flexibly. CDE procedures formulated within this framework may be characterized as belonging to a conditional location-scale family, that is, in which $p_n(Y \mid X) = \rho((Y - \mu_n(X)) / \sigma_n(X))$. While CDE with conditional location-scale families is not without potential disadvantages (e.g., the restriction on the density's functional form could lead to misspecification bias), this strategy is flexible in that it allows for arbitrary machine learning algorithms to be used in estimating the conditional mean of $Y$ given $X$, \mu(X) = \E(Y \mid X)$, and the conditional variance of $Y$ given $X$, $\sigma(X) = \E[(Y - \mu(X))^2 \mid X]$.

In settings with limited data, the additional structure imposed by the assumption that the target density belongs to a location-scale family may prove advantageous by smoothing over areas of low support in the data. However, in practice, it is impossible to know whether and when this assumption holds. This procedure is not a novel contribution of our own (and we have been unable to locate a formal description of it in the literature); nevertheless, we provide an informal algorithm sketch below. This algorithm considers access to $n$ independendent and identically distributed (i.i.d.) copies of an observed data random variable $O = (Y, X)$, an a priori-specified kernel function $\rho$, a candidate regression procedure $f_{\mu}$ to estimate $\mu(X)$, and a candidate regression procedure $f_{\sigma}$ to estimate $\sigma(X)$.

Estimate $\mu(X) = \E[Y \mid X]$, the conditional mean of $Y$ given $X$, by applying the regression estimator $f_{\mu}$, yielding $\hat{\mu}(X)$.
Estimate $\sigma(X) = \mathbb{V}[Y \mid X]$, the conditional variance of $Y$ given $X$, by applying the regression estimator $f_{\sigma}$, yielding $\hat{\sigma}^2(X)$. Note that this step involves only estimation of the conditional mean $\E[(Y - \hat{\mu}(X))^2 \mid X]$.
Estimate the one-dimensional density of $(Y - \hat{\mu}(X))^2 / \hat{\sigma}^2(X)$, using kernel smoothing to obtain $\hat{\rho}(Y)$.
Construct the estimated conditional density $p_n(Y \mid X) = \hat{\rho}((Y
- \hat{\mu}(X)) / \hat{\sigma}(X))$.

This algorithm sketch encompasses two forms of this CDE approach, which diverge at the second step above. To simplify the approach, one may elect to estimate only the conditional mean $\mu(X)$, leaving the conditional variance to be assumed constant (i.e., estimated simply as the marginal mean of the residuals $\E[(Y - \hat{\mu}(X))^2]$). This subclass of CDE approaches have homoscedastic error based on the variance assumption made. The conditional variance can instead by estimated as the conditional mean of the residuals $(Y - \hat{\mu}(X))^2$ given $X$, $\E[(Y - \hat{\mu}(X))^2 \mid X]$, where the candidate algorithm $f_{\sigma}$ is used to evaluate the expectation. Both approaches have been implemented in sl3, in the learner Lrnr_density_semiparametric. The mean_learner argument specifies $f_{\mu}$ and the optional var_learner argument specifies $f_{\sigma}$. We demonstrate CDE with this approach below.

# semiparametric density estimator with homoscedastic errors (HOSE)
hose_hal_lrnr <- Lrnr_density_semiparametric$new(
  mean_learner = Lrnr_hal9001$new()
)
# semiparametric density estimator with heteroscedastic errors (HESE)
hese_rf_glm_lrnr <- Lrnr_density_semiparametric$new(
  mean_learner = Lrnr_ranger$new()
  var_learner = Lrnr_glm$new()
)

# SL for the conditional treatment density
sl_dens_lrnr <- Lrnr_sl$new(
  learners = list(hose_hal_lrnr, hese_rf_glm_lrnr),
  metalearner = Lrnr_solnp_density$new()
)

Pooled hazard regression

Another approach for CDE available in sl3, and originally proposed in @diaz2011super, leverages the relationship between the (conditional) hazard and density functions. To develop their CDE framework, @diaz2011super proposed discretizing a continuous dependent variable $Y$ with support $\mathcal{Y}$ based on a number of bins $T$ and a binning procedure (e.g., cutting $\mathcal{Y}$ into $T$ bins of exactly the same length). The tuning parameter $T$ conceptually corresponds to the choice of bandwidth in classical kernel density estimation. Following discretization, each unit is represented by a collection of records, and the number of records representing a given unit depends on the rank of the bin (along the discretized support) into which the unit falls.

To take an example, an instantiation of this procedure might divide the support of $Y$ into, say, $T = 4$, bins of equal length (note this requires $T+1$ cut points): $[\alpha_1, \alpha_2), [\alpha_2, \alpha_3), [\alpha_3, \alpha_4), [\alpha_4, \alpha_5]$ (n.b., the rightmost interval is fully closed while the others are only partially closed). Next, an artificial, repeated measures dataset would be created in which each unit would be represented by up to $T$ records. To better see this structure, consider an individual unit $O_i = (Y_i, X_i)$ whose $Y_i$ value is within $[\alpha_3, \alpha_4)$, the third bin. This unit would be represented by three distinct records: ${Y_{ij}, X_{ij}}{j=1}^3$, where ${{Y{ij} = 0}{j=1}^2$, $Y{i3} = 1}$
and three exact copies of $X_i$, ${X_{ij}}_{j=1}^3$. This representation in terms of multiple records for the same unit allows for the conditional hazard probability of $Y_i$ falling in a given bin along the discretized support to be evaluated via standard binary regression techniques.

In fact, this proposal reformulates the binary regression problem into a corresponding set of hazard regressions: $\mathbb{P} (Y \in [\alpha_{t-1}, \alpha_t) \mid X) = \mathbb{P} (Y \in [\alpha_{t-1}, \alpha_t) \mid Y \geq \alpha_{t-1}, X) \times \prod_{j = 1}^{t -1} {1 - \mathbb{P} (Y \in [\alpha_{j-1}, \alpha_j) \mid Y \geq \alpha_{j-1}, X) }$. Here, the probability of $Y \in \mathcal{Y}$ falling in bin $[\alpha_{t-1}, \alpha_t)$ may be directly estimated via a binary regression procedure, by re-expressing the corresponding likelihood in terms of the likelihood of a binary variable in a dataset with this repeated measures structure. Finally, the hazard estimates can be mapped into density estimates by re-scaling the hazard estimates by the bin sizes $\lvert \alpha_t - \alpha_{t-1} \rvert$, that is, $p_{n, \alpha}(Y \mid X) = \mathbb{P}(Y \in [\alpha_{t-1}, \alpha_t) \mid X) / \lvert \alpha_t - \alpha_{t-1} \rvert$, for $\alpha_{t-1} \leq a < \alpha_t$. We provide an informal sketch of this algorithm below.

Apply a procedure to divide the observed support of $Y$, $\max(Y) - \min(Y)$, into $T$ bins: $[\alpha_1, \alpha_2), \ldots, [\alpha_{t-1}, \alpha_t), [\alpha_t, \alpha_{t+1}]$.
Expand the observed data into a repeated measures data structure, expressing each individual observation as a set of up to $T$ records, recording the observation ID alongside each such record. For a single unit $i$, the set of records takes the form ${Y_{ij}, X_{ij}}{j=1}^{T_i}$, where $X{ij}$ are constant in the index set $\mathcal{J}$, $Y_{ij}$ is a binary counting process that jumps from $0$ to $1$ at its final index (at the bin into which $Y_i$ falls), and $T_i \leq T$ indicates the bin along its support into which $Y_i$ falls.
Estimate the hazard probability, conditional on $X$, of bin membership $\mathbb{P}(Y_i \in [\alpha_{t-1}, \alpha_t) \mid X)$ using any binary regression estimator or appropriate machine learning algorithm.
Rescale the conditional hazard probability estimates to the conditional density scale by dividing the cumulative hazard by the width of the bin into which $X_i$ falls, for each observation $i = 1, \ldots, n$. If the support set is partitioned into bins of equal size (approximately $n/T$ samples in each bin), this amounts to rescaling by a constant. If the support set is partitioned into bins of equal range, then the rescaling might vary across bins.

A key element of this proposal is the flexibility to use any binary regression procedure or appropriate machine learning algorithm to estimate $\prob(Y \in [\alpha_{t-1}, \alpha_t) \mid X)$, facilitating the incorporation of flexible techniques like ensemble learning [@breiman1996stacked; @vdl2007super]. This extreme degree of flexibility integrates perfectly with the underlying design principles of sl3; however, we have not yet implemented this approach in its full generality. A version of this CDE approach, which limits the original proposal by replacing the use of arbitrary binary regression with the highly adaptive lasso (HAL) algorithm [@benkeser2016hal] is supported in the haldensify package [@hejazi2020haldensify] (the HAL implementation in haldensify is provided the [hal9001 package](https://github.com tlverse/hal9001) [@coyle2020hal9001; @hejazi2020hal9001]). This CDE algorithm that uses haldensify is incorporated as learner Lrnr_haldensify in sl3, as we demonstrate below.

# learners used for conditional densities for (g_n)
haldensify_lrnr <- Lrnr_haldensify$new(
  n_bins = c(5, 10)
)

Exercises {#sl3-exercises}

Predicting Myocardial Infarction with `sl3` {#sl3ex1}

Follow the steps below to predict myocardial infarction (mi) using the available covariate data. We thank Prof. David Benkeser at Emory University for making the this Cardiovascular Health Study (CHS) data accessible.

# load the data set
library(readr)
db_data <- url(
  paste0(
    "https://raw.githubusercontent.com/benkeser/sllecture/master/",
    "chspred.csv"
  )
)
chspred <- read_csv(file = db_data, col_names = TRUE)

Let's take a quick peek at the data:

waist	alcoh	hdl	smoke	ace	ldl	bmi	aspirin	gend	age	glu	ins	cysgfr	dm	fetuina	whr	hsed	race	logcystat	logtrig	logcrp	logcre	health	logkcal	sysbp
110.164	0.0000	66.497	0	1	114.216	27.997	0	0	73.518	159.931	70.3343	75.008	1	0.17516	1.16898	1	1	-0.34202	5.4063	2.01260	-0.67385	0	4.3926	177.135
89.976	0.0000	50.065	0	0	103.777	20.893	0	0	61.772	153.389	33.9695	82.743	1	0.57165	0.90114	0	0	-0.08465	4.8592	3.29328	-0.55509	1	6.2071	136.374
106.194	8.4174	40.506	0	0	165.716	28.455	1	1	72.931	121.715	-17.3017	74.699	0	0.35168	1.17971	0	1	-0.44511	4.5088	0.30132	-0.01152	0	6.7320	135.199
90.057	0.0000	36.175	0	0	45.203	23.961	0	0	79.119	53.969	11.7315	95.782	0	0.54391	1.13599	0	0	-0.48072	5.1832	3.02426	-0.57507	1	7.3972	139.018
78.614	2.9790	71.064	1	0	131.312	10.966	0	1	69.018	94.315	9.7112	72.711	0	0.49159	1.10276	1	0	0.31206	4.2190	-0.70568	0.00534	1	8.2779	88.047
91.659	0.0000	59.496	0	0	171.187	29.132	0	1	81.835	212.907	-28.2269	69.218	1	0.46215	0.95291	1	0	-0.28716	5.1773	0.97046	0.21268	1	5.9942	69.594

head(chspred)

Create an sl3 task, setting myocardial infarction mi as the outcome and using all available covariate data.
Make a library of seven relatively fast base learning algorithms (i.e., do not consider BART or HAL). Customize tuning parameters for one of your learners. Incorporate at least one screener-learner coupling.
Make the SL and train it on the task.
Print the SL fit results by adding $cv_risk(loss_squared_error) to your fit object.

Concluding Remarks

Super Learner (SL) is a general approach that can be applied to a diversity of estimation and prediction problems which can be defined by a loss function.
It would be straightforward to plug in the estimator returned by SL into the target parameter mapping.
- For example, suppose we are after the average treatment effect (ATE) of a binary treatment intervention: $\Psi_0 = E_{0,W}[E_0(Y|A=1,W) - E_0(Y|A=0,W)]$.
- We could use the SL that was trained on the original data (let's call this sl_fit) to predict the outcome for all subjects under each intervention. All we would need to do is take the average difference between the counterfactual outcomes under each intervention of interest.
- Considering $\Psi_0$ above, we would first need two $n$-length vectors of predicted outcomes under each intervention. One vector would represent the predicted outcomes under an intervention that sets all subjects to receive $A=1$, $Y_i|A_i=1,W_i$ for all $i=1,\ldots,n$. The other vector would represent the predicted outcomes under an intervention that sets all subjects to receive $A=0$, $Y_i|A_i=0,W_i$ for all $i=1,\ldots,n$.
- After obtaining these vectors of counterfactual predicted outcomes, all we would need to do is average and then take the difference in order to "plug-in" the SL estimator into the target parameter mapping.
- In sl3 and with our current ATE example, this could be achieved with mean(sl_fit$predict(A1_task)) - mean(sl_fit$predict(A0_task)); where A1_task$data would contain all 1's (or the level that pertains to receiving the treatment) for the treatment column in the data (keeping all else the same), and A0_task$data would contain all 0's (or the level that pertains to not receiving the treatment) for the treatment column in the data.
It's a worthwhile exercise to obtain the predicted counterfactual outcomes and create these counterfactual sl3 tasks. It's too biased; however, to plug the SL fit into the target parameter mapping, (e.g., calling the result of mean(sl_fit$predict(A1_task)) - mean(sl_fit$predict(A0_task)) the estimated ATE. We would end up with an estimator for the ATE that was optimized for estimation of the prediction function, and not the ATE!
At the end of the "analysis day", we want an estimator that is optimized for our target estimand of interest. We ultimately care about doing a good job estimating $\psi_0$. The SL is an essential step to help us get there. In fact, we will use the counterfactual predicted outcomes that were explained at length above. However, SL is not the end of the estimation procedure. Specifically, the Super Learner would not be an asymptotically linear estimator of the target estimand; and it is not an efficient substitution estimator. This begs the question, why is it so important for an estimator to possess these properties?
- An asymptotically linear estimator converges to the estimand a $\frac{1}{\sqrt{n}}$ rate, thereby permitting formal statistical inference (i.e., confidence intervals and $p$-values) [ADD REF].
- Substitution, or plug-in, estimators of the estimand are desirable because they respect both the local and global constraints of the statistical model (e.g., bounds), and have they have better finite-sample properties[ADD REF].
- An efficient estimator is optimal in the sense that it has the lowest possible variance, and is thus the most precise. An estimator is efficient if and only if is asymptotically linear with influence curve equal to the canonical gradient [ADD REF].
  - The canonical gradient is a mathematical object that is specific to the target estimand, and it provides information on the level of difficulty of the estimation problem [ADD REF]. Various canonical gradient are shown in the chapters that follow.
  - Practitioner's do not need to know how to calculate a canonical gradient in order to understand efficiency and use Targeted Maximum Likelihood Estimation (TMLE). Metaphorically, you do not need to be Yoda in order to be a Jedi.
TMLE is a general strategy that succeeds in constructing efficient and asymptotically linear plug-in estimators.
SL is fantastic for pure prediction, and for obtaining an initial estimate in the first step of TMLE, but we need the second step of TMLE to have the desirable statistical properties mentioned above.
In the chapters that follow, we focus on the targeted maximum likelihood estimator and the targeted minimum loss-based estimator, both referred to as TMLE.

Appendix

Exercise 1 Solution {#sl3ex1-sol}

Here is a potential solution to the sl3 Exercise 1 -- Predicting Myocardial Infarction with sl3.

db_data <- url(
  "https://raw.githubusercontent.com/benkeser/sllecture/master/chspred.csv"
)
chspred <- read_csv(file = db_data, col_names = TRUE)
data.table::setDT(chspred)

# make task
chspred_task <- make_sl3_Task(
  data = chspred,
  covariates = colnames(chspred)[-1],
  outcome = "mi"
)

# make learners
glm_learner <- Lrnr_glm$new()
lasso_learner <- Lrnr_glmnet$new(alpha = 1)
ridge_learner <- Lrnr_glmnet$new(alpha = 0)
enet_learner <- Lrnr_glmnet$new(alpha = 0.5)
# curated_glm_learner uses formula = "mi ~ smoke + beta"
curated_glm_learner <- Lrnr_glm_fast$new(covariates = c("smoke", "beta"))
mean_learner <- Lrnr_mean$new() # That is one mean learner!
glm_fast_learner <- Lrnr_glm_fast$new()
ranger_learner <- Lrnr_ranger$new()
svm_learner <- Lrnr_svm$new()
xgb_learner <- Lrnr_xgboost$new()

# screening
screen_cor <- make_learner(Lrnr_screener_correlation)
glm_pipeline <- make_learner(Pipeline, screen_cor, glm_learner)

# stack learners together
stack <- make_learner(
  Stack,
  glm_pipeline, glm_learner,
  lasso_learner, ridge_learner, enet_learner,
  curated_glm_learner, mean_learner, glm_fast_learner,
  ranger_learner, svm_learner, xgb_learner
)

# make and train SL
sl <- Lrnr_sl$new(
  learners = stack
)
sl_fit <- sl$train(chspred_task)
sl_fit$cv_risk(loss_squared_error)

Files

04-sl3.md

Latest commit

History