finalze, add tests

easystats · Aug 18, 2024 · 0050749 · 0050749
1 parent a9dcaf3
commit 0050749
Show file tree

Hide file tree

Showing 4 changed files with 139 additions and 49 deletions.
diff --git a/NEWS.md b/NEWS.md
@@ -1,3 +1,10 @@
+# datawizard 0.12.3
+
+CHANGES
+
+* `demean()` (and `degroup()`) now also work for nested designs, if argument
+  `nested = TRUE` and  `by` specifies more than one variable.
+
 # datawizard 0.12.2
 
 * Remove `htmltools` from `Suggests` in an attempt of fixing an error in CRAN
@@ -73,8 +80,8 @@ BREAKING CHANGES
 
 * The following arguments were deprecated in 0.5.0 and are now removed:
 
-  * in `data_to_wide()`: `colnames_from`, `rows_from`, `sep` 
-  * in `data_to_long()`: `colnames_to` 
+  * in `data_to_wide()`: `colnames_from`, `rows_from`, `sep`
+  * in `data_to_long()`: `colnames_to`
   * in `data_partition()`: `training_proportion`
 
 NEW FUNCTIONS
@@ -93,7 +100,7 @@ CHANGES
   argument, to compute weighted frequency tables. `include_na` allows to include
   or omit missing values from the table. Furthermore, a `by` argument was added,
   to compute crosstables (#479, #481).
-  
+
 # datawizard 0.9.1
 
 CHANGES
@@ -144,7 +151,7 @@ CHANGES
 
 * `unnormalize()` and `unstandardize()` now work with grouped data (#415).
 
-* `unnormalize()` now errors instead of emitting a warning if it doesn't have the 
+* `unnormalize()` now errors instead of emitting a warning if it doesn't have the
   necessary info (#415).
 
 BUG FIXES
@@ -167,7 +174,7 @@ BUG FIXES
 
 * Fixed issue in `data_filter()` where functions containing a `=` (e.g. when
   naming arguments, like `grepl(pattern, x = a)`) were mistakenly seen as
-  faulty syntax. 
+  faulty syntax.
 
 * Fixed issue in `empty_column()` for strings with invalid multibyte strings.
   For such data frames or files, `empty_column()` or `data_read()` no longer
@@ -204,14 +211,14 @@ CHANGES
 
 NEW FUNCTIONS
 
-* `rowid_as_column()` to complement `rownames_as_column()` (and to mimic 
-  `tibble::rowid_to_column()`). Note that its behavior is different from 
+* `rowid_as_column()` to complement `rownames_as_column()` (and to mimic
+  `tibble::rowid_to_column()`). Note that its behavior is different from
   `tibble::rowid_to_column()` for grouped data. See the Details section in the
   docs.
 
 * `data_unite()`, to merge values of multiple variables into one new variable.
 
-* `data_separate()`, as counterpart to `data_unite()`, to separate a single 
+* `data_separate()`, as counterpart to `data_unite()`, to separate a single
   variable into multiple new variables.
 
 * `data_modify()`, to create new variables, or modify or remove existing
@@ -234,7 +241,7 @@ BUG FIXES
 
 * `center()` and `standardize()` did not work for grouped data frames (of class
   `grouped_df`) when `force = TRUE`.
-  
+
 * The `data.frame` method of `describe_distribution()` returns `NULL` instead of
   an error if no valid variable were passed (for example a factor variable with
   `include_factors = FALSE`) (#421).
@@ -262,12 +269,12 @@ BUG FIXES
 
 # datawizard 0.7.0
 
-BREAKING CHANGES 
+BREAKING CHANGES
 
 * In selection patterns, expressions like `-var1:var3` to exclude all variables
   between `var1` and `var3` are no longer accepted. The correct expression is
   `-(var1:var3)`. This is for 2 reasons:
-  
+
   * to be consistent with the behavior for numerics (`-1:2` is not accepted but
     `-(1:2)` is);
   * to be consistent with `dplyr::select()`, which throws a warning and only
@@ -279,8 +286,8 @@ NEW FUNCTIONS
   or more variables into a new variable.
 
 * `mean_sd()` and `median_mad()` for summarizing vectors to their mean (or
-  median) and a range of one SD (or MAD) above and below.  
-  
+  median) and a range of one SD (or MAD) above and below.
+
 * `data_write()` as counterpart to `data_read()`, to write data frames into
   CSV, SPSS, SAS, Stata files and many other file types. One advantage over
   existing functions to write data in other packages is that labelled (numeric)
@@ -296,8 +303,8 @@ MINOR CHANGES
 
 * `data_rename()` gets a `verbose` argument.
 * `winsorize()` now errors if the threshold is incorrect (previously, it provided
-  a warning and returned the unchanged data). The argument `verbose` is now 
-  useless but is kept for backward compatibility. The documentation now contains   
+  a warning and returned the unchanged data). The argument `verbose` is now
+  useless but is kept for backward compatibility. The documentation now contains
   details about the valid values for `threshold` (#357).
 * In all functions that have arguments `select` and/or `exclude`, there is now
   one warning per misspelled variable. The previous behavior was to have only one
@@ -318,7 +325,7 @@ BUG FIXES
 * Fix unexpected warning in `convert_na_to()` when `select` is a list (#352).
 * Fixed issue with correct labelling of numeric variables with more than nine
   unique values and associated value labels.
-  
+
 
 # datawizard 0.6.5
 
@@ -350,7 +357,7 @@ NEW FUNCTIONS
 * `data_codebook()`: to generate codebooks of data frames.
 
 * New functions to deal with duplicates: `data_duplicated()` (keep all duplicates,
-  including the first occurrence) and `data_unique()` (returns the data, excluding 
+  including the first occurrence) and `data_unique()` (returns the data, excluding
   all duplicates except one instance of each, based on the selected method).
 
 MINOR CHANGES
@@ -360,15 +367,15 @@ MINOR CHANGES
 * The `include_bounds` argument in `normalize()` can now also be a numeric
   value, defining the limit to the upper and lower bound (i.e. the distance
   to 1 and 0).
-  
-* `data_filter()` now works with grouped data. 
+
+* `data_filter()` now works with grouped data.
 
 BUG FIXES
 
 * `data_read()` no longer prints message for empty columns when the data
   actually had no empty columns.
-  
- * `data_to_wide()` now drops columns that are not in `id_cols` (if specified), 
+
+ * `data_to_wide()` now drops columns that are not in `id_cols` (if specified),
   `names_from`, or `values_from`. This is the behaviour observed in `tidyr::pivot_wider()`.
 
 # datawizard 0.6.3
@@ -800,4 +807,3 @@ NEW FUNCTIONS
 # datawizard 0.1.0
 
 * First release.
-
diff --git a/R/demean.R b/R/demean.R
@@ -11,7 +11,7 @@
 #' @param x A data frame.
 #' @param select Character vector (or formula) with names of variables to select
 #'   that should be group- and de-meaned.
-#' @param by Character vector (or formula) with the name of the variable(s) that
+#' @param by Character vector (or formula) with the name of the variable that
 #'   indicates the group- or cluster-ID. For cross-classified or nested designs,
 #'   `by` can also identify two or more variables as group- or cluster-IDs. If
 #'   the data is nested and should be treated as such, set `nested = TRUE`. Else,
@@ -20,7 +20,8 @@
 #'
 #'   For nested designs, `by` can be:
 #'   - a character vector with the name of the variable that indicates the
-#'     levels, ordered from *highest* level to *lowest* (e.g. `by = c("L3", "L2"`).
+#'     levels, ordered from *highest* level to *lowest* (e.g.
+#'     `by = c("L4", "L3", "L2")`.
 #'   - a character vector with variable names in the format `by = "L4/L3/L2"`,
 #'     where the levels are separated by `/`.
 #'
@@ -47,7 +48,10 @@
 #' @return
 #' A data frame with the group-/de-meaned variables, which get the suffix
 #' `"_between"` (for the group-meaned variable) and `"_within"` (for the
-#' de-meaned variable) by default.
+#' de-meaned variable) by default. For cross-classified or nested designs,
+#' the name pattern of the group-meaned variables is the name of the centered
+#' variable followed by the name of the variable that indicates the related
+#' grouping level, e.g. `predictor_L3_between` and `predictor_L2_between`.
 #'
 #' @seealso If grand-mean centering (instead of centering within-clusters)
 #'   is required, see [`center()`]. See [`performance::check_heterogeneity_bias()`]
@@ -178,19 +182,30 @@
 #'
 #' @section De-meaning for cross-classified designs:
 #'
-#' `demean()` can also handle cross-classified designs, where the data has two
-#' or more groups at the higher (i.e. second) level. In such cases, the
+#' `demean()` can handle cross-classified designs, where the data has two or
+#' more groups at the higher (i.e. second) level. In such cases, the
 #' `by`-argument can identify two or more variables that represent the
 #'  cross-classified group- or cluster-IDs. The de-meaned variables for
 #' cross-classified designs are simply subtracting all group means from each
 #' individual value, i.e. _fully cluster-mean-centering_ (see _Guo et al. 2024_
 #' for details). Note that de-meaning for cross-classified designs is *not*
 #' equivalent to de-meaning of nested data structures from models with three or
-#' more levels. Set `nested = TRUE` to explicitly assume a nested design. for
+#' more levels. Set `nested = TRUE` to explicitly assume a nested design. For
 #' cross-classified designs, de-meaning is supposed to work for models like
 #' `y ~ x + (1|level3) + (1|level2)`, but *not* for models like
 #' `y ~ x + (1|level3/level2)`.
 #'
+#' @section De-meaning for nested designs:
+#'
+#' _Brincks et al. (2017)_ have suggested an algorithm to center variables for
+#' nested designs, which is implememented in `demean()`. For nested designs,
+#' set `nested = TRUE` *and* specify the variables that indicate the different
+#' levels in descending order in the `by` argument. E.g.,
+#' `by = c("level4", "level3, "level2")` assumes a model like
+#' `y ~ x + (1|level4/level3/level2)`. An alternative notation for the
+#' `by`-argument would be `by = c("level4/level3/level2")`, similar to the
+#' formula notation.
+#'
 #' @section Analysing panel data with mixed models using lme4:
 #'
 #' A description of how to translate the formulas described in *Bell et al. 2018*
@@ -200,35 +215,40 @@
 #' @references
 #'
 #'   - Bafumi J, Gelman A. 2006. Fitting Multilevel Models When Predictors
-#'   and Group Effects Correlate. In. Philadelphia, PA: Annual meeting of the
-#'   American Political Science Association.
+#'     and Group Effects Correlate. In. Philadelphia, PA: Annual meeting of the
+#'     American Political Science Association.
 #'
 #'   - Bell A, Fairbrother M, Jones K. 2019. Fixed and Random Effects
-#'   Models: Making an Informed Choice. Quality & Quantity (53); 1051-1074
+#'     Models: Making an Informed Choice. Quality & Quantity (53); 1051-1074
 #'
 #'   - Bell A, Jones K. 2015. Explaining Fixed Effects: Random Effects
-#'   Modeling of Time-Series Cross-Sectional and Panel Data. Political Science
-#'   Research and Methods, 3(1), 133–153.
+#'     Modeling of Time-Series Cross-Sectional and Panel Data. Political Science
+#'     Research and Methods, 3(1), 133–153.
+#'
+#'   - Brincks, A. M., Enders, C. K., Llabre, M. M., Bulotsky-Shearer, R. J.,
+#'     Prado, G., and Feaster, D. J. (2017). Centering Predictor Variables in
+#'     Three-Level Contextual Models. Multivariate Behavioral Research, 52(2),
+#'     149–163. https://doi.org/10.1080/00273171.2016.1256753
 #'
 #'   - Gelman A, Hill J. 2007. Data Analysis Using Regression and
-#'   Multilevel/Hierarchical Models. Analytical Methods for Social Research.
-#'   Cambridge, New York: Cambridge University Press
+#'     Multilevel/Hierarchical Models. Analytical Methods for Social Research.
+#'     Cambridge, New York: Cambridge University Press
 #'
 #'   - Giesselmann M, Schmidt-Catran, AW. 2020. Interactions in fixed
-#'   effects regression models. Sociological Methods & Research, 1–28.
-#'   https://doi.org/10.1177/0049124120914934
+#'     effects regression models. Sociological Methods & Research, 1–28.
+#'     https://doi.org/10.1177/0049124120914934
 #'
 #'   - Guo Y, Dhaliwal J, Rights JD. 2024. Disaggregating level-specific effects
-#'   in cross-classified multilevel models. Behavior Research Methods, 56(4),
-#'   3023–3057.
+#'     in cross-classified multilevel models. Behavior Research Methods, 56(4),
+#'     3023–3057.
 #'
 #'   - Heisig JP, Schaeffer M, Giesecke J. 2017. The Costs of Simplicity:
-#'   Why Multilevel Models May Benefit from Accounting for Cross-Cluster
-#'   Differences in the Effects of Controls. American Sociological Review 82
-#'   (4): 796–827.
+#'     Why Multilevel Models May Benefit from Accounting for Cross-Cluster
+#'     Differences in the Effects of Controls. American Sociological Review 82
+#'     (4): 796–827.
 #'
 #'   - Hoffman L. 2015. Longitudinal analysis: modeling within-person
-#'   fluctuation and change. New York: Routledge
+#'     fluctuation and change. New York: Routledge
 #'
 #' @examples
 #'

diff --git a/man/demean.Rd b/man/demean.Rd