Skip to content

Commit

Permalink
finalze, add tests
Browse files Browse the repository at this point in the history
  • Loading branch information
strengejacke committed Aug 18, 2024
1 parent a9dcaf3 commit 0050749
Show file tree
Hide file tree
Showing 4 changed files with 139 additions and 49 deletions.
50 changes: 28 additions & 22 deletions NEWS.md
Original file line number Diff line number Diff line change
@@ -1,3 +1,10 @@
# datawizard 0.12.3

CHANGES

* `demean()` (and `degroup()`) now also work for nested designs, if argument
`nested = TRUE` and `by` specifies more than one variable.

# datawizard 0.12.2

* Remove `htmltools` from `Suggests` in an attempt of fixing an error in CRAN
Expand Down Expand Up @@ -73,8 +80,8 @@ BREAKING CHANGES

* The following arguments were deprecated in 0.5.0 and are now removed:

* in `data_to_wide()`: `colnames_from`, `rows_from`, `sep`
* in `data_to_long()`: `colnames_to`
* in `data_to_wide()`: `colnames_from`, `rows_from`, `sep`
* in `data_to_long()`: `colnames_to`
* in `data_partition()`: `training_proportion`

NEW FUNCTIONS
Expand All @@ -93,7 +100,7 @@ CHANGES
argument, to compute weighted frequency tables. `include_na` allows to include
or omit missing values from the table. Furthermore, a `by` argument was added,
to compute crosstables (#479, #481).

# datawizard 0.9.1

CHANGES
Expand Down Expand Up @@ -144,7 +151,7 @@ CHANGES

* `unnormalize()` and `unstandardize()` now work with grouped data (#415).

* `unnormalize()` now errors instead of emitting a warning if it doesn't have the
* `unnormalize()` now errors instead of emitting a warning if it doesn't have the
necessary info (#415).

BUG FIXES
Expand All @@ -167,7 +174,7 @@ BUG FIXES

* Fixed issue in `data_filter()` where functions containing a `=` (e.g. when
naming arguments, like `grepl(pattern, x = a)`) were mistakenly seen as
faulty syntax.
faulty syntax.

* Fixed issue in `empty_column()` for strings with invalid multibyte strings.
For such data frames or files, `empty_column()` or `data_read()` no longer
Expand Down Expand Up @@ -204,14 +211,14 @@ CHANGES

NEW FUNCTIONS

* `rowid_as_column()` to complement `rownames_as_column()` (and to mimic
`tibble::rowid_to_column()`). Note that its behavior is different from
* `rowid_as_column()` to complement `rownames_as_column()` (and to mimic
`tibble::rowid_to_column()`). Note that its behavior is different from
`tibble::rowid_to_column()` for grouped data. See the Details section in the
docs.

* `data_unite()`, to merge values of multiple variables into one new variable.

* `data_separate()`, as counterpart to `data_unite()`, to separate a single
* `data_separate()`, as counterpart to `data_unite()`, to separate a single
variable into multiple new variables.

* `data_modify()`, to create new variables, or modify or remove existing
Expand All @@ -234,7 +241,7 @@ BUG FIXES

* `center()` and `standardize()` did not work for grouped data frames (of class
`grouped_df`) when `force = TRUE`.

* The `data.frame` method of `describe_distribution()` returns `NULL` instead of
an error if no valid variable were passed (for example a factor variable with
`include_factors = FALSE`) (#421).
Expand Down Expand Up @@ -262,12 +269,12 @@ BUG FIXES

# datawizard 0.7.0

BREAKING CHANGES
BREAKING CHANGES

* In selection patterns, expressions like `-var1:var3` to exclude all variables
between `var1` and `var3` are no longer accepted. The correct expression is
`-(var1:var3)`. This is for 2 reasons:

* to be consistent with the behavior for numerics (`-1:2` is not accepted but
`-(1:2)` is);
* to be consistent with `dplyr::select()`, which throws a warning and only
Expand All @@ -279,8 +286,8 @@ NEW FUNCTIONS
or more variables into a new variable.

* `mean_sd()` and `median_mad()` for summarizing vectors to their mean (or
median) and a range of one SD (or MAD) above and below.
median) and a range of one SD (or MAD) above and below.

* `data_write()` as counterpart to `data_read()`, to write data frames into
CSV, SPSS, SAS, Stata files and many other file types. One advantage over
existing functions to write data in other packages is that labelled (numeric)
Expand All @@ -296,8 +303,8 @@ MINOR CHANGES

* `data_rename()` gets a `verbose` argument.
* `winsorize()` now errors if the threshold is incorrect (previously, it provided
a warning and returned the unchanged data). The argument `verbose` is now
useless but is kept for backward compatibility. The documentation now contains
a warning and returned the unchanged data). The argument `verbose` is now
useless but is kept for backward compatibility. The documentation now contains
details about the valid values for `threshold` (#357).
* In all functions that have arguments `select` and/or `exclude`, there is now
one warning per misspelled variable. The previous behavior was to have only one
Expand All @@ -318,7 +325,7 @@ BUG FIXES
* Fix unexpected warning in `convert_na_to()` when `select` is a list (#352).
* Fixed issue with correct labelling of numeric variables with more than nine
unique values and associated value labels.


# datawizard 0.6.5

Expand Down Expand Up @@ -350,7 +357,7 @@ NEW FUNCTIONS
* `data_codebook()`: to generate codebooks of data frames.

* New functions to deal with duplicates: `data_duplicated()` (keep all duplicates,
including the first occurrence) and `data_unique()` (returns the data, excluding
including the first occurrence) and `data_unique()` (returns the data, excluding
all duplicates except one instance of each, based on the selected method).

MINOR CHANGES
Expand All @@ -360,15 +367,15 @@ MINOR CHANGES
* The `include_bounds` argument in `normalize()` can now also be a numeric
value, defining the limit to the upper and lower bound (i.e. the distance
to 1 and 0).
* `data_filter()` now works with grouped data.

* `data_filter()` now works with grouped data.

BUG FIXES

* `data_read()` no longer prints message for empty columns when the data
actually had no empty columns.
* `data_to_wide()` now drops columns that are not in `id_cols` (if specified),

* `data_to_wide()` now drops columns that are not in `id_cols` (if specified),
`names_from`, or `values_from`. This is the behaviour observed in `tidyr::pivot_wider()`.

# datawizard 0.6.3
Expand Down Expand Up @@ -800,4 +807,3 @@ NEW FUNCTIONS
# datawizard 0.1.0

* First release.

62 changes: 41 additions & 21 deletions R/demean.R
Original file line number Diff line number Diff line change
Expand Up @@ -11,7 +11,7 @@
#' @param x A data frame.
#' @param select Character vector (or formula) with names of variables to select
#' that should be group- and de-meaned.
#' @param by Character vector (or formula) with the name of the variable(s) that
#' @param by Character vector (or formula) with the name of the variable that
#' indicates the group- or cluster-ID. For cross-classified or nested designs,
#' `by` can also identify two or more variables as group- or cluster-IDs. If
#' the data is nested and should be treated as such, set `nested = TRUE`. Else,
Expand All @@ -20,7 +20,8 @@
#'
#' For nested designs, `by` can be:
#' - a character vector with the name of the variable that indicates the
#' levels, ordered from *highest* level to *lowest* (e.g. `by = c("L3", "L2"`).
#' levels, ordered from *highest* level to *lowest* (e.g.
#' `by = c("L4", "L3", "L2")`.
#' - a character vector with variable names in the format `by = "L4/L3/L2"`,
#' where the levels are separated by `/`.
#'
Expand All @@ -47,7 +48,10 @@
#' @return
#' A data frame with the group-/de-meaned variables, which get the suffix
#' `"_between"` (for the group-meaned variable) and `"_within"` (for the
#' de-meaned variable) by default.
#' de-meaned variable) by default. For cross-classified or nested designs,
#' the name pattern of the group-meaned variables is the name of the centered
#' variable followed by the name of the variable that indicates the related
#' grouping level, e.g. `predictor_L3_between` and `predictor_L2_between`.
#'
#' @seealso If grand-mean centering (instead of centering within-clusters)
#' is required, see [`center()`]. See [`performance::check_heterogeneity_bias()`]
Expand Down Expand Up @@ -178,19 +182,30 @@
#'
#' @section De-meaning for cross-classified designs:
#'
#' `demean()` can also handle cross-classified designs, where the data has two
#' or more groups at the higher (i.e. second) level. In such cases, the
#' `demean()` can handle cross-classified designs, where the data has two or
#' more groups at the higher (i.e. second) level. In such cases, the
#' `by`-argument can identify two or more variables that represent the
#' cross-classified group- or cluster-IDs. The de-meaned variables for
#' cross-classified designs are simply subtracting all group means from each
#' individual value, i.e. _fully cluster-mean-centering_ (see _Guo et al. 2024_
#' for details). Note that de-meaning for cross-classified designs is *not*
#' equivalent to de-meaning of nested data structures from models with three or
#' more levels. Set `nested = TRUE` to explicitly assume a nested design. for
#' more levels. Set `nested = TRUE` to explicitly assume a nested design. For
#' cross-classified designs, de-meaning is supposed to work for models like
#' `y ~ x + (1|level3) + (1|level2)`, but *not* for models like
#' `y ~ x + (1|level3/level2)`.
#'
#' @section De-meaning for nested designs:
#'
#' _Brincks et al. (2017)_ have suggested an algorithm to center variables for
#' nested designs, which is implememented in `demean()`. For nested designs,
#' set `nested = TRUE` *and* specify the variables that indicate the different
#' levels in descending order in the `by` argument. E.g.,
#' `by = c("level4", "level3, "level2")` assumes a model like
#' `y ~ x + (1|level4/level3/level2)`. An alternative notation for the
#' `by`-argument would be `by = c("level4/level3/level2")`, similar to the
#' formula notation.
#'
#' @section Analysing panel data with mixed models using lme4:
#'
#' A description of how to translate the formulas described in *Bell et al. 2018*
Expand All @@ -200,35 +215,40 @@
#' @references
#'
#' - Bafumi J, Gelman A. 2006. Fitting Multilevel Models When Predictors
#' and Group Effects Correlate. In. Philadelphia, PA: Annual meeting of the
#' American Political Science Association.
#' and Group Effects Correlate. In. Philadelphia, PA: Annual meeting of the
#' American Political Science Association.
#'
#' - Bell A, Fairbrother M, Jones K. 2019. Fixed and Random Effects
#' Models: Making an Informed Choice. Quality & Quantity (53); 1051-1074
#' Models: Making an Informed Choice. Quality & Quantity (53); 1051-1074
#'
#' - Bell A, Jones K. 2015. Explaining Fixed Effects: Random Effects
#' Modeling of Time-Series Cross-Sectional and Panel Data. Political Science
#' Research and Methods, 3(1), 133–153.
#' Modeling of Time-Series Cross-Sectional and Panel Data. Political Science
#' Research and Methods, 3(1), 133–153.
#'
#' - Brincks, A. M., Enders, C. K., Llabre, M. M., Bulotsky-Shearer, R. J.,
#' Prado, G., and Feaster, D. J. (2017). Centering Predictor Variables in
#' Three-Level Contextual Models. Multivariate Behavioral Research, 52(2),
#' 149–163. https://doi.org/10.1080/00273171.2016.1256753
#'
#' - Gelman A, Hill J. 2007. Data Analysis Using Regression and
#' Multilevel/Hierarchical Models. Analytical Methods for Social Research.
#' Cambridge, New York: Cambridge University Press
#' Multilevel/Hierarchical Models. Analytical Methods for Social Research.
#' Cambridge, New York: Cambridge University Press
#'
#' - Giesselmann M, Schmidt-Catran, AW. 2020. Interactions in fixed
#' effects regression models. Sociological Methods & Research, 1–28.
#' https://doi.org/10.1177/0049124120914934
#' effects regression models. Sociological Methods & Research, 1–28.
#' https://doi.org/10.1177/0049124120914934
#'
#' - Guo Y, Dhaliwal J, Rights JD. 2024. Disaggregating level-specific effects
#' in cross-classified multilevel models. Behavior Research Methods, 56(4),
#' 3023–3057.
#' in cross-classified multilevel models. Behavior Research Methods, 56(4),
#' 3023–3057.
#'
#' - Heisig JP, Schaeffer M, Giesecke J. 2017. The Costs of Simplicity:
#' Why Multilevel Models May Benefit from Accounting for Cross-Cluster
#' Differences in the Effects of Controls. American Sociological Review 82
#' (4): 796–827.
#' Why Multilevel Models May Benefit from Accounting for Cross-Cluster
#' Differences in the Effects of Controls. American Sociological Review 82
#' (4): 796–827.
#'
#' - Hoffman L. 2015. Longitudinal analysis: modeling within-person
#' fluctuation and change. New York: Routledge
#' fluctuation and change. New York: Routledge
#'
#' @examples
#'
Expand Down
33 changes: 27 additions & 6 deletions man/demean.Rd

Some generated files are not rendered by default. Learn more about how customized files appear on GitHub.

Loading

0 comments on commit 0050749

Please sign in to comment.