Appropriate Statistical Test Question #589

delucalab · 2021-08-28T16:08:09Z

delucalab
Aug 28, 2021

Dear All,

I have collected data on the presence and number of lesions of 1) different types and 2) within different locations in a patient cohort. I looked at the following summary (http://htmlpreview.github.io/?https://github.com/strengejacke/mixed-models-snippets/blob/master/overview_modelling_packages.html) to come up with an approach and wanted to get some feedback. Ultimately, I want to perform the following:

Case-level approach to look for differences in the proportion of cases with lesions at the different locations (region 1, 2, and 3) and of the different lesion types (type 1, 2, and 3). To accomplish this, I thought I would perform a glmer (family=binomial) model since the data is binary (lesion vs not) at the nested levels of location and lesion type for each case. Would you agree with this method?
Lesion-level approach using the data I have on number of lesions (rather than just presence) to look for differences in the proportion of lesions that are found at particular locations (region 1, 2, and 3) and are of particular types (type 1, 2, and 3). To accomplish this, I thought I would perform glmmTMB(ziformula, family=beta_family/betabinomial) model on the proportion of total lesions identified for each case that fall into the different categories (ie locations and types) since the data is proportional, includes 0 and 1, and is nested for each case. Would this be the best method?

As always, thank you for your insights and help! Do not hesitate to let me know if I need to clarify my aims or the type of data any further.

bwiernik · 2021-08-28T16:55:01Z

bwiernik
Aug 28, 2021
Maintainer

It's best to think of both of these as "case-level"--the difference here is whether you are dichotomizing lesions into a binary variable (present or not) versus leaving it as a count variable (number of lesions).

I'll discuss model form in a minute. Let's first discuss your predictors. You have described 3 predictor variables—case, type, and location. You need to decide whether to model each of these as a fixed-effects predictor or as a random-effects predictor. To decide, ask yourself, are you interested in the specific values of the variables themselves (e.g., these cases, these locations), or do you want to treat these values as samples from a broader population for the variable and to generalize to that broader population (e.g., do you want to generalize to the population of potential cases, to the population of potential locations)? Another way to think about this is--do you want to take extreme values for one of the cases/locations/types at face value, or do you want to regularize them a bit and pull them toward the overall mean (this is often reasonable). If you want to generalize to a broader population or regularize values, model the variable as a random grouping factor. Otherwise, model it as a fixed factor.

For example, if you want to model case as a random factor but type and location as fixed factors, you could use the formula:
lesion ~ location + type + (1 | case)

If you want to model all 3 as random factors, then you could use:
lesion ~ (1 | location) + (1 | type) + (1 | case)

Both of the above formulations treat the three types of grouping factors as distinct (but correlated): cases may have predispositions toward more lesions generally, but not predispositions to specific types or locations of lesions.

If you want to consider predispositions toward specific types/locations across cases, you can add an interaction to your grouping structure:
lesion ~ (1 | location) + (1 | type) + (1 | case / location) + (1 | case / type)

Here, I've left in the direct effects of location and type to consider that these factors may have main effects in addition to their individual case-level effects. You could consider dropping those.

Now, turning to your question about model family.

The most appropriate form I would argue is your (2)--to model the number of lesions, which may include zero. For this approach, you likely want to choose a family that (1) reflects a count variable, (2) is flexible about the mean and variance for the lesion counts, and (3) also flexibly models the absence of any lesions. For this, I would recommend a zero-inflated negative binomial model, as it has all of these features. glmmTMB can fit this family of models, with random effects for main count portion of the model, but only fixed effects for the zero-inflation.

glmmTMB(lesion ~ (1 | location) + (1 | type) + (1 | case), ziformula = ~ lesion ~ location + type, family = nbinom2())

If you want to model the zero-inflation with random effects as well, use brms.

brms::brm(bf(lesion ~ (1 | location) + (1 | type) + (1 | case), zi = ~ lesion ~ location + type, family = zero_inflated_negbinomial())

Note that you can use this model to answer your first question (what predicts presence of any lesions versus none), but does so better than a binomial model because the binomial model treats any number of lesions greater than zero as the same.

0 replies

delucalab · 2021-08-28T17:23:47Z

delucalab
Aug 28, 2021
Author

Thank you so much! So would using the suggested model address assessing the differences shown in the following tables? The first table is the proportion of cases with particular lesion type/location while the second is the proportion of lesions of a certain type/location. I think this is why I was thinking of trying to split analyses into case-level vs lesion-level but may just be confusing myself.

	Type 1	Type 2	Type 3
Region 1	22/102 (21.5%)	51/102 (50.0%)	42/102 (41.2%)
Region 2	17/116 (14.7%)	38/116 (32.8%)	44/116 (37.9%)
Region 3	14/106 (13.2%)	35/106 (33.0%)	33/106 (31.1%)

	Type 1	Type 2	Type 3
Region 1	28/177 (15.8%)	79/177 (44.6%)	70/177 (39.5%)
Region 2	23/149 (15.4%)	56/149 (37.6%)	70/149 (47.0%)
Region 3	24/131 (18.3%)	55/131 (42.0%)	52/131 (39.7%)

0 replies

bwiernik · 2021-08-28T17:57:14Z

bwiernik
Aug 28, 2021
Maintainer

The issue is that first table treats a case with 1 lesion and 10 lesions identically. The model I suggest can make that comparison, but it doesn’t assume they are identical the way a binomial model would.

0 replies

delucalab · 2021-08-28T22:11:05Z

delucalab
Aug 28, 2021
Author

That makes sense! Thank you! One additional question: What are the pros and cons for including the random effect in the zero inflation? Is there a rule of thumb for when you should?

0 replies

bwiernik · 2021-08-28T22:54:43Z

bwiernik
Aug 28, 2021
Maintainer

Same arguments apply as I laid out for the mean function.

0 replies

mattansb · 2021-08-31T10:35:47Z

mattansb
Aug 31, 2021
Maintainer

The most appropriate form I would argue is your (2)--to model the number of lesions, which may include zero. For this approach, you likely want to choose a family that (1) reflects a count variable, (2) is flexible about the mean and variance for the lesion counts, and (3) also flexibly models the absence of any lesions. For this, I would recommend a zero-inflated negative binomial model

Just to clarify that negative-binomial models can be used to model 0-counts; zero-inflated models model excess zeros - that is, when you have more zeros that is expected from the NB distribution alone (:

0 replies

strengejacke · 2021-09-01T06:08:57Z

strengejacke
Sep 1, 2021
Maintainer

I have converted this issue into a discussion, seemed more appropriate to me.

0 replies

delucalab · 2021-09-02T17:38:19Z

delucalab
Sep 2, 2021
Author

Thanks! I just thinking a bit about an extension to this analysis:

I'm interested in whether lesion number of particular types and locations predicts age at death and disease duration. I was thinking about adding age, disease duration, and interactions for these two variables with lesion type and location to the model we discussed above to answer this question. However, it seemed this would make the model much more complicated and I was hitting some convergence issues. If I just use age and disease duration as predictors but not location and lesion type I do get a significant relationship with age but am curious if this is driven by particular locations or lesion types. In addition, would doing the mixed model approach answer the reverse question of whether age and disease duration predicts lesion number at particular locations and of particular types rather than question of interest? Therefore, I thought that I could run a separate analysis with age and disease duration as outcome variables instead but wasn't sure how to deal with predictors that would clustered.

9 replies

bwiernik Sep 3, 2021
Maintainer

The main thing I would be concerned about is the collinearity of the items. I think something like:

time_to_event ~ number_of_lesions + (number_of_lesions | region) + ( number_of_lesions | type)

would make sense. This will somewhat regularize the effects of the many predictors and give more clear results versus dropping them all into one single level model

delucalab Sep 3, 2021
Author

The thought about collinearity makes sense. So in this formula you indicated, are region and type treated as random effects rather than fixed effects? If so, would this formula allow me to derive output hazard ratios for each type of lesion at each region (ie active lesions in region 1)? I assumed that I would need to include an interaction between region and type in order to accomplish that? Would I also need to include the term (1 | ID)?

Thanks again for helping me think through this since it is so new to me!

bwiernik Sep 3, 2021
Maintainer

Yes, these are random effects. Randomly estimating effects for each region and each type. The estimate of the effect for a specific combination would be the sum of the two effects. If you want to freely estimate effects for each combination, either enter one random effect as (number_of_lesions | region/type) or more simply, make a new variable to combines the information about both into one. You should control for random effect of ID as well.

delucalab Sep 3, 2021
Author

If I’m only including lesion number (irrespective of region and lesion type) as s fixed effect but not including the region and lesion type effects as fixed, how can I tabulate hazard ratios for specific contrasts of interest? I’ve only ever used random effects for ID to control for the clustering of data so I think that’s why I’m getting a bit lost.

bwiernik Sep 3, 2021
Maintainer

The estimate_grouplevel() function (or the coef() function) adds the fixed and random components together to get the estimated total slope for each group.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Appropriate Statistical Test Question #589

{{title}}

Replies: 8 comments 9 replies

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

Select a reply

Appropriate Statistical Test Question #589

delucalab Aug 28, 2021

Replies: 8 comments · 9 replies

bwiernik Aug 28, 2021 Maintainer

delucalab Aug 28, 2021 Author

bwiernik Aug 28, 2021 Maintainer

delucalab Aug 28, 2021 Author

bwiernik Aug 28, 2021 Maintainer

mattansb Aug 31, 2021 Maintainer

strengejacke Sep 1, 2021 Maintainer

delucalab Sep 2, 2021 Author

bwiernik Sep 3, 2021 Maintainer

delucalab Sep 3, 2021 Author

bwiernik Sep 3, 2021 Maintainer

delucalab Sep 3, 2021 Author

bwiernik Sep 3, 2021 Maintainer

delucalab
Aug 28, 2021

Replies: 8 comments 9 replies

bwiernik
Aug 28, 2021
Maintainer

delucalab
Aug 28, 2021
Author

bwiernik
Aug 28, 2021
Maintainer

delucalab
Aug 28, 2021
Author

bwiernik
Aug 28, 2021
Maintainer

mattansb
Aug 31, 2021
Maintainer

strengejacke
Sep 1, 2021
Maintainer

delucalab
Sep 2, 2021
Author

bwiernik Sep 3, 2021
Maintainer

delucalab Sep 3, 2021
Author

bwiernik Sep 3, 2021
Maintainer

delucalab Sep 3, 2021
Author

bwiernik Sep 3, 2021
Maintainer