Add method `thin_fit_result` to CLV models #393

ricardoV94 · 2023-10-12T12:39:58Z

CLV summary statistic models can be quite memory intensive and fail with the standard 4k draws. Because these methods are all built assuming there is a self.idata lying around (called in multiple places), the two hacky solutions would be to pass a thin argument to every method or to modify idata inplace.

This PR instead adds a method that returns a copy of the CLV model with a thinned dataset. A user can then use the methods in this thinned model to obtain summary stats.

Closes #374

This PR also cleans up several internals in the CLVModel base-class.

TODO

Add test using customer_lifetime_value

📚 Documentation preview 📚: https://pymc-marketing--393.org.readthedocs.build/en/393/

codecov · 2023-10-12T13:16:10Z

Codecov Report

All modified and coverable lines are covered by tests ✅

Comparison is base (30c91ee) 90.83% compared to head (80c48ab) 90.70%.

Additional details and impacted files

@@            Coverage Diff             @@
##             main     #393      +/-   ##
==========================================
- Coverage   90.83%   90.70%   -0.13%     
==========================================
  Files          21       21              
  Lines        1920     1904      -16     
==========================================
- Hits         1744     1727      -17     
- Misses        176      177       +1

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

pymc_marketing/clv/models/basic.py

ColtAllen · 2023-11-11T08:23:35Z

The assert stuff is probably just splitting hairs, but maybe raise a UserWarning in fit() to suggest using this function when dealing with huge datasets and/or draw sizes? Say, draw size > 10k and customer_ids > 50k. That way users know to try this when encountering memory crashes.

ricardoV94 · 2023-11-11T08:27:53Z

The assert stuff is probably just splitting hairs, but maybe raise a UserWarning in fit() to suggest using this function when dealing with huge datasets and/or draw sizes? Say, draw size > 10k and customer_ids > 50k. That way users know to try this when encountering memory crashes.

That could be useful. I wonder if we should do it in fit though? Or in the heavy functions like clv? Easier in fit but potentially more annoying as well.

ColtAllen · 2023-11-11T08:46:58Z

The assert stuff is probably just splitting hairs, but maybe raise a UserWarning in fit() to suggest using this function when dealing with huge datasets and/or draw sizes? Say, draw size > 10k and customer_ids > 50k. That way users know to try this when encountering memory crashes.

That could be useful. I wonder if we should do it in fit though? Or in the heavy functions like clv? Easier in fit but potentially more annoying as well.

Where's it most likely to crash? I only suggested fit because draw size can be obtained from sampler_config.

ricardoV94 · 2023-11-11T11:38:36Z

The assert stuff is probably just splitting hairs, but maybe raise a UserWarning in fit() to suggest using this function when dealing with huge datasets and/or draw sizes? Say, draw size > 10k and customer_ids > 50k. That way users know to try this when encountering memory crashes.

That could be useful. I wonder if we should do it in fit though? Or in the heavy functions like clv? Easier in fit but potentially more annoying as well.

Where's it most likely to crash? I only suggested fit because draw size can be obtained from sampler_config.

We always know the size of the samples from the idata

ricardoV94 · 2023-11-15T14:16:19Z

Where's it most likely to crash? I only suggested fit because draw size can be obtained from sampler_config.

It shouldn't be in fit, because there are not so many parameters in the model. Any problems will appear in functions that compute summary statistics per observations (most predictive methods, including the customer_lifetime_value one). It's also not model specific. I would keep it simple for now, and we revisit the warning idea if people keep being puzzled / don't find out naturally about the thin_fit_result method.

ColtAllen · 2023-11-26T20:17:15Z

Where's it most likely to crash? I only suggested fit because draw size can be obtained from sampler_config.

It shouldn't be in fit, because there are not so many parameters in the model. Any problems will appear in functions that compute summary statistics per observations (most predictive methods, including the customer_lifetime_value one). It's also not model specific. I would keep it simple for now, and we revisit the warning idea if people keep being puzzled / don't find out naturally about the thin_fit_result method.

Sounds good. Maybe add a comment about using thin_fit_result in the docstrings so it shows up in the docs, and/or the CLV Quickstart? The latter can be its own PR as there are plenty of other ways the Quickstart can be improved (ie an overview of the CLV modeling domains, use of the clv_summary function for data prep, etc.)

ColtAllen

Aside from the documentation suggestions, the only other remarks I have concern the reductions in testing coverage.

Are fit_result and expected_customer_lifetime_valuefully covered when the --runslow tests are ran?

Also, are there plans to expand the placeholder methods output_var, _generate_and_preprocess_model_data, and _data_setter? I'm not sure what the need is for those methods at the moment.

If you're chasing 100% coverage in beta_geo.py, you can adapt the following code:
https://github.com/pymc-labs/pymc-marketing/blob/main/tests/clv/models/test_pareto_nbd.py#L302

ricardoV94 · 2023-12-01T08:25:10Z

I think the coverage is just because the tests are not passing yet.

Regarding the weird methods, they are abstract methods that the base ModelBuilder class demands but are not at all relevant for CLV

ricardoV94 · 2023-12-01T08:53:38Z

@ColtAllen can I get your input again? I opened an issue for documenting the new feature in #448

ColtAllen

Did CodeCov not update for the new customer_lifetime_value test? I also made a note in #448 that it is not resolved by this PR, nor does it have to be.

As for the placeholder methods, I have some ideas on how to use them, but it would be CLV model-specific and probably outside the scope of this PR.

I think this is good to merge. Thanks for cleaning up the CLV base class!

tomthepeach · 2023-12-05T21:06:14Z

Awesome thanks guys!

ricardoV94 added enhancement New feature or request CLV labels Oct 12, 2023

ColtAllen mentioned this pull request Oct 23, 2023

Add posterior population sampling methods to ParetoNBD #401

Merged

ricardoV94 force-pushed the thin-clv-model branch from 0457137 to 387cc5e Compare November 1, 2023 14:31

ricardoV94 marked this pull request as ready for review November 1, 2023 14:33

ricardoV94 requested a review from ColtAllen November 1, 2023 14:33

ricardoV94 changed the title ~~Allow thining of CLV model results~~ Add method thin_fit_result to CLV models Nov 1, 2023

ColtAllen requested changes Nov 10, 2023

View reviewed changes

pymc_marketing/clv/models/basic.py Show resolved Hide resolved

pymc_marketing/clv/models/basic.py Show resolved Hide resolved

pymc_marketing/clv/models/basic.py Show resolved Hide resolved

ricardoV94 requested a review from ColtAllen November 15, 2023 14:14

ColtAllen reviewed Nov 26, 2023

View reviewed changes

Cleanup base methods in CLVModel

bb3926d

ricardoV94 force-pushed the thin-clv-model branch from 387cc5e to 16ee6e8 Compare December 1, 2023 08:47

ricardoV94 mentioned this pull request Dec 1, 2023

Illustrate use of thin_fit_result in notebook #448

Open

ricardoV94 requested a review from ColtAllen December 1, 2023 08:55

ricardoV94 added 2 commits December 1, 2023 09:58

Add option to thin fit result in CLV models

d293c0b

Test customer_lifetime_value after thinning

80c48ab

ricardoV94 force-pushed the thin-clv-model branch from 16ee6e8 to 80c48ab Compare December 1, 2023 08:59

ColtAllen approved these changes Dec 5, 2023

View reviewed changes

ricardoV94 merged commit 372ec23 into pymc-labs:main Dec 5, 2023
12 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add method `thin_fit_result` to CLV models #393

Add method `thin_fit_result` to CLV models #393

ricardoV94 commented Oct 12, 2023 •

edited

Loading

codecov bot commented Oct 12, 2023 •

edited

Loading

ColtAllen commented Nov 11, 2023

ricardoV94 commented Nov 11, 2023

ColtAllen commented Nov 11, 2023

ricardoV94 commented Nov 11, 2023

ricardoV94 commented Nov 15, 2023

ColtAllen commented Nov 26, 2023

ColtAllen left a comment

ricardoV94 commented Dec 1, 2023

ricardoV94 commented Dec 1, 2023

ColtAllen left a comment

tomthepeach commented Dec 5, 2023

Add method thin_fit_result to CLV models #393

Add method thin_fit_result to CLV models #393

Conversation

ricardoV94 commented Oct 12, 2023 • edited Loading

TODO

codecov bot commented Oct 12, 2023 • edited Loading

Codecov Report

ColtAllen commented Nov 11, 2023

ricardoV94 commented Nov 11, 2023

ColtAllen commented Nov 11, 2023

ricardoV94 commented Nov 11, 2023

ricardoV94 commented Nov 15, 2023

ColtAllen commented Nov 26, 2023

ColtAllen left a comment

Choose a reason for hiding this comment

ricardoV94 commented Dec 1, 2023

ricardoV94 commented Dec 1, 2023

ColtAllen left a comment

Choose a reason for hiding this comment

tomthepeach commented Dec 5, 2023

Add method `thin_fit_result` to CLV models #393

Add method `thin_fit_result` to CLV models #393

ricardoV94 commented Oct 12, 2023 •

edited

Loading

codecov bot commented Oct 12, 2023 •

edited

Loading