Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add method thin_fit_result to CLV models #393

Merged
merged 3 commits into from
Dec 5, 2023

Conversation

ricardoV94
Copy link
Contributor

@ricardoV94 ricardoV94 commented Oct 12, 2023

CLV summary statistic models can be quite memory intensive and fail with the standard 4k draws. Because these methods are all built assuming there is a self.idata lying around (called in multiple places), the two hacky solutions would be to pass a thin argument to every method or to modify idata inplace.

This PR instead adds a method that returns a copy of the CLV model with a thinned dataset. A user can then use the methods in this thinned model to obtain summary stats.

Closes #374

This PR also cleans up several internals in the CLVModel base-class.

TODO

  • Add test using customer_lifetime_value

📚 Documentation preview 📚: https://pymc-marketing--393.org.readthedocs.build/en/393/

@ricardoV94 ricardoV94 added enhancement New feature or request CLV labels Oct 12, 2023
@codecov
Copy link

codecov bot commented Oct 12, 2023

Codecov Report

All modified and coverable lines are covered by tests ✅

Comparison is base (30c91ee) 90.83% compared to head (80c48ab) 90.70%.

Additional details and impacted files
@@            Coverage Diff             @@
##             main     #393      +/-   ##
==========================================
- Coverage   90.83%   90.70%   -0.13%     
==========================================
  Files          21       21              
  Lines        1920     1904      -16     
==========================================
- Hits         1744     1727      -17     
- Misses        176      177       +1     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@ricardoV94 ricardoV94 marked this pull request as ready for review November 1, 2023 14:33
@ricardoV94 ricardoV94 changed the title Allow thining of CLV model results Add method thin_fit_result to CLV models Nov 1, 2023
pymc_marketing/clv/models/basic.py Show resolved Hide resolved
pymc_marketing/clv/models/basic.py Show resolved Hide resolved
pymc_marketing/clv/models/basic.py Show resolved Hide resolved
@ColtAllen
Copy link
Collaborator

The assert stuff is probably just splitting hairs, but maybe raise a UserWarning in fit() to suggest using this function when dealing with huge datasets and/or draw sizes? Say, draw size > 10k and customer_ids > 50k. That way users know to try this when encountering memory crashes.

@ricardoV94
Copy link
Contributor Author

The assert stuff is probably just splitting hairs, but maybe raise a UserWarning in fit() to suggest using this function when dealing with huge datasets and/or draw sizes? Say, draw size > 10k and customer_ids > 50k. That way users know to try this when encountering memory crashes.

That could be useful. I wonder if we should do it in fit though? Or in the heavy functions like clv? Easier in fit but potentially more annoying as well.

@ColtAllen
Copy link
Collaborator

The assert stuff is probably just splitting hairs, but maybe raise a UserWarning in fit() to suggest using this function when dealing with huge datasets and/or draw sizes? Say, draw size > 10k and customer_ids > 50k. That way users know to try this when encountering memory crashes.

That could be useful. I wonder if we should do it in fit though? Or in the heavy functions like clv? Easier in fit but potentially more annoying as well.

Where's it most likely to crash? I only suggested fit because draw size can be obtained from sampler_config.

@ricardoV94
Copy link
Contributor Author

The assert stuff is probably just splitting hairs, but maybe raise a UserWarning in fit() to suggest using this function when dealing with huge datasets and/or draw sizes? Say, draw size > 10k and customer_ids > 50k. That way users know to try this when encountering memory crashes.

That could be useful. I wonder if we should do it in fit though? Or in the heavy functions like clv? Easier in fit but potentially more annoying as well.

Where's it most likely to crash? I only suggested fit because draw size can be obtained from sampler_config.

We always know the size of the samples from the idata

@ricardoV94
Copy link
Contributor Author

Where's it most likely to crash? I only suggested fit because draw size can be obtained from sampler_config.

It shouldn't be in fit, because there are not so many parameters in the model. Any problems will appear in functions that compute summary statistics per observations (most predictive methods, including the customer_lifetime_value one). It's also not model specific. I would keep it simple for now, and we revisit the warning idea if people keep being puzzled / don't find out naturally about the thin_fit_result method.

@ColtAllen
Copy link
Collaborator

Where's it most likely to crash? I only suggested fit because draw size can be obtained from sampler_config.

It shouldn't be in fit, because there are not so many parameters in the model. Any problems will appear in functions that compute summary statistics per observations (most predictive methods, including the customer_lifetime_value one). It's also not model specific. I would keep it simple for now, and we revisit the warning idea if people keep being puzzled / don't find out naturally about the thin_fit_result method.

Sounds good. Maybe add a comment about using thin_fit_result in the docstrings so it shows up in the docs, and/or the CLV Quickstart? The latter can be its own PR as there are plenty of other ways the Quickstart can be improved (ie an overview of the CLV modeling domains, use of the clv_summary function for data prep, etc.)

Copy link
Collaborator

@ColtAllen ColtAllen left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Aside from the documentation suggestions, the only other remarks I have concern the reductions in testing coverage.

Are fit_result and expected_customer_lifetime_valuefully covered when the --runslow tests are ran?

Also, are there plans to expand the placeholder methods output_var, _generate_and_preprocess_model_data, and _data_setter? I'm not sure what the need is for those methods at the moment.

If you're chasing 100% coverage in beta_geo.py, you can adapt the following code:
https://github.com/pymc-labs/pymc-marketing/blob/main/tests/clv/models/test_pareto_nbd.py#L302

@ricardoV94
Copy link
Contributor Author

I think the coverage is just because the tests are not passing yet.

Regarding the weird methods, they are abstract methods that the base ModelBuilder class demands but are not at all relevant for CLV

@ricardoV94
Copy link
Contributor Author

@ColtAllen can I get your input again? I opened an issue for documenting the new feature in #448

Copy link
Collaborator

@ColtAllen ColtAllen left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Did CodeCov not update for the new customer_lifetime_value test? I also made a note in #448 that it is not resolved by this PR, nor does it have to be.

As for the placeholder methods, I have some ideas on how to use them, but it would be CLV model-specific and probably outside the scope of this PR.

I think this is good to merge. Thanks for cleaning up the CLV base class!

@ricardoV94 ricardoV94 merged commit 372ec23 into pymc-labs:main Dec 5, 2023
12 checks passed
@tomthepeach
Copy link

Awesome thanks guys!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
CLV enhancement New feature or request
Projects
None yet
3 participants