Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix unsupported dtype from PyMC sampler warnings with ClickHouseBackend #75

Merged
merged 7 commits into from
Dec 22, 2022

Conversation

michaelosthege
Copy link
Member

@michaelosthege michaelosthege commented Dec 20, 2022

This MR fixes some incompatibilities with NUTS sampling on recent PyMC versions.

Background

Since PyMC v4.3.0 or more specifically since pymc-devs/pymc#6192, the SamplerWarnings emitted by some step methods --- most importantly NUTS --- are no longer collected on the PyMC BaseTrace object, but rather as any other sampler stat.

This created an incompatibility with McBackend, where the ClickHouseBackend could not store these object-dtyped things into the database.

Solution

Instead of filtering/ignoring the "warning" stat, I went all the way to actually support storing it by pickling it, base64-encoding the bytes and storing the base64 encoded data as an ASCII string.

For this, the following changes were needed:

  • Supporting the storage of str dtypes in the NumPyBackend and ClickHouseBackend.
  • Tweaking the mcbackend.adapters.pymc.TraceBackend to encode/decode the "warning" stat such that the mcbackend.Chain only sees str data coming in/out.
  • Running the PyMC integration test with NUTS to provoke needing to store SamplerWarning.

Further changes along the way:

@codecov-commenter
Copy link

codecov-commenter commented Dec 20, 2022

Codecov Report

Base: 95.46% // Head: 95.65% // Increases project coverage by +0.19% 🎉

Coverage data is based on head (dfd4d38) compared to base (15b6082).
Patch coverage: 100.00% of modified lines in pull request are covered.

Additional details and impacted files
@@            Coverage Diff             @@
##             main      #75      +/-   ##
==========================================
+ Coverage   95.46%   95.65%   +0.19%     
==========================================
  Files           9        9              
  Lines         617      644      +27     
==========================================
+ Hits          589      616      +27     
  Misses         28       28              
Impacted Files Coverage Δ
mcbackend/adapters/pymc.py 94.32% <100.00%> (+0.72%) ⬆️
mcbackend/backends/clickhouse.py 100.00% <100.00%> (ø)
mcbackend/backends/numpy.py 100.00% <100.00%> (ø)

Help us with your feedback. Take ten seconds to tell us how you rate us. Have a feature suggestion? Share it here.

☔ View full report at Codecov.
📢 Do you have feedback about the report comment? Let us know in this issue.

In earlier versions of ClickHouse, the `UInt8` type had to be used,
but now this is causing trouble with NumPy bool arrays.
@michaelosthege
Copy link
Member Author

@lhelleckes @qacwnfq this was a somewhat tricky PR. If you have a chance to review that's cool, but don't feel obliged to do so.

Knowing that you two are short on time and not familiar with the codebase, I challenged myself to do the changes such that each commit passes the tests.

The Run PyMC test with NUTS is ❌ only because of decreasing coverage. Here I XFAILed the PyMC test to confirm that it triggers #73.

The Store PyMC SamplerWarning stats as str by pickling after reverts the XFAIL and makes no changes to backend implementation (only to the PyMC adapter).
This should confirm that the changes are well-isolated.

Copy link
Collaborator

@qacwnfq qacwnfq left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm

@@ -142,7 +142,9 @@ def setup(
name,
str(self.var_dtypes[name]),
list(self.var_shapes[name]),
dims=list(self.model.RV_dims[name]) if name in self.model.RV_dims else [],
dims=list(self.model.named_vars_to_dims[name])
if name in self.model.named_vars_to_dims
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seems a little unorthodox to use a ternary operator to compute the default arg. Couldn't it be just None and if it is None then it is computed? Just a nitpick though.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think I did this because it's in a list comprehension, and there's no way to conditionally NOT pass the kwarg

# This 👇 is needed until PyMC provides shapes ahead of time.
undefined_ndim=True,
)
if statname == "warning":
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Meaning that if we would want to record warnings with hopsy we need to give the stat the name "warning"?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, for now this is only for stats named "warning", which also get special treatment by PyMC.

One could extend this to more arbitrary objects, but generally I think we should try to stick to standard data types and avoid objects, so I didn't want to add more flexibility than needed.

tune=3,
draws=5,
tune=30,
draws=50,
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is the rng set somewhere to guarantee a warning?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

within these steps

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

no, but with this few steps the NUTS is pretty much guaranteed to emit warnings.
And even if it occasionally does not, that should not break things.

@michaelosthege michaelosthege merged commit 408cca2 into main Dec 22, 2022
@michaelosthege michaelosthege deleted the issue-73 branch December 22, 2022 14:05
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants