Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Bug fix for plotting order in plot_comparison (fixes mismatch between labels and plot) #731

Merged
merged 20 commits into from
Oct 3, 2023

Conversation

jt-lab
Copy link
Contributor

@jt-lab jt-lab commented Sep 28, 2023

A call like ...

bmb.interpret.plot_comparisons(
        model=model, 
        idata=trace, 
        contrast={"factor1": ["X", "Y"]},
        conditional={"factor2": ["C", "A", "B"], "factor3": [1, 2]},
)

currently leads to a mismatch of the order with which the levels of factor2 are plotted on the x axis and the x axis labels!

I have traced this down to .values.dtype.categories returning the names in sorted order which is eventually used for the labels. The actual plot is based on the user provided order.

This commit is only a quick fix which sorts the user-provided levels to match the order to avoid any false results. It would be better to give the user control about the order by using the order of the user provided list for the labels but a change like that seems to interact with further code which I don't fully understand.

Moreover, I'm not entirely sure what the benefit of the call above is compared to:

bmb.interpret.plot_comparisons(
        model=model, 
        idata=trace, 
        contrast={"factor1": ["X", "Y"]},
        conditional=["factor2", "factor3"],
)

I thought it was to give the user control over which levels are plotted (and in which order). But removing levels from the list does lead to errors (and the order is not applied). I might miss something, but if there is no additional functionality comapred to the list version, perhaps the dict version should be removed (for now).

I'm sorry for the many incremental commits just for fixing spaces and typos. I could not work with a local version but make change in githubs web interface (and test the code on a server).

Perhaps the rationale behind centering could be added to help user decide whether to turn it on or not. But I'm not entirely sure about this.
… character, or so.

Also deleted superflous space unrelated to the black's complaint
... for more informative output.
Cannot make changes locally at the moment. Hence so many commits ...
It's a bit annoying, I can only work from the web interface currently, but hopefully all typos are resolved now
@tomicapretto
Copy link
Collaborator

@jt-lab thanks for the contribution! Can I ask for an example of how this looked before and how it looks after the change? Thanks!

Bonus: If it comes with data so I can test on my end, even better!

@jt-lab
Copy link
Contributor Author

jt-lab commented Sep 28, 2023

@jt-lab thanks for the contribution! Can I ask for an example of how this looked before and how it looks after the change? Thanks!

Bonus: If it comes with data so I can test on my end, even better!

Here you go:
Code (note the two different orders of the levels in the conditional argument of plot_comparisons):

import pandas as pd
import bambi as bmb
from matplotlib.pylab import plt

df = pd.read_csv('simulated_data_order_prob.csv')

model = bmb.Model('p(correct, count) ~ factor1*factor2*factor3 + (factor3+factor2|individual)',
                  df, family='binomial')
trace = model.fit(tune=500, draws=1000, target_accept=0.7, init='adapt_diag')

f, ax = plt.subplots(2)
bmb.interpret.plot_comparisons(
    model=model,
    idata=trace,
    contrast={"factor2": ["X", "Y"]},
    conditional={"factor1": ["A", "B", "C"], "factor3": [1, 2]},
    ax=ax[0]
) 
bmb.interpret.plot_comparisons(
    model=model,
    idata=trace,
    contrast={"factor2": ["X", "Y"]},
    conditional={"factor1": ["C", "B", "A"], "factor3": [1, 2]},
    ax=ax[1]
)

Plot before the fix:
(e.g.: git+https://github.com/jt-lab/bambi.git@a072c84809fe62e6efe1d7818f3d5d28d81e00ed)
before

After the fix:
(e.g.: git+https://github.com/jt-lab/bambi.git)
after

Data set:
simulated_data_order_prob.csv
This is the same simulated data I also posted in question on the pymc discourse. I just added a further level for factor1 and matched the factor and level names with my original description of the problem.

@GStechschulte
Copy link
Collaborator

GStechschulte commented Sep 28, 2023

@jt-lab thanks for finding the subtle bug and opening a PR (good attention to detail)👍🏼. The difference between

# user provided values only (no default values are computed for "factor1" and "factor3")
bmb.interpret.plot_comparisons(
        model=model, 
        idata=trace, 
        contrast={"factor2": ["X", "Y"]},
        conditional={"factor1": ["C", "A", "B"], "factor3": [1, 2]},
)

and

# computes default values for "factor1" and "factor3"
bmb.interpret.plot_comparisons(
        model=model, 
        idata=trace, 
        contrast={"factor2": ["X", "Y"]},
        conditional=["factor1", "factor3"],
)

is that in the latter code snippet, since no values are passed for factor1 and factor3, default values are computed for these covariates. In the Default contrast and conditional values section of the plot_comparisons docs, these details are outlined.

Thanks!

Also, you are right. This is because in def _plot_categoric(...), to determine the x-axis values we call get_unique_levels(plot_data[main]) which in turn uses np.unique to return the unique elements. By default np.unique also sorts this array. Then, once ax.setxticklabels(main_levels) is called, the bug is introduced.

@jt-lab
Copy link
Contributor Author

jt-lab commented Sep 28, 2023

@GStechschulte, thanks for the explantions! I'm not sure I understand the purpose of the dictionary conditional argument. When I list all levels of the factors, I get the same as with the list version defaults. When I list a subset of the levels, I get errors.

Unrelated: When reading through the plot_comparisons doc, I noticed this sentence "Bambi uses the ordering (keys if dict and elements if list)...". Im not sure I understand correctly, but dictionaries don't gurantee any order of the items or keys. So this might also be a source of unexpected behavior. So the user might need to pass an OrderedDict to be safe. But maybe i'm missing something!

Many thanks again!

@GStechschulte
Copy link
Collaborator

GStechschulte commented Sep 30, 2023

I'm not sure I understand the purpose of the dictionary conditional argument. When I list all levels of the factors, I get the same as with the list version defaults

This is due to the data types of the data being used to fit the model.

When you pass the list ["factor1", "factor3"] for the conditional arg., factor1 is assigned as the "main" covariate and factor3 is assigned the "group" covariate. Then, since you don't specify any values to be associated with the factor covariates, a default value is computed for them. Default values for "main", "group", and "panel" are the following pseudocode:

if v == "main":
    
    if v == numeric:
        return np.linspace(v.min(), v.max(), 50)
    elif v == categorical:
        return np.unique(v)

elif v == "group":
    
    if v == numeric:
        return np.quantile(v, np.linspace(0, 1, 5))
    elif v == categorical:
        return np.unique(v)

elif v == "panel":
    
    if v == numeric:
        return np.quantile(v, np.linspace(0, 1, 5))
    elif v == categorical:
        return np.unique(v)

In your case, "main" and "group" are both categorical data types. Thus, the default values are the unique values.

When you pass a dict to the conditional arg., the keys of the dict are taken as the "main", "group", and "panel" covariates. Thus, {"factor1": ["A", "B", "C"], "factor3": [1, 2]} corresponds to {"main": "factor1", "group": "factor3"}. Since you passed values for the covariates, no default values are computed.

Then, since you ended up passing all the unique values for factor1, and factor3

df["factor1"].unique(), df["factor3"].unique()
(array(['A', 'B', 'C'], dtype=object), array([1, 2]))

as the dict values, the result of passing a list to conditional and the passing of the dict (stated above) results in the same data being generated. If you were to pass {"factor1": ["A", "B"], "factor3": [1]}, you would get a different result between the two.

Lastly, regarding

When I list a subset of the levels, I get errors.

This is due to a bug in the function def get_unique_levels(...) of utils.py. The code in the if block wasn't returning the unique levels, it was returning all levels. Thus, the reason for the shape error. Once the function returns unique levels

def get_unique_levels(x: np.ndarray) -> np.ndarray:
    return np.unique(x)

there are no more shape errors. As an aside, if factor3 should be a categorical data type, you should explicitly convert the data type (pd.Categorical(data["factor3"]), ordered=True) as such. Otherwise, Bambi and the interpret sub-package functions will treat it as numeric since it takes the values of [1, 2].

I hope this helps 👍🏼

@GStechschulte
Copy link
Collaborator

Should I make a separate PR incorporating @jt-lab commits and my local changes to fix the bugs? Or should @jt-lab keep this PR and then commit my changes to fix the bugs? @tomicapretto

@tomicapretto
Copy link
Collaborator

@GStechschulte if your fixes are related to what @jt-lab is doing here, I think @jt-lab could allow this PR to be edited by maintainers and then you could contribute here so everything is in place and we respect the original authorship. But if they are separate things, you could open a different PR.

But if you think that's too much, feel free to do something else.

@jt-lab
Copy link
Contributor Author

jt-lab commented Oct 1, 2023

@tomicapretto, how can I make it editable?

@tomicapretto
Copy link
Collaborator

I think it's already enabled

@GStechschulte
Copy link
Collaborator

I have pushed 3 commits to your branch. Now, the plots work with no errors. I also decided that your original commit to sort the dict values in ascending order is the most appropriate. In particular, the x-axis of graphs should be in ascending order $1, 2, 3,...,n$ or $A, B, C,...,n$.

Thanks a lot for raising the issue and for opening a PR 😄 It is much appreciated!

Below are a few of your examples:

f, ax = plt.subplots(2)
bmb.interpret.plot_comparisons(
    model=model,
    idata=trace,
    contrast={"factor2": ["X", "Y"]},
    conditional={"factor1": ["A", "B", "C"], "factor3": [1, 2]},
    ax=ax[0]
) 
bmb.interpret.plot_comparisons(
    model=model,
    idata=trace,
    contrast={"factor2": ["X", "Y"]},
    conditional={"factor1": ["C", "B", "A"], "factor3": [1, 2]},
    ax=ax[1]
)
plt.tight_layout();

image

bmb.interpret.plot_comparisons(
    model=model,
    idata=trace,
    contrast={"factor2": ["X", "Y"]},
    conditional={"factor1": ["A", "C"], "factor3": [1, 2]},
    fig_kwargs=dict(figsize=(7, 3))
);

image

bmb.interpret.plot_comparisons(
    model=model,
    idata=trace,
    contrast={"factor2": ["X", "Y"]},
    conditional=["factor1", "factor3"],
    fig_kwargs=dict(figsize=(7, 3))
);

image

@review-notebook-app
Copy link

Check out this pull request on  ReviewNB

See visual diffs & provide feedback on Jupyter Notebooks.


Powered by ReviewNB

@tomicapretto
Copy link
Collaborator

@GStechschulte @jt-lab ok I struggled a little bit with git and the fact that we're working on @jt-lab's main branch (I used a different name locally). Nevertheless, it seems everything is working. I just made a small modification to the if-else structure, I think it's clearer now, and I also wrote a fix for the HSGP and the failing tests. We can merge once CI passes.

@GStechschulte
Copy link
Collaborator

@tomicapretto Ahh great! And thanks for the HSGP fix 👍🏼

@GStechschulte
Copy link
Collaborator

Mmmm. The test_plots.py passed locally before I pushed those three commits (Tomas's commits shouldn't cause them to fail). I will look into this evening.

@tomicapretto
Copy link
Collaborator

Mmmm. The test_plots.py passed locally before I pushed those three commits (Tomas's commits shouldn't cause them to fail). I will look into this evening.

The problem happens when we test this

plot_slopes(model, idata, wrt={"Days": 2}, conditional={"Subject": 308}

and it's because it's calling sorted(308), which cannot be sorted because it's not iterable. I'm fixing it now :)

@codecov-commenter
Copy link

Codecov Report

Merging #731 (8d7e161) into main (e53f8da) will decrease coverage by 0.01%.
Report is 4 commits behind head on main.
The diff coverage is 83.33%.

@@            Coverage Diff             @@
##             main     #731      +/-   ##
==========================================
- Coverage   89.56%   89.55%   -0.01%     
==========================================
  Files          44       44              
  Lines        3525     3524       -1     
==========================================
- Hits         3157     3156       -1     
  Misses        368      368              
Files Coverage Δ
bambi/backend/terms.py 96.65% <100.00%> (ø)
bambi/interpret/create_data.py 100.00% <100.00%> (ø)
bambi/interpret/utils.py 87.37% <100.00%> (-0.32%) ⬇️
bambi/models.py 79.22% <ø> (ø)
bambi/priors/scaler.py 96.70% <100.00%> (ø)
bambi/interpret/plot_types.py 73.39% <75.00%> (ø)
bambi/interpret/plotting.py 85.40% <83.33%> (+0.43%) ⬆️

... and 1 file with indirect coverage changes

📣 We’re building smart automated test selection to slash your CI/CD build times. Learn more

@tomicapretto tomicapretto merged commit 0e4d4fc into bambinos:main Oct 3, 2023
4 checks passed
GStechschulte added a commit to GStechschulte/bambi that referenced this pull request Oct 3, 2023
… labels and plot) (bambinos#731)

Co-authored-by: GStechschulte <[email protected]>
Co-authored-by: Tomas Capretto <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants