Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merge ebm with different subset of variables #564

Open
sadsquirrel369 opened this issue Jul 24, 2024 · 7 comments
Open

Merge ebm with different subset of variables #564

sadsquirrel369 opened this issue Jul 24, 2024 · 7 comments

Comments

@sadsquirrel369
Copy link

I am fitting a model with a subset of variables with no interaction present. I now want to fit interactions with a larger subset of variables and merge it with the original model.

The merge ebm method does not allow for this in its current form. Is there not a smart way to build a new model with the components of the two underlying models into a new clean instance?

@sadsquirrel369 sadsquirrel369 changed the title Merge ebm with differentsubset of variables Merge ebm with different subset of variables Jul 24, 2024
@paulbkoch
Copy link
Collaborator

Hi @sadsquirrel369 -- This is supported, but is currently a bit more complicated than it should be. In the future we want to support scikit-learn's warm_start functionality, which will make this simpler. Today, you need to do the following:

  1. Make a dataframe or numpy array with a superset of all the features you'll need for both mains and interactions
  2. set interactions=0 and exclude any individual features that you don't want considered in the mains.
  3. Fit the mains model
  4. Use exclude to exclude all mains, and you can also exclude any additional pairs you don't want to be considered for pairs. Set interactions to either a number for automatic detection, or a list of the specific interactions. Call fit using the init_score parameter set to the mains model so that it boosts the pairs on top of the mains.
  5. call merge_ebms on the two EBMs. There are more details to this which are covered in our docs here: https://interpret.ml/docs/python/examples/custom-interactions.html

@sadsquirrel369
Copy link
Author

@paulbkoch Thanks for the prompt reply. So by excluding variables (with the parameter) in the model "mains" fitting, will all of the feature names be in the model.feature_names_in_ variable, irrespective of whether they were in the original dataset?

@paulbkoch
Copy link
Collaborator

Hi @sadsquirrel369 -- Features that are excluded will be recorded in the model.feature_names_in_ attribute, but they will not be used for prediction. Anything that is used for prediction is called a "term" in EBMs. If you print the model.term_names_ you'll see a list of everything that is used for prediction. For some datatypes like numpy arrays there are no column names and features are determined by their index, so it's important in these cases that both the features used in mains and the features used in pairs are all in the same dataset, even if they are not used in the model.

@sadsquirrel369
Copy link
Author

Thanks for the help!

@sadsquirrel369
Copy link
Author

Hi @paulbkoch,

When trying to merge the mains model with an interaction model I get this issue:

Inconsistent bin types within a model:

`---------------------------------------------------------------------------
Exception Traceback (most recent call last)
/var/folders/3b/lp8_hqx917138jd8rxttmzjc0000gn/T/ipykernel_35823/3985698511.py in
----> 1 merge_ebms([loaded_model,loaded_int_model])

/opt/homebrew/Caskroom/miniforge/base/lib/python3.9/site-packages/interpret/glassbox/ebm/merge_ebms.py in merge_ebms(models)
392 for model in models:
393 if any(len(set(map(type, bin_levels))) != 1 for bin_levels in model.bins
):
--> 394 raise Exception("Inconsistent bin types within a model.")
395
396 feature_bounds = getattr(model, "feature_bounds
", None)

Exception: Inconsistent bin types within a model.`

It appears the issue stems from some variables used in the interaction not having bin values in the main models because they were excluded. The interaction models will work correctly when the variables are present in both the main and interaction models. However, some variables are only beneficial when used in an interaction and not on their own (for example for vehicle classification where the combination of weight and power can help us identify different vehicle types).

@paulbkoch
Copy link
Collaborator

Hi @sadsquirrel369 -- This is really interesting. It appears you have a model where one of the feature mains is considered a categorical or continuous, but a pair using the same feature is considered to be the opposite. Are you doing any re-merging where you first merge a set of models and then merge that result again with some other models, or is it happening on the first merge when the main and interaction models are combined?

You can probably avoid this error by explicitly setting the feature_types parameter on all calls to the ExplainableBoostingClassifier constructor, thereby ensuring they are identical in all models being merged. This is something we could handle better though within merge_ebms. We can convert a feature from categorical into continuous during merges, but perhaps this isn't completely robust to more complicated scenarios involving pairs.

@ANNIKADAHLMANN-8451
Copy link

ANNIKADAHLMANN-8451 commented Oct 10, 2024

I'm currently encountering this same error when trying to merge two EBMs I have ~10 features and I'm wondering if there's a streamlined way to specify all of these feature types? I'm getting the inconsistent bins error on the second merge (basically I'm trying to batch train an EBM model since my data is larger than what can fit in memory). I specify the data types using the feature_types parameter using the below snippet:

dtypes = ['continuous' if d == 'float64' else None if d == 'int64' else 'ordinal' if col in ordinal_types else 'nominal' for d, col in zip(X_trn.dtypes, X_trn.columns)]

And when I try to implement the workaround suggested in issue #576, I still get the same error

for attr, val in clf1.get_params(deep=False).items():
    if not hasattr(clf, attr):
        setattr(clf, attr, val)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Development

No branches or pull requests

3 participants