-
Notifications
You must be signed in to change notification settings - Fork 3.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[python-package] Introduce refit_tree_manual
to Booster
class.
#6617
base: master
Are you sure you want to change the base?
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for your continued interest in LightGBM.
At first glance, I'm -1 on this proposal.
- I don't understand what you mean when you say that
set_leaf_output()
does not update "the predictions underneath". Can you elaborate? - Couldn't this same behavior be achieved by using a custom objective function? The tree output values are in terms of the loss function.
Thank you for the fast response! Maybe I am missing something but:
If you try to refit the tree using
For the case of debiasing it may indeed be possible(I am not sure how), but for honest splits(splitting on one dataset and creating the leaf values on another) I do not think it is remotely possible. All in all, I think this change will introduce a clean way to manage the training of the tree with a minimal user-facing API. There is also the precedence of the existence of |
I can provide some context for this PR. :) As @neNasko1 mentioned, we are interested in implementing debiasing and honest splits. With certain objective functions, predictions can be biased. For example, the Honest splits involve using one data set to determine the splits and another to compute the leaf values. It is a form of regularization: if a tree overfits and produces a spurious split, the leaf values should still end up similar, because outcomes in the second data set shouldn't differ across the spurious split. Honest splits also produce interesting statistical properties for researchers (see Wager and Athey (2018)): the resulting predictions can be shown to be asymptotically consistent, unbiased and normally distributed. Here, again, we could refit leaf values after training, but the tree structure will have been contaminated. In both cases, we need to adjust the leaf values of each tree after its construction. We then need to compute the raw scores, accounting for the updated values, before constructing the next tree. We're planning on performing the adjustment in downstream callbacks, minimising the disruption in LightGBM itself. However, we need to be able to modify leaf values after each iteration and to have those modifications reflected in the raw scores that LightGBM will use in the following iteration (hence this PR). If LightGBM already offers similar functionality, we'd obviously prefer to use that. :) |
@lorentzenchr might be interested too. |
Thanks very much for the detailed explanation. To set the right expectation, I personally will need to do some significant reading on these topics to be able to provide a thoughtful review. We are very cautious here with expansions of the public API (both in the Python package and the C API)... such expansions are relatively easy to make, but relatively difficult to take back (because removing a part of the public API is a user-facing breaking change). That same comment applies to the proposal in #6568 as well. Hopefully some other review who's more knowledgeable about these topics will be able to offer a review. |
@lbittarello I read https://arxiv.org/abs/1510.04342 and the asymtotic results as valid for random forests, not so much for boosted trees. I have had good experiences with a multiplicative (1 parameter) recalibration of non-canonical losses with a log-link, see scikit-learn/scikit-learn#26311. |
Maybe it is my bad for not explaining in too much detail. This change enables both honest splits and debiasing(i.e. in different implementation) to be implemented as callbacks to training. Debiasing in the case of the gamma loss(implemented as a test) can be as simple as offsetting the The honest split use case is not connected to debiasing, but more so is a way of reducing overfitting and can be applied to all types of losses. Again worth noting that if a similar mechanism exists - being able to change the leaf outputs during training and updating the scores for the training data in an efficient way, we will obviously prefer to use it instead of contributing changes. |
True. LightGBM also supports random forests though and honest splits are still useful as a regularisation strategy for GBMs. :)
As @neNasko1 explained, this PR also enables debiasing, which is independent from honest splitting. |
Since all written above is a bit abstract let me illustrate by example the utility honest splits can bring to a trained model. import pandas as pd
import numpy as np
import lightgbm as lgb
from sklearn.preprocessing import LabelEncoder
class HonestSplitCallback():
def __init__(self, data, labels):
self.score = 0
self.data = data
self.labels = labels
def _init(self, env):
self.learning_rate = env.model.params["learning_rate"]
def _get_gradients(self):
y = self.labels
mu = np.exp(-self.score)
gradient = 1 - y * mu
hessian = y * mu
return gradient, hessian
def _internal_refit(self, env):
booster = env.model
gradient, hessian = self._get_gradients()
predicted_leaves = booster.predict(
self.data,
pred_leaf=True,
start_iteration=env.iteration,
num_iteration=1,
)
sums = pd.DataFrame({'grad': gradient, 'hess': hessian, 'leaf': predicted_leaves}).groupby('leaf').sum()
sums['refitted'] = -sums['grad'] / sums['hess'] * self.learning_rate
refitted_leaves = np.zeros(env.model.dump_model()['tree_info'][env.iteration]['num_leaves'])
refitted_leaves[sums.index.to_numpy()] = sums['refitted'].to_numpy()
booster.refit_tree_manual(
env.iteration,
refitted_leaves
)
self.score += refitted_leaves[predicted_leaves]
def __call__(self, env):
if env.iteration == env.begin_iteration:
self._init(env)
self._internal_refit(env)
df = pd.read_csv("https://www.openml.org/data/get_csv/20649148/freMTPL2freq.arff", quotechar="'")
df_sev = pd.read_csv("https://www.openml.org/data/get_csv/20649149/freMTPL2sev.arff", index_col=0)
df.rename(lambda x: x.replace('"', ''), axis='columns', inplace=True)
df['IDpol'] = df['IDpol'].astype(np.int64)
df.set_index('IDpol', inplace=True)
df = df.join(df_sev.groupby(level=0).sum(), how='left')
df.fillna(value={'ClaimAmount': 0, 'ClaimAmountCut': 0}, inplace=True)
labelencoder = LabelEncoder()
df['Area'] = labelencoder.fit_transform(df['Area'])
df['VehBrand'] = labelencoder.fit_transform(df['VehBrand'])
df['VehGas'] = labelencoder.fit_transform(df['VehGas'])
df['Region'] = labelencoder.fit_transform(df['Region'])
df = df[df['ClaimAmount'] > 0]
df_train = df[['Area', 'VehPower', 'VehAge', 'DrivAge', 'BonusMalus', 'VehBrand', 'VehGas', 'Density', 'Region']]
sample_column = np.random.choice(['train', 'hsplit', 'test'], len(df))
honest_split_callback = HonestSplitCallback(
df_train[sample_column == 'hsplit'],
df['ClaimAmount'][sample_column == 'hsplit']
)
bst = lgb.LGBMRegressor(
learning_rate=0.05,
n_estimators=500,
num_leaves=31,
objective='gamma'
).fit(
df_train[sample_column == 'train'], df['ClaimAmount'][sample_column == 'train'],
eval_set=[
(df_train[sample_column == 'train'], df['ClaimAmount'][sample_column == 'train']),
(df_train[sample_column == 'hsplit'], df['ClaimAmount'][sample_column == 'hsplit']),
(df_train[sample_column == 'test'], df['ClaimAmount'][sample_column == 'test']),
],
eval_names=["train", "hsplit", "test"],
callbacks=[
honest_split_callback,
lgb.log_evaluation(period=50),
],
eval_metric='gamma',
) The example is rudimentary and only for illustration purposes but training with/without and examining the "test" dataset we can see the presence of overfitting in the vanilla training and the absence of it in the honest-splits one: [50] train's gamma: 8.47428 hsplit's gamma: 8.75784 test's gamma: 8.79983
[100] train's gamma: 8.39185 hsplit's gamma: 8.78157 test's gamma: 8.86788
[150] train's gamma: 8.35027 hsplit's gamma: 8.79986 test's gamma: 8.91633
[200] train's gamma: 8.31865 hsplit's gamma: 8.82251 test's gamma: 8.93402
[250] train's gamma: 8.29276 hsplit's gamma: 8.83342 test's gamma: 8.96648
[300] train's gamma: 8.2702 hsplit's gamma: 8.84719 test's gamma: 8.9856
[350] train's gamma: 8.2532 hsplit's gamma: 8.86033 test's gamma: 9.02807
[400] train's gamma: 8.23536 hsplit's gamma: 8.87122 test's gamma: 9.05208
[450] train's gamma: 8.22125 hsplit's gamma: 8.87913 test's gamma: 9.06393
[500] train's gamma: 8.20966 hsplit's gamma: 8.88448 test's gamma: 9.07705
[50] train's gamma: 219.634 hsplit's gamma: 172.385 test's gamma: 210.213
[100] train's gamma: 23.926 hsplit's gamma: 19.7791 test's gamma: 23.0972
[150] train's gamma: 9.61772 hsplit's gamma: 8.99064 test's gamma: 9.47934
[200] train's gamma: 8.96602 hsplit's gamma: 8.60529 test's gamma: 8.86329
[250] train's gamma: 8.94747 hsplit's gamma: 8.59725 test's gamma: 8.83827
[300] train's gamma: 8.94775 hsplit's gamma: 8.59631 test's gamma: 8.83784
[350] train's gamma: 8.94703 hsplit's gamma: 8.59619 test's gamma: 8.83751
[400] train's gamma: 8.94677 hsplit's gamma: 8.59619 test's gamma: 8.83749
[450] train's gamma: 8.947 hsplit's gamma: 8.59616 test's gamma: 8.83751
[500] train's gamma: 8.94747 hsplit's gamma: 8.59613 test's gamma: 8.83755 Moreover examining the partial dependency plots we get "smoother" results: import matplotlib.pyplot as plt
from sklearn.inspection import PartialDependenceDisplay
PartialDependenceDisplay.from_estimator(bst, df_train, df_train.columns)
plt.show() Without honest splits: Note: it seems like the name |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@jameslamb I'm generally in favor of adding this feature. It does not add a new complex algorithm to this package but rather enables implementing more complex algorithms (such as honest splits) on top of LightGBM.
To this end, we are only exposing a little more LightGBM internals here. However, I would even argue that we're not exposing any "implementational details" as the new "hook" to properly modify outputs after each boosting iteration is an "algorithmic detail", i.e. a "semantic hook" rather than something odd that LightGBM introduces (not sure if it is clear what I mean 😄)
@@ -4912,6 +4912,37 @@ def refit( | |||
new_booster._network = self._network | |||
return new_booster | |||
|
|||
def refit_tree_manual(self, tree_id: int, values: np.ndarray) -> "Booster": |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As a user it would not yet be clear to me how this function is different to set_leaf_output
, i.e. why is this not just called set_leaf_outputs
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do you propose changing the name of the function or just writing better docs? I am a bit unsure if calling this function set_leaf_outputs
would be a bit strange as it does additional things than to just update the leaf values.
This PR strives to introduce a way to refit a tree manually, for example during a callback. Currently the only related functionality is using
set_leaf_output
, however this does not update the predictions underneath.Enabling this will allow users to implement regularisation methods that are out of scope of the library, e.g. honest splits. The provided test also includes a debiasing callback, which creates a model whose mean is the same as the dataset mean.
I am open to discussion of this feature and I am wondering if there is already some work in a related direction!