[python-package] Introduce `refit_tree_manual` to `Booster` class. #6617

neNasko1 · 2024-08-16T15:01:00Z

This PR strives to introduce a way to refit a tree manually, for example during a callback. Currently the only related functionality is using set_leaf_output, however this does not update the predictions underneath.

Enabling this will allow users to implement regularisation methods that are out of scope of the library, e.g. honest splits. The provided test also includes a debiasing callback, which creates a model whose mean is the same as the dataset mean.

I am open to discussion of this feature and I am wondering if there is already some work in a related direction!

jameslamb

Thanks for your continued interest in LightGBM.

At first glance, I'm -1 on this proposal.

I don't understand what you mean when you say that set_leaf_output() does not update "the predictions underneath". Can you elaborate?
Couldn't this same behavior be achieved by using a custom objective function? The tree output values are in terms of the loss function.

neNasko1 · 2024-08-16T15:28:06Z

Thank you for the fast response!

Maybe I am missing something but:

I don't understand what you mean when you say that set_leaf_output() does not update "the predictions underneath". Can you elaborate?

If you try to refit the tree using set_leaf_output-s using a callback those updates are essentially ignored in the next boosting rounds, because they do not update the score(i.e. using AddScore).

Couldn't this same behavior be achieved by using a custom objective function? The tree output values are in terms of the loss function.

For the case of debiasing it may indeed be possible(I am not sure how), but for honest splits(splitting on one dataset and creating the leaf values on another) I do not think it is remotely possible.

All in all, I think this change will introduce a clean way to manage the training of the tree with a minimal user-facing API. There is also the precedence of the existence of rollback_one_iter.

lbittarello · 2024-08-16T15:49:42Z

I can provide some context for this PR. :)

As @neNasko1 mentioned, we are interested in implementing debiasing and honest splits.

With certain objective functions, predictions can be biased. For example, the "gamma" and "tweedie" inbuilt objectives typically undershoot. That occurs because they apply the exponential link function to the raw scores (which is not canonical in a GLM sense). In business contexts, bias can be unacceptable (say, you shouldn't systematically underestimate risk in insurance). While we could patch leaf values in one go at the end of training, the tree structure will have been contaminated by the bias in each iteration. We could also use a different objective function, but that has drawbacks too (poorer fit overall, loss of multiplicative structure, etc.).

Honest splits involve using one data set to determine the splits and another to compute the leaf values. It is a form of regularization: if a tree overfits and produces a spurious split, the leaf values should still end up similar, because outcomes in the second data set shouldn't differ across the spurious split. Honest splits also produce interesting statistical properties for researchers (see Wager and Athey (2018)): the resulting predictions can be shown to be asymptotically consistent, unbiased and normally distributed. Here, again, we could refit leaf values after training, but the tree structure will have been contaminated.

In both cases, we need to adjust the leaf values of each tree after its construction. We then need to compute the raw scores, accounting for the updated values, before constructing the next tree. We're planning on performing the adjustment in downstream callbacks, minimising the disruption in LightGBM itself. However, we need to be able to modify leaf values after each iteration and to have those modifications reflected in the raw scores that LightGBM will use in the following iteration (hence this PR).

If LightGBM already offers similar functionality, we'd obviously prefer to use that. :)

lbittarello · 2024-08-16T15:53:55Z

@lorentzenchr might be interested too.

jameslamb · 2024-08-27T02:58:42Z

Thanks very much for the detailed explanation.

To set the right expectation, I personally will need to do some significant reading on these topics to be able to provide a thoughtful review. We are very cautious here with expansions of the public API (both in the Python package and the C API)... such expansions are relatively easy to make, but relatively difficult to take back (because removing a part of the public API is a user-facing breaking change). That same comment applies to the proposal in #6568 as well.

Hopefully some other review who's more knowledgeable about these topics will be able to offer a review.

lorentzenchr · 2024-08-27T15:50:57Z

@lbittarello I read https://arxiv.org/abs/1510.04342 and the asymtotic results as valid for random forests, not so much for boosted trees.
The bias of, e.g., Gamma deviance with a log-link stems indeed from the fact that the log-link is non-canonical for the Gamma deviance. IMHO, This won't change with honest trees:
Even if the leaf values are unbiased, they live on the log scale and the exponentiation of values introduces the bias (same for gradient boosting as for GLMs).

I have had good experiences with a multiplicative (1 parameter) recalibration of non-canonical losses with a log-link, see scikit-learn/scikit-learn#26311.

neNasko1 · 2024-08-27T16:52:51Z

@lorentzenchr

Maybe it is my bad for not explaining in too much detail. This change enables both honest splits and debiasing(i.e. in different implementation) to be implemented as callbacks to training. Debiasing in the case of the gamma loss(implemented as a test) can be as simple as offsetting the raw_scores after tree training.

The honest split use case is not connected to debiasing, but more so is a way of reducing overfitting and can be applied to all types of losses.

Again worth noting that if a similar mechanism exists - being able to change the leaf outputs during training and updating the scores for the training data in an efficient way, we will obviously prefer to use it instead of contributing changes.

lbittarello · 2024-08-27T17:02:24Z

the asymtotic results as valid for random forests, not so much for boosted trees.

True. LightGBM also supports random forests though and honest splits are still useful as a regularisation strategy for GBMs. :)

The bias of, e.g., Gamma deviance with a log-link stems indeed from the fact that the log-link is non-canonical for the Gamma deviance. IMHO, This won't change with honest trees

As @neNasko1 explained, this PR also enables debiasing, which is independent from honest splitting.

neNasko1 · 2024-08-28T14:47:48Z

Since all written above is a bit abstract let me illustrate by example the utility honest splits can bring to a trained model.
This example uses the French Motor Claims Datasets. Here is a sample code that implements the honest splits training using a callback to refit after each training step:

import pandas as pd
import numpy as np
import lightgbm as lgb
from sklearn.preprocessing import LabelEncoder

class HonestSplitCallback():

    def __init__(self, data, labels):

        self.score = 0
        self.data = data
        self.labels = labels

    def _init(self, env):
        self.learning_rate = env.model.params["learning_rate"]

    def _get_gradients(self):
        y = self.labels
        mu = np.exp(-self.score)

        gradient = 1 - y * mu
        hessian = y * mu

        return gradient, hessian

    def _internal_refit(self, env):

        booster = env.model

        gradient, hessian = self._get_gradients()

        predicted_leaves = booster.predict(
            self.data,
            pred_leaf=True,
            start_iteration=env.iteration,
            num_iteration=1,
        )

        sums = pd.DataFrame({'grad': gradient, 'hess': hessian, 'leaf': predicted_leaves}).groupby('leaf').sum()
        sums['refitted'] = -sums['grad'] / sums['hess'] * self.learning_rate
        refitted_leaves = np.zeros(env.model.dump_model()['tree_info'][env.iteration]['num_leaves'])
        refitted_leaves[sums.index.to_numpy()] = sums['refitted'].to_numpy()

        booster.refit_tree_manual(
            env.iteration,
            refitted_leaves
        )
        self.score += refitted_leaves[predicted_leaves]

    def __call__(self, env):

        if env.iteration == env.begin_iteration:
            self._init(env)

        self._internal_refit(env)

df = pd.read_csv("https://www.openml.org/data/get_csv/20649148/freMTPL2freq.arff", quotechar="'")
df_sev = pd.read_csv("https://www.openml.org/data/get_csv/20649149/freMTPL2sev.arff", index_col=0)

df.rename(lambda x: x.replace('"', ''), axis='columns', inplace=True)
df['IDpol'] = df['IDpol'].astype(np.int64)
df.set_index('IDpol', inplace=True)

df = df.join(df_sev.groupby(level=0).sum(), how='left')
df.fillna(value={'ClaimAmount': 0, 'ClaimAmountCut': 0}, inplace=True)

labelencoder = LabelEncoder()

df['Area'] = labelencoder.fit_transform(df['Area'])
df['VehBrand'] = labelencoder.fit_transform(df['VehBrand'])
df['VehGas'] = labelencoder.fit_transform(df['VehGas'])
df['Region'] = labelencoder.fit_transform(df['Region'])

df = df[df['ClaimAmount'] > 0]

df_train = df[['Area', 'VehPower', 'VehAge', 'DrivAge', 'BonusMalus', 'VehBrand', 'VehGas', 'Density', 'Region']]
sample_column = np.random.choice(['train', 'hsplit', 'test'], len(df))

honest_split_callback = HonestSplitCallback(
    df_train[sample_column == 'hsplit'],
    df['ClaimAmount'][sample_column == 'hsplit']
)

bst = lgb.LGBMRegressor(
    learning_rate=0.05,
    n_estimators=500,
    num_leaves=31,
    objective='gamma'
).fit(
    df_train[sample_column == 'train'], df['ClaimAmount'][sample_column == 'train'],
    eval_set=[
        (df_train[sample_column == 'train'], df['ClaimAmount'][sample_column == 'train']),
        (df_train[sample_column == 'hsplit'], df['ClaimAmount'][sample_column == 'hsplit']),
        (df_train[sample_column == 'test'], df['ClaimAmount'][sample_column == 'test']),
    ],
    eval_names=["train", "hsplit", "test"],
    callbacks=[
        honest_split_callback,
        lgb.log_evaluation(period=50),
    ],
    eval_metric='gamma',
)

The example is rudimentary and only for illustration purposes but training with/without and examining the "test" dataset we can see the presence of overfitting in the vanilla training and the absence of it in the honest-splits one:
log_evaluation(period=50) without honest splits:

[50]	train's gamma: 8.47428	hsplit's gamma: 8.75784	test's gamma: 8.79983
[100]	train's gamma: 8.39185	hsplit's gamma: 8.78157	test's gamma: 8.86788
[150]	train's gamma: 8.35027	hsplit's gamma: 8.79986	test's gamma: 8.91633
[200]	train's gamma: 8.31865	hsplit's gamma: 8.82251	test's gamma: 8.93402
[250]	train's gamma: 8.29276	hsplit's gamma: 8.83342	test's gamma: 8.96648
[300]	train's gamma: 8.2702	hsplit's gamma: 8.84719	test's gamma: 8.9856
[350]	train's gamma: 8.2532	hsplit's gamma: 8.86033	test's gamma: 9.02807
[400]	train's gamma: 8.23536	hsplit's gamma: 8.87122	test's gamma: 9.05208
[450]	train's gamma: 8.22125	hsplit's gamma: 8.87913	test's gamma: 9.06393
[500]	train's gamma: 8.20966	hsplit's gamma: 8.88448	test's gamma: 9.07705

log_evaluation(period=50) with honest splits:

[50]	train's gamma: 219.634	hsplit's gamma: 172.385	test's gamma: 210.213
[100]	train's gamma: 23.926	hsplit's gamma: 19.7791	test's gamma: 23.0972
[150]	train's gamma: 9.61772	hsplit's gamma: 8.99064	test's gamma: 9.47934
[200]	train's gamma: 8.96602	hsplit's gamma: 8.60529	test's gamma: 8.86329
[250]	train's gamma: 8.94747	hsplit's gamma: 8.59725	test's gamma: 8.83827
[300]	train's gamma: 8.94775	hsplit's gamma: 8.59631	test's gamma: 8.83784
[350]	train's gamma: 8.94703	hsplit's gamma: 8.59619	test's gamma: 8.83751
[400]	train's gamma: 8.94677	hsplit's gamma: 8.59619	test's gamma: 8.83749
[450]	train's gamma: 8.947	hsplit's gamma: 8.59616	test's gamma: 8.83751
[500]	train's gamma: 8.94747	hsplit's gamma: 8.59613	test's gamma: 8.83755

Moreover examining the partial dependency plots we get "smoother" results:

import matplotlib.pyplot as plt
from sklearn.inspection import PartialDependenceDisplay

PartialDependenceDisplay.from_estimator(bst, df_train, df_train.columns)
plt.show()

Without honest splits:

With honest splits:

Note: it seems like the name refit_tree_manual may be a bit misleading as we are essentially overwriting the leaves and committing the changes to the training procedure(again - if this is possible to do in the already existing codebase, we will be happy to close the PR).

borchero

@jameslamb I'm generally in favor of adding this feature. It does not add a new complex algorithm to this package but rather enables implementing more complex algorithms (such as honest splits) on top of LightGBM.

To this end, we are only exposing a little more LightGBM internals here. However, I would even argue that we're not exposing any "implementational details" as the new "hook" to properly modify outputs after each boosting iteration is an "algorithmic detail", i.e. a "semantic hook" rather than something odd that LightGBM introduces (not sure if it is clear what I mean 😄)

borchero · 2024-09-02T11:41:25Z

python-package/lightgbm/basic.py

@@ -4912,6 +4912,37 @@ def refit(
        new_booster._network = self._network
        return new_booster

+    def refit_tree_manual(self, tree_id: int, values: np.ndarray) -> "Booster":


As a user it would not yet be clear to me how this function is different to set_leaf_output, i.e. why is this not just called set_leaf_outputs?

Do you propose changing the name of the function or just writing better docs? I am a bit unsure if calling this function set_leaf_outputs would be a bit strange as it does additional things than to just update the leaf values.

src/boosting/gbdt.cpp

tests/python_package_test/test_engine.py

Atanas Dimitrov added 8 commits August 5, 2024 18:52

Use tmp_path instead of leaving files

14fcf21

Init

954e6a9

Prevent faulty data

a07f1a2

Add validation checks

75a4ec5

Add tests

b26a21c

Merge branch 'master' of github.com:neNasko1/LightGBM

60d43ac

Merge branch 'master' into refit-tree-manual

b725f5d

Add comment

9c55402

neNasko1 requested review from guolinke, jameslamb, shiyu1994, jmoralez, borchero and StrikerRUS as code owners August 16, 2024 15:01

Fix for 3.8

a387aa5

jameslamb requested changes Aug 16, 2024

View reviewed changes

borchero reviewed Sep 2, 2024

View reviewed changes

Atanas Dimitrov and others added 3 commits September 2, 2024 17:48

Comments after code review

1d73e4e

Appease linter

bdc173b

Merge branch 'master' into refit-tree-manual

13ba538

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[python-package] Introduce `refit_tree_manual` to `Booster` class. #6617

[python-package] Introduce `refit_tree_manual` to `Booster` class. #6617

neNasko1 commented Aug 16, 2024

jameslamb left a comment

neNasko1 commented Aug 16, 2024 •

edited

Loading

lbittarello commented Aug 16, 2024

lbittarello commented Aug 16, 2024

jameslamb commented Aug 27, 2024

lorentzenchr commented Aug 27, 2024

neNasko1 commented Aug 27, 2024 •

edited

Loading

lbittarello commented Aug 27, 2024

neNasko1 commented Aug 28, 2024 •

edited

Loading

borchero left a comment

borchero Sep 2, 2024

neNasko1 Sep 2, 2024 •

edited

Loading

[python-package] Introduce refit_tree_manual to Booster class. #6617

Are you sure you want to change the base?

[python-package] Introduce refit_tree_manual to Booster class. #6617

Conversation

neNasko1 commented Aug 16, 2024

jameslamb left a comment

Choose a reason for hiding this comment

neNasko1 commented Aug 16, 2024 • edited Loading

lbittarello commented Aug 16, 2024

lbittarello commented Aug 16, 2024

jameslamb commented Aug 27, 2024

lorentzenchr commented Aug 27, 2024

neNasko1 commented Aug 27, 2024 • edited Loading

lbittarello commented Aug 27, 2024

neNasko1 commented Aug 28, 2024 • edited Loading

borchero left a comment

Choose a reason for hiding this comment

borchero Sep 2, 2024

Choose a reason for hiding this comment

neNasko1 Sep 2, 2024 • edited Loading

Choose a reason for hiding this comment

[python-package] Introduce `refit_tree_manual` to `Booster` class. #6617

[python-package] Introduce `refit_tree_manual` to `Booster` class. #6617

neNasko1 commented Aug 16, 2024 •

edited

Loading

neNasko1 commented Aug 27, 2024 •

edited

Loading

neNasko1 commented Aug 28, 2024 •

edited

Loading

neNasko1 Sep 2, 2024 •

edited

Loading