Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Aggregation Bug Fixed & Support Provided for Example Datasets #12

Open
wants to merge 10 commits into
base: master
Choose a base branch
from

Conversation

Garen-Wang
Copy link

@Garen-Wang Garen-Wang commented Jan 27, 2021

When directly running experiments according to the sample, error occurs due to passing a dict as a parameter of function agg, which is now deprecated by pandas.

Just changing its type to list can solve this problem. So there is no need to use pandas==0.25.


Jan 29 upd:

If directly run experiments of example datasets, actually feature importance and AUC cannot be calculated correctly.

For instance, the AUC will always be 0.5 and the feature importance of heart dataset will be like this:

   feature_name  split  gain  gain_percent  split_percent  feature_score
0           age      0   0.0           NaN            NaN            NaN
1           sex      0   0.0           NaN            NaN            NaN
2    chest-pain      0   0.0           NaN            NaN            NaN
3    bp-resting      0   0.0           NaN            NaN            NaN
4   cholesterol      0   0.0           NaN            NaN            NaN
5    bs-fasting      0   0.0           NaN            NaN            NaN
6   ecg-resting      0   0.0           NaN            NaN            NaN
7        hr-max      0   0.0           NaN            NaN            NaN
8           eia      0   0.0           NaN            NaN            NaN
9       oldpeak      0   0.0           NaN            NaN            NaN
10    k-oldpeak      0   0.0           NaN            NaN            NaN
11      vessels      0   0.0           NaN            NaN            NaN
12         thal      0   0.0           NaN            NaN            NaN

Actually we need to modify one of the parameters of algorithm LightGBM, called min_data, to a factor of the number of instances in our own dataset. Now the feature importance will become normal:

   feature_name  split        gain  gain_percent  split_percent  feature_score
7        hr-max     75  173.631106     11.917672      23.291925       0.198796
4   cholesterol     48   76.677835      5.263005      14.906832       0.120137
0           age     48   70.926419      4.868240      14.906832       0.118953
2    chest-pain     14  396.282254     27.199977       4.347826       0.112035
9       oldpeak     39  125.718957      8.629084      12.111801       0.110670
12         thal     14  290.041550     19.907839       4.347826       0.090158
11      vessels     20  174.509817     11.977985       6.211180       0.079412
3    bp-resting     27   37.454598      2.570804       8.385093       0.066408
1           sex     13   35.071794      2.407254       4.037267       0.035483
6   ecg-resting     10   17.174318      1.178809       3.105590       0.025276
8           eia      8   29.409258      2.018589       2.484472       0.023447
10    k-oldpeak      6   30.023399      2.060743       1.863354       0.019226
5    bs-fasting      0    0.000000      0.000000       0.000000       0.000000

Then we can expand our scalability to other datasets.

@Garen-Wang Garen-Wang changed the title bug fixed when aggregating Aggregation Bug Fixed & Support Provided for Example Datasets Jan 29, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant