Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Reconsider balancing data by resampling? #323

Open
joelostblom opened this issue Dec 13, 2023 · 0 comments
Open

Reconsider balancing data by resampling? #323

joelostblom opened this issue Dec 13, 2023 · 0 comments

Comments

@joelostblom
Copy link
Collaborator

For the future, maybe we should reconsider the recommendation to rebalance data by duplicating observations in this section https://python.datasciencebook.ca/pull317/classification1.html#balancing. Both this year and last, I have encountered students who find that the optimal K=1 when they do this, and from visualizing the data it is impossible to see that the reason for that is that there is an exact copy of the data point that they are predicting hiding underneath it. Maybe this can be avoided by sampling in a smarter way, but we are introducing it in the first classification chapter where we haven't introduced any evaluation yet, so we might have to move it later as a more advanced topic if we do a smarter resampling (not sure what this would look like or if it is possible).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant