-
Notifications
You must be signed in to change notification settings - Fork 50
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Training & testing #12
Comments
@ssobel The issue with putting an optimizer (here, boosting) on top of an optimizer, is that the validation set gets overfitted (the model overfits heavily the training set and a lot the validation set). This is also why you are seeing great performance on the validation set, but poor performance on testing set. Unfortunately, there are no known workaround for it theoretically: it is bound to overfit the validation set. Well, that's in theory, because practice says otherwise. In practice, we use nested cross-validation for such scenario. This also means it gets computationally expensive (for instance a 5x 5-fold cross-validation means 25 validations and 25 test sets... it parallelizes very well linearly with the number of cores available if there is enough RAM). Try the following, a 5x 5-fold cross-validation:
|
Hi Laurae, thank you for responding. I am not sure I completely understand, would you mind clarifying? Very much appreciated in advance. My data is:
If I interpreting your message above correctly, then I should do this: Then: folds <- kfold(train_labels, k = 5) But can you help me understand how to set up the model(s)? model <- CascadeForest(training_data = ?, Thanks again. |
Hello Laurae, thanks for your response earlier about my question about emulating daForest.
This time I have a question sort of related to validation_data=NULL #6, in that I want to make sure I understand how to properly do training & testing to try to avoid overfitting. I tried running CascadeForest and got excellent results on training and held out validation data (where I knew the labels), but when I applied the model to test data (exclusive of my train & validation data, and where I did not know the labels but the contest website gave me my score), the model did not perform very well. So, I believe I am overfitting.
Basically, I trained CascadeForest using d_train & d_valid like this:
CascadeForest(training_data = d_train,
validation_data = d_valid,
training_labels = labels_train,
validation_labels = labels_valid, ...)
Where: d_train & labels_train = predictor columns & known labels (65% of my total training data)
d_valid & labels_valid = predictor columns & known labels, exclusive of d_train (the other 35% of my total training data).
My AUC was something like 0.96 when I then predicted on d_train and also when I predicted on d_valid. So, that made me happy, and I then applied the predict function to d_test, which is exclusive of d_train and d_valid, and where I don't know the true labels, but I submitted my predictions to the contest website and got a 0.75 AUC, not nearly as good as 0.96.
So that made me think I should use cross-validation in CascadeForest, like this:
CascadeForest(training_data = d_alltrain,
validation_data = NULL,
training_labels = labels_alltrain,
validation_labels = NULL, ...)
Where: d_alltrain is all my training data (= 65% + 35% = 100%), and labels_alltrain is all my known labels for all my training data.
But, I got the error as noted in validation_data=NULL #6. I have not yet tried the solution you suggest to fix the lines of code to work for cross-validation, but is this the proper way to do cross-val? Then if I get a good AUC indicated by the model (on cross-val d_alltrain), and then apply the model to d_test then that is the proper way to try to avoid overfitting and I should hope for a better score?
Thank you very much.
The text was updated successfully, but these errors were encountered: