Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Training & testing #12

Open
ssobel opened this issue Jul 16, 2018 · 2 comments
Open

Training & testing #12

ssobel opened this issue Jul 16, 2018 · 2 comments

Comments

@ssobel
Copy link

ssobel commented Jul 16, 2018

Hello Laurae, thanks for your response earlier about my question about emulating daForest.

This time I have a question sort of related to validation_data=NULL #6, in that I want to make sure I understand how to properly do training & testing to try to avoid overfitting. I tried running CascadeForest and got excellent results on training and held out validation data (where I knew the labels), but when I applied the model to test data (exclusive of my train & validation data, and where I did not know the labels but the contest website gave me my score), the model did not perform very well. So, I believe I am overfitting.

Basically, I trained CascadeForest using d_train & d_valid like this:

CascadeForest(training_data = d_train,
validation_data = d_valid,
training_labels = labels_train,
validation_labels = labels_valid, ...)

Where: d_train & labels_train = predictor columns & known labels (65% of my total training data)
d_valid & labels_valid = predictor columns & known labels, exclusive of d_train (the other 35% of my total training data).

My AUC was something like 0.96 when I then predicted on d_train and also when I predicted on d_valid. So, that made me happy, and I then applied the predict function to d_test, which is exclusive of d_train and d_valid, and where I don't know the true labels, but I submitted my predictions to the contest website and got a 0.75 AUC, not nearly as good as 0.96.

So that made me think I should use cross-validation in CascadeForest, like this:

CascadeForest(training_data = d_alltrain,
validation_data = NULL,
training_labels = labels_alltrain,
validation_labels = NULL, ...)

Where: d_alltrain is all my training data (= 65% + 35% = 100%), and labels_alltrain is all my known labels for all my training data.

But, I got the error as noted in validation_data=NULL #6. I have not yet tried the solution you suggest to fix the lines of code to work for cross-validation, but is this the proper way to do cross-val? Then if I get a good AUC indicated by the model (on cross-val d_alltrain), and then apply the model to d_test then that is the proper way to try to avoid overfitting and I should hope for a better score?

Thank you very much.

@Laurae2
Copy link
Owner

Laurae2 commented Jul 16, 2018

@ssobel The issue with putting an optimizer (here, boosting) on top of an optimizer, is that the validation set gets overfitted (the model overfits heavily the training set and a lot the validation set).

This is also why you are seeing great performance on the validation set, but poor performance on testing set. Unfortunately, there are no known workaround for it theoretically: it is bound to overfit the validation set. Well, that's in theory, because practice says otherwise.

In practice, we use nested cross-validation for such scenario. This also means it gets computationally expensive (for instance a 5x 5-fold cross-validation means 25 validations and 25 test sets... it parallelizes very well linearly with the number of cores available if there is enough RAM).

Try the following, a 5x 5-fold cross-validation:

  • 64% train data
  • 16% validation data
  • 20% test data

@ssobel
Copy link
Author

ssobel commented Jul 19, 2018

Hi Laurae, thank you for responding. I am not sure I completely understand, would you mind clarifying? Very much appreciated in advance.

My data is:

  • d_alltrain & labels_alltrain = all my training data & training labels (labels known)
  • d_alltest & labels_alltest = all my test data & test labels (labels unknown)

If I interpreting your message above correctly, then I should do this:
train_data = random 64% of my d_alltrain, with its labels defined as "train_labels"
validation_data = random 16% of my d_alltrain (exclusive of train_data), with its labels defined as "validation_labels"
test_data = the remaining 20% of my d_alltrain (exclusive of train_data & validation_data), with its labels defined as "test_labels"

Then:

folds <- kfold(train_labels, k = 5)

But can you help me understand how to set up the model(s)?

model <- CascadeForest(training_data = ?,
validation_data = ?,
training_labels = ?,
validation_labels = ?,
folds = folds,
...)

Thanks again.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants