You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Hi. Great tutorial. Just a quick note on Session 11: when creating cosine similarities with spam message feature on training data you should exclude the observation itself from the spam messages list:
# cosine similarities with spam messages and vice versa!
spam.indexes <- which(train$Label == "spam")
train.svd$SpamSimilarity <- rep(0.0, nrow(train.svd))
for(i in 1:nrow(train.svd)) {
spam.indexesCV <- setdiff(spam.indexes,i)
train.svd$SpamSimilarity[i] <- mean(train.similarities[i, spam.indexesCV])
}
This solves the data leakage problem leading to over-fitting. The RF results on test data with updated feature are much better:
# Drill-in on results
confusionMatrix(preds, test.svd$Label)
Confusion Matrix and Statistics
Reference
Prediction ham spam
ham 1445 32
spam 2 192
Accuracy : 0.98
95% CI : (0.972, 0.986)
No Information Rate : 0.866
P-Value [Acc > NIR] : < 2e-16
Kappa : 0.907
Mcnemar's Test P-Value : 0.000000658
Sensitivity : 0.999
Specificity : 0.857
Pos Pred Value : 0.978
Neg Pred Value : 0.990
Prevalence : 0.866
Detection Rate : 0.865
Detection Prevalence : 0.884
Balanced Accuracy : 0.928
'Positive' Class : ham
The text was updated successfully, but these errors were encountered:
Hi. Great tutorial. Just a quick note on Session 11: when creating cosine similarities with spam message feature on training data you should exclude the observation itself from the spam messages list:
This solves the data leakage problem leading to over-fitting. The RF results on test data with updated feature are much better:
The text was updated successfully, but these errors were encountered: