Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Cosine Similarity with spam message Feature Data Leakage #1

Open
blazysecon opened this issue May 29, 2018 · 0 comments
Open

Cosine Similarity with spam message Feature Data Leakage #1

blazysecon opened this issue May 29, 2018 · 0 comments

Comments

@blazysecon
Copy link

blazysecon commented May 29, 2018

Hi. Great tutorial. Just a quick note on Session 11: when creating cosine similarities with spam message feature on training data you should exclude the observation itself from the spam messages list:

# cosine similarities with spam messages and vice versa!
spam.indexes <- which(train$Label == "spam")
train.svd$SpamSimilarity <- rep(0.0, nrow(train.svd))
for(i in 1:nrow(train.svd)) {
    spam.indexesCV <- setdiff(spam.indexes,i)
    train.svd$SpamSimilarity[i] <- mean(train.similarities[i, spam.indexesCV])
}

This solves the data leakage problem leading to over-fitting. The RF results on test data with updated feature are much better:

 # Drill-in on results
 confusionMatrix(preds, test.svd$Label)
Confusion Matrix and Statistics

          Reference
Prediction  ham spam
      ham  1445   32
      spam    2  192
                                      
               Accuracy : 0.98          
                 95% CI : (0.972, 0.986)
    No Information Rate : 0.866         
    P-Value [Acc > NIR] : < 2e-16       
                                        
                  Kappa : 0.907         
 Mcnemar's Test P-Value : 0.000000658   
                                        
            Sensitivity : 0.999         
            Specificity : 0.857         
         Pos Pred Value : 0.978         
         Neg Pred Value : 0.990         
             Prevalence : 0.866         
         Detection Rate : 0.865         
   Detection Prevalence : 0.884         
      Balanced Accuracy : 0.928         
                                        
       'Positive' Class : ham           
                            
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant