Skip to content

RF Model building and evaluation 1

chriszhangpodo edited this page Nov 15, 2017 · 1 revision

Before Cross Validation

As we know in the earlier page, before building the RF model, you need to know the mtry ,the Ntrees and the maxnodes.

Usually we could do a CV to find out these parameters and choose the best.

There are several ways to do the CV (not considering the maxnodes as we did not want the tree to be pruned):
1 Using repeated CV to build the for loop and loop will contain 3 levels; 1 is the mtry loop, 2 is the Ntree loop, 3 is the CV folds
1.1 For each mtry, ntree and CV loop, print the AUC (Area under the curve, refer to ROC curve)
1.2 The highest Average AUC will be the best pair of (mtry and ntree)

2 Using the rfcv function in the RandomForest package to get the best mtree range
3 Combining the 1 & 2 together

Cross Validation

If you are familiar with library(caret) you probably know it is quite handy to do the CV via trainControl, while caret RF is quite slow as it tunes the model automatically by using tuneLength or tuneGrid

This time we will use our own repeated CV by using R script and combing with the rfcv method

rfcv(data[,predictors],data[,outcomenames] , cv.fold=5, step=2, recursive=FALSE)

Then we could get the plot by using with(result, plot(n.var, error.cv, log="x", type="o", lwd=2))

Then you could get the best range of mtry is from 16-40 and let's cut it to 30 (by experiences)

Then we will use the CV to find out the best mtry and ntree pairs

#accuracy <- 1
cv <- 5
trainSet <- training1.1set5.2
cvDivider <- floor(nrow(training1.1set5.2)/(cv+1))

for (mtry1 in c(20,24,28,30)){
  for (ntree1 in c(600,700,1000)){
     indexCount <-1 
     for (cv in seq(1:cv)){
    #assign data chunk to data set  
      datatestIndex <- c((cv*cvDivider):(cv*cvDivider +cvDivider)) 
      dataTest <- trainSet[datatestIndex,]
    #Other Chunks into Train
      dataTrain <- trainSet[-datatestIndex,]
    #train
    rf.cv.1 <- randomForest(x=dataTrain[,predictors1],y=dataTrain[,outcomename1], ntree = ntree1,mtry = mtry1,importance=TRUE)
    gc()
    #predict
    predictions <- predict(rf.cv.1,dataTest[,predictors1],type="prob")
    
    #accuracy by ROC and AUC
    predictions.y <- predictions[,2]
    pred.1 <- prediction(predictions.y,dataTest[,outcomename1])
    auc <- performance(pred.1,"auc")
    auc <- unlist(slot(auc,"y.values"))
    
    #print(auc)
    print(paste(mtry1,ntree1,cv,auc))
     }
  }
}

After running the CV 5 times on each iteration, you will get the AUC (area under the curve) value for each iteration showing in below block:

[1] "20 600 1 0.94433400333366"    
[1] "20 600 2 0.948419703557847"    
[1] "20 600 3 0.95432480141218"   
[1] "20 600 4 0.957974475798549"   
[1] "20 600 5 0.950504506269593"   0.95111
[1] "20 700 1 0.943744484753407"   
[1] "20 700 2 0.948541236493272"   
[1] "20 700 3 0.95432480141218"   
[1] "20 700 4 0.957703801685283"   
[1] "20 700 5 0.951147384404389"   0.95112
[1] "20 1000 1 0.943969997058535"  
[1] "20 1000 2 0.949395649857475"  
[1] "20 1000 3 0.955168186721584"  
[1] "20 1000 4 0.957963452870858"   
[1] "20 1000 5 0.95045797413793"      0.9514
[1] "24 600 1 0.945001960976567"   
[1] "24 600 2 0.949790938798961"   
[1] "24 600 3 0.955106894184563"   
[1] "24 600 4 0.957467421124829"   
[1] "24 600 5 0.95112901645768"      0.951699
[1] "24 700 1 0.944646533973919"   
[1] "24 700 2 0.948962305148332"   
[1] "24 700 3 0.955615622241836"   
[1] "24 700 4 0.95733759553204"  
[1] "24 700 5 0.950340419278997"     0.95138
[1] "24 1000 1 0.944769095009315"   
[1] "24 1000 2 0.949729558528544"   
[1] "24 1000 3 0.95509708737864"   
[1] "24 1000 4 0.957681755829903"   
[1] "24 1000 5 0.950554711990595"   0.951566
[1] "28 600 1 0.945301009902932"   
[1] "28 600 2 0.949745517398852"   
[1] "28 600 3 0.954943856036088"  
[1] "28 600 4 0.957562953164805"   
[1] "28 600 5 0.951098403213166"   0.9517303
[1] "28 700 1 0.945144131777625"
[1] "28 700 2 0.949763931479976"
[1] "28 700 3 0.95479797979798"
[1] "28 700 4 0.957455173427395"
[1] "28 700 5 0.950706553683386"   0.951573
[1] "28 1000 1 0.945607412491422"
[1] "28 1000 2 0.949351456062775"
[1] "28 1000 3 0.95501005197607"
[1] "28 1000 4 0.95758009994121"
[1] "28 1000 5 0.950712676332288"  0.951652
[1] "30 600 1 0.945634375919208"
[1] "30 600 2 0.949103479770291"
[1] "30 600 3 0.954926694125723"
[1] "30 600 4 0.956449637468155"
[1] "30 600 5 0.950328173981192"   0.951288
[1] "30 700 1 0.946122168840082"
[1] "30 700 2 0.94963626051751"
[1] "30 700 3 0.955527360988525"
[1] "30 700 4 0.957910787771899"
[1] "30 700 5 0.949985305642633"  0.951836
[1] "30 1000 1 0.946100107853712"
[1] "30 1000 2 0.949649764177001"
[1] "30 1000 3 0.955088506423457"
[1] "30 1000 4 0.957995296884184"
[1] "30 1000 5 0.950323275862069"  0.951831

From the CV result, we could see that the when mtry is 30 and ntree is 700, the average AUC is the best (0.951836). Then we may could decide the mtry and ntree parameter by 30 and 700. Maxnodes will not be considered if we don't want to prune the tree. (sometimes give a certain maxnode will make the AUC better, and overcome the overfitting)

Building Model

After the CV, we know that using mtry 30 and 700 ntree will roughly give us a 0.95 AUC on the RF model. So we could progress this into mode building.

We used about 7600 rows data to do the CV and for Training, we would like to use more data. So this time i use the 51000 rows data for the training (after SMOTE re-balancing).

rf.fit <- randomForest(Churn_flag ~., data= training1, mtry = 30, ntree = 700, do.trace = T, importance = T)

  1. Churn_flag is the factor target for prediction
  2. do.trace = T means we want to track the tree training by viewing the out of bag error rate on both class (Y and N)
  3. importance = T means whether the important variables will be assessed, it means the model will calculate the mean decrease in accuracy when the important variable being removed.
  4. we don't use any tree pruning, thus no maxnodes

The oob error showing like below:

Row:   oobE     1      2
334:   4.01%  4.62%  3.40%
335:   4.03%  4.64%  3.41%
336:   4.02%  4.63%  3.41%
337:   4.03%  4.65%  3.41%
338:   4.01%  4.64%  3.39%
339:   4.02%  4.64%  3.40%
340:   4.02%  4.64%  3.41%
341:   4.01%  4.62%  3.40%
.....

You could see the exciting tree performance while the model was building its trees. In about No.300 tree, the error is slightly become small and give us good result.

After the console finished the running, we could use rf.fit1$confusion to see the confusion matrix

> rf.fit1$confusion
      N     Y class.error
N 24383  1153  0.04515194
Y   880 24656  0.03446115

To be continued...