Skip to content

RF Model prediction and evaluation end

chriszhangpodo edited this page Nov 15, 2017 · 1 revision

Continuing on the Model confusion matrix

      N     Y class.error
N 24383  1153  0.04515194
Y   880 24656  0.03446115

From the confusion matrix, we could see that:
the the precision of the model on N is
Precision(N) = tp / (tp+fp) = 24383/25536 = 0.9548481 (which is the 1 - class.error on N)
Recall(N) = tp / (tp+fn) = 24383/25263 = 0.96517

the the precision of the model on Y is
Precision(Y) = tn / (tn+fn) = 24656/25536 = 0.96554 (which is the 1 - class.error on Y)
Recall(Y) = tn / (tn+fp) = 24656/25809 = 0.95533

As this confusion matrix is from the oob data validation on model training, so it just give us a idea of how well it performs in the validation or how well it performs by using our parameter setting.

We still need using unseen data to test the model for the performance.

Prediction on the Model

First we need to make sure the testing data is unseen data.

fulldata2.1.3 <- fulldata2[-trainingportion,]
# remove the training data from full dataset to build a unseen data population
> columnname <- names(training1.3)
# get training data-set's column name, in case the test data still need to clean the useless columns 
> fulldata2.1.3.1 <- fulldata2.1.3[,columnname]
# select only useful columns 
> View(fulldata2.1.3.1)
> fulldata2.1.3.2 <- fulldata2.1.3.1[sample(nrow(fulldata2.1.3.1)),]
# shuffle the rows in testing data, as the full data had been using SMOTE to sample the balance of Y/N by 1:1
> View(fulldata2.1.3.2)
> fulldata2.1.3.3 <- fulldata2.1.3.2[sample(nrow(fulldata2.1.3.2),60000),]
# Shuffle the rows and take 60,000 rows from unseen data as the testing data set

After we had our unseen testing data and formatted it well, we could use prediction function to predict the results

predictors2 <- names(fulldata2.1.3.3)[!names(fulldata2.1.3.3) %in% outcomename]
# get the predictors by removing the target column name in the data set
predict.fit1 <- predict(rf.fit1,fulldata2.1.3.3[,predictors2],type = "prob")
# predict the unseen data using rf.fit1 model and prob type in prediction (refer page 2), this will give us a probability on N and Y class

Results of the prediction is in below:

> predict.fit1
                 N           Y
101320 0.935714286 0.064285714
351755 0.005714286 0.994285714
453079 0.511428571 0.488571429
37151  0.327142857 0.672857143
226657 0.000000000 1.000000000
270618 0.927142857 0.072857143
42087  0.001428571 0.998571429
124053 0.015714286 0.984285714
261578 0.855714286 0.144285714
233972 0.782857143 0.217142857
58831  0.551428571 0.448571429
419982 0.001428571 0.998571429
......

Let's see the distribution of the prediction probabilities on Y:

We could see that it has two modes and the median is appox. at 0.5

  • if we take a sample of this prob distribution, and sample size larger enough, the mean probability still will be a normal distribution, so we could estimate the probability of mean error rate range on prediction sample by using the corresponding B ~ N(x,u), x is the population mean, and u is the population variance divided by sample size n --- that is the central limited theorem.*

We turned the Y/N to 1/0 in the testing data set then we could calculate the ROC, AUC, precision, recall, and the best cut off (usually, in a prediction, we treated value over 0.5 will be 1, <0.5 will be 0, but some times it did not give use the most accuracy rate, we need to sort all the predicted probabilities to see which value could give us the best accuracy).

Model evaluation test

we use library(ROCR) to calculate the ROC/AUC/ACC/CUT-OFF/PRECISION-RECALL

> pred.fit1 <- prediction(predict.fit1[,2],fulldata2.1.3.3$CHURN_FLAG_B)
# use prediction function in ROCR to run the calculation 
> eval.fit1 <- performance(pred.fit1,"acc")
# get accuracy performance
> plot(eval.fit1)
# making a plot for accuracy curve by cut_off

We could see the accuracy curve as below:

In this acc curve, we could see that 0.5 is not a perfect threshold to determine whether it is 1 or 0. So we need to find out the best cut off in the model.

y.max.fit1 <- which.max(slot(eval.fit1,"y.values")[[1]])
# we first find out the max y value's position in the acc curve vector -- which is a vector of the accuracy rate on a cut off value x, using [[1]] is because slot() returned a list.
> acc.fit1 <- slot(eval.fit1,"y.values")[[1]][y.max.fit1]
> acc.fit1
[1] 0.9209
# find out the best accuracy

> cut.off.fit1 <- slot(eval.fit1,"x.values")[[1]][y.max.fit1]
# use that position to get corresponding x -- which is the cut off value
> cut.off.fit1
   344169 
0.5471429 
# when cut off value equals 0.5471429, we had the best accuracy rate

So, we know the threshold in 0.547, we had the best accuracy rate of 0.92 which is quite good for unseen data!

Now we need to had a look for ROC, we need to know whether this model's accuracy is quite similar on predicting both Y and N. See pic below:

The code to generate the ROC is like below:

> roc.fit1 <- performance(pred.fit1,"tpr","fpr")
> plot(roc.fit1,colorize = T)
> abline(a=0,b=1)

We could see that the curve is fair close to left corner, and the AUC is:

> auc.fit1 <- performance(pred.fit1,"auc")
> auc.fit1.1 <- as.numeric([email protected])
> auc.fit1.1
[1] 0.9652801

AUC is 0.96528, quite close to our cross validation AUC value -- 0.978, not bad!

We could further have look with precision and recall curve

> precision.recall.fit1 <- performance(pred.fit1,"prec","rec")
> plot(precision.recall.fit1)

We could see the best precision and recall pair by using cut off of 0.547 is about 0.9 and 0.8

We further use the precision curve only to target the precision value

> precision.fit1 <- performance(pred.fit1,"prec")
> plot(precision.fit1)
> abline(v=0.547)
> abline(h=0.93,v=0.547)
> abline(h=0.95,v=0.547)

Precision is about 0.95, not bad! same method for recall, it is about 0.91, not bad as well!

Some talks at the end

Churn prediction is quite common used in all sales analytics cross different industry. Further analysis will be continue on "how did we do when we had those high churned clients"

Usually, few options we could look into:

*1 we could have a look those high churned prob clients's characteristic. eg: their contract price, their payment method, their industry, their on subscription life month, their particular product usage....

*2 Talking with project manager and sales manager to gather what is the most important clients for them, try to build those client's profile within the high churned clients simple

*3 By looking the low churned clients profile or feature importance to get what action is efficient to retain the clients

*4 Making a action list by analyzing the point above, and keep monitoring those high churned clients simple

*5 Discuss prediction frequency with sales (project) managers, try to make a customer prediction profile dashboard by showing the churn probabilities.

*6 Frequently looking the churn rate and data, modifying and optimizing the model again if possible

END