Skip to content

RF Model building and evaluation

chriszhangpodo edited this page Nov 3, 2017 · 12 revisions

Some talks before our model

1 what is Random forest and how does it work

Random forest is a supervised learning algorithm which could be used in many cases. It has two significant feature "random" and forest. Random forest is based on the decision tree concept foundation.

Random refers when building the single tree, the way that the data is selected.
If the there is a M X N sized data to be trained in RF, first the algorithm will select m (m < M)rows of observations with n(n < N) data features into the sample which will be used to build the first tree; Then the algorithm will use select and replace method to keep generating the samples in order to keep building the trees. Usually, RF tree will only use 2/3 of the data to train the tree and left 1/3 of the data (out of bag) to calculate the oob error.

Forest means, there are about 500 trees getting built in the algorithm; when there is a new case (observation) feeding into the model, each tree will give a vote. it is very democratic -- which class got most votes, this case will be put into which class. Of course, the case will have a vote score for classifications, which is the probability of a certain class in prediction.

By having this features, RF also could be treated as a bagging algorithm which will significantly reduce the variance (the results of the model comes from average voting score, and each tree was built by the sample which was used by boostrap method with replacement, thus the variance will be reduced compared with single tree).

2 What kind of problems it could be used

Classification and regression (if the target data type is not a factor)

3 Does Random forest need cross validation

Yes and No, a lot of the article pointed that because of out of bag data already estimate the error in each tree when we train the data, so the performance is quite obvious when the model got trained.

The purpose of CV is using different sample set to validate the model performance (prevent the model is overfit or underfit) then we could get the best parameter. In random forest, because of random select n features each time, so many different trees got build, thus there will have better performance comparing with single trees in different data. It is unlikely overfit (same as reducing the variance). Thus CV is not that needed in RF. With more trees build, the oob error will be stabilized eventually.

But on the other hand, if you want to compare RF with other algorithms, you'd better to use CV. Also, certainly, RF is not absolutely immune to overfitting. You may want to tune the RF by the mtry/ nTrees and the Maxnodes(tree depth)

4 How we start our first model in R

Usually, we could use two common package to build a RF model in R (i will write a version of Python as well which it will use scikit-learn and Pandas .. etc.).
library(caret) or library(RandomForest) The difference of this two package is caret is a package which contain a lot of other classification algorithms like Bagging trees (ada..), Bayesian, NN , KNN, regression... and also caret contains other tookit like CV, create data partition...etc..

RandomForest package is a RF package only, very focused on RF features.

In this case, i use RandomForest package only as i could use do.trace to monitor each tree's building in R console and also i could do a hold out CV manually.

The model building code is easyrffit <- randomforest(data$target~., data= data, ntree= n, mtry = m, maxnodes = j,importance = TRUE, do.trace = True)

Few tips here: most of the algorithms in R could use the format of xx(target~., data=data) to train the model, ~ is the operator and means "on the right side will be coming"; "." means all the variables that not yet be used. Training format also could use matrix style which is a little bit quicker than the way above. Matrix format is like "x=, y=.."

As you could see above, we could not start the model until we found the best parameters we need in RF, like nTree, Mtry, and the maxnodes

5 Random forest regression VS classification

Before we start to get our hands dirty in RF, i would like to talk about the RF classification and Regression. As you see, the RF could do both classification and Regression in the package. If the target is a factor, the RF will be classification tree ensemble; while numeric or continuous, the RF will choose Regression Trees.

There are a lot of topic in decision trees and regression trees, generally, decision trees is using entropy and information gain to choose the node and then split; while the regression tree is using the smallest sum of squared error (variance) to split. (more to read in http://alumni.media.mit.edu/~tpminka/courses/36-350.2001/lectures/day19/)

Note: it is different that using regression trees to give a result probability comparing using decision trees but set predict(...,type="prob"). if you use decision trees but set predict(...,type="prob"), then the probabilities is relying on the votes score, if your model object = 'regression', then the prediction is a percentage vector anyway, you don't need to specify the type.

We will use decision tree and predict by type = prob as we want to use ROC and AUC to validate the model performance

Clone this wiki locally