Skip to content

RF Model building and evaluation

chriszhangpodo edited this page Oct 25, 2017 · 12 revisions

Some talks before our model

1 what is Random forest and how did it work

Random forest is a supervised learning algorithm which could be used in many cases. It has two significant feature "random" and forest. Random forest is based on the decision tree concept foundation.

Random refers when building the single tree, the way that the data is selected.
If the there is a M X N sized data to be trained in RF, first the algorithm will select m (m < M)rows of observations with n(n < N) data features into the sample which will be used to build the first tree; Then the algorithm will use select and replace method to keep generating the samples in order to keep building the trees. Usually, RF tree will only use 2/3 of the data to train the tree and left 1/3 of the data (out of bag) to calculate the oob error.

Forest means, when there are about 500 trees getting built in the algorithm; when there is a new case (observation) feeding into the model, each tree will give a vote. it is very democratic -- which class got most votes, this case will be put into which class. Of course, the case will have a vote score for classifications, which is the probability of a certain class in prediction.

By having this features, RF also could be treated as a bagging algorithm which will significantly reduce the variance (the results of the model comes from average voting score, and each tree was built by the sample which was used by boostrap method with replacement, thus the variance will be reduced compared with single tree).

2 What kind of problems it could be used

Classification and regression (if the target data type is not a factor)

3 Does Random forest need cross validation

Yes and No, a lot of the article pointed that because of out of bag data already estimate the error in each tree when we train the data, so the performance is quite obvious when the model got trained.

The purpose of CV is using different sample set to validate the model performance (prevent the model is overfit or underfit) then we could get the best parameter. In random forest, because of random select n features each time, so many different trees got build, thus there will have better performance comparing with single trees in different data. It is unlikely overfit (same as reducing the variance). Thus CV is not that needed in RF. With more trees build, the oob error will be stabilized eventually.

But on the other hand, if you want to compare RF with other algorithms, you'd better to use CV. Also, certainly, RF is not absolutely immune to overfitting. You may want to tune the RF by the mtry/ nTrees and the Maxnodes(tree depth)

4 How we start our first model

5 Random forest regression VS classification

Clone this wiki locally