-
Notifications
You must be signed in to change notification settings - Fork 0
/
PracticalMachineLearning.Rmd
146 lines (113 loc) · 5.16 KB
/
PracticalMachineLearning.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
Practical Machine Learning Project Write up
--------------------------------------------------
We first consider the dataset and see what are the variables which are being used in the model. We note that many of the elements being used are actually descriptive statistics of the observations done for every window of the data. Also these descriptive statistics ( like average, max, min, amplitude, skewness, kurtosis, standard deviation, variance) are present for only a few of the rows - i.e. rows which are having new_window=yes. We omit all those columns which are these statistical observations for our predictions.
```{r message=FALSE}
# set up environment
options(warn=-1)
library(lattice)
library(caret)
library(rattle)
library(randomForest)
library(cvTools)
library(png)
library(grid)
library(rpart.plot)
library(RColorBrewer)
library(rpart)
# remove statistics columns
pml.training <- read.csv("C:/SujoyRc/Temp/Coursera/pml-training.csv")
pml.testing <- read.csv("C:/SujoyRc/Temp/Coursera/pml-testing.csv")
```
For simplicity of coding we create a new dataset removing these columns.We can also confirm that none of the other columns will have any NAs (and thus no rows will be excluded from our analysis)
```{r message=FALSE}
cols_exclude_pattern<-c("max","min","skewness","kurtosis","avg","var","stddev","amplitude")
pml_use<-pml.training[ , -grep(paste(cols_exclude_pattern,collapse="|"),names(pml.training))]
apply(pml_use, 2, function(x) length(which(is.na(x))))
summary(pml_use$new_window)
```
One question is how distribution of the columns differ for new_window=yes as against new_window=no values - this is to be confident that there is nothing specific about these rows in that they will need to be excluded from the analysis or some special treatment is required. We plot a few of these and see that there is not much difference between them visually.
```{r eval=FALSE}
transparentTheme (trans = .9)
featurePlot(x = pml_use[, 8:59],
y = pml_use$new_window,
plot = "density",
## Pass in options to xyplot() to
## make it prettier
scales = list(x = list(relation="free"),
y = list(relation="free")),
adjust = 1.5,
pch = "|",
layout = c(9, 6),
auto.key = list(columns = 52))
```
```{r echo=FALSE}
img<-readPNG("C:/SujoyRc/Temp/Coursera/VariableDensities.png")
grid.raster(img)
```
To start our modelling, we must split the data into testing and training sets - strictly speaking the testing dataset is an out-of-sample set as it does not have the response variable in it.
```{r message=FALSE}
set.seed(12345)
inTrain<-createDataPartition(y=pml_use$classe,p=0.7,list=FALSE)
training<-pml_use[inTrain,]
testing<-pml_use[-inTrain,]
### List all columns into one single variable
fmla<-as.formula(paste("classe ~",paste(names(pml_use[8:59]),collapse=" + ")))
```
We start with a RPART model and see the performance.
```{r eval=FALSE}
modFit<-train(fmla, method="rpart",data=training)
```
```{r echo=FALSE}
load("C:/SujoyRc/Temp/Coursera/modFit")
```
```{r fig.width=7, fig.height=6}
fancyRpartPlot(modFit$finalModel)
```
And see the model fit statistics both for in-sample and out of sample errors.
**IN-SAMPLE ERRORS**
```{r message=FALSE}
confusionMatrix(training$classe,predict(modFit))
```
**OUT OF SAMPLE ERRORS**
```{r message=FALSE}
confusionMatrix(testing$classe,predict(modFit,newdata=testing))
```
These two results show a match of only 60% approximately - clearly it is not good enough. We attempt a Random Forest result for the same and note the variable importance in a plot.
For simplicity we assign the randomForest object in the train model into a separate model. This will be useful as we will be using randomForest package functions for variable importance plots and cross-validation.
```{r eval=FALSE}
modFit_rf<-train(fmla, method="rf",data=training)
rf<-modFit_rf$finalModel
```
```{r echo=FALSE}
load("C:/SujoyRc/Temp/Coursera/rf_model")
rf<-modFit_rf$finalModel
```
And visualize the node importance in the following plot
```{r fig.width=7, fig.height=6 }
varImpPlot(rf,type=2,main="mean decrease in node impurity")
```
And the confusion matrices are analyzed
**IN-SAMPLE ERRORS**
```{r message=FALSE}
confusionMatrix(training$classe,predict(modFit_rf))
```
The confusion matrix for the training dataset gives a result of 100% which creates some concerns if this data is overfitted.
**OUT OF SAMPLE ERRORS**
```{r message=FALSE}
confusionMatrix(testing$classe,predict(modFit_rf,newdata=testing))
```
However the confusion matrix for the testing dataset belies those fears and it is sufficient.
We then run a cross-validation and observer the out-of-sample errors for trees involving 1,3,6,13,26,52 variables.
```{r eval=FALSE }
cv_results<-rfcv(pml_use[,8:59],pml_use[,60])
cv_results$error.cv
```
```{r echo=FALSE }
load("C:/SujoyRc/Temp/Coursera/cv_results")
cv_results$error.cv
```
```{r fig.width=7, fig.height=6 }
with(cv_results, plot(n.var, error.cv, log="x", type="o", lwd=2))
```
The errrors are very small and provide a good estimae of out-of-sample errors.
These results, all encouraging, allow us to choose the random forest as a model of choice for this problem.