PRACTICAL MACHINE LEARNING

INSTRUCTIONS

The goal of your project is to predict the manner in which they did the exercise. This is the "classe" variable in the training set. You may use any of the other variables to predict with. You should create a report describing how you built your model, how you used cross validation, what you think the expected out of sample error is, and why you made the choices you did. You will also use your prediction model to predict 20 different test cases.

INPUT DATA


In [1]:
#read.csv(file_dest_training, na.strings=c("NA",""), header=TRUE)
#ExoData=read.csv(file="pml-training.csv",head=TRUE,sep=",", stringsAsFactors=FALSE)
ExoData=read.csv(file="pml-training.csv", na.strings=c("NA",""), header=TRUE)
nrow(ExoData)
#str(ExoData)
dim(ExoData)


Out[1]:
19622
Out[1]:
  1. 19622
  2. 160

FEATURES

Having verified that the schema of both the training and testing sets are not identical, I decided to eliminate both NA columns and other extraneous columns.


In [2]:
b=sapply(ExoData, function(x) sum(is.na(x)))
FullData=subset(ExoData,select=c(which(!b>0)), stringsAsFactors=FALSE)
#str(FullData)
dim(FullData)


Out[2]:
  1. 19622
  2. 60

which leaves use with only 60 variables to train on.


In [4]:
smartData=FullData
smartData=smartData[,colSums(smartData != 0) != 0] 
dim(smartData)
s=sapply(smartData, function(x) sum(is.na(x)))
#str(smartData)


Out[4]:
  1. 19622
  2. 60

Random Forest Model

We implement a random forest model using cross validation to control overfitting:


In [5]:
library(caret)
library(mlbench)
set.seed(3)


Loading required package: lattice
Loading required package: ggplot2

In [6]:
tData=smartData
tData$cvtd_timestamp=NULL #with these in, there was a factor level mismatch with final validation set
tData$new_window=NULL #  final validation set
dim(tData)
trainIdx=createDataPartition(tData$classe, p = .75, list=FALSE)
trainD=tData[trainIdx,]
testD=tData[-trainIdx,]
x <- trainD[,-58]
y <- trainD[,58]


Out[6]:
  1. 19622
  2. 58

In [7]:
library(parallel)
library(doParallel)
cluster <- makeCluster(detectCores() - 1) # convention to leave 1 core for OS
registerDoParallel(cluster)


Loading required package: foreach
Loading required package: iterators

CROSS VALIDATION

we use 10-folds cross validation to control overfitting


In [8]:
fitControl <- trainControl(method = "cv",
                           number = 10,
                           allowParallel = TRUE)

In [9]:
fit <- train(x,y, method="rf",data=trainD,trControl = fitControl)
plot(fit)


Loading required package: randomForest
randomForest 4.6-12
Type rfNews() to see new features/changes/bug fixes.

Attaching package: ‘randomForest’

The following object is masked from ‘package:ggplot2’:

    margin


In [11]:
rf.pred=predict(fit,testD[,-58])
confusionMat=table(rf.pred,testD[,58])
confusionMat


Out[11]:
       
rf.pred    A    B    C    D    E
      A 1395    0    0    0    0
      B    0  949    1    0    0
      C    0    0  854    0    0
      D    0    0    0  804    0
      E    0    0    0    0  901

In [14]:
#str(trainD)

Suscpiciously High Accuracy

the results appear to be too precise to be true, which would mean we are probably overfitting. However, we are using cross validation to select the variables and it appears to peak above 0.9998 accuracy around 30 variables. We where unable to find an explanation for overfitting and had to conclude that the high performance on the test set is probably due to data being artificial and not having enough noise.


In [10]:
vData=read.csv(file="pml-testing.csv", na.strings=c("NA",""), header=TRUE)
b=sapply(vData, function(x) sum(is.na(x)))
fvData=subset(vData,select=c(which(!b>0)), stringsAsFactors=FALSE)
#str(vData)
#dim(vData)
svData=fvData
svData=svData[,colSums(svData != 0) != 0] 
dim(svData)
s=sapply(svData, function(x) sum(is.na(x)))
#str(svData)


Out[10]:
  1. 20
  2. 60

Prediction Results

bellow the prediction results for the 20 test cases:


In [12]:
validationData=svData
validationData$cvtd_timestamp=NULL
validationData$new_window=NULL
#nrow(validationData)
#str(validationData)
#dim(validationData)
validation.pred=predict(fit, validationData)
validation.pred


Out[12]:
  1. A
  2. A
  3. A
  4. A
  5. A
  6. A
  7. A
  8. A
  9. A
  10. A
  11. A
  12. A
  13. A
  14. A
  15. A
  16. A
  17. A
  18. A
  19. A
  20. A

In [ ]: