The goal of your project is to predict the manner in which they did the exercise. This is the "classe" variable in the training set. You may use any of the other variables to predict with. You should create a report describing how you built your model, how you used cross validation, what you think the expected out of sample error is, and why you made the choices you did. You will also use your prediction model to predict 20 different test cases.
In [1]:
#read.csv(file_dest_training, na.strings=c("NA",""), header=TRUE)
#ExoData=read.csv(file="pml-training.csv",head=TRUE,sep=",", stringsAsFactors=FALSE)
ExoData=read.csv(file="pml-training.csv", na.strings=c("NA",""), header=TRUE)
nrow(ExoData)
#str(ExoData)
dim(ExoData)
Out[1]:
Out[1]:
In [2]:
b=sapply(ExoData, function(x) sum(is.na(x)))
FullData=subset(ExoData,select=c(which(!b>0)), stringsAsFactors=FALSE)
#str(FullData)
dim(FullData)
Out[2]:
which leaves use with only 60 variables to train on.
In [4]:
smartData=FullData
smartData=smartData[,colSums(smartData != 0) != 0]
dim(smartData)
s=sapply(smartData, function(x) sum(is.na(x)))
#str(smartData)
Out[4]:
In [5]:
library(caret)
library(mlbench)
set.seed(3)
In [6]:
tData=smartData
tData$cvtd_timestamp=NULL #with these in, there was a factor level mismatch with final validation set
tData$new_window=NULL # final validation set
dim(tData)
trainIdx=createDataPartition(tData$classe, p = .75, list=FALSE)
trainD=tData[trainIdx,]
testD=tData[-trainIdx,]
x <- trainD[,-58]
y <- trainD[,58]
Out[6]:
In [7]:
library(parallel)
library(doParallel)
cluster <- makeCluster(detectCores() - 1) # convention to leave 1 core for OS
registerDoParallel(cluster)
In [8]:
fitControl <- trainControl(method = "cv",
number = 10,
allowParallel = TRUE)
In [9]:
fit <- train(x,y, method="rf",data=trainD,trControl = fitControl)
plot(fit)
In [11]:
rf.pred=predict(fit,testD[,-58])
confusionMat=table(rf.pred,testD[,58])
confusionMat
Out[11]:
In [14]:
#str(trainD)
the results appear to be too precise to be true, which would mean we are probably overfitting. However, we are using cross validation to select the variables and it appears to peak above 0.9998 accuracy around 30 variables. We where unable to find an explanation for overfitting and had to conclude that the high performance on the test set is probably due to data being artificial and not having enough noise.
In [10]:
vData=read.csv(file="pml-testing.csv", na.strings=c("NA",""), header=TRUE)
b=sapply(vData, function(x) sum(is.na(x)))
fvData=subset(vData,select=c(which(!b>0)), stringsAsFactors=FALSE)
#str(vData)
#dim(vData)
svData=fvData
svData=svData[,colSums(svData != 0) != 0]
dim(svData)
s=sapply(svData, function(x) sum(is.na(x)))
#str(svData)
Out[10]:
In [12]:
validationData=svData
validationData$cvtd_timestamp=NULL
validationData$new_window=NULL
#nrow(validationData)
#str(validationData)
#dim(validationData)
validation.pred=predict(fit, validationData)
validation.pred
Out[12]:
In [ ]: