The source for the training data is:
https://d396qusza40orc.cloudfront.net/predmachlearn/pml-training.csv
The source for the test data is:
https://d396qusza40orc.cloudfront.net/predmachlearn/pml-testing.csv
Many thanks to the authors for providing this dataset for free use. http://groupware.les.inf.puc-rio.br/har
Velloso, E.; Bulling, A.; Gellersen, H.; Ugulino, W.; Fuks, H. Qualitative Activity Recognition of Weight Lifting Exercises.
Proceedings of 4th International Conference in Cooperation with SIGCHI (Augmented Human '13) . Stuttgart, Germany: ACM SIGCHI, 2013.
In this project we are going to attempt to predict the testing dataset provided through the link above, through the training dataset also provided.
Our first step will be to load the necessary libraries for our analysis.
Our second step will be to read the data, partition the training dataset into 60% training and 40% validation.
Next we need to clean it from any NA values or anomalies, and reduce it to a workable size.
Our third step is to implement various machine learning algorithms, namely Decision Tree, Random Forest and Support Machine Vector to predict the testing dataset.
Our fourth step is to assess the performance and accuracy of these methods.
Our fifth and final step is to use the algorithm with the highest accuracy to effectively and accurately predict the test dataset provided.
In [21]:
library(caret) #For training datasets and applying machine learning algorithms
library(ggplot2) #For awesome plotting
library(rpart)
library(rpart.plot)
library(rattle)
library(randomForest)
library(e1071)
library(dplyr)
set.seed(111)
In [22]:
# We will use url0 for the training dataset and url1 for the testing dataset
url0 <- "https://d396qusza40orc.cloudfront.net/predmachlearn/pml-training.csv"
url1 <- "https://d396qusza40orc.cloudfront.net/predmachlearn/pml-testing.csv"
# Download and read datasets
train <- read.csv(url(url0))
test <- read.csv(url(url1))
In [23]:
#Partition the training data into 60% training and 40% validation and check dimensions.
trainIndex <- createDataPartition(train$classe, p = .60, list = FALSE)
trainData <- train[ trainIndex, ]
validData <- train[-trainIndex, ]
In [24]:
dim(trainData)
In [25]:
dim(validData)
In [26]:
# Both return 160 variables, many of which are filled with NA records
# If over 90% of a variable is filled with NA then it is omitted from the training and test datasets
trainData <- trainData[, colMeans(is.na(trainData)) < 0.1]
validData <- validData[, colMeans(is.na(validData)) < 0.1]
dim(trainData)
dim(validData)
# We can remove the first five colums which are ID columns, as well as the timestamp as we do not need it in this analysis.
trainData <- trainData[, -(1:5)]
validData <- validData[, -(1:5)]
dim(trainData)
dim(validData)
# We can also remove all variables with nearly zero variance
near.zero.var <- nearZeroVar(trainData)
trainData <- trainData[, -near.zero.var]
validData <- validData[, -near.zero.var]
dim(trainData)
dim(validData)
We have now managed to reduce the number of variables from 160 to 54 and since both the validData and trainData have an equal number of variables, we can implement our prediction algorithms in an easier fashion.
In [27]:
mod.train.dt <- train(classe ~ ., method = "rpart", data = trainData)
mod.predict.dt <- predict(mod.train.dt, validData)
cm.dt <- confusionMatrix(mod.predict.dt, validData$classe)
print(mod.train.dt$finalModel)
In [28]:
fancyRpartPlot(mod.train.dt$finalModel,cex=.5,under.cex=1,shadow.offset=0)
In [29]:
mod.train.rf <- randomForest(classe ~ ., data = trainData, mtry = 3, ntree = 200, do.trace = 25, cv.fold = 10)
In [30]:
mod.predict.rf <- predict(mod.train.rf, validData)
cm.rf <- confusionMatrix(mod.predict.rf, validData$classe)
In [31]:
# Variable Importance According to Random Forest
imp.rf <- importance(mod.train.rf)
imp.rf.arranged <- arrange(as.data.frame(imp.rf), desc(MeanDecreaseGini))
head(imp.rf.arranged, 15)
In [32]:
varImpPlot(mod.train.rf, n.var = 15, sort = TRUE, main = "Variable Importance", lcolor = "blue", bg = "purple")
In [33]:
mod.train.svm <- svm(classe ~ ., data = trainData)
mod.predict.svm <- predict(mod.train.svm, validData)
cm.svm <- confusionMatrix(mod.predict.svm, validData$classe)
In [34]:
a.dt <- cm.dt$overall[1]
a.rf <- cm.rf$overall[1]
a.svm <- cm.svm$overall[1]
cm.dataframe <- data.frame(Algorithm = c("Decision Tree", "Random Forest", "Support Vector Machine"), Index = c("dt", "rf", "svm"), Accuracy = c(a.dt, a.rf, a.svm))
cm.dataframe <- arrange(cm.dataframe, desc(Accuracy))
cm.dataframe
We can clearly see that Random Forest has the highest accuracy at ~ 99.4%, followed by Support Vector Machine at ~ 94.6%. Decision Tree gave us the lowest accuracy at ~ 47.7%.
In [35]:
# In sample Error Rate
InSampError.rf <- (1 - 0.994)*100
InSampError.rf
We can see that the In Sample error is 0.6%
In [36]:
# Out of sample Error Rate
print(mod.train.rf)
We can see that the OOB (Out of Bag) or Out of Sample Error of Random Forest with 10-Fold Cross Validation is 0.53%, which is consistent with the confusion matrix.
However, it is worthy to note that Random Forest OOB estimation does not require Cross Validation to decrease bias.
The In Sample Error is actually higher than the OOB, which is definitely considered an anomaly. It might be due to variance in the estimation of the error rates or due to overfitting. Nonetheless our prediction in the next section proves our model highly accurate.
In [37]:
fp.rf <- predict(mod.train.rf, newdata=test)
fp.rf
Using Random Forest to predict our Testing Dataset is the best decision. And it accurately predicted all 20 cases.
In [ ]: