Multi-class classification Dataset Description: https://archive.ics.uci.edu/ml/datasets/Iris
In [59]:
library(caret)
In [60]:
# attach the iris dataset to the environment
data(iris)
# rename the dataset
dataset <- iris
We need to know whether the model that we created is any good. We are going to hold back some data that the algorithms will not get to see and we will use this data to get a second and independent idea of how accurate the best model might actually be. We will split the loaded dataset into two, 80% of which we will use to train our models and 20% that we will hold back as a validation dataset.
In [61]:
# create a list of 80% of the rows in the original dataset we can use for training
validation_index <- createDataPartition(dataset$Species, p=0.80, list=FALSE)
# select 20% of the data for validation
validation <- dataset[-validation_index,]
# use the remaining 80% of data to training and testing the models
dataset <- dataset[validation_index,]
You now have training data in the dataset variable and a validation set we will use later in the validation variable. Note that we replaced our dataset variable with the 80% sample of the dataset. This was an attempt to keep the rest of the code simpler and readable.
Dimensions of the Dataset. How many rows and columns
In [62]:
dim(dataset)
list the types for each attribute. gives us an idea of how to better summarize the data you have and the types of transforms you might need to use to prepare the data before you model it
In [63]:
sapply(dataset, class)
In [64]:
# take a peek at the first 5 rows of the data
head(dataset)
list the levels for the class. This is a multiclass or a multinomial classification problem. If there were two levels, it would be a binary classification problem.
In [65]:
levels(dataset$Species)
Class Distribution. a look at the number of instances (rows) that belong to each class. We can view this as an absolute count and as a percentage. each class has the same number of instances (40 or 33% of the dataset)
In [66]:
# summarize the class distribution
percentage <- prop.table(table(dataset$Species)) * 100
cbind(freq=table(dataset$Species), percentage=percentage)
Statistical Summary includes the mean, the min and max values as well as some percentiles.
In [67]:
# summarize attribute distributions
summary(dataset)
We can see that all of the numerical values have the same scale (centimeters) and similar ranges [0,8] centimeters.
In [68]:
# a) Univariate
# split input and output
#name the features then split into x and y by feature name and label name
x <- dataset[,1:4]
y <- dataset[,5]
#how to refer to specific columns by number
#df[,c(1:3,6,9:12)]
Given that the input variables are numeric, we can create box and whisker plots of each.
In [69]:
# boxplots for numeric
par(mfrow=c(1,4))
for(i in 1:4) {
boxplot(x[,i], main=names(dataset)[i])
}
In [70]:
# barplot for class breakdown
plot(y)
#This confirms what we learned in the last section, that the instances are evenly distributedacross the three classes
Multivariate Plots look at the interactions between the variables let’s look at scatter plots of all pairs of attributes and color the points by class. In addition, because the scatter plots show that points for each class are generally separate, we can draw ellipses around them
We can see some clear relationships between the input attributes (trends) and between attributes and the class values (ellipses)
In [71]:
# scatterplot matrix
featurePlot(x=x, y=y, plot="ellipse")
We can also look at box and whisker plots of each input variable again, but this time broken down into separate plots for each class. This can help to tease out obvious linear separations between the classes. This is useful as it shows that there are clearly different distributions of the attributes for each class value.
In [72]:
# box and whisker plots for each attribute
featurePlot(x=x, y=y, plot="box")
Next we can get an idea of the distribution of each attribute, again like the box and whisker plots, broken down by class value. Sometimes histograms are good for this, but in this case we will use some probability density plots to give nice smooth lines for each distribution
Like the boxplots, we can see the difference in distribution of each attribute by class value. We can also see the Gaussian-like distribution (bell curve) of each attribute.
In [73]:
# density plots for each attribute by class value
scales <- list(x=list(relation="free"), y=list(relation="free"))
featurePlot(x=x, y=y, plot="density", scales=scales)
Now it is time to create some models of the data and estimate their accuracy on unseen data.
Setup the test harness to use 10-fold cross-validation. We will use 10-fold cross-validation to estimate accuracy. This will split our dataset into 10 parts, train in 9 and test on 1 and repeat for all combinations of train-test splits.
We are using the metric of Accuracy to evaluate models. This is a ratio of the number of correctly predicted instances divided by the total number of instances in the dataset multiplied by 100 to give a percentage (e.g. 95% accurate). We will be using the metric variable when we build and evaluate each models in the next section.
In [74]:
# Run algorithms using 10-fold cross validation
control <- trainControl(method="cv", number=10)
metric <- "Accuracy"
We don’t know which algorithms would be good on this problem or what configurations to use. We do get an idea from the plots that some of the classes are partially linearly separable in some dimensions, so we are expecting generally good results. Let’s evaluate 5 different algorithms: Linear Discriminant Analysis (LDA). Classification and Regression Trees (CART). k-Nearest Neighbors (KNN). Support Vector Machines (SVM) with a radial kernel. Random Forest (RF).
This is a good mixture of simple linear (LDA), nonlinear (CART, KNN) and complex nonlinear methods (SVM, RF). We reset the random number seed before each run to ensure that the evaluation of each algorithm is performed using exactly the same data splits. It ensures the results are directly comparable.
In [75]:
# a) linear algorithms
# LDA
set.seed(7)
fit.lda <- train(Species~., data=dataset, method="lda", metric=metric, trControl=control)
# b) nonlinear algorithms
# CART
set.seed(7)
fit.cart <- train(Species~., data=dataset, method="rpart", metric=metric, trControl=control)
# kNN
set.seed(7)
fit.knn <- train(Species~., data=dataset, method="knn", metric=metric, trControl=control)
# c) advanced algorithms
# SVM
set.seed(7)
fit.svm <- train(Species~., data=dataset, method="svmRadial", metric=metric, trControl=control)
# Random Forest
set.seed(7)
fit.rf <- train(Species~., data=dataset, method="rf", metric=metric, trControl=control)
We now have 5 models and accuracy estimations for each. We need to compare the models to each other and select the most accurate. We can report on the accuracy of each model by first creating a list of the models, gathering resample statistics and using the summary function on the result.
In [76]:
# d) compare algorithms
results <- resamples(list(lda=fit.lda, cart=fit.cart, knn=fit.knn, svm=fit.svm, rf=fit.rf))
summary(results)
In the table of results, we can see the distribution of both the Accuracy and Kappa of the models. Let’s just focus on Accuracy for now.
We can also create a plot of the model evaluation results and compare the spread and the mean accuracy of each model. There is a population of accuracy measures for each algorithm because each algorithm was evaluated 10 times (10 fold cross-validation).
In [77]:
dotplot(results)
The results for just the LDA model can be summarized.
In [78]:
# summarize Best Model
# Output estimated accuracy of a model.
print(fit.lda)
This gives a nice summary of what was used to train the model and the mean and standard deviation (SD) accuracy achieved, specifically 97.5% accuracy +/- 4%.
The LDA algorithm created the most accurate model. Now we want to get an idea of the accuracy of the model on our validation set. This will give us an independent final check on the accuracy of the best model. It is valuable to keep a validation set just in case you made a slip during training, such as overfitting to the training set or a data leak. Both will result in an overly optimistic result. We can run the LDA model directly on the validation set and summarize the results in a confusion matrix.
In [79]:
set.seed(7)
predictions <- predict(fit.lda, newdata=validation)
confusionMatrix(predictions, validation$Species)
We can see that the accuracy is 100%. It was a small validation dataset, but this result is within our expected margin of 97% +/-4% suggesting we may have an accurate and a reliably accurate model.