This is an introduction to Decision Trees using the [R] language. Using the iris dataset we'll build a tree to identify, based on the width and length of Sepal and Pedal, the Species of the flower.
For this purpose we'll use the caret library, from which we'll load up the rpart module for Recursive partitioning for classification and survival trees.
We'll also need the rattle package to do some fancy plots for evaluating the model.
Let's start!
In [1]:
library(datasets)
library(caret)
library(rattle)
In [2]:
indata <- datasets::iris
In [3]:
head(indata)
Out[3]:
In [5]:
pairs(~. , data=indata)
or make it a bit more pretty with featurePlot (caret)
In [7]:
featurePlot(x=indata[, 1:4],
y= indata$Species,
plot="pairs",
## Add a key at the top
auto.key = list(columns=3))
One can do the same plot but adding an ellipse to cluster the events.
In [8]:
featurePlot(x=indata[, 1:4],
y= indata$Species,
plot="ellipse",
## Add a key at the top
auto.key = list(columns=3))
Do density plots
In [9]:
featurePlot(x=indata[, 1:4],
y= indata$Species,
plot="density",
## Pass in options to xyplot() to
## make it prettier
scales = list(x = list(relation="free"),
y = list(relation="free")),
adjust = 1.5,
pch = "|",
layout = c(4, 1),
auto.key = list(columns = 3))
.. box plots...
In [10]:
featurePlot(x = iris[, 1:4],
y = iris$Species,
plot = "box",
## Pass in options to bwplot()
scales = list(y = list(relation="free"),
x = list(rot = 90)),
layout = c(4,1 ),
auto.key = list(columns = 2))
In [11]:
set.seed(1987)
To perform the split, I will sample out of a linear space of continuous integers baring the size nrows(indata) a percentage of the total; let's say 60%. These random 90 integers will be the row indices that I will use in my training sample. The rest are for my testing.
Let's skip CV and validation samples for the moment...
In [12]:
train_indx <- sample(nrow(indata), floor(nrow(indata)*0.6)) # sample(nrow(indata) = 150
# floor(nrow(indata)*0.6 = 90
In [13]:
train_indx
Out[13]:
In [14]:
train_sample <- indata[train_indx, ]
In [16]:
test_sample <- indata[-train_indx, ]
For example my test sample now holds...
In [17]:
test_sample
Out[17]:
Now to train the model we'll use the train function.
This needs a formula object specifying which variable is the Y and which are the X's in the form of Y ~ X1 + X2 + ...
Also, the string with the name of the model is needed to be specified. To find out which model names are included in the caret package simply to :
In [18]:
names(getModelInfo())
Out[18]:
So first let's create the formula
In [19]:
formula <- as.formula(Species ~ Sepal.Length + Sepal.Width + Petal.Length + Petal.Width)
And train the model by
In [20]:
tr <- train(formula, train_sample, method="rpart") #,
# minsplit=2, minbucket=1, cp=0.001, maxdepth=8)
Additional settings/parameters of the model can be set in the train function. For example the minsplit in the rpart model is the minimum number of observations a node must have to be splitted. Similarly, minbucket is the minimum number of observations the leaf node must have, cp is the complexity parameter (any split that does not decrease the overall lack of fit by a factor of cp is not attempted). Finally, maxdepth is the size of the tree (Root node being 0).
N.B. The complexity parameter is used to control the size of the decision tree and to select the optimal tree size. This is useful if you want to look at the values of CP for various tree sizes. The default value is 0.01.
Printing out the tr object gives us some information about the model
In [29]:
print(tr)
In [ ]:
In [27]:
summary(tr)
In [26]:
post(tr$finalModel,file='')
In [ ]: