In [12]:
# helloMachine.r
# MWL, Lecture 1
# Author(s): [Phil Snyder]
In [13]:
data(iris) # load data (dataset "iris" comes with your R installation)
In [14]:
"
R doesn't support multi-line comments, but we can get away with
benignly passing it a string instead.
"
Out[14]:
In [15]:
iris
Out[15]:
In [16]:
help(topic='iris', package='datasets')
Out[16]:
In [17]:
names(iris)
Out[17]:
In [18]:
class(iris)
Out[18]:
In [19]:
str(iris) # str := structure
In [20]:
levels(iris$Species)
Out[20]:
In [21]:
?levels
Out[21]:
In [22]:
nrow(iris)
Out[22]:
In [23]:
ncol(iris)
Out[23]:
In [24]:
iris[1:3,] # pay attention to where the comma is!
Out[24]:
In [25]:
iris[,1:3]
Out[25]:
In [26]:
iris$Sepal.Length
Out[26]:
In [27]:
iris[,'Sepal.Length']
Out[27]:
In [60]:
setosa <- iris[iris$Species == "setosa",] # get rows from iris where Species == "setosa"
linearModel <- lm(formula = Sepal.Length ~ Sepal.Width, data = setosa) # ~ := "as explained by"
In [61]:
linearModel
Out[61]:
In [62]:
names(linearModel)
Out[62]:
In [63]:
linearModel$call
Out[63]:
In [64]:
plot(formula = Sepal.Length ~ Sepal.Width, data = setosa)
abline(linearModel)
In [65]:
# sapply is like a map function
squaredResiduals <- sapply(linearModel$residuals, function(x) return(x^2))
squaredError <- sum(squaredResiduals) / length(squaredResiduals)
squaredError
Out[65]:
In [66]:
# What if we remove the potential outlier at (2.3, 4.5)?
setosa[42,]
Out[66]:
In [81]:
setosaMinusOutlier <- setosa[-42,]
fixedLinearModel <- lm(Sepal.Length ~ Sepal.Width, setosaMinusOutlier)
plot(Sepal.Length ~ Sepal.Width, setosaMinusOutlier)
abline(fixedLinearModel)
In [82]:
fixedSquaredResiduals <- sapply(fixedLinearModel$residuals, function(x) return(x^2))
fixedSquaredError <- sum(fixedSquaredResiduals) / length(fixedSquaredResiduals)
fixedSquaredError
Out[82]:
Not much improvement in this case.
When is removing outliers justified? What if we had seen a significant decrease in the squared error? Is Sepal.Length ~ Sepal.Width a smart thing to model, or are there hidden variables? (i.e., mean amount of daily sunshine, precipitation levels, exposure to wind, ...).
In [83]:
# now for some classification
flowers <- subset(iris, Species == "setosa" | Species == "virginica")
# equivalently flowers <- iris[iris$Species == "setosa" | iris$Species == "virginica",]
# R has different "OR" operators depending on whether we are doing operations
# on matrix like objects or atomic objects. Be sure to use a single bar | here.
In [84]:
# Here our data has two classes, hence we are performing "binary classification".
plot(flowers[,1:4], pch=sapply(flowers$Species, substr, 1, 1))
In [85]:
flowers$Species <- as.factor(flowers$Species)
In this case, this is unnecessary because class(flowers$Species) == factor already, but in general we need to make our categorical response variables factors for classification.
In [86]:
flowers <- droplevels(flowers) # need to drop "versicolor" from Species levels since it's empty
dotchart(flowers$Petal.Length, pch=sapply(flowers$Species, substr, 1, 1))
Notice how the data is linearly separable. (In dot charts the y-axis is meaningless).
In [87]:
# 1-d case
library(MASS)
oneLinearClass <- lda(Species ~ Petal.Length, flowers)
lda stands for Linear Discriminant Analysis. How it works is not important for now. Just know that it attempts to draw a dot/line/plane/hyperplane separating our two classes.
In [88]:
onePredictions <- predict(oneLinearClass, flowers)
onePredictionResults <- table(onePredictions$class == flowers$Species) / length(flowers$Species)
onePredictionResults
Out[88]:
Wow! 0 error! Who would've thunk.
In [89]:
# 2-d case
plot(Sepal.Length ~ Sepal.Width, flowers, pch=sapply(flowers$Species, substr, 1, 1))
In [90]:
twoLinearClass <- lda(Species ~ Sepal.Length + Sepal.Width, flowers)
twoPredictions <- predict(twoLinearClass, flowers)
twoPredictionResults <- table(twoPredictions$class == flowers$Species) / length(flowers$Species)
twoPredictionResults
Out[90]:
Here LDA is not perfect, although our data is (barely) linearly separable. This is because we abused the LDA assumption that the two classes have the same covariance matrix. (More on this in later lectures).
Now let's try the entire iris dataset, including Species == "versicolor" in n-dimensions. n = 4 here, since we have 4 features (we don't include our "Species" label).
In [91]:
labelMapper <- function(s) { # this is just to help us plot the data in the next line
if (s == "setosa") return(1)
if (s == "versicolor") return(2)
if (s == "virginica") return(3)
}
plot(iris[,1:4], pch=sapply(iris$Species, labelMapper))
In [92]:
nLinearClass <- lda(Species ~ ., iris) # Species ~ . := Species as explained by "everything"
nPredictions <- predict(nLinearClass, iris)
nPredictionResults <- table(nPredictions$class == iris$Species) / length(iris$Species)
nPredictionResults
Out[92]:
This is actually pretty good, especially since all we're doing is drawing a hyperplane through our 4 dimensional data!
BUT, as we'll see next lecture, we have done something egregious that has given us a false sense of confidence about the accuracy of our predictor...