13. Using the Boston data set, fit classification models in order to predict whether a given suburb has a crime rate above or below the median. Explore logistic regression, LDA, and KNN models using various subsets of the predictors. Describe your findings.
In [3]:
.libPaths("/Library/Frameworks/R.framework/Versions/3.4/Resources/library")
.libPaths() #Added R Studio's Package Path to Anaconda's Path
library(ISLR)
library(MASS)
In [4]:
head(Boston)
Let us create a response variable _highcrime which is 1 when the crime rate crim is above the median and 0 when below.
In [4]:
df = data.frame(matrix(0, ncol=1, nrow=nrow(Boston)))
colnames(df) = "high_crim"
new_Boston = cbind(df, Boston)
new_Boston$high_crim[new_Boston$crim > median(new_Boston$crim)] = 1
head(new_Boston, 20)
I first created a new data.frame initialized with the shape $ num \space samples \times 1 $ using the matrix()
function. Each entry was initialized to 0. The column name of the data.frame df
was changed to high crim. We then perform a column bind to prepend the existing Boston data frame with the new column entry. The value of high crim is then modified based on the values in crim.
We now have to perform the following steps:
Lets us take a look at the data from it's summary
In [87]:
summary(new_Boston)
From the get go, I noticed some high leverage samples because of the crim summary. $75\% $ of the samples have a values less than $3.6$. Yet the maximum goes up to 88.97! We see a similar effect in zn. In the case of age, there is a significant difference between the minimum and the 1st quantile mark.
Let us determine the dependence of this new class high crim with the other features
In [97]:
par(mfrow=c(2,2))
boxplot(crim~high_crim, data=new_Boston, xlab="High Crime?", ylab="crim")
boxplot(zn~high_crim, data=new_Boston, xlab="High Crime?", ylab="zn")
boxplot(indus~high_crim, data=new_Boston, xlab="High Crime?", ylab="indus")
boxplot(chas~high_crim, data=new_Boston, xlab="High Crime?", ylab="chas")
boxplot(nox~high_crim, data=new_Boston, xlab="High Crime?", ylab="nox")
boxplot(rm~high_crim, data=new_Boston, xlab="High Crime?", ylab="rm")
boxplot(age~high_crim, data=new_Boston, xlab="High Crime?", ylab="age")
boxplot(dis~high_crim, data=new_Boston, xlab="High Crime?", ylab="dis")
par(mfrow=c(2,3))
boxplot(rad~high_crim, data=new_Boston, xlab="High Crime?", ylab="rad")
boxplot(tax~high_crim, data=new_Boston, xlab="High Crime?", ylab="tax")
boxplot(ptratio~high_crim, data=new_Boston, xlab="High Crime?", ylab="ptratio")
boxplot(black~high_crim, data=new_Boston, xlab="High Crime?", ylab="black")
boxplot(lstat~high_crim, data=new_Boston, xlab="High Crime?", ylab="lstat")
boxplot(medv~high_crim, data=new_Boston, xlab="High Crime?", ylab="medv")
We'll only consider predictors where we see some significant separation. In our case, we exclude rm, chas.
Let us now split the data into train and test sets. We first shuffle the data using the sample function. Since the sample
function wants a vector of numbers, we shuffle numbers from 1 to the number of tuples. We then access the wor numbers of these tuples.
In [5]:
set.seed(1)
shuffled_idx = sample(nrow(new_Boston), nrow(new_Boston))
train_idx = shuffled_idx[1:(0.7*length(shuffled_idx))]
test_idx = shuffled_idx[(0.7*length(shuffled_idx)+1):length(shuffled_idx)]
train = new_Boston[train_idx,-c(5, 7)]
test = new_Boston[test_idx,-c(5, 7)]
dim(train)
dim(test)
In [141]:
model = glm(high_crim~. -crim -high_crim, train, family="binomial")
predictions = predict(model, test[,-1] ,type="response") # It should be test[,-c(1,2)]. Why it's working with _crim_?
preds = data.frame(matrix(0, nrow=length(predictions), ncol=1))
colnames(preds) = "preds"
new_frame = data.frame(cbind(predictions, preds))
new_frame$preds[new_frame$predictions > 0.5] = 1
predicted_labels = test[,1]
actual_labels = new_frame[,2]
confusion = table(predicted_labels, actual_labels)[2:1, 2:1]
confusion
Let us use our performance function written for question 10.
In [142]:
f_beta_measure = function(precision, recall, beta){
return ( 1/((beta*(1/precision))+ ((1-beta)*(1/recall))))
}
get_accuracy_scores = function(conf){
TP = conf[1,1]
FP = conf[1,2]
FN = conf[2,1]
TN = conf[2,2]
simple_accuracy = (TP+TN)/(TP+FP+TN+FN)
precision = TP/(TP+FP)
recall = TP/(TP+FN)
f = f_beta_measure(precision, recall, beta=0.5)
return (c(simple_accuracy, precision, recall, f))
}
get_accuracy_scores(confusion)
We have some pretty good accuracy scores. Simple Accuracy here is $92.71\%$, giving us a test error of $7.19\%$
We now train an LDA model without crim rate.
In [164]:
model = lda(high_crim~. -crim -high_crim, train)
preds = predict(model, test[,-1])
predicted_labels = preds$class
actual_labels = test[,1]
confusion = table(predicted_labels, actual_labels)[2:1, 2:1]
confusion
In [165]:
get_accuracy_scores(confusion)
So we get an accuracy of $86.75\%$, or a training error of $13.25\%$
But let us now exclude every other column, and use crim as the only covariate.
In [195]:
model = lda(high_crim~crim, train)
preds = predict(model, test[,1:2])
predicted_labels = preds$class
actual_labels = test[,1]
confusion = table(predicted_labels, actual_labels)[2:1, 2:1]
confusion
NOTE: It looks like the accuracy actually fell even though we only consider crim. It is surprising because high crim was created from crim. I expected a $ 0\% $ error rate. There doesn't seem to be unbalanced training data.
In [198]:
dim(train[train$high_crim < median(new_Boston$high_crim),])
dim(train[train$high_crim >= median(new_Boston$high_crim),])
We now train a QDA model without crime rate.
In [200]:
model = qda(high_crim~. -crim -high_crim, train)
preds = predict(model, test[,-1])
predicted_labels = preds$class
actual_labels = test[,1]
confusion = table(predicted_labels, actual_labels)[2:1, 2:1]
confusion
But what if crim is the only covariate?
In [201]:
model = qda(high_crim~crim, train)
preds = predict(model, test[,1:2])
predicted_labels = preds$class
actual_labels = test[,1]
confusion = table(predicted_labels, actual_labels)[2:1, 2:1]
confusion
In [202]:
get_accuracy_scores(confusion)
This works as intended. We have a very high performance in high crim as this was created from crim. From the results, we have $0$ false positives, giving us $100 \%$ Precision.
In [6]:
library(class)
In [13]:
X_train = train[,-1]
y_train = train[,1]
X_test = test[,-1]
y_test = test[,1]
tab = data.frame(matrix(ncol=4, nrow=0))
for (k in 1:100){
preds = knn(X_train, X_test, y_train, k=k)
conf = table(preds, y_test)[2:1, 2:1]
scores = get_accuracy_scores(conf)
tab = rbind(tab, scores)
}
colnames(tab) = c("Accuracy","Precision", "Recall", "F")
rownames(tab) = paste("", 1:nrow(tab)) # A little work around to visually display row numbers
head(tab)
In [20]:
x = 1:100
y = tab$Accuracy
plot(x=x, y=y, xlab="K", ylab="Simple Accuracy", type="line")
It is interesting to note the highest simaple accuracy is at k = 3.
In [21]:
tab[tab$Accuracy == max(tab$Accuracy),]
Let us take crim rate as the only covariate.
In [35]:
X_train = as.data.frame(train[,2])
y_train = train[,1]
X_test = as.data.frame(test[,2])
y_test = as.factor(test[,1])
tab = data.frame(matrix(ncol=4, nrow=0))
for (k in 1:100){
preds = knn(X_train, X_test, y_train, k=k)
conf = table(preds, y_test)[2:1, 2:1]
scores = get_accuracy_scores(conf)
tab = rbind(tab, scores)
}
colnames(tab) = c("Accuracy","Precision", "Recall", "F")
rownames(tab) = paste("", 1:nrow(tab)) # A little work around to visually display row numbers
head(tab)
We actually get a perfect predictions for all types of performance measures when $ \textit{K = 1,2, and 3}$!