13. Using the Boston data set, fit classification models in order to predict whether a given suburb has a crime rate above or below the median. Explore logistic regression, LDA, and KNN models using various subsets of the predictors. Describe your findings.


In [3]:
.libPaths("/Library/Frameworks/R.framework/Versions/3.4/Resources/library")
.libPaths() #Added R Studio's Package Path to Anaconda's Path
library(ISLR)
library(MASS)


  1. '/Library/Frameworks/R.framework/Versions/3.4/Resources/library'
  2. '/Users/Ajay/anaconda/lib/R/library'

In [4]:
head(Boston)


crimzninduschasnoxrmagedisradtaxptratioblacklstatmedv
0.0063218 2.31 0 0.538 6.575 65.2 4.0900 1 296 15.3 396.90 4.98 24.0
0.02731 0 7.07 0 0.469 6.421 78.9 4.9671 2 242 17.8 396.90 9.14 21.6
0.02729 0 7.07 0 0.469 7.185 61.1 4.9671 2 242 17.8 392.83 4.03 34.7
0.03237 0 2.18 0 0.458 6.998 45.8 6.0622 3 222 18.7 394.63 2.94 33.4
0.06905 0 2.18 0 0.458 7.147 54.2 6.0622 3 222 18.7 396.90 5.33 36.2
0.02985 0 2.18 0 0.458 6.430 58.7 6.0622 3 222 18.7 394.12 5.21 28.7

Let us create a response variable _highcrime which is 1 when the crime rate crim is above the median and 0 when below.


In [4]:
df = data.frame(matrix(0, ncol=1, nrow=nrow(Boston)))
colnames(df) = "high_crim"
new_Boston = cbind(df, Boston)
new_Boston$high_crim[new_Boston$crim > median(new_Boston$crim)] = 1
head(new_Boston, 20)


high_crimcrimzninduschasnoxrmagedisradtaxptratioblacklstatmedv
0 0.0063218.0 2.31 0 0.538 6.575 65.2 4.0900 1 296 15.3 396.90 4.98 24.0
0 0.02731 0.0 7.07 0 0.469 6.421 78.9 4.9671 2 242 17.8 396.90 9.14 21.6
0 0.02729 0.0 7.07 0 0.469 7.185 61.1 4.9671 2 242 17.8 392.83 4.03 34.7
0 0.03237 0.0 2.18 0 0.458 6.998 45.8 6.0622 3 222 18.7 394.63 2.94 33.4
0 0.06905 0.0 2.18 0 0.458 7.147 54.2 6.0622 3 222 18.7 396.90 5.33 36.2
0 0.02985 0.0 2.18 0 0.458 6.430 58.7 6.0622 3 222 18.7 394.12 5.21 28.7
0 0.0882912.5 7.87 0 0.524 6.012 66.6 5.5605 5 311 15.2 395.60 12.43 22.9
0 0.1445512.5 7.87 0 0.524 6.172 96.1 5.9505 5 311 15.2 396.90 19.15 27.1
0 0.2112412.5 7.87 0 0.524 5.631 100.0 6.0821 5 311 15.2 386.63 29.93 16.5
0 0.1700412.5 7.87 0 0.524 6.004 85.9 6.5921 5 311 15.2 386.71 17.10 18.9
0 0.2248912.5 7.87 0 0.524 6.377 94.3 6.3467 5 311 15.2 392.52 20.45 15.0
0 0.1174712.5 7.87 0 0.524 6.009 82.9 6.2267 5 311 15.2 396.90 13.27 18.9
0 0.0937812.5 7.87 0 0.524 5.889 39.0 5.4509 5 311 15.2 390.50 15.71 21.7
1 0.62976 0.0 8.14 0 0.538 5.949 61.8 4.7075 4 307 21.0 396.90 8.26 20.4
1 0.63796 0.0 8.14 0 0.538 6.096 84.5 4.4619 4 307 21.0 380.02 10.26 18.2
1 0.62739 0.0 8.14 0 0.538 5.834 56.5 4.4986 4 307 21.0 395.62 8.47 19.9
1 1.05393 0.0 8.14 0 0.538 5.935 29.3 4.4986 4 307 21.0 386.85 6.58 23.1
1 0.78420 0.0 8.14 0 0.538 5.990 81.7 4.2579 4 307 21.0 386.75 14.67 17.5
1 0.80271 0.0 8.14 0 0.538 5.456 36.6 3.7965 4 307 21.0 288.99 11.69 20.2
1 0.72580 0.0 8.14 0 0.538 5.727 69.5 3.7965 4 307 21.0 390.95 11.28 18.2

I first created a new data.frame initialized with the shape $ num \space samples \times 1 $ using the matrix() function. Each entry was initialized to 0. The column name of the data.frame df was changed to high crim. We then perform a column bind to prepend the existing Boston data frame with the new column entry. The value of high crim is then modified based on the values in crim.

We now have to perform the following steps:

  • Variable Selection : Determine which covariates effect the response variable
  • Split Data: Generate Training and Testing Data.
  • Model Selection: We will use Logistic Regression, LDA, QDA and KNN.

Variable Selection

Lets us take a look at the data from it's summary


In [87]:
summary(new_Boston)


   high_crim        crim                zn             indus      
 Min.   :0.0   Min.   : 0.00632   Min.   :  0.00   Min.   : 0.46  
 1st Qu.:0.0   1st Qu.: 0.08204   1st Qu.:  0.00   1st Qu.: 5.19  
 Median :0.5   Median : 0.25651   Median :  0.00   Median : 9.69  
 Mean   :0.5   Mean   : 3.61352   Mean   : 11.36   Mean   :11.14  
 3rd Qu.:1.0   3rd Qu.: 3.67708   3rd Qu.: 12.50   3rd Qu.:18.10  
 Max.   :1.0   Max.   :88.97620   Max.   :100.00   Max.   :27.74  
      chas              nox               rm             age        
 Min.   :0.00000   Min.   :0.3850   Min.   :3.561   Min.   :  2.90  
 1st Qu.:0.00000   1st Qu.:0.4490   1st Qu.:5.886   1st Qu.: 45.02  
 Median :0.00000   Median :0.5380   Median :6.208   Median : 77.50  
 Mean   :0.06917   Mean   :0.5547   Mean   :6.285   Mean   : 68.57  
 3rd Qu.:0.00000   3rd Qu.:0.6240   3rd Qu.:6.623   3rd Qu.: 94.08  
 Max.   :1.00000   Max.   :0.8710   Max.   :8.780   Max.   :100.00  
      dis              rad              tax           ptratio     
 Min.   : 1.130   Min.   : 1.000   Min.   :187.0   Min.   :12.60  
 1st Qu.: 2.100   1st Qu.: 4.000   1st Qu.:279.0   1st Qu.:17.40  
 Median : 3.207   Median : 5.000   Median :330.0   Median :19.05  
 Mean   : 3.795   Mean   : 9.549   Mean   :408.2   Mean   :18.46  
 3rd Qu.: 5.188   3rd Qu.:24.000   3rd Qu.:666.0   3rd Qu.:20.20  
 Max.   :12.127   Max.   :24.000   Max.   :711.0   Max.   :22.00  
     black            lstat            medv      
 Min.   :  0.32   Min.   : 1.73   Min.   : 5.00  
 1st Qu.:375.38   1st Qu.: 6.95   1st Qu.:17.02  
 Median :391.44   Median :11.36   Median :21.20  
 Mean   :356.67   Mean   :12.65   Mean   :22.53  
 3rd Qu.:396.23   3rd Qu.:16.95   3rd Qu.:25.00  
 Max.   :396.90   Max.   :37.97   Max.   :50.00  

From the get go, I noticed some high leverage samples because of the crim summary. $75\% $ of the samples have a values less than $3.6$. Yet the maximum goes up to 88.97! We see a similar effect in zn. In the case of age, there is a significant difference between the minimum and the 1st quantile mark.

Let us determine the dependence of this new class high crim with the other features


In [97]:
par(mfrow=c(2,2))
boxplot(crim~high_crim, data=new_Boston, xlab="High Crime?", ylab="crim")
boxplot(zn~high_crim, data=new_Boston, xlab="High Crime?", ylab="zn")
boxplot(indus~high_crim, data=new_Boston, xlab="High Crime?", ylab="indus")
boxplot(chas~high_crim, data=new_Boston, xlab="High Crime?", ylab="chas")
boxplot(nox~high_crim, data=new_Boston, xlab="High Crime?", ylab="nox")
boxplot(rm~high_crim, data=new_Boston, xlab="High Crime?", ylab="rm")
boxplot(age~high_crim, data=new_Boston, xlab="High Crime?", ylab="age")
boxplot(dis~high_crim, data=new_Boston, xlab="High Crime?", ylab="dis")

par(mfrow=c(2,3))
boxplot(rad~high_crim, data=new_Boston, xlab="High Crime?", ylab="rad")
boxplot(tax~high_crim, data=new_Boston, xlab="High Crime?", ylab="tax")
boxplot(ptratio~high_crim, data=new_Boston, xlab="High Crime?", ylab="ptratio")
boxplot(black~high_crim, data=new_Boston, xlab="High Crime?", ylab="black")
boxplot(lstat~high_crim, data=new_Boston, xlab="High Crime?", ylab="lstat")
boxplot(medv~high_crim, data=new_Boston, xlab="High Crime?", ylab="medv")


We'll only consider predictors where we see some significant separation. In our case, we exclude rm, chas.

Split our Data

Let us now split the data into train and test sets. We first shuffle the data using the sample function. Since the sample function wants a vector of numbers, we shuffle numbers from 1 to the number of tuples. We then access the wor numbers of these tuples.


In [5]:
set.seed(1)
shuffled_idx = sample(nrow(new_Boston), nrow(new_Boston))

train_idx = shuffled_idx[1:(0.7*length(shuffled_idx))]
test_idx = shuffled_idx[(0.7*length(shuffled_idx)+1):length(shuffled_idx)]

train = new_Boston[train_idx,-c(5, 7)]
test = new_Boston[test_idx,-c(5, 7)]

dim(train)
dim(test)


  1. 354
  2. 13
  1. 151
  2. 13

Training our Model

Logistic Regression


In [141]:
model = glm(high_crim~. -crim -high_crim, train, family="binomial")
predictions = predict(model, test[,-1] ,type="response") # It should be test[,-c(1,2)]. Why it's working with _crim_?
preds = data.frame(matrix(0, nrow=length(predictions), ncol=1))
colnames(preds) = "preds"

new_frame = data.frame(cbind(predictions, preds))
new_frame$preds[new_frame$predictions > 0.5] = 1

predicted_labels = test[,1]
actual_labels = new_frame[,2]

confusion = table(predicted_labels, actual_labels)[2:1, 2:1]
confusion


                actual_labels
predicted_labels  1  0
               1 69  7
               0  4 71

Let us use our performance function written for question 10.


In [142]:
f_beta_measure = function(precision, recall, beta){
    return ( 1/((beta*(1/precision))+ ((1-beta)*(1/recall))))
}

get_accuracy_scores = function(conf){
    TP = conf[1,1]
    FP = conf[1,2]
    FN = conf[2,1]
    TN = conf[2,2]
    
    simple_accuracy = (TP+TN)/(TP+FP+TN+FN)
    precision = TP/(TP+FP)
    recall = TP/(TP+FN)
    f = f_beta_measure(precision, recall, beta=0.5)
    return (c(simple_accuracy, precision, recall, f))
}

get_accuracy_scores(confusion)


  1. 0.927152317880795
  2. 0.907894736842105
  3. 0.945205479452055
  4. 0.926174496644295

We have some pretty good accuracy scores. Simple Accuracy here is $92.71\%$, giving us a test error of $7.19\%$

Linear Discriminant Analysis

We now train an LDA model without crim rate.


In [164]:
model = lda(high_crim~. -crim -high_crim, train)
preds = predict(model, test[,-1])

predicted_labels = preds$class
actual_labels = test[,1]

confusion = table(predicted_labels, actual_labels)[2:1, 2:1]
confusion


                actual_labels
predicted_labels  1  0
               1 57  1
               0 19 74

In [165]:
get_accuracy_scores(confusion)


  1. 0.867549668874172
  2. 0.982758620689655
  3. 0.75
  4. 0.850746268656716

So we get an accuracy of $86.75\%$, or a training error of $13.25\%$

But let us now exclude every other column, and use crim as the only covariate.


In [195]:
model = lda(high_crim~crim, train)
preds = predict(model, test[,1:2]) 

predicted_labels = preds$class
actual_labels = test[,1]

confusion = table(predicted_labels, actual_labels)[2:1, 2:1]
confusion


                actual_labels
predicted_labels  1  0
               1 35  0
               0 41 75

NOTE: It looks like the accuracy actually fell even though we only consider crim. It is surprising because high crim was created from crim. I expected a $ 0\% $ error rate. There doesn't seem to be unbalanced training data.


In [198]:
dim(train[train$high_crim < median(new_Boston$high_crim),])
dim(train[train$high_crim >= median(new_Boston$high_crim),])


  1. 177
  2. 13
  1. 177
  2. 13

Quadratic Discriminant Analysis

We now train a QDA model without crime rate.


In [200]:
model = qda(high_crim~. -crim -high_crim, train)
preds = predict(model, test[,-1])

predicted_labels = preds$class
actual_labels = test[,1]

confusion = table(predicted_labels, actual_labels)[2:1, 2:1]
confusion


                actual_labels
predicted_labels  1  0
               1 54  1
               0 22 74

But what if crim is the only covariate?


In [201]:
model = qda(high_crim~crim, train)
preds = predict(model, test[,1:2]) 

predicted_labels = preds$class
actual_labels = test[,1]

confusion = table(predicted_labels, actual_labels)[2:1, 2:1]
confusion


                actual_labels
predicted_labels  1  0
               1 71  0
               0  5 75

In [202]:
get_accuracy_scores(confusion)


  1. 0.966887417218543
  2. 1
  3. 0.934210526315789
  4. 0.965986394557823

This works as intended. We have a very high performance in high crim as this was created from crim. From the results, we have $0$ false positives, giving us $100 \%$ Precision.

KNN


In [6]:
library(class)

In [13]:
X_train = train[,-1]
y_train = train[,1] 

X_test = test[,-1]
y_test = test[,1]

tab = data.frame(matrix(ncol=4, nrow=0))

for (k in 1:100){
    preds = knn(X_train, X_test, y_train, k=k)
    conf = table(preds, y_test)[2:1, 2:1]
    scores = get_accuracy_scores(conf)
    tab = rbind(tab, scores)
}

colnames(tab) = c("Accuracy","Precision", "Recall", "F")
rownames(tab) = paste("", 1:nrow(tab)) # A little work around to visually display row numbers

head(tab)


AccuracyPrecisionRecallF
10.92052980.94444440.89473680.9189189
20.90066230.91780820.88157890.8993289
30.92715230.97101450.88157890.9241379
40.92052980.95714290.88157890.9178082
50.91390730.94366200.88157890.9115646
60.90728480.93055560.88157890.9054054

In [20]:
x = 1:100
y = tab$Accuracy

plot(x=x, y=y, xlab="K", ylab="Simple Accuracy", type="line")


Warning message in plot.xy(xy, type, ...):
“plot type 'line' will be truncated to first character”

It is interesting to note the highest simaple accuracy is at k = 3.


In [21]:
tab[tab$Accuracy == max(tab$Accuracy),]


AccuracyPrecisionRecallF
30.92715230.97101450.88157890.9241379

Let us take crim rate as the only covariate.


In [35]:
X_train = as.data.frame(train[,2])
y_train = train[,1]

X_test = as.data.frame(test[,2])
y_test = as.factor(test[,1])

tab = data.frame(matrix(ncol=4, nrow=0))

for (k in 1:100){
    preds = knn(X_train, X_test, y_train, k=k)
    conf = table(preds, y_test)[2:1, 2:1]
    scores = get_accuracy_scores(conf)
    tab = rbind(tab, scores)
}

colnames(tab) = c("Accuracy","Precision", "Recall", "F")
rownames(tab) = paste("", 1:nrow(tab)) # A little work around to visually display row numbers

head(tab)


AccuracyPrecisionRecallF
11.00000001 1.00000001.0000000
21.00000001 1.00000001.0000000
31.00000001 1.00000001.0000000
40.99337751 0.98684210.9933775
50.99337751 0.98684210.9933775
60.99337751 0.98684210.9933775

We actually get a perfect predictions for all types of performance measures when $ \textit{K = 1,2, and 3}$!