Data Science

Classification I: K-Nearest Neighbors

Instructor: Alessandro Gagliardi
TA: Kevin Perko

Last Time:

  1. R
    1. History of R
    2. Comparing Python and R
  2. Machine Learning
    1. What is Machine learning?
    2. Machine Learning Problems

Questions?

Agenda

  1. Project Discussion
  2. Readiness Assessment
  3. Classification Problems
  4. Building Effective Classifiers
  5. Lab: the KNN classification Model

Project Discussion

Readiness Assessment

Teams:

AA BB
AC CB
EE FC DD
E FF D

Fill in the Blank

??? ???
??? regression classification
??? dimension reduction clustering

Classification Problems

Goals

  • Clarify what a classification problem is
  • Understand what data looks like for a classification problem
  • Review supervised learning
  • Determine steps required for classification

Example Data


In [1]:
%load_ext rmagic

In [2]:
%%R 
head(iris, 10)


   Sepal.Length Sepal.Width Petal.Length Petal.Width Species
1           5.1         3.5          1.4         0.2  setosa
2           4.9         3.0          1.4         0.2  setosa
3           4.7         3.2          1.3         0.2  setosa
4           4.6         3.1          1.5         0.2  setosa
5           5.0         3.6          1.4         0.2  setosa
6           5.4         3.9          1.7         0.4  setosa
7           4.6         3.4          1.4         0.3  setosa
8           5.0         3.4          1.5         0.2  setosa
9           4.4         2.9          1.4         0.2  setosa
10          4.9         3.1          1.5         0.1  setosa

Here we have 4 independent variables, and a class label (note, not a dependent variable here!) Q: What kind of data is is the "Species" label? Quantitative or Qualitative? Q: Given these features and label, what do you think we will be trying to predict?

What does "supervised" mean?

We already know the labels we are attempting to classify.


In [3]:
%%R 
summary(iris)


  Sepal.Length    Sepal.Width     Petal.Length    Petal.Width   
 Min.   :4.300   Min.   :2.000   Min.   :1.000   Min.   :0.100  
 1st Qu.:5.100   1st Qu.:2.800   1st Qu.:1.600   1st Qu.:0.300  
 Median :5.800   Median :3.000   Median :4.350   Median :1.300  
 Mean   :5.843   Mean   :3.057   Mean   :3.758   Mean   :1.199  
 3rd Qu.:6.400   3rd Qu.:3.300   3rd Qu.:5.100   3rd Qu.:1.800  
 Max.   :7.900   Max.   :4.400   Max.   :6.900   Max.   :2.500  
       Species  
 setosa    :50  
 versicolor:50  
 virginica :50  
                
                
                

How does a classification problem work?

Input data, output predicted labels.

What steps does a classification problem require?

  1. Split the dataset
  2. Train the model
  3. Test the model
  4. Make predictions with new data

Model of breaking apart data for train, test, and output

Building Effective Classifiers

Goals:

  • Determine types of prediction errors
  • Explain why we use training and test sets
  • Explain what "overfitting" and "underfitting" mean
  • Describe the basic process of n-fold cross validation

Life with errors

aka "why predictive analytics is hard"

Types of errors we'll run into

Training error
Generalization error
Out of Sample (new data) error

Note to comment about how all of these should be predicted, particularly OOS error

Training error

What happens if we had no test data and only used a training set?

Q: Why should we use training & test sets?

Thought experiment:
Suppose instead, we train our model using the entire dataset

Q: How low can we push the training error?

  • We can make the model arbitrarily complex (effectively “memorizing” the entire training set).

A: Down to zero!

This phenomenon is called overfitting.

source: Data Analysis with Open Source Tools, by Philipp K. Janert. O’Reilly Media, 2011


source: dtreg.com


source: dtreg.com

The most significant reason why we'd consider generalization over "overfitting" is the cost of an algorithm. Notice how much less complicated the natural fit is compared to the "overfit" in this example, with only minor errors. This is much more effective for us as data scientists. Note that at times we might deal with "underfitting" as well, where we generalize our data too much. Let's strive to find the right fit for our data.

Q: Why should we use training & test sets?

Thought experiment:
Suppose instead, we train our model using the entire dataset.

Q: How low can we push the training error?

  • We can make the model arbitrarily complex (effectively “memorizing” the entire training set). A: Down to zero!

A: Training error is not a good estimate of OOS accuracy.

Generalization Error

How well does the model generalize to unseen data?

Generalization Error

Try training the model on a subset of the data:

Does the error remain the same with different training data?

Generalization Error

NO!

Generalization Error

  • Generalization error gives a high variance estimate of OOS error
  • Insufficient training can lead to incorrect model selection

Generalization Error

Something is still missing!

Q: How can we do better?

Thought experiment:
Different train/test splits will give us different generalization errors.

Q: What if we did a bunch of these and took the average?

A: Now you’re talking!

A: Cross-validation.

Cross Validation

Assessing a model using different subsets of the data for training and testing

Examples of Cross Validation

  • N-fold cross validation
  • Random sub-sampling validation
  • Leave-one-out cross validation

N-fold cross validation

  1. Randomly split the dataset into n equal groups
  2. Use partition 1 as test set & union of other groups as training
  3. Find generalization error
  4. Repeat steps 2-3 using different group as test set at each iteration
  5. Take average generalization error as estimate of OOS accuracy

Features of n-fold cross-validation:

  1. More accurate estimate of OOS prediction error.

  2. More efficient use of data than single train/test split.

    • Each record in our dataset is used for both training and testing.
  3. Presents tradeoff between efficiency and computational expense.
    • 10-fold CV is 10x more expensive than a single train/test split
  4. Can be used for model selection.

k-Nearest Neighbor (KNN) Classification

Goals

  • Define KNN Classification
  • Practice KNN Classification using Training and Test Set data

What is KNN Classification?

KNN is an non-parametric lazy learning algorithm.

non-parametric:

does not make any assumptions on the underlying data distribution.

lazy:

does not use the training data points to do any generalization

Parametric vs. Non-parametric

Say you have the following dataset:


In [4]:
%%R
c(0.001, 23.4, 17.3, 26.8, 32.8, 31.3, 34.5, 7352.3)


[1]    0.001   23.400   17.300   26.800   32.800   31.300   34.500 7352.300

Because of one outlier (7352.3), standard metrics like mean and sd become meaningless.


In [5]:
%%R
mean(c(0.001, 23.4, 17.3, 26.8, 32.8, 31.3, 34.5, 7352.3))


[1] 939.8001

In [6]:
%%R
sd(c(0.001, 23.4, 17.3, 26.8, 32.8, 31.3, 34.5, 7352.3))


[1] 2591.065

Non-parametric

ignore the actual value and compare only relative position, i.e. the rank.

Imagine the following are the distances to eight points:

0.001, 23.4, 17.3, 26.8, 32.8, 31.3, 34.5, 7352.3

is treated the same as

1, 3, 2, 4, 6, 5, 7, 8

Suppose we want to predict the color of the gray dot.

  • Pick value for k
  • Find colors of k nearest neighbors
  • Assign most common color to gray dot

Distance

  • How we measure distance is very important in KNN.
  • By default, we use Euclidean distance: $d = \sqrt{\sum(x_i - p_i)^2}$
  • Any problems with this?

What if one dimension is scaled much differently than another? (Say height in feet and weight in pounds)


In [7]:
%%R
d <- data.frame(heights=c(5.42, 5.67, 5.75, 4.83), weights=c(120, 150, 110, 90))
plot(d, pch=16)
text(d$heights, y = d$weights + 2, labels = seq(4))
points(x=c(4.9), y=c(101), col=2, pch=16)
text(4.9, y = 103, labels = "?", col=2)


Which black dot is closest to the red one?


In [8]:
%%R
sqrt((4.9 - d$heights)^2 + (101 - d$weights)^2)


[1] 19.00711 49.00605  9.04005 11.00022

Moral:

Scale your data first!

$$ x_{scaled} = \frac{x - \bar{x}}{\sigma_x} $$

In [9]:
%%R
d2 <- data.frame(scale(d))
sqrt((((4.9 - mean(d$heights)) / sd(d$heights)) - d2$heights)^2 + 
     (((101 - mean(d$weights)) / sd(d$weights)) - d2$weights)^2)


[1] 1.4625924 2.6954840 2.0741460 0.4710603

...or use a different distance metric (depending on the data)

Define: Imputation

Lab

  1. Look at tinyurl.com/iris-knn
  2. Try it on your own categorical data.
    1. Hold out a portion of your dataset.
    2. See how well kNN can predict unknown labels
    3. If it does poorly, why?
      • Scaling?
      • Nonlinear relationships?

Next Time:

NAÏVE BAYES CLASSIFICATION