Data Science

Classification I: K-Nearest Neighbors

Instructor: Alessandro Gagliardi
TA: Kevin Perko

Last Time:

R
1. History of R
2. Comparing Python and R
Machine Learning
1. What is Machine learning?
2. Machine Learning Problems

Questions?

Agenda

Project Discussion
Readiness Assessment
Classification Problems
Building Effective Classifiers
Lab: the KNN classification Model

Project Discussion

Readiness Assessment

Teams:

		A	A	B	B
		A	C	C	B
E	E	F	C	D	D
	E	F	F	D

Fill in the Blank

	???	???
???	regression	classification
???	dimension reduction	clustering

Classification Problems

Goals

Clarify what a classification problem is
Understand what data looks like for a classification problem
Review supervised learning
Determine steps required for classification

Example Data



In [1]:

    
%load_ext rmagic



In [2]:

    
%%R 
head(iris, 10)









    





   Sepal.Length Sepal.Width Petal.Length Petal.Width Species
1           5.1         3.5          1.4         0.2  setosa
2           4.9         3.0          1.4         0.2  setosa
3           4.7         3.2          1.3         0.2  setosa
4           4.6         3.1          1.5         0.2  setosa
5           5.0         3.6          1.4         0.2  setosa
6           5.4         3.9          1.7         0.4  setosa
7           4.6         3.4          1.4         0.3  setosa
8           5.0         3.4          1.5         0.2  setosa
9           4.4         2.9          1.4         0.2  setosa
10          4.9         3.1          1.5         0.1  setosa

Here we have 4 independent variables, and a class label (note, not a dependent variable here!) Q: What kind of data is is the "Species" label? Quantitative or Qualitative? Q: Given these features and label, what do you think we will be trying to predict?

What does "supervised" mean?

We already know the labels we are attempting to classify.



In [3]:

    
%%R 
summary(iris)









    





  Sepal.Length    Sepal.Width     Petal.Length    Petal.Width   
 Min.   :4.300   Min.   :2.000   Min.   :1.000   Min.   :0.100  
 1st Qu.:5.100   1st Qu.:2.800   1st Qu.:1.600   1st Qu.:0.300  
 Median :5.800   Median :3.000   Median :4.350   Median :1.300  
 Mean   :5.843   Mean   :3.057   Mean   :3.758   Mean   :1.199  
 3rd Qu.:6.400   3rd Qu.:3.300   3rd Qu.:5.100   3rd Qu.:1.800  
 Max.   :7.900   Max.   :4.400   Max.   :6.900   Max.   :2.500  
       Species  
 setosa    :50  
 versicolor:50  
 virginica :50

How does a classification problem work?

Input data, output predicted labels.

What steps does a classification problem require?

Split the dataset
Train the model
Test the model
Make predictions with new data

Model of breaking apart data for train, test, and output

Building Effective Classifiers

Goals:

Determine types of prediction errors
Explain why we use training and test sets
Explain what "overfitting" and "underfitting" mean
Describe the basic process of n-fold cross validation

Life with errors

aka "why predictive analytics is hard"

Types of errors we'll run into

Training error
Generalization error
Out of Sample (new data) error

Note to comment about how all of these should be predicted, particularly OOS error

Training error

What happens if we had no test data and only used a training set?

Q: Why should we use training & test sets?

Thought experiment:
Suppose instead, we train our model using the entire dataset

Q: How low can we push the training error?

We can make the model arbitrarily complex (effectively “memorizing” the entire training set).

A: Down to zero!

This phenomenon is called overfitting.

source: Data Analysis with Open Source Tools, by Philipp K. Janert. O’Reilly Media, 2011

source: dtreg.com

The most significant reason why we'd consider generalization over "overfitting" is the cost of an algorithm. Notice how much less complicated the natural fit is compared to the "overfit" in this example, with only minor errors. This is much more effective for us as data scientists. Note that at times we might deal with "underfitting" as well, where we generalize our data too much. Let's strive to find the right fit for our data.

Q: Why should we use training & test sets?

Thought experiment:
Suppose instead, we train our model using the entire dataset.

Q: How low can we push the training error?

We can make the model arbitrarily complex (effectively “memorizing” the entire training set). A: Down to zero!

A: Training error is not a good estimate of OOS accuracy.

Generalization Error

How well does the model generalize to unseen data?

Generalization Error

Try training the model on a subset of the data:

Does the error remain the same with different training data?

Generalization Error

NO!

Generalization Error

Generalization error gives a high variance estimate of OOS error
Insufficient training can lead to incorrect model selection

Generalization Error

Something is still missing!

Q: How can we do better?

Thought experiment:
Different train/test splits will give us different generalization errors.

Q: What if we did a bunch of these and took the average?

A: Now you’re talking!

A: Cross-validation.

Cross Validation

Assessing a model using different subsets of the data for training and testing

Examples of Cross Validation

N-fold cross validation
Random sub-sampling validation
Leave-one-out cross validation

N-fold cross validation

Randomly split the dataset into n equal groups
Use partition 1 as test set & union of other groups as training
Find generalization error
Repeat steps 2-3 using different group as test set at each iteration
Take average generalization error as estimate of OOS accuracy

Features of n-fold cross-validation:

More accurate estimate of OOS prediction error.
More efficient use of data than single train/test split.
- Each record in our dataset is used for both training and testing.
Presents tradeoff between efficiency and computational expense.
- 10-fold CV is 10x more expensive than a single train/test split
Can be used for model selection.

k-Nearest Neighbor (KNN) Classification

Goals

Define KNN Classification
Practice KNN Classification using Training and Test Set data

What is KNN Classification?

KNN is an non-parametric lazy learning algorithm.

non-parametric:

does not make any assumptions on the underlying data distribution.

lazy:

does not use the training data points to do any generalization

Parametric vs. Non-parametric

Say you have the following dataset:



In [4]:

    
%%R
c(0.001, 23.4, 17.3, 26.8, 32.8, 31.3, 34.5, 7352.3)









    





[1]    0.001   23.400   17.300   26.800   32.800   31.300   34.500 7352.300

Because of one outlier (7352.3), standard metrics like mean and sd become meaningless.



In [5]:

    
%%R
mean(c(0.001, 23.4, 17.3, 26.8, 32.8, 31.3, 34.5, 7352.3))









    





[1] 939.8001



In [6]:

    
%%R
sd(c(0.001, 23.4, 17.3, 26.8, 32.8, 31.3, 34.5, 7352.3))









    





[1] 2591.065

Non-parametric

ignore the actual value and compare only relative position, i.e. the rank.

Imagine the following are the distances to eight points:

0.001, 23.4, 17.3, 26.8, 32.8, 31.3, 34.5, 7352.3

is treated the same as

1, 3, 2, 4, 6, 5, 7, 8

Suppose we want to predict the color of the gray dot.

Pick value for k
Find colors of k nearest neighbors
Assign most common color to gray dot

Distance

How we measure distance is very important in KNN.
By default, we use Euclidean distance: $d = \sqrt{\sum(x_i - p_i)^2}$
Any problems with this?

What if one dimension is scaled much differently than another? (Say height in feet and weight in pounds)



In [7]:

    
%%R
d <- data.frame(heights=c(5.42, 5.67, 5.75, 4.83), weights=c(120, 150, 110, 90))
plot(d, pch=16)
text(d$heights, y = d$weights + 2, labels = seq(4))
points(x=c(4.9), y=c(101), col=2, pch=16)
text(4.9, y = 103, labels = "?", col=2)

Which black dot is closest to the red one?



In [8]:

    
%%R
sqrt((4.9 - d$heights)^2 + (101 - d$weights)^2)









    





[1] 19.00711 49.00605  9.04005 11.00022

Moral:

Scale your data first!

$$ x_{scaled} = \frac{x - \bar{x}}{\sigma_x} $$



In [9]:

    
%%R
d2 <- data.frame(scale(d))
sqrt((((4.9 - mean(d$heights)) / sd(d$heights)) - d2$heights)^2 + 
     (((101 - mean(d$weights)) / sd(d$weights)) - d2$weights)^2)









    





[1] 1.4625924 2.6954840 2.0741460 0.4710603

...or use a different distance metric (depending on the data)

Define: Imputation

Lab

Look at tinyurl.com/iris-knn
Try it on your own categorical data.
1. Hold out a portion of your dataset.
2. See how well kNN can predict unknown labels
3. If it does poorly, why?
  - Scaling?
  - Nonlinear relationships?

Data Science

Classification I: K-Nearest Neighbors

Last Time:

Questions?

Agenda

Project Discussion

Readiness Assessment

Teams:

Fill in the Blank

Classification Problems

Goals

Example Data

What does "supervised" mean?

How does a classification problem work?

What steps does a classification problem require?

Model of breaking apart data for train, test, and output

Building Effective Classifiers

Goals:

Life with errors

Types of errors we'll run into

Training error

Q: Why should we use training & test sets?

This phenomenon is called overfitting.

Q: Why should we use training & test sets?

A: Training error is not a good estimate of OOS accuracy.

Generalization Error

Generalization Error

Try training the model on a subset of the data:

Does the error remain the same with different training data?

Generalization Error

NO!

Generalization Error

Generalization Error

Something is still missing!

Q: How can we do better?

A: Cross-validation.

Cross Validation

Examples of Cross Validation

N-fold cross validation

Features of n-fold cross-validation:

k-Nearest Neighbor (KNN) Classification

Goals

What is KNN Classification?

non-parametric:

lazy:

Parametric vs. Non-parametric

Non-parametric

Suppose we want to predict the color of the gray dot.

Distance

Moral:

Scale your data first!

Lab

Next Time:

NAÏVE BAYES CLASSIFICATION