Classification

Iris Data Set

We will be examing the famous Iris flower data set.

The Iris flower data set or Fisher's Iris data set is a multivariate data set introduced by Sir Ronald Fisher in the 1936 as an example of discriminant analysis.

The data set consists of 50 samples from each of three species of Iris (Iris setosa, Iris virginica and Iris versicolor), so 150 total samples. Four features were measured from each sample: the length and the width of the sepals and petals, in centimetres.

Here's a picture of the three different Iris types:



In [2]:

    
# The Iris Setosa
from IPython.display import Image
url = 'http://upload.wikimedia.org/wikipedia/commons/5/56/Kosaciec_szczecinkowaty_Iris_setosa.jpg'
Image(url,width=300, height=300)









    Out[2]:



In [3]:

    
# The Iris Versicolor
from IPython.display import Image
url = 'http://upload.wikimedia.org/wikipedia/commons/4/41/Iris_versicolor_3.jpg'
Image(url,width=300, height=300)









    Out[3]:



In [4]:

    
# The Iris Virginica
from IPython.display import Image
url = 'http://upload.wikimedia.org/wikipedia/commons/9/9f/Iris_virginica.jpg'
Image(url,width=300, height=300)









    Out[4]:

The iris dataset contains measurements for 150 iris flowers from three different species.

The three classes in the Iris dataset:

Iris-setosa (n=50)
Iris-versicolor (n=50)
Iris-virginica (n=50)

The four features of the Iris dataset:

sepal length in cm
sepal width in cm
petal length in cm
petal width in cm

Here's a picture describing the petals and the sepals.

Multi-Class Classification with Sci-kit Learn

we will learn how to use multi-class classification with SciKit Learn to seperate data into multiple classes.

We will first use SciKit Learn to implement a strategy known as one vs all (sometimes called one vs rest) to perform multi-class classification. This method works by basically performing a logistic regression for binary classification for each possible class. The class that is then predicted with the highest confidence is assigned to that data point. For a great visual explanation of this, here is Andrew Ng's quick explanation of how one-vs-rest works.



In [5]:

    
# Andrew Ng's visual Explanation for Multiclass Classification
from IPython.display import YouTubeVideo
YouTubeVideo("Zj403m-fjqg")









    Out[5]:

Data Formatting

Let's go ahead and start with our imports.



In [6]:

    
# Data Imports
import numpy as np
import pandas as pd
from pandas import Series,DataFrame

# Plot imports
import matplotlib.pyplot as plt
import seaborn as sns
sns.set_style('whitegrid')

%matplotlib inline

Load the data set from Sci Kit Learn



In [7]:

    
from sklearn import linear_model
from sklearn.datasets import load_iris

# Import the data
iris = load_iris()

# Grab features (X) and the Target (Y)
X = iris.data

Y = iris.target

# Show the Built-in Data Description
print iris.DESCR









    



Iris Plants Database

Notes
-----
Data Set Characteristics:
    :Number of Instances: 150 (50 in each of three classes)
    :Number of Attributes: 4 numeric, predictive attributes and the class
    :Attribute Information:
        - sepal length in cm
        - sepal width in cm
        - petal length in cm
        - petal width in cm
        - class:
                - Iris-Setosa
                - Iris-Versicolour
                - Iris-Virginica
    :Summary Statistics:

    ============== ==== ==== ======= ===== ====================
                    Min  Max   Mean    SD   Class Correlation
    ============== ==== ==== ======= ===== ====================
    sepal length:   4.3  7.9   5.84   0.83    0.7826
    sepal width:    2.0  4.4   3.05   0.43   -0.4194
    petal length:   1.0  6.9   3.76   1.76    0.9490  (high!)
    petal width:    0.1  2.5   1.20  0.76     0.9565  (high!)
    ============== ==== ==== ======= ===== ====================

    :Missing Attribute Values: None
    :Class Distribution: 33.3% for each of 3 classes.
    :Creator: R.A. Fisher
    :Donor: Michael Marshall (MARSHALL%PLU@io.arc.nasa.gov)
    :Date: July, 1988

This is a copy of UCI ML iris datasets.
http://archive.ics.uci.edu/ml/datasets/Iris

The famous Iris database, first used by Sir R.A Fisher

This is perhaps the best known database to be found in the
pattern recognition literature.  Fisher's paper is a classic in the field and
is referenced frequently to this day.  (See Duda & Hart, for example.)  The
data set contains 3 classes of 50 instances each, where each class refers to a
type of iris plant.  One class is linearly separable from the other 2; the
latter are NOT linearly separable from each other.

References
----------
   - Fisher,R.A. "The use of multiple measurements in taxonomic problems"
     Annual Eugenics, 7, Part II, 179-188 (1936); also in "Contributions to
     Mathematical Statistics" (John Wiley, NY, 1950).
   - Duda,R.O., & Hart,P.E. (1973) Pattern Classification and Scene Analysis.
     (Q327.D83) John Wiley & Sons.  ISBN 0-471-22361-1.  See page 218.
   - Dasarathy, B.V. (1980) "Nosing Around the Neighborhood: A New System
     Structure and Classification Rule for Recognition in Partially Exposed
     Environments".  IEEE Transactions on Pattern Analysis and Machine
     Intelligence, Vol. PAMI-2, No. 1, 67-71.
   - Gates, G.W. (1972) "The Reduced Nearest Neighbor Rule".  IEEE Transactions
     on Information Theory, May 1972, 431-433.
   - See also: 1988 MLC Proceedings, 54-64.  Cheeseman et al"s AUTOCLASS II
     conceptual clustering system finds 3 classes in the data.
   - Many, many more ...

Let's put the data into a pandas DataFrame.



In [8]:

    
# Grab data
iris_data = DataFrame(X,columns=['Sepal Length','Sepal Width','Petal Length','Petal Width'])

# Grab Target
iris_target = DataFrame(Y,columns=['Species'])

If we look at the iris_target data, we'll notice that the Species are still defined as either 0,1,or 2. Let's go ahead and use apply() to split the column, apply a naming scheme function, and then combine it back together.



In [9]:

    
def flower(num):
    ''' Takes in numerical class, returns flower name'''
    if num == 0:
        return 'Setosa'
    elif num == 1:
        return 'Veriscolour'
    else:
        return 'Virginica'

# Apply
iris_target['Species'] = iris_target['Species'].apply(flower)

Let's look at the targets:



In [10]:

    
iris_target.head()









    Out[10]:






  
    
      
      Species
    
  
  
    
      0
      Setosa
    
    
      1
      Setosa
    
    
      2
      Setosa
    
    
      3
      Setosa
    
    
      4
      Setosa



In [11]:

    
# Create a combined Iris DataSet
iris = pd.concat([iris_data,iris_target],axis=1)

# Preview all data
iris.head()









    Out[11]:






  
    
      
      Sepal Length
      Sepal Width
      Petal Length
      Petal Width
      Species
    
  
  
    
      0
      5.1
      3.5
      1.4
      0.2
      Setosa
    
    
      1
      4.9
      3.0
      1.4
      0.2
      Setosa
    
    
      2
      4.7
      3.2
      1.3
      0.2
      Setosa
    
    
      3
      4.6
      3.1
      1.5
      0.2
      Setosa
    
    
      4
      5.0
      3.6
      1.4
      0.2
      Setosa

Data Visualization Analysis

Let's do some quick visualizations of the data. We can actually do a very broad and quick birds eye view with seaborn's pairplot.



In [12]:

    
# First a pairplot of all the different features
sns.pairplot(iris,hue='Species',size=2)









    Out[12]:





<seaborn.axisgrid.PairGrid at 0x115964650>

Awesome! With this pairplot we can actually begin to see the grouping between the 3 different Iris types. We'll have to ignore the "Species" vs "Species" since this is categorical data.

A quick observation with this visualization allows us to see that Iris type Setosa has the most distinct features out of the three types.

Multi-Class Classification with Sci Kit Learn

Let's go ahead and start using SciKit Learn to perform a Multi-Class Classification using Logistic Regression Techniques.

We already have X and Y defined as the Data Features and Target so let's go ahead and continue with those arrays. We will then have to split the data into Testing and Training sets. I'll pass a test_size argument to have the testing data be 40% of the total data set. I'll also pass a random seed number.



In [15]:

    
# Import SciKit Learn Log Reg
from sklearn.linear_model import LogisticRegression
from sklearn.cross_validation import train_test_split

# Create a Logistic Regression Class object
logreg = LogisticRegression()

# Split the data into Trainging and Testing sets
X_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size=0.4,random_state=3)

# Train the model with the training set
logreg.fit(X_train, Y_train)









    Out[15]:





LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
          intercept_scaling=1, max_iter=100, multi_class='ovr', n_jobs=1,
          penalty='l2', random_state=None, solver='liblinear', tol=0.0001,
          verbose=0, warm_start=False)

Now that we've trained our model with a training set, let's test our accuracy with the testing set. We'll make a prediction using our model and then check its accuracy.



In [17]:

    
# Import testing metrics from SciKit Learn
from sklearn import metrics

# Prediction from X_test
Y_pred = logreg.predict(X_test)

#Check accuracy
print metrics.accuracy_score(Y_test,Y_pred)









    



0.933333333333

Our model had almost 93% accuracy. (Note: this could change from run to run due to the random splitting) Should we trust this level of accuracy? I encourage you to figure out ways to intuitively understand this result. Try looking at the PairPlot again and check to see how separate the data features initially were. Also try changing the test_size parameter and check how that effects the outcome. In conclusion, given how clean the data is and how seperated some of the features are, we should expect pretty high accuracy.

Now let's see how to use K-Nearest Neighbors to implement Multi-Class Classification!

K-Nearest Neighbors

Let's start with a basic overview of the K-Nearest Neighbors Algorithm. The premise of the algorithm is actually quite simple. Given an object to be assigned to a class in a feature space, select the class that is "nearest" to the negihbors in the training set. This 'nearness" is a distance metric, which is usually a Euclidean distance.

The k-nearest neighbor (kNN) algorithm is very well explained in the following two videos. The ifrst one is a quick overall explanation and the second one is an MIT OpenCourse Lecture on the topic. I encourage you to check them both out.



In [18]:

    
# Short Explanation
from IPython.display import YouTubeVideo
YouTubeVideo('UqYde-LULfs')









    Out[18]:



In [19]:

    
# MIT Lecture
YouTubeVideo('09mb78oiPkA')









    Out[19]:

For a quick explanation in this Notebook Lecture, we can demonstrate the concept with an example. Take for instance the following diagram:



In [20]:

    
Image('http://bdewilde.github.io/assets/images/2012-10-26-knn-concept.png',width=400, height=300)









    Out[20]:

Imagine we have two Classes in our training set, A and B. Then we have to classify a new data point in our testing data, we represent this as a red star. Now we just expand a specific distance away from our feature space until we hit k number of other data points. In the figure above you can see the differences between various k values. An important thing to note, for a binary classification using this method, we must choose an odd number for k, to avoid the case of a "tied" distance between two classes.

kNN with SciKit Learn

Let's go ahead and see kNN Algorithm in action with SciKit Learn and our Iris dataset!



In [21]:

    
#Import from SciKit Learn
from sklearn.neighbors import KNeighborsClassifier

# We'll first start with k=6

# Import the kNeighbors Classifiers 
knn = KNeighborsClassifier(n_neighbors = 6)

# Fit the data
knn.fit(X_train,Y_train)

# Run a prediction
Y_pred = knn.predict(X_test)

# Check Accuracy against the Testing Set
print metrics.accuracy_score(Y_test,Y_pred)

Looks like using k=6 got us around 95% accuracy. Let's see what happens if we reduce that value to k=1, that means the closest point in the feature space to our testing data point will be the class the testing point joins.



In [22]:

    
# Import the kNeighbors Classifiers 
knn = KNeighborsClassifier(n_neighbors = 1)

# Fit the data
knn.fit(X_train,Y_train)

# Run a prediction
Y_pred = knn.predict(X_test)

# Check Accuracy against the Testing Set
print metrics.accuracy_score(Y_test,Y_pred)









    



0.966666666667

Looks like using k=1 got us around 96% accuracy. How about we cycle through various k values and find the optimal value.



In [23]:

    
# Test k values 1 through 20
k_range = range(1, 21)

# Set an empty list
accuracy = []

# Repeat above process for all k values and append the result
for k in k_range:
    knn = KNeighborsClassifier(n_neighbors=k)
    knn.fit(X_train, Y_train)
    Y_pred = knn.predict(X_test)
    accuracy.append(metrics.accuracy_score(Y_test, Y_pred))

Now let's plot the results!



In [24]:

    
plt.plot(k_range, accuracy)
plt.xlabel('K value for for kNN')
plt.ylabel('Testing Accuracy')









    Out[24]:





<matplotlib.text.Text at 0x11a942250>

Interesting! Try changing the way Sci Kit Learn split the training and Testing data sets and try re-running this analysis. What changed?

We've learned how to perform Multi-Class Classification using two great techniques, Logistic Regression and k-Nearest Neighbors.

Here are several more resources for you to Explore:

1.) Wikipedia on Multiclass Classification

2.) MIT Lecture Slides on MultiClass Classification

3.) Sci Kit Learn Documentation

4.) DataRobot on Classification Techniques



In [ ]:

	Sepal Length	Sepal Width	Petal Length	Petal Width	Species
0	5.1	3.5	1.4	0.2	Setosa
1	4.9	3.0	1.4	0.2	Setosa
2	4.7	3.2	1.3	0.2	Setosa
3	4.6	3.1	1.5	0.2	Setosa
4	5.0	3.6	1.4	0.2	Setosa