Iris Dataset

From Wikipedia:

The Iris flower data set or Fisher's Iris data set is a multivariate data set introduced by the British statistician and biologist Ronald Fisher in his 1936 paper The use of multiple measurements in taxonomic problems as an example of linear discriminant analysis. It is sometimes called Anderson's Iris data set because Edgar Anderson collected the data to quantify the morphologic variation of Iris flowers of three related species. Two of the three species were collected in the Gaspé Peninsula "all from the same pasture, and picked on the same day and measured at the same time by the same person with the same apparatus".

Pandas

Pandas is a library modeled after the R dataframe API that enables the quick exploration and processing of heterogenous data.

One of the many great things about pandas is that is has many functions for grabbing data--including functions for grabbing data from the internet. In the cell below, we grabbed data from the https://archive.ics.uci.edu/ml/datasets/Iris, which has the data as a csv (without headers).


In [ ]:
import pandas as pd
url = "https://archive.ics.uci.edu/ml/machine-learning-databases/iris/iris.data"

df = pd.read_csv(url,names=['sepal_length',
                            'sepal_width',
                            'petal_length',
                            'petal_width',
                            'species'])
df.head()

read_html

Wikipedia has the same dataset as a html table at https://en.wikipedia.org/wiki/Iris_flower_data_set. Let's use http://pandas.pydata.org/pandas-docs/stable/generated/pandas.read_html.html to grab the data directly from Wikipedia.

You might have to run the following command first:

conda install html5lib BeautifulSoup4 lxml

In [ ]:
df_w = pd.read_html('https://en.wikipedia.org/wiki/Iris_flower_data_set',header=0)[0]
df_w.head()

Plotting

Let's use pandas to plot the sepal_length vs the petal_length.


In [ ]:
import pylab as plt
%matplotlib inline

plt.scatter(df.sepal_length, df.petal_length)

It would be nice to encode by color and plot all combinations of values, but this isn't easy with matplotlib. Instead, let's use seaborn (conda install seaborn).


In [ ]:
import seaborn as sns

sns.pairplot(df,vars=['sepal_length',
                    'sepal_width',
                    'petal_length',
                    'petal_width'],hue='species')

In [ ]:
sns.swarmplot(x="species", y="petal_length", data=df)

In [ ]:
from pandas.tools.plotting import radviz
radviz(df, "species",)

Excercise

Visit the https://seaborn.pydata.org/ and make two new plots with this Iris dataset using seaborn functions we haven't used above.


In [ ]:
## Plot 1 Here
sns.violinplot(x="species", y="petal_length", data=df)

In [ ]:
## Plot 2 Here
sns.interactplot("petal_length", 'petal_width', "sepal_width", data=df)

Classification

Let's say that we are an amature botonist and we'd like to determine the specied of Iris in our front yard, but that all we have available to us to make that classification is this dataset and a ruler.

Approach

This is a classic machine learning / classification problem where we want to used a collection of "labeled" data to help us sort through new data that we receive. In this case, the new data is a set of four measurements for a flower in our yard.

Because we have labeled data, this is a "supervised leanring" problem. If we did not know which species each point in the dataset belonged to, we could still use machine learning for "unsupervised learning".

Let's reimport the data using scikit learn.


In [ ]:
import numpy as np
import matplotlib.pyplot as plt
from sklearn import datasets, svm

iris = datasets.load_iris()
X = iris.data
y = iris.target.astype(float)


# keep only two features and keep only two species
X = X[y != 0, :2]
y = y[y != 0]

X,y, X.shape

Try Different Classifiers


In [ ]:
# fit the model
for fig_num, kernel in enumerate(('linear', 'rbf', 'poly')):
    clf = svm.SVC(kernel=kernel, gamma=10)
    clf.fit(X, y)

    plt.figure(fig_num)
    plt.clf()
    plt.scatter(X[:, 0], X[:, 1], c=y, zorder=10)

    plt.axis('tight')
    x_min = X[:, 0].min()
    x_max = X[:, 0].max()
    y_min = X[:, 1].min()
    y_max = X[:, 1].max()

    XX, YY = np.mgrid[x_min:x_max:200j, y_min:y_max:200j]
    Z = clf.decision_function(np.c_[XX.ravel(), YY.ravel()])

    # Put the result into a color plot
    Z = Z.reshape(XX.shape)
    plt.pcolormesh(XX, YY, Z > 0, cmap=plt.cm.Paired)
    plt.contour(XX, YY, Z, colors=['k', 'k', 'k'], linestyles=['--', '-', '--'],
                levels=[-.5, 0, .5])

    plt.title(kernel)
plt.show()

Which Classifier is Best?

First, let's predict the species from the measurements. Because the classifier is clearly not perfect, we expect some mis-classifications.


In [ ]:
y_pred = clf.predict(X)

print(y,y_pred)

Inaccuracy Score

Because we only have two classes, we can find the accuracy by taking the mean of the magnitude of the difference. This value is percent of time we are inaccurate. A lower score is better.


In [ ]:
for kernel in ('linear', 'rbf', 'poly'):
    clf = svm.SVC(kernel=kernel, gamma=10)
    clf.fit(X, y)
    y_pred = clf.predict(X)
    print(kernel,np.mean(np.abs(y-y_pred))*100,'%')

Excercise

In the above code we excluded species==0 and we only classified based on the sepal dimensions. Complete the following:

  • Copy the code cells from above and exclude species==1
  • Copy the code cells from above and use the petal dimensions for classification

For each case, use the inaccuracy score to see how good the classification works.


In [ ]:
## species==1


iris = datasets.load_iris()
X = iris.data
y = iris.target.astype(float)

# keep only two features and keep only two species
X = X[y != 1, :2] # changed here
y = y[y != 1] # changed here

# fit the model
for fig_num, kernel in enumerate(('linear', 'rbf', 'poly')):
    clf = svm.SVC(kernel=kernel, gamma=10)
    clf.fit(X, y)

    plt.figure(fig_num)
    plt.clf()
    plt.scatter(X[:, 0], X[:, 1], c=y, zorder=10)

    plt.axis('tight')
    x_min = X[:, 0].min()
    x_max = X[:, 0].max()
    y_min = X[:, 1].min()
    y_max = X[:, 1].max()

    XX, YY = np.mgrid[x_min:x_max:200j, y_min:y_max:200j]
    Z = clf.decision_function(np.c_[XX.ravel(), YY.ravel()])

    # Put the result into a color plot
    Z = Z.reshape(XX.shape)
    plt.pcolormesh(XX, YY, Z > 0, cmap=plt.cm.Paired)
    plt.contour(XX, YY, Z, colors=['k', 'k', 'k'], linestyles=['--', '-', '--'],
                levels=[-.5, 0, .5])

    plt.title(kernel)
    
    y_pred = clf.predict(X)
    print(kernel,np.mean(np.abs(y-y_pred))*100,'%')
plt.show()

In [ ]:
## petals

iris = datasets.load_iris()
X = iris.data
y = iris.target.astype(float)

# keep only two features and keep only two species
X = X[y != 0, 2:] # changed here
y = y[y != 0] 

# fit the model
for fig_num, kernel in enumerate(('linear', 'rbf', 'poly')):
    clf = svm.SVC(kernel=kernel, gamma=10)
    clf.fit(X, y)

    plt.figure(fig_num)
    plt.clf()
    plt.scatter(X[:, 0], X[:, 1], c=y, zorder=10)

    plt.axis('tight')
    x_min = X[:, 0].min()
    x_max = X[:, 0].max()
    y_min = X[:, 1].min()
    y_max = X[:, 1].max()

    XX, YY = np.mgrid[x_min:x_max:200j, y_min:y_max:200j]
    Z = clf.decision_function(np.c_[XX.ravel(), YY.ravel()])

    # Put the result into a color plot
    Z = Z.reshape(XX.shape)
    plt.pcolormesh(XX, YY, Z > 0, cmap=plt.cm.Paired)
    plt.contour(XX, YY, Z, colors=['k', 'k', 'k'], linestyles=['--', '-', '--'],
                levels=[-.5, 0, .5])

    plt.title(kernel)
    
    y_pred = clf.predict(X)
    print(kernel,np.mean(np.abs(y-y_pred))*100,'%')
plt.show()

Clustering

Instead of using the labels, we could ignor the labels and do blind clustering on the dataset. Let's try that with sklearn.


In [ ]:
from sklearn.cluster import KMeans, DBSCAN

iris = datasets.load_iris()
X = iris.data
y = iris.target.astype(float)
estimators = {'k_means_iris_3': KMeans(n_clusters=3),
              'k_means_iris_8': KMeans(n_clusters=8),
              'dbscan_iris_1': DBSCAN(eps=1)}

for name, est in estimators.items():
    est.fit(X)
    labels = est.labels_
    df[name] = labels

Visualize Clusters

Now let's visualize how we did. We'd hope that the cluster color would be as well-seperated as the original data labels.


In [ ]:
sns.pairplot(df,vars=['sepal_length',
                    'sepal_width',
                    'petal_length',
                    'petal_width'],hue='dbscan_iris_1')

Accuracy

The plot looks good, but it isn't clear how good the labels are until we compare them with the true labels.


In [ ]:
from sklearn.metrics import homogeneity_score

for name, est in estimators.items():
    print('completeness', name, homogeneity_score(df[name],df['species']))
    print('homogeneity', name, homogeneity_score(df['species'],df[name]))

Excercise

Visit http://scikit-learn.org/stable/auto_examples/cluster/plot_cluster_comparison.html and add two more clustering algorithms of your choice to the comparisons above.


In [ ]:
## Algo One
from sklearn.cluster import AgglomerativeClustering, Birch
iris = datasets.load_iris()
X = iris.data
y = iris.target.astype(float)
estimators = {'k_means_iris_3': KMeans(n_clusters=3),
              'k_means_iris_8': KMeans(n_clusters=8),
              'dbscan_iris_1': DBSCAN(eps=1),
              'AgglomerativeClustering': AgglomerativeClustering(n_clusters=3),
              'Birch': Birch()}

for name, est in estimators.items():
    est.fit(X)
    labels = est.labels_
    df[name] = labels



name='Birch'
    
sns.pairplot(df,vars=['sepal_length',
                    'sepal_width',
                    'petal_length',
                    'petal_width'],hue=name)
print('completeness', name, homogeneity_score(df[name],df['species']))
print('homogeneity', name, homogeneity_score(df['species'],df[name]))

In [ ]:
## Algo Two

name='AgglomerativeClustering'
    
sns.pairplot(df,vars=['sepal_length',
                    'sepal_width',
                    'petal_length',
                    'petal_width'],hue=name)
print('completeness', name, homogeneity_score(df[name],df['species']))
print('homogeneity', name, homogeneity_score(df['species'],df[name]))