Fisher's Iris data set is a collection of measurements commonly used to discuss various example algorithms. It is popular due to the fact that it consists of multiple dimensions, a large enough set of samples to perform most basic statistics, and uses a set of measurements that is understandable by most people.

Here, I will use the iris data set to discuss some basic machine learning algorithms. I will begin with some visuals to help understand the data, then perform some supervised algorithms to better characterize the data. I will conclude with a demonstration of a Support Vector Machine (SVM) classifier.

I start with some standard imports, by loading the iris data and shaping it into a Pandas dataframe for better manipulation.


In [53]:
import matplotlib.pyplot as plt
from sklearn import datasets, svm
from sklearn.decomposition import PCA
import seaborn as sns
import pandas as pd
import numpy as np

# import some data to play with
iris = datasets.load_iris()
dfX = pd.DataFrame(iris.data,columns = ['sepal_length','sepal_width','petal_length','petal_width'])
dfY = pd.DataFrame(iris.target,columns=['species'])
dfX['species'] = dfY
print(dfX.head())

sns.pairplot(dfX, hue="species")


   sepal_length  sepal_width  petal_length  petal_width  species
0           5.1          3.5           1.4          0.2        0
1           4.9          3.0           1.4          0.2        0
2           4.7          3.2           1.3          0.2        0
3           4.6          3.1           1.5          0.2        0
4           5.0          3.6           1.4          0.2        0
Out[53]:
<seaborn.axisgrid.PairGrid at 0x7f4d25133d68>

You should see that Species 0 (setosa) is quickly distinguishable from Species 1 (versicolor) and 2 (virginica).

I will now demonstrate how to calculate various descriptive statistics.

Before showing the code, I want to remind readers of the pitfall of relying entirely on descriptive statistics; Anscombe's quartet is a collection of 4 sets, each set consisting of eleven points. All of the sets have similar descriptive statistics but are visually very different.


In [48]:
#find and print mean, median, 95% intervals
print('mean')
print(dfX.groupby('species').mean())

print('median')
print(dfX.groupby('species').median())

print('two-σ interval')
dfX_high = dfX.groupby('species').mean() + 2*dfX.groupby('species').std()
dfX_low = dfX.groupby('species').mean() - 2*dfX.groupby('species').std()
df = pd.DataFrame()
for C in dfX_high.columns:
  df[C + '_hilo'] = dfX_high[C].astype(str) +'_' + dfX_low[C].astype(str)

print(df)


mean
         sepal_length  sepal_width  petal_length  petal_width
species                                                      
0               5.006        3.418         1.464        0.244
1               5.936        2.770         4.260        1.326
2               6.588        2.974         5.552        2.026
median
         sepal_length  sepal_width  petal_length  petal_width
species                                                      
0                 5.0          3.4          1.50          0.2
1                 5.9          2.8          4.35          1.3
2                 6.5          3.0          5.55          2.0
two-σ interval
                   sepal_length_hilo             sepal_width_hilo  \
species                                                             
0        5.71097937443_4.30102062557  4.18004879591_2.65595120409   
1        6.96834229413_4.90365770587  3.39759664676_2.14240335324   
2        7.85975918655_5.31624081345  3.61899327635_2.32900672365   

                   petal_length_hilo                petal_width_hilo  
species                                                               
0        1.81102231887_1.11697768113  0.458419006163_0.0295809938366  
1        5.19982195448_3.32017804552    1.72150536001_0.930494639991  
2        6.65578939133_4.44821060867     2.57530011127_1.47669988873  

We can from both the scatterplots and the hi/low plots that petal length is sufficient to discriminate Species 0 from the other two

I will conclude with demonstrating how to use SVM to predict Species. I start with some helper functions


In [0]:
def make_meshgrid(x, y, h=.02):
    """Create a mesh of points to plot in

    Parameters
    ----------
    x: data to base x-axis meshgrid on
    y: data to base y-axis meshgrid on
    h: stepsize for meshgrid, optional

    Returns
    -------
    xx, yy : ndarray
    """
    x_min, x_max = x.min() - 1, x.max() + 1
    y_min, y_max = y.min() - 1, y.max() + 1
    xx, yy = np.meshgrid(np.arange(x_min, x_max, h),
                         np.arange(y_min, y_max, h))
    return xx, yy
  
def plot_contours(ax, clf, xx, yy, **params):
    """Plot the decision boundaries for a classifier.

    Parameters
    ----------
    ax: matplotlib axes object
    clf: a classifier
    xx: meshgrid ndarray
    yy: meshgrid ndarray
    params: dictionary of params to pass to contourf, optional
    """
    Z = clf.predict(np.c_[xx.ravel(), yy.ravel()])
    Z = Z.reshape(xx.shape)
    out = ax.contourf(xx, yy, Z, **params)
    return out

And we can now instantiate and train a model


In [58]:
dfX = pd.DataFrame(iris.data,columns = ['sepal_length','sepal_width','petal_length','petal_width']) 
X = iris.data[:, :2]
C = 1.0  # SVM regularization parameter
clf =  svm.SVC(kernel='rbf', gamma=0.7, C=C)
clf = clf.fit(X, dfY)

title = 'SVC with RBF kernel'
fig, ax = plt.subplots()


X0, X1 = X[:, 0], X[:, 1]
xx, yy = make_meshgrid(X0, X1)

plot_contours(ax, clf, xx, yy,cmap=plt.cm.coolwarm, alpha=0.8)
ax.scatter(X0, X1, c=y, cmap=plt.cm.coolwarm, s=20, edgecolors='k')
ax.set_xlim(xx.min(), xx.max())
ax.set_ylim(yy.min(), yy.max())
ax.set_xlabel('Sepal length')
ax.set_ylabel('Sepal width')
ax.set_xticks(())
ax.set_yticks(())
ax.set_title(title)

plt.show()


/usr/local/lib/python3.6/dist-packages/sklearn/utils/validation.py:526: DataConversionWarning: A column-vector y was passed when a 1d array was expected. Please change the shape of y to (n_samples, ), for example using ravel().
  y = column_or_1d(y, warn=True)

We see that our algorithm is nearly perfect at predicting Species 0, but we have some trouble discriminating between Species 1 and 2. This can be addressed using higher dimensions of data, or a different learning function.