Fisher's Iris data set is a collection of measurements commonly used to discuss various example algorithms. It is popular due to the fact that it consists of multiple dimensions, a large enough set of samples to perform most basic statistics, and uses a set of measurements that is understandable by most people.
Here, I will use the iris data set to discuss some basic machine learning algorithms. I will begin with some visuals to help understand the data, then perform some supervised algorithms to better characterize the data. I will conclude with a demonstration of a Support Vector Machine (SVM) classifier.
I start with some standard imports, by loading the iris data and shaping it into a Pandas dataframe for better manipulation.
In [53]:
import matplotlib.pyplot as plt
from sklearn import datasets, svm
from sklearn.decomposition import PCA
import seaborn as sns
import pandas as pd
import numpy as np
# import some data to play with
iris = datasets.load_iris()
dfX = pd.DataFrame(iris.data,columns = ['sepal_length','sepal_width','petal_length','petal_width'])
dfY = pd.DataFrame(iris.target,columns=['species'])
dfX['species'] = dfY
print(dfX.head())
sns.pairplot(dfX, hue="species")
Out[53]:
You should see that Species 0 (setosa) is quickly distinguishable from Species 1 (versicolor) and 2 (virginica).
I will now demonstrate how to calculate various descriptive statistics.
Before showing the code, I want to remind readers of the pitfall of relying entirely on descriptive statistics; Anscombe's quartet is a collection of 4 sets, each set consisting of eleven points. All of the sets have similar descriptive statistics but are visually very different.
In [48]:
#find and print mean, median, 95% intervals
print('mean')
print(dfX.groupby('species').mean())
print('median')
print(dfX.groupby('species').median())
print('two-σ interval')
dfX_high = dfX.groupby('species').mean() + 2*dfX.groupby('species').std()
dfX_low = dfX.groupby('species').mean() - 2*dfX.groupby('species').std()
df = pd.DataFrame()
for C in dfX_high.columns:
df[C + '_hilo'] = dfX_high[C].astype(str) +'_' + dfX_low[C].astype(str)
print(df)
We can from both the scatterplots and the hi/low plots that petal length is sufficient to discriminate Species 0 from the other two
I will conclude with demonstrating how to use SVM to predict Species. I start with some helper functions
In [0]:
def make_meshgrid(x, y, h=.02):
"""Create a mesh of points to plot in
Parameters
----------
x: data to base x-axis meshgrid on
y: data to base y-axis meshgrid on
h: stepsize for meshgrid, optional
Returns
-------
xx, yy : ndarray
"""
x_min, x_max = x.min() - 1, x.max() + 1
y_min, y_max = y.min() - 1, y.max() + 1
xx, yy = np.meshgrid(np.arange(x_min, x_max, h),
np.arange(y_min, y_max, h))
return xx, yy
def plot_contours(ax, clf, xx, yy, **params):
"""Plot the decision boundaries for a classifier.
Parameters
----------
ax: matplotlib axes object
clf: a classifier
xx: meshgrid ndarray
yy: meshgrid ndarray
params: dictionary of params to pass to contourf, optional
"""
Z = clf.predict(np.c_[xx.ravel(), yy.ravel()])
Z = Z.reshape(xx.shape)
out = ax.contourf(xx, yy, Z, **params)
return out
And we can now instantiate and train a model
In [58]:
dfX = pd.DataFrame(iris.data,columns = ['sepal_length','sepal_width','petal_length','petal_width'])
X = iris.data[:, :2]
C = 1.0 # SVM regularization parameter
clf = svm.SVC(kernel='rbf', gamma=0.7, C=C)
clf = clf.fit(X, dfY)
title = 'SVC with RBF kernel'
fig, ax = plt.subplots()
X0, X1 = X[:, 0], X[:, 1]
xx, yy = make_meshgrid(X0, X1)
plot_contours(ax, clf, xx, yy,cmap=plt.cm.coolwarm, alpha=0.8)
ax.scatter(X0, X1, c=y, cmap=plt.cm.coolwarm, s=20, edgecolors='k')
ax.set_xlim(xx.min(), xx.max())
ax.set_ylim(yy.min(), yy.max())
ax.set_xlabel('Sepal length')
ax.set_ylabel('Sepal width')
ax.set_xticks(())
ax.set_yticks(())
ax.set_title(title)
plt.show()
We see that our algorithm is nearly perfect at predicting Species 0, but we have some trouble discriminating between Species 1 and 2. This can be addressed using higher dimensions of data, or a different learning function.