From Wikipedia:
The Iris flower data set or Fisher's Iris data set is a multivariate data set introduced by the British statistician and biologist Ronald Fisher in his 1936 paper The use of multiple measurements in taxonomic problems as an example of linear discriminant analysis. It is sometimes called Anderson's Iris data set because Edgar Anderson collected the data to quantify the morphologic variation of Iris flowers of three related species. Two of the three species were collected in the Gaspé Peninsula "all from the same pasture, and picked on the same day and measured at the same time by the same person with the same apparatus".
Pandas is a library modeled after the R dataframe API that enables the quick exploration and processing of heterogenous data.
One of the many great things about pandas is that is has many functions for grabbing data--including functions for grabbing data from the internet. In the cell below, we grabbed data from the https://archive.ics.uci.edu/ml/datasets/Iris, which has the data as a csv (without headers).
In [ ]:
import pandas as pd
url = "https://archive.ics.uci.edu/ml/machine-learning-databases/iris/iris.data"
df = pd.read_csv(url,names=['sepal_length',
'sepal_width',
'petal_length',
'petal_width',
'species'])
df.head()
read_htmlWikipedia has the same dataset as a html table at https://en.wikipedia.org/wiki/Iris_flower_data_set. Let's use http://pandas.pydata.org/pandas-docs/stable/generated/pandas.read_html.html to grab the data directly from Wikipedia.
You might have to run the following command first:
conda install html5lib BeautifulSoup4 lxml
In [ ]:
df_w = pd.read_html('https://en.wikipedia.org/wiki/Iris_flower_data_set',header=0)[0]
df_w.head()
In [ ]:
import pylab as plt
%matplotlib inline
plt.scatter(df.sepal_length, df.petal_length)
It would be nice to encode by color and plot all combinations of values, but this isn't easy with matplotlib. Instead, let's use seaborn (conda install seaborn).
In [ ]:
import seaborn as sns
sns.pairplot(df,vars=['sepal_length',
'sepal_width',
'petal_length',
'petal_width'],hue='species')
In [ ]:
sns.swarmplot(x="species", y="petal_length", data=df)
In [ ]:
from pandas.tools.plotting import radviz
radviz(df, "species",)
Visit the https://seaborn.pydata.org/ and make two new plots with this Iris dataset using seaborn functions we haven't used above.
In [ ]:
## Plot 1 Here
In [ ]:
## Plot 2 Here
Let's say that we are an amature botonist and we'd like to determine the specied of Iris in our front yard, but that all we have available to us to make that classification is this dataset and a ruler.
This is a classic machine learning / classification problem where we want to used a collection of "labeled" data to help us sort through new data that we receive. In this case, the new data is a set of four measurements for a flower in our yard.
Because we have labeled data, this is a "supervised leanring" problem. If we did not know which species each point in the dataset belonged to, we could still use machine learning for "unsupervised learning".
Let's reimport the data using scikit learn.
In [ ]:
import numpy as np
import matplotlib.pyplot as plt
from sklearn import datasets, svm
iris = datasets.load_iris()
X = iris.data
y = iris.target.astype(float)
# keep only two features and keep only two species
X = X[y != 0, :2]
y = y[y != 0]
X,y, X.shape
In [ ]:
# fit the model
for fig_num, kernel in enumerate(('linear', 'rbf', 'poly')):
clf = svm.SVC(kernel=kernel, gamma=10)
clf.fit(X, y)
plt.figure(fig_num)
plt.clf()
plt.scatter(X[:, 0], X[:, 1], c=y, zorder=10)
plt.axis('tight')
x_min = X[:, 0].min()
x_max = X[:, 0].max()
y_min = X[:, 1].min()
y_max = X[:, 1].max()
XX, YY = np.mgrid[x_min:x_max:200j, y_min:y_max:200j]
Z = clf.decision_function(np.c_[XX.ravel(), YY.ravel()])
# Put the result into a color plot
Z = Z.reshape(XX.shape)
plt.pcolormesh(XX, YY, Z > 0, cmap=plt.cm.Paired)
plt.contour(XX, YY, Z, colors=['k', 'k', 'k'], linestyles=['--', '-', '--'],
levels=[-.5, 0, .5])
plt.title(kernel)
plt.show()
In [ ]:
y_pred = clf.predict(X)
print(y,y_pred)
In [ ]:
for kernel in ('linear', 'rbf', 'poly'):
clf = svm.SVC(kernel=kernel, gamma=10)
clf.fit(X, y)
y_pred = clf.predict(X)
print(kernel,np.mean(np.abs(y-y_pred))*100,'%')
In the above code we excluded species==0 and we only classified based on the sepal dimensions. Complete the following:
species==1For each case, use the inaccuracy score to see how good the classification works.
In [ ]:
## species==1
In [ ]:
## petals
In [ ]:
from sklearn.cluster import KMeans, DBSCAN
iris = datasets.load_iris()
X = iris.data
y = iris.target.astype(float)
estimators = {'k_means_iris_3': KMeans(n_clusters=3),
'k_means_iris_8': KMeans(n_clusters=8),
'dbscan_iris_1': DBSCAN(eps=1)}
for name, est in estimators.items():
est.fit(X)
labels = est.labels_
df[name] = labels
In [ ]:
sns.pairplot(df,vars=['sepal_length',
'sepal_width',
'petal_length',
'petal_width'],hue='dbscan_iris_1')
In [ ]:
from sklearn.metrics import homogeneity_score
for name, est in estimators.items():
print('completeness', name, homogeneity_score(df[name],df['species']))
print('homogeneity', name, homogeneity_score(df['species'],df[name]))
Visit http://scikit-learn.org/stable/auto_examples/cluster/plot_cluster_comparison.html and add two more clustering algorithms of your choice to the comparisons above.
In [ ]:
## Algo One
In [ ]:
## Algo Two