Fisher's Iris Data Set is a well known data set that has become a common test case in machine learning. Each row in the data set is comprised of four numeric values for petal length, petal width, sepal length and sepal width. The row also contains the type of iris flower (one of three: Iris setosa, Iris versicolor, or Iris virginica).
According to Lichman [1],
"One class is linearly separable from the other 2; the latter are NOT linearly separable from each other".
Types are clustered together and can be analysed to distinguish or predict the type of iris flower by it's measurements (petal length, petal width, sepal length and sepal width)[2].
References:
[1] Lichman, M. (2013). UCI Machine Learning Repository [http://archive.ics.uci.edu/ml]. Irvine, CA: University of California, School of Information and Computer Science.
[2] True, Joseph - Content Data Scientist (2015). IMB Watson Analytics [https://www.ibm.com/communities/analytics/watson-analytics-blog/watson-analytics-use-case-the-iris-data-set/].
In [1]:
import numpy as np
# Load in data from csv file.
sepal_length, sepal_width, petal_length, petal_width = np.genfromtxt('../data/IRIS.csv', delimiter=',', usecols=(0,1,2,3), unpack=True, dtype=float)
iris_class = np.genfromtxt('../data/IRIS.csv', delimiter=',', usecols=(4), unpack=True, dtype=str)
# Loaded the columns into separate variables for ease of use.
In [2]:
import matplotlib.pyplot as plt
# Plot Sepal Length on the x-axis and Sepal Width on the y-axis; complete with labels.
# Scale graph to a bigger size
plt.rcParams['figure.figsize'] = (14.0, 6.0)
# Set title
plt.title('Iris Data Set: Sepal Measurements', fontsize=16)
# plot scatter graph
plt.scatter(sepal_length, sepal_width)
# Add labels
plt.xlabel('Sepal Length', fontsize=14)
plt.ylabel('Sepal Width', fontsize=14)
# Output Graph
plt.show()
In [3]:
# https://matplotlib.org/users/legend_guide.html
import matplotlib.patches as mp
# https://stackoverflow.com/questions/27318906/python-scatter-plot-with-colors-corresponding-to-strings
colours = {'Iris-setosa': 'r', 'Iris-versicolor': 'g', 'Iris-virginica': 'b'}
plt.scatter(sepal_length, sepal_width, c=[colours[i] for i in iris_class], label=[colours[i] for i in colours])
# Add title
plt.title('Iris Setosa, Versicolor, and Virginica: Sepal Measurements', fontsize=16)
# Add labels
plt.xlabel('Sepal Length', fontsize=14)
plt.ylabel('Sepal Width', fontsize=14)
# https://matplotlib.org/api/patches_api.html
plt.legend(handles = [mp.Patch(color=colour, label=label) for label, colour in [('Iris Setosa', 'r'), ('Iris Versicolor', 'g'), ('Iris Virginica', 'b')]])
plt.show()
In [4]:
import seaborn as sns
import pandas as pd
# Prepare data with pandas DataFrame for seaborn usage.
df = pd.DataFrame(dict(zip(['Sepal Length', 'Sepal Width','Petal Length', 'Petal Width', 'Iris Class'], [sepal_length, sepal_width, petal_length, petal_width, iris_class])))
df
Out[4]:
In [5]:
# Adapted from: https://seaborn.pydata.org/examples/scatterplot_matrix.html
%matplotlib inline
sns.pairplot(df, hue="Iris Class")
Out[5]:
In [6]:
# Reset size after seaborn
plt.rcParams['figure.figsize'] = (14.0, 6.0)
# https://github.com/emerging-technologies/emerging-technologies.github.io/blob/master/notebooks/simple-linear-regression.ipynb
# Calculate the best values for m and c.
m, c = np.polyfit(petal_length, petal_width, 1)
# Plot Setosa measurements
plt.scatter(petal_length, petal_width,marker='o', label='Data Set')
# Plot best fit line
plt.plot(petal_length, m * petal_length + c, 'forestgreen', label='Best fit line')
# Add title
plt.title('Iris Data Set: Petal Measurements', fontsize=16)
# Add labels
plt.xlabel('Petal Length')
plt.ylabel('Petal Width')
plt.legend()
# Print graph
plt.show()
In [7]:
# Calculate the R-squared value for our data set using numpy.
np.corrcoef(petal_length, petal_width)[0][1]**2
Out[7]:
In [8]:
# https://stackoverflow.com/questions/27947487/is-zip-the-most-efficient-way-to-combine-arrays-with-respect-to-memory-in-nump
# Combine arrays
iris_data = np.column_stack((sepal_length, sepal_width, petal_length, petal_width,iris_class))
# https://docs.scipy.org/doc/numpy/reference/generated/numpy.in1d.html
# Filter Data with 'Iris-setosa' & transpose after
filter_setosa = (iris_data[np.in1d(iris_data[:,4],'Iris-setosa')]).transpose()
# https://stackoverflow.com/questions/3877491/deleting-rows-in-numpy-array
# https://docs.scipy.org/doc/numpy-1.13.0/reference/generated/numpy.chararray.astype.html
# Prepare data - delete row of unnecessary data and cover types to float
setosa_data = (np.delete(filter_setosa, (4), axis=0)).astype(np.float)
setosa_data
Out[8]:
In [15]:
# https://github.com/emerging-technologies/emerging-technologies.github.io/blob/master/notebooks/simple-linear-regression.ipynb
# Calculate the best values for m and c.
m, c = np.polyfit(setosa_data[2], setosa_data[3], 1)
# Plot Setosa measurements
plt.scatter(setosa_data[2],setosa_data[3],marker='o', label='Iris Setosa')
# Plot best fit line
plt.plot(setosa_data[2], m * setosa_data[2] + c, 'forestgreen', label='Best fit line')
# Add title
plt.title('Iris Setosa: Petal Measurements', fontsize=16)
# Add labels
plt.xlabel('Petal Length', fontsize=14)
plt.ylabel('Petal Width', fontsize=14)
plt.legend()
# Print graph
plt.show()
In [10]:
# Calculate the R-squared value for the Setosa data using numpy.
np.corrcoef(setosa_data[2], setosa_data[3])[0][1]**2
Out[10]:
Gradient Descent is an approximation technique. To utilize this approximation technique, we guess the value that we wish to approximate and iteratively improve that guess.
In [11]:
# Calculate the partial derivative of cost with respect to m while treating c as a constant.
def gradient_descent_m(x, y, m, c):
return -2.0 * np.sum(x * (y - m * x - c))
# Calculate the partial derivative of cost with respect to c while treating m as a constant.
def gradient_descent_c(x, y, m , c):
return -2.0 * np.sum(y - m * x - c)
In [12]:
eta = 0.0001
g_m, g_c = 1.0, 1.0
change = True
# Iterate the partial derivatives until the outcomes do not change
while change:
g_m_new = g_m - eta * gradient_descent_m(setosa_data[2], setosa_data[3], g_m, g_c)
g_c_new = g_c - eta * gradient_descent_c(setosa_data[2], setosa_data[3], g_m, g_c)
if g_m == g_m_new and g_c == g_c_new:
change = False
else:
g_m, g_c = g_m_new, g_c_new
To the human eye it is difficult to see a difference between the best fit line and the best fit line approximated by gradient descent.
In [16]:
# Plot Setosa measurements
plt.scatter(setosa_data[2],setosa_data[3],marker='o', label='Iris Setosa')
# Plot best fit line according to Gradient Descent
plt.plot(setosa_data[2], g_m * setosa_data[2] + g_c, 'forestgreen', label='Best fit line: Gradient Descent')
# Add title
plt.title('Iris Setosa: Petal Measurements', fontsize=16)
# Add labels
plt.xlabel('Petal Length')
plt.ylabel('Petal Width')
plt.legend()
# Print graph
plt.show()
However, the results from the two techniques are in fact different. With both results for m and c differing after the eleventh decimal point, the gradient descent technique did manage to approximate adequate results; although inexact and inaccurate.
In [17]:
print("BEST LINE: m: %20.16f c: %20.16f" % (m, c))
print()
print("GRADIENT DESCENT: m: %20.16f c: %20.16f" % (g_m, g_c))