From the Wikipedia article “Iris flower data set”
The Iris flower data set or Fisher’s Iris data set is a multivariate data set introduced by Ronald Fisher in his 1936 paper "The use of multiple measurements in taxonomic problems as an example of linear discriminant analysis". It is sometimes called Anderson’s Iris data set because Edgar Anderson collected the data to quantify the variation of Iris flowers.
The data set consists of 50 samples from each of three species of Iris (Iris setosa, Iris virginica and Iris versicolor). Four features were measured from each sample: the length and the width of the sepals and petals, in centimetres.
To download this data into an Excel spreadsheet, click on Fisher's Iris
In [1]:
# numpy efficiently deals with numerical multi-dimensional arrays.
import numpy as np
# matplotlib is a plotting library, and pyplot is its easy-to-use module.
import matplotlib.pyplot as pl
# This just sets the default plot size to be bigger.
pl.rcParams['figure.figsize'] = (16.0, 8.0)
In [2]:
# Load data
OriginalData = np.loadtxt("iris.csv",str, delimiter=",", skiprows=1, unpack=True)
# Transposing the data
data = OriginalData.transpose()
# get each column
sepal_length = sepal_width = petal_length = petal_width = species =[]
sepal_length , sepal_width , petal_length , petal_width , species = OriginalData
In [3]:
# Create the plot.
pl.scatter(sepal_length, sepal_width, c='black')
# Set some properties for the plot.
pl.xlabel('sepal_length')
pl.ylabel('sepal_width')
# Show the plot.
pl.title('relationship between sepal Length and width')
pl.show()
In [4]:
# copy of data as numpy.array
_data=np.array(data)
# replase '_data', column 'species' to colors
# adapt from : https://stackoverflow.com/questions/20355311/how-to-replace-values-in-a-numpy-array-based-on-another-column
_data[_data[:,4] == 'setosa',4] = 'red'
_data[_data[:,4] == 'virginica',4] = 'blue'
_data[_data[:,4] == 'versicolor',4] = 'green'
# create a color and label set
colors = np.array(_data)[:,4]
labels = np.array(data)[:,4]
# Create the plot.
pl.scatter(sepal_length, sepal_width, c=colors, label=labels)
# match label for colors
# adap from : https://stackoverflow.com/questions/44164111/setting-a-legend-matching-the-colours-in-pyplot-scatter/44164349#44164349
import matplotlib.patches as mpatches
handles = [mpatches.Patch(color=colour, label=label) for label, colour in [('setosa', 'red'), ('virginica', 'blue'), ('versicolor', 'green')]]
pl.legend(handles=handles)
# Set some properties for the plot.
pl.xlabel('sepal_length')
pl.ylabel('sepal_width')
# Show the plot
pl.show()
Use the seaborn library to create a scatterplot matrix of all five variables.
In [5]:
# use pandas reload iris.csv
import pandas
iris = pandas.read_csv('iris.csv')
# use seaborn library show plots
import seaborn
seaborn.pairplot(iris, hue="species")
pl.show()
cost = lambda m,c: np.sum([(petal_width[i] - m * petal_length[i] - c)**2 for i in range(petal_width.size)])
minimising the cost:
$$ Cost(m, c) = \sum_i (y_i - mx_i - c)^2 $$
In [6]:
# here, just use the numpy.polyfit function to get the best values for m and c.
petal_length = np.array(petal_length).astype(np.float)
petal_width = np.array(petal_width).astype(np.float)
m,c=np.polyfit(petal_length, petal_width, 1)
print("m is %8.6f and c is %6.6f." % (m, c))
plot the data points in a scatter plot with the best fit line shown.
In [7]:
# Create the plot.
pl.plot(petal_length, m * petal_length + c, 'b-')
pl.plot(petal_length, petal_width, 'k.')
# Set some properties for the plot.
pl.xlabel('petal_length')
pl.ylabel('petal_width')
# Show the plot.
pl.title('relationship between petal Length and width')
pl.show()
In [8]:
# here, just use the numpy.corrcoef function to get the same value.
rsq = np.corrcoef(petal_length, petal_width)[0][1]**2
print("The R-squared value is %6.4f" % rsq)
In [9]:
# get a copy of data
data_ = np.array(data)
# get setosa data set
# adapt from: https://stackoverflow.com/questions/23359886/selecting-rows-in-numpy-ndarray-based-on-the-value-of-two-columns
setosa = []
setosa = data_[np.where(data_[:,4] == 'setosa')]
setosa_petal_length = np.array(setosa[:,2]).astype(np.float)
setosa_petal_width = np.array(setosa[:,3]).astype(np.float)
# pick model y = m * x + c
# get best m & c
m,c=np.polyfit(setosa_petal_length, setosa_petal_width, 1)
print("m is %8.6f and c is %6.6f." % (m, c))
# Create the plot.
pl.plot(setosa_petal_length, m * setosa_petal_length + c, 'b-')
pl.plot(setosa_petal_length, setosa_petal_width, 'k.')
pl.xlabel('setosa_petal_length')
pl.ylabel('setosa_petal_width')
pl.show()
In [10]:
# again, use numpy.corrcoef function to get R-squared.
rsq = np.corrcoef(setosa_petal_length, setosa_petal_width)[0][1]**2
print("The R-squared value is %6.4f" % rsq)
Use gradient descent to approximate the best fit line for the petal length and petal width setosa values. Compare the outputs to your calculations above.
gradient descent: In gradient descent, we select a random guess of a parameter and iteratively improve that guess.
For instance, we might pick $1.0$ as our initial guess for $m$ and then create a for
loop to iteratively improve the value of $m$.
The way we improve $m$ is to first take the partial derivative of our cost function with respect to $m$.
Calculate the partial derivatives
formula: $$ \begin{align} Cost(m, c) &= \sum_i (y_i - mx_i - c)^2 \\[1cm] \frac{\partial Cost}{\partial m} &= \sum 2(y_i - m x_i -c)(-x_i) \\ &= -2 \sum x_i (y_i - m x_i -c) \\[0.5cm] \frac{\partial Cost}{\partial c} & = \sum 2(y_i - m x_i -c)(-1) \\ & = -2 \sum (y_i - m x_i -c) \\ \end{align} $$
Code the partial derivatives
In [11]:
def grad_m(x, y, m, c):
return -2.0 * np.sum(x * (y - m * x - c))
In [12]:
def grad_c(x, y, m , c):
return -2.0 * np.sum(y - m * x - c)
Iterate
run our gradient descent algorithm. For mm , we keep replacing its value with m−ηgrad_m(x,y,m,c)m−ηgrad_m(x,y,m,c) until it doesn't change. For cc , we keep replacing its value with c−ηgrad_c(x,y,m,c)c−ηgrad_c(x,y,m,c) until it doesn't change.
In [13]:
eta = 0.0001
m, c = 1.0, 1.0
change = True
while change:
mnew = m - eta * grad_m(setosa_petal_length, setosa_petal_width, m, c)
cnew = c - eta * grad_c(setosa_petal_length, setosa_petal_width, m, c)
if m == mnew and c == cnew:
change = False
else:
m, c = mnew, cnew
print("m: %20.16f c: %20.16f" % (m, c))
Compare to values of $m$ & $c$ we got earlier:
m is 0.189262 and c is -0.033080.
(almost equivalent)
In [14]:
# Show the plot.
pl.plot(setosa_petal_length, m * setosa_petal_length + c, 'b-')
pl.plot(setosa_petal_length, setosa_petal_width, 'k.')
pl.xlabel('setosa_petal_length')
pl.ylabel('setosa_petal_width')
pl.show()