This problem sheet relates to the Iris data set and uses jupyter, numpy and pyplot. Problems are labelled 1 to 10.
import numpy as np
# Adapted from
filename = 'data.csv'
sLen, sWid, pLen, pWid = np.genfromtxt('data.csv', delimiter=',', usecols=(0,1,2,3), unpack=True, dtype=float)
spec = np.genfromtxt('data.csv', delimiter=',', usecols=(4), unpack=True, dtype=str)
for i in range(10):
print('{0:.1f} {1:.1f} {2:.1f} {3:.1f} {4:s}'.format(sLen[i], sWid[i], pLen[i], pWid[i], spec[i]))
The Iris data set was created by Ronald Fisher in 1936 and contains 50 samples from each of the three species of Iris - Iris setosa, Iris virginica and Iris versicolor. The structure of the set is as follows: sepal length, sepal width, petal length, petal width, species classification. A raw copy of the data set can be found here.
import matplotlib.pyplot as pl
pl.rcParams['figure.figsize'] = (14, 6) # Adapted from gradient descent notebook:
pl.scatter(sLen, sWid, marker='.')
pl.title('Scatter Diagram of Sepal Width vs Length', fontsize=14)
pl.xlabel('Sepal Length')
pl.ylabel('Sepal Width')
import matplotlib.patches as mpatches
pl.rcParams['figure.figsize'] = (14,6)
# Colour related to type adapted from
colours = {'Iris-setosa': 'red', 'Iris-versicolor': 'green', 'Iris-virginica': 'blue'}
pl.scatter(sLen, sWid, c=[colours[i] for i in spec], label=[colours[i] for i in colours], marker=".")
pl.title('Scatter Diagram of Sepal Width vs Length', fontsize=14)
pl.xlabel('Sepal Length')
pl.ylabel('Sepal Width')
# Custom handles adapted from
a = 'red'
b = 'green'
c = 'blue'
handles = [mpatches.Patch(color=colour, label=label) for label, colour in [('Iris-setosa', a), ('Iris-versicolor', b), ('Iris-virginica', c)]]
pl.legend(handles=handles, loc=2, frameon=True)
Use Seaborn to create a scatterplot matrix of all five variables (sepal length, sepal width, petal length, petal width, species classification).
Note: needs work, dataframe working but sb plot isn't. Will do other questions and come back to this if there's time.
# Seaborn scatterplot adapted from
import seaborn as sb
# Load the data - Iris included in Seaborn's github repo for csv files here:
data = sb.load_dataset("iris")
# Plot data, base the colour of points on species
sb.pairplot(data, hue="species")
# Conversions adapted from
# Adapted from
w = pLen
d = pWid
w_avg = np.mean(w)
d_avg = np.mean(d)
w_zero = w - w_avg
d_zero = d - d_avg
m = np.sum(w_zero * d_zero) / np.sum(w_zero * w_zero)
c = d_avg - m * w_avg
# Graph labels etc
pl.rcParams['figure.figsize'] = (14,6)
pl.title('Petal Measurements', fontsize=14)
pl.xlabel('Petal Length')
pl.ylabel('Petal Width')
pl.scatter(w, d, marker='.', label='Data Set')
pl.plot(w, m * w + c, 'r', label='Best Fit Line')
pl.legend(loc=2, frameon=True)
# Adapted from
rsq = 1.0 - (np.sum((d - m * w - c)**2)/np.sum((d - d_avg)**2))
print("R-squared: {0:.6f}".format(rsq))
# Adding arrays as columns adapted from
data = np.column_stack((sLen, sWid, pLen, pWid, spec))
# Setosa data -> 0 - 49 in data set. (Definitely better ways of doing this but works for now, will change if there's time)
spLen, spWid= [], []
for index, row in enumerate(data):
# Petal info contained in cols 2 & 3
# For each row, append column 2 to spLen array and column 3 to spWid array
if index == 49:
# Calculate best values for m and c
m, c = np.polyfit(spLen, spWid, 1)
y = m * (spLen + c)
# Graph labels etc
pl.rcParams['figure.figsize'] = (16,8)
pl.title('Iris Setosa Petal Measurements', fontsize=14)
pl.xlabel('Petal Length')
pl.ylabel('Petal Width')
pl.scatter(spLen, spWid, label = 'Iris Setosa') # Plot the data points
pl.plot(spLen, y, 'r', label = 'Best Fit Line') # Plot the line
pl.legend(loc=2, frameon=True)
orM = m
orC = c
rsq = 1.0 - (np.sum((d - m * w - c)**2)/np.sum((d - d_avg)**2))
print("R-squared: {0:.6f}".format(rsq))
w = np.array(spLen)
d = np.array(spWid)
print("Original \t\tm: %20.16f c: %20.16f" % (orM, orC))
# Adapted from Gradient Descent worksheet -
# Partial derivatives with respect to m and c
def grad_m(x, y, m, c):
return -2.0 * np.sum(x * (y - m * x - c))
def grad_c(x, y, m, c):
return -2.0 * np.sum(y - m * x - c)
# Set up variables
eta = 0.0001 # The x in mx + c
gdm, gdc = 1.0, 1.0 # Initial guesses for GD m and c
change = True
while change:
mnew = gdm - eta * grad_m(w, d, gdm, gdc)
cnew = gdc - eta * grad_c(w, d, gdm, gdc)
if gdm == mnew and gdc == cnew:
# Calculations no longer changing, stop the loop
change = False
gdm, gdc = mnew, cnew
# - End adapted from Gradient Descent worksheet -
print("Gradient desc \t\tm: %20.16f c: %20.16f" % (gdm, gdc))
# Graph labels etc
pl.rcParams['figure.figsize'] = (16,8)
pl.title('Iris Setosa Best Fit Line using Gradient Descent', fontsize=14)
pl.xlabel('Petal Length')
pl.ylabel('Petal Width')
y = gdm * (spLen + gdc)
pl.scatter(spLen, spWid, label = 'Iris Setosa')
pl.plot(spLen, y, 'g', label='Best Fit Line using Gradient Descent')
As we can see above, there is a very slight difference in best fit lines generated using polyfit and the gradient descent method. The difference is so small that if you were looking at these lines plotted on two graphs, they would look identical - see the graph in problem 8, which used polyfit to get the best fit line.