Problem-set-Jupyter-Pyplot-and-Numpy

Write a note about the data set

Fisher's Iris Data Set is a well known data set that has become a common test case in machine learning. Each row in the data set is comprised of four numeric values for petal length, petal width, sepal length and sepal width. The row also contains the type of iris flower (one of three: Iris setosa, Iris versicolor, or Iris virginica).

According to Lichman [1],

"One class is linearly separable from the other 2; the latter are NOT linearly separable from each other".

Types are clustered together and can be analysed to distinguish or predict the type of iris flower by it's measurements (petal length, petal width, sepal length and sepal width)[2].

References:

[1] Lichman, M. (2013). UCI Machine Learning Repository [http://archive.ics.uci.edu/ml]. Irvine, CA: University of California, School of Information and Computer Science.

[2] True, Joseph - Content Data Scientist (2015). IMB Watson Analytics [https://www.ibm.com/communities/analytics/watson-analytics-blog/watson-analytics-use-case-the-iris-data-set/].

Get and load the data


In [1]:
import numpy as np

# Load in data from csv file.
sepal_length, sepal_width, petal_length, petal_width = np.genfromtxt('../data/IRIS.csv', delimiter=',', usecols=(0,1,2,3), unpack=True, dtype=float)
iris_class = np.genfromtxt('../data/IRIS.csv', delimiter=',', usecols=(4), unpack=True, dtype=str)

# Loaded the columns into separate variables for ease of use.

Create a simple plot


In [2]:
import matplotlib.pyplot as plt

# Plot Sepal Length on the x-axis and Sepal Width on the y-axis; complete with labels.

# Scale graph to a bigger size
plt.rcParams['figure.figsize'] = (14.0, 6.0)

# Set title
plt.title('Iris Data Set: Sepal Measurements', fontsize=16)

# plot scatter graph
plt.scatter(sepal_length, sepal_width)

# Add labels
plt.xlabel('Sepal Length', fontsize=14)
plt.ylabel('Sepal Width', fontsize=14)

# Output Graph
plt.show()


Create a more complex plot


In [3]:
# https://matplotlib.org/users/legend_guide.html
import matplotlib.patches as mp

# https://stackoverflow.com/questions/27318906/python-scatter-plot-with-colors-corresponding-to-strings
colours = {'Iris-setosa': 'r', 'Iris-versicolor': 'g', 'Iris-virginica': 'b'}

plt.scatter(sepal_length, sepal_width, c=[colours[i] for i in iris_class], label=[colours[i] for i in colours])

# Add title
plt.title('Iris Setosa, Versicolor, and Virginica: Sepal Measurements', fontsize=16)

# Add labels
plt.xlabel('Sepal Length', fontsize=14)
plt.ylabel('Sepal Width', fontsize=14)

# https://matplotlib.org/api/patches_api.html
plt.legend(handles = [mp.Patch(color=colour, label=label) for label, colour in [('Iris Setosa', 'r'), ('Iris Versicolor', 'g'), ('Iris Virginica', 'b')]])
plt.show()


Use seaborn


In [4]:
import seaborn as sns
import pandas as pd 

# Prepare data with pandas DataFrame for seaborn usage.
df = pd.DataFrame(dict(zip(['Sepal Length', 'Sepal Width','Petal Length', 'Petal Width', 'Iris Class'], [sepal_length, sepal_width, petal_length, petal_width, iris_class])))
df


Out[4]:
Iris Class Petal Length Petal Width Sepal Length Sepal Width
0 Iris-setosa 1.4 0.2 5.1 3.5
1 Iris-setosa 1.4 0.2 4.9 3.0
2 Iris-setosa 1.3 0.2 4.7 3.2
3 Iris-setosa 1.5 0.2 4.6 3.1
4 Iris-setosa 1.4 0.2 5.0 3.6
5 Iris-setosa 1.7 0.4 5.4 3.9
6 Iris-setosa 1.4 0.3 4.6 3.4
7 Iris-setosa 1.5 0.2 5.0 3.4
8 Iris-setosa 1.4 0.2 4.4 2.9
9 Iris-setosa 1.5 0.1 4.9 3.1
10 Iris-setosa 1.5 0.2 5.4 3.7
11 Iris-setosa 1.6 0.2 4.8 3.4
12 Iris-setosa 1.4 0.1 4.8 3.0
13 Iris-setosa 1.1 0.1 4.3 3.0
14 Iris-setosa 1.2 0.2 5.8 4.0
15 Iris-setosa 1.5 0.4 5.7 4.4
16 Iris-setosa 1.3 0.4 5.4 3.9
17 Iris-setosa 1.4 0.3 5.1 3.5
18 Iris-setosa 1.7 0.3 5.7 3.8
19 Iris-setosa 1.5 0.3 5.1 3.8
20 Iris-setosa 1.7 0.2 5.4 3.4
21 Iris-setosa 1.5 0.4 5.1 3.7
22 Iris-setosa 1.0 0.2 4.6 3.6
23 Iris-setosa 1.7 0.5 5.1 3.3
24 Iris-setosa 1.9 0.2 4.8 3.4
25 Iris-setosa 1.6 0.2 5.0 3.0
26 Iris-setosa 1.6 0.4 5.0 3.4
27 Iris-setosa 1.5 0.2 5.2 3.5
28 Iris-setosa 1.4 0.2 5.2 3.4
29 Iris-setosa 1.6 0.2 4.7 3.2
... ... ... ... ... ...
120 Iris-virginica 5.7 2.3 6.9 3.2
121 Iris-virginica 4.9 2.0 5.6 2.8
122 Iris-virginica 6.7 2.0 7.7 2.8
123 Iris-virginica 4.9 1.8 6.3 2.7
124 Iris-virginica 5.7 2.1 6.7 3.3
125 Iris-virginica 6.0 1.8 7.2 3.2
126 Iris-virginica 4.8 1.8 6.2 2.8
127 Iris-virginica 4.9 1.8 6.1 3.0
128 Iris-virginica 5.6 2.1 6.4 2.8
129 Iris-virginica 5.8 1.6 7.2 3.0
130 Iris-virginica 6.1 1.9 7.4 2.8
131 Iris-virginica 6.4 2.0 7.9 3.8
132 Iris-virginica 5.6 2.2 6.4 2.8
133 Iris-virginica 5.1 1.5 6.3 2.8
134 Iris-virginica 5.6 1.4 6.1 2.6
135 Iris-virginica 6.1 2.3 7.7 3.0
136 Iris-virginica 5.6 2.4 6.3 3.4
137 Iris-virginica 5.5 1.8 6.4 3.1
138 Iris-virginica 4.8 1.8 6.0 3.0
139 Iris-virginica 5.4 2.1 6.9 3.1
140 Iris-virginica 5.6 2.4 6.7 3.1
141 Iris-virginica 5.1 2.3 6.9 3.1
142 Iris-virginica 5.1 1.9 5.8 2.7
143 Iris-virginica 5.9 2.3 6.8 3.2
144 Iris-virginica 5.7 2.5 6.7 3.3
145 Iris-virginica 5.2 2.3 6.7 3.0
146 Iris-virginica 5.0 1.9 6.3 2.5
147 Iris-virginica 5.2 2.0 6.5 3.0
148 Iris-virginica 5.4 2.3 6.2 3.4
149 Iris-virginica 5.1 1.8 5.9 3.0

150 rows × 5 columns


In [5]:
# Adapted from: https://seaborn.pydata.org/examples/scatterplot_matrix.html

%matplotlib inline
sns.pairplot(df, hue="Iris Class")


Out[5]:
<seaborn.axisgrid.PairGrid at 0x198449fe668>

Fit a line


In [6]:
# Reset size after seaborn
plt.rcParams['figure.figsize'] = (14.0, 6.0)

# https://github.com/emerging-technologies/emerging-technologies.github.io/blob/master/notebooks/simple-linear-regression.ipynb
# Calculate the best values for m and c.
m, c = np.polyfit(petal_length, petal_width, 1)

# Plot Setosa measurements 
plt.scatter(petal_length, petal_width,marker='o', label='Data Set')

# Plot best fit line 
plt.plot(petal_length, m * petal_length + c, 'forestgreen', label='Best fit line')

# Add title
plt.title('Iris Data Set: Petal Measurements', fontsize=16)

# Add labels
plt.xlabel('Petal Length')
plt.ylabel('Petal Width')
plt.legend()

# Print graph
plt.show()


Calculate the R-squared value


In [7]:
# Calculate the R-squared value for our data set using numpy.
np.corrcoef(petal_length, petal_width)[0][1]**2


Out[7]:
0.92690122792200302

Fit another line


In [8]:
# https://stackoverflow.com/questions/27947487/is-zip-the-most-efficient-way-to-combine-arrays-with-respect-to-memory-in-nump
# Combine arrays
iris_data = np.column_stack((sepal_length, sepal_width, petal_length, petal_width,iris_class))

# https://docs.scipy.org/doc/numpy/reference/generated/numpy.in1d.html
# Filter Data with 'Iris-setosa' & transpose after
filter_setosa = (iris_data[np.in1d(iris_data[:,4],'Iris-setosa')]).transpose()

# https://stackoverflow.com/questions/3877491/deleting-rows-in-numpy-array
# https://docs.scipy.org/doc/numpy-1.13.0/reference/generated/numpy.chararray.astype.html
# Prepare data - delete row of unnecessary data and cover types to float
setosa_data = (np.delete(filter_setosa, (4), axis=0)).astype(np.float)
setosa_data


Out[8]:
array([[ 5.1,  4.9,  4.7,  4.6,  5. ,  5.4,  4.6,  5. ,  4.4,  4.9,  5.4,
         4.8,  4.8,  4.3,  5.8,  5.7,  5.4,  5.1,  5.7,  5.1,  5.4,  5.1,
         4.6,  5.1,  4.8,  5. ,  5. ,  5.2,  5.2,  4.7,  4.8,  5.4,  5.2,
         5.5,  4.9,  5. ,  5.5,  4.9,  4.4,  5.1,  5. ,  4.5,  4.4,  5. ,
         5.1,  4.8,  5.1,  4.6,  5.3,  5. ],
       [ 3.5,  3. ,  3.2,  3.1,  3.6,  3.9,  3.4,  3.4,  2.9,  3.1,  3.7,
         3.4,  3. ,  3. ,  4. ,  4.4,  3.9,  3.5,  3.8,  3.8,  3.4,  3.7,
         3.6,  3.3,  3.4,  3. ,  3.4,  3.5,  3.4,  3.2,  3.1,  3.4,  4.1,
         4.2,  3.1,  3.2,  3.5,  3.1,  3. ,  3.4,  3.5,  2.3,  3.2,  3.5,
         3.8,  3. ,  3.8,  3.2,  3.7,  3.3],
       [ 1.4,  1.4,  1.3,  1.5,  1.4,  1.7,  1.4,  1.5,  1.4,  1.5,  1.5,
         1.6,  1.4,  1.1,  1.2,  1.5,  1.3,  1.4,  1.7,  1.5,  1.7,  1.5,
         1. ,  1.7,  1.9,  1.6,  1.6,  1.5,  1.4,  1.6,  1.6,  1.5,  1.5,
         1.4,  1.5,  1.2,  1.3,  1.5,  1.3,  1.5,  1.3,  1.3,  1.3,  1.6,
         1.9,  1.4,  1.6,  1.4,  1.5,  1.4],
       [ 0.2,  0.2,  0.2,  0.2,  0.2,  0.4,  0.3,  0.2,  0.2,  0.1,  0.2,
         0.2,  0.1,  0.1,  0.2,  0.4,  0.4,  0.3,  0.3,  0.3,  0.2,  0.4,
         0.2,  0.5,  0.2,  0.2,  0.4,  0.2,  0.2,  0.2,  0.2,  0.4,  0.1,
         0.2,  0.1,  0.2,  0.2,  0.1,  0.2,  0.2,  0.3,  0.3,  0.2,  0.6,
         0.4,  0.3,  0.2,  0.2,  0.2,  0.2]])

In [15]:
# https://github.com/emerging-technologies/emerging-technologies.github.io/blob/master/notebooks/simple-linear-regression.ipynb
# Calculate the best values for m and c.
m, c = np.polyfit(setosa_data[2], setosa_data[3], 1)

# Plot Setosa measurements 
plt.scatter(setosa_data[2],setosa_data[3],marker='o', label='Iris Setosa')

# Plot best fit line 
plt.plot(setosa_data[2], m * setosa_data[2] + c, 'forestgreen', label='Best fit line')

# Add title
plt.title('Iris Setosa: Petal Measurements', fontsize=16)

# Add labels
plt.xlabel('Petal Length', fontsize=14)
plt.ylabel('Petal Width', fontsize=14)

plt.legend()
# Print graph
plt.show()


Calculate the R-squared value


In [10]:
# Calculate the R-squared value for the Setosa data using numpy.
np.corrcoef(setosa_data[2], setosa_data[3])[0][1]**2


Out[10]:
0.09382472022283582

Use gradient descent

Gradient Descent is an approximation technique. To utilize this approximation technique, we guess the value that we wish to approximate and iteratively improve that guess.


In [11]:
# Calculate the partial derivative of cost with respect to m while treating c as a constant.
def gradient_descent_m(x, y, m, c):
  return -2.0 * np.sum(x * (y - m * x - c))

# Calculate the partial derivative of cost with respect to c while treating m as a constant.
def gradient_descent_c(x, y, m , c):
  return -2.0 * np.sum(y - m * x - c)

In [12]:
eta = 0.0001
g_m, g_c = 1.0, 1.0
change = True

# Iterate the partial derivatives until the outcomes do not change
while change:
  g_m_new = g_m - eta * gradient_descent_m(setosa_data[2], setosa_data[3], g_m, g_c)
  g_c_new = g_c - eta * gradient_descent_c(setosa_data[2], setosa_data[3], g_m, g_c)
  if g_m == g_m_new and g_c == g_c_new:
    change = False
  else:
    g_m, g_c = g_m_new, g_c_new

To the human eye it is difficult to see a difference between the best fit line and the best fit line approximated by gradient descent.


In [16]:
# Plot Setosa measurements 
plt.scatter(setosa_data[2],setosa_data[3],marker='o', label='Iris Setosa')

# Plot best fit line according to Gradient Descent
plt.plot(setosa_data[2], g_m * setosa_data[2] + g_c, 'forestgreen', label='Best fit line: Gradient Descent')

# Add title
plt.title('Iris Setosa: Petal Measurements', fontsize=16)

# Add labels
plt.xlabel('Petal Length')
plt.ylabel('Petal Width')

plt.legend()
# Print graph
plt.show()


However, the results from the two techniques are in fact different. With both results for m and c differing after the eleventh decimal point, the gradient descent technique did manage to approximate adequate results; although inexact and inaccurate.


In [17]:
print("BEST LINE:  m: %20.16f  c: %20.16f" % (m, c))
print()
print("GRADIENT DESCENT:  m: %20.16f  c: %20.16f" % (g_m, g_c))


BEST LINE:  m:   0.1892624728850328  c:  -0.0330802603036879

GRADIENT DESCENT:  m:   0.1892624728849683  c:  -0.0330802603035933