Exercise: "Human learning" with iris data

Question: Can you predict the species of an iris using petal and sepal measurements?

  1. Read the iris data into a Pandas DataFrame, including column names.
  2. Gather some basic information about the data.
  3. Use sorting, split-apply-combine, and/or visualization to look for differences between species.
  4. Write down a set of rules that could be used to predict species based on iris measurements.

BONUS: Define a function that accepts a row of data and returns a predicted species. Then, use that function to make predictions for all existing rows of data, and check the accuracy of your predictions.


In [ ]:
import pandas as pd
import matplotlib.pyplot as plt

# display plots in the notebook
%matplotlib inline

# increase default figure and font sizes for easier viewing
plt.rcParams['figure.figsize'] = (8, 6)
plt.rcParams['font.size'] = 14

Task 1

Read the iris data into a pandas DataFrame, including column names. Name the dataframe iris.


In [ ]:
# define a list of column names (as strings)
col_names = ['sepal_length', 'sepal_width', 'petal_length', 'petal_width', 'species']

# define the URL from which to retrieve the data (as a string)
url = 'http://archive.ics.uci.edu/ml/machine-learning-databases/iris/iris.data'

import pandas as pd

# retrieve the CSV file and add the column names.  Name the dataframe iris
iris = pd.read_csv(url, sep = ",", names=col_names, header=0)

print(iris)

Task 2

Gather some basic information about the data such as:

  • shape
  • head
  • data types of the columns
  • describe
  • counts of the values in the column species
  • count the nulls

In [57]:
iris.shape


Out[57]:
(149, 7)

In [ ]:
iris.head()

In [ ]:
iris.dtypes

In [ ]:
iris.describe()

In [ ]:
iris.species.value_counts()

In [ ]:
iris.isnull().sum()

Task 3

Use sorting, split-apply-combine, and/or visualization to look for differences between species.

sorting


In [ ]:
# Sort the values in the petal_width column and display them

iris.sort_values("petal_width").values

split-apply-combine


In [ ]:
# Find the mean of sepal_length grouped by species
iris.groupby("species").sepal_length.mean()

In [ ]:
# Find the mean of all numeric columns grouped by species
iris.groupby("species").mean()

In [ ]:
# Get the describe information for all numeric columns grouped by species
iris.groupby("species").describe()

visualization


In [ ]:
# Generate a histogram of petal_width grouped by species
plt.style.use('bmh')
iris.hist(column="petal_width", by="species")

#iris.groupby('species').petal_width.plot(kind='hist')

In [ ]:
# Display a box plot of petal_width grouped by species
plt.style.use("fivethirtyeight")
iris.boxplot(column="petal_width", by="species")

In [ ]:
# Display box plot of all numeric columns grouped by species
#iris.groupby("species").plot(kind="boxplot") #not a box plot but does givie you plots of lines broken out by species
iris.boxplot(by='species')

In [62]:
# map species to a numeric value so that plots can be colored by species
iris['species_num'] = iris.species.map({'Iris-setosa':0, 'Iris-versicolor':1, 'Iris-virginica':2})

print(iris.species_num)
# alternative method, I like the mapping better since it also documents what integer mappings are
#iris['species_num'] = iris.species.factorize()[0]


0      0
1      0
2      0
3      0
4      0
5      0
6      0
7      0
8      0
9      0
10     0
11     0
12     0
13     0
14     0
15     0
16     0
17     0
18     0
19     0
20     0
21     0
22     0
23     0
24     0
25     0
26     0
27     0
28     0
29     0
      ..
119    2
120    2
121    2
122    2
123    2
124    2
125    2
126    2
127    2
128    2
129    2
130    2
131    2
132    2
133    2
134    2
135    2
136    2
137    2
138    2
139    2
140    2
141    2
142    2
143    2
144    2
145    2
146    2
147    2
148    2
Name: species_num, dtype: int64

In [ ]:
# Generate a scatter plot of petal_length vs petal_width colored by species
iris.plot(kind="scatter", x="petal_length", y="petal_width", c="species_num", colormap="brg")

In [ ]:
# Generate a scatter matrix of all features colored by species.  Make the figure size 12x10
pd.scatter_matrix(iris.drop("species_num", axis=1), c=iris.species_num, figsize=(12,10))

Task 4

Decide on a set of rules that could be used to predict species based on iris measurements.


In [ ]:
# Define a new feature that represents petal area ("feature engineering")
iris["petal_area"] = iris.petal_length * iris.petal_width

In [ ]:
# Display a describe of petal_area grouped by species
iris.groupby("species").petal_area.describe().unstack()

In [ ]:
# Display a box plot of petal_area grouped by species
iris.boxplot(column="petal_area", by="species")

Predicting setosa will be straightforward since all our Iris-setosa pedal_areas are < 2 and the other Iris species have petal_areas larger than 2. But what about the petal_areas of Iris-versicolor and Iris-virginica? Some of their petal_area values overlap. Let's look at that overlap in more detail.


In [ ]:
# Show only dataframe rows with a petal_area between 7 and 9
iris[(iris.petal_area > 7) & (iris.petal_area < 9)].sort_values('petal_area')

My set of rules for predicting species:

  • if petal_area < 2
  • then "setsosa"
  • elseif petal_area < 7.5
  • then "versicolor"
  • else "virginica"

Bonus

Define a function that accepts a row of data and returns a predicted species. Then, use that function to make predictions for all existing rows of data, and check the accuracy of your predictions.


In [76]:
# Define a function that given a row of data, returns a predicted species_num (0/1/2)
def classify_species(row):
    petal_area = (row[2] * row[3]) #define petal area, petal_length * petal_width
    
    if petal_area < 2:
        prediction = "setosa"
    elif petal_area < 7.5: 
        prediction = "versicolor"
    else: 
        prediction = "virginica"
    
    factorize = {'setosa':0, 'versicolor':1, 'virginica':2}    #need to map the strings back to their factors
    return factorize[prediction]

In [73]:
# Print the first row
iris.loc[0,:]


Out[73]:
sepal_length              4.9
sepal_width                 3
petal_length              1.4
petal_width               0.2
species           Iris-setosa
species_num                 0
y_pred_species         setosa
Name: 0, dtype: object

In [74]:
# Print the last row
iris.loc[148,:]


Out[74]:
sepal_length                 5.9
sepal_width                    3
petal_length                 5.1
petal_width                  1.8
species           Iris-virginica
species_num                    2
y_pred_species         verginica
Name: 148, dtype: object

In [77]:
# Test the function on the first and last rows
print classify_species(iris.loc[0,:]) 
print classify_species(iris.loc[148,:])


0
2

In [78]:
# Make predictions for all rows and store them in the DataFrame
iris["y_pred_species"] = [classify_species(row) for index, row in iris.iterrows()]

In [83]:
# Calculate the percentage of correct predictions
sum(iris.species_num == iris.y_pred_species) / 149.


Out[83]:
0.97315436241610742

In [ ]: