Exercise: "Human learning" with iris data

Question: Can you predict the species of an iris using petal and sepal measurements?

  1. Read the iris data into a Pandas DataFrame, including column names.
  2. Gather some basic information about the data.
  3. Use sorting, split-apply-combine, and/or visualization to look for differences between species.
  4. Write down a set of rules that could be used to predict species based on iris measurements.

BONUS: Define a function that accepts a row of data and returns a predicted species. Then, use that function to make predictions for all existing rows of data, and check the accuracy of your predictions.

import pandas as pd
import matplotlib.pyplot as plt

# display plots in the notebook
%matplotlib inline

# increase default figure and font sizes for easier viewing
plt.rcParams['figure.figsize'] = (8, 6)
plt.rcParams['font.size'] = 14

Task 1

Read the iris data into a pandas DataFrame, including column names. Name the dataframe iris.

# define a list of column names (as strings)
col_names = ['sepal_length', 'sepal_width', 'petal_length', 'petal_width', 'species']

# define the URL from which to retrieve the data (as a string)
url = 'http://archive.ics.uci.edu/ml/machine-learning-databases/iris/iris.data'

import pandas as pd

# retrieve the CSV file and add the column names.  Name the dataframe iris
iris = pd.read_csv(url, sep = ",", names=col_names, header=0)


Task 2

Gather some basic information about the data such as:

  • shape
  • head
  • data types of the columns
  • describe
  • counts of the values in the column species
  • count the nulls

(149, 7)

Task 3

Use sorting, split-apply-combine, and/or visualization to look for differences between species.


# Sort the values in the petal_width column and display them



# Find the mean of sepal_length grouped by species

# Find the mean of all numeric columns grouped by species

# Get the describe information for all numeric columns grouped by species


# Generate a histogram of petal_width grouped by species
iris.hist(column="petal_width", by="species")


# Display a box plot of petal_width grouped by species
iris.boxplot(column="petal_width", by="species")

# Display box plot of all numeric columns grouped by species
#iris.groupby("species").plot(kind="boxplot") #not a box plot but does givie you plots of lines broken out by species

# map species to a numeric value so that plots can be colored by species
iris['species_num'] = iris.species.map({'Iris-setosa':0, 'Iris-versicolor':1, 'Iris-virginica':2})

# alternative method, I like the mapping better since it also documents what integer mappings are
#iris['species_num'] = iris.species.factorize()[0]

# Generate a scatter plot of petal_length vs petal_width colored by species
iris.plot(kind="scatter", x="petal_length", y="petal_width", c="species_num", colormap="brg")

# Generate a scatter matrix of all features colored by species.  Make the figure size 12x10
pd.scatter_matrix(iris.drop("species_num", axis=1), c=iris.species_num, figsize=(12,10))

Task 4

Decide on a set of rules that could be used to predict species based on iris measurements.

# Define a new feature that represents petal area ("feature engineering")
iris["petal_area"] = iris.petal_length * iris.petal_width

# Display a describe of petal_area grouped by species

# Display a box plot of petal_area grouped by species
iris.boxplot(column="petal_area", by="species")

Predicting setosa will be straightforward since all our Iris-setosa pedal_areas are < 2 and the other Iris species have petal_areas larger than 2. But what about the petal_areas of Iris-versicolor and Iris-virginica? Some of their petal_area values overlap. Let's look at that overlap in more detail.

# Show only dataframe rows with a petal_area between 7 and 9
iris[(iris.petal_area > 7) & (iris.petal_area < 9)].sort_values('petal_area')

My set of rules for predicting species:

  • if petal_area < 2
  • then "setsosa"
  • elseif petal_area < 7.5
  • then "versicolor"
  • else "virginica"


Define a function that accepts a row of data and returns a predicted species. Then, use that function to make predictions for all existing rows of data, and check the accuracy of your predictions.

# Define a function that given a row of data, returns a predicted species_num (0/1/2)
def classify_species(row):
    petal_area = (row[2] * row[3]) #define petal area, petal_length * petal_width
    if petal_area < 2:
        prediction = "setosa"
    elif petal_area < 7.5: 
        prediction = "versicolor"
        prediction = "virginica"
    factorize = {'setosa':0, 'versicolor':1, 'virginica':2}    #need to map the strings back to their factors
    return factorize[prediction]

# Print the first row

sepal_length              4.9
sepal_width                 3
petal_length              1.4
petal_width               0.2
species           Iris-setosa
species_num                 0
y_pred_species         setosa
Name: 0, dtype: object

# Print the last row

sepal_length                 5.9
sepal_width                    3
petal_length                 5.1
petal_width                  1.8
species           Iris-virginica
species_num                    2
y_pred_species         verginica
Name: 148, dtype: object

# Test the function on the first and last rows
print classify_species(iris.loc[0,:]) 
print classify_species(iris.loc[148,:])


# Make predictions for all rows and store them in the DataFrame
iris["y_pred_species"] = [classify_species(row) for index, row in iris.iterrows()]

# Calculate the percentage of correct predictions
sum(iris.species_num == iris.y_pred_species) / 149.


