Exercise: "Human learning" with iris data

Question: Can you predict the species of an iris using petal and sepal measurements?

Read the iris data into a Pandas DataFrame, including column names.
Gather some basic information about the data.
Use sorting, split-apply-combine, and/or visualization to look for differences between species.
Write down a set of rules that could be used to predict species based on iris measurements.

BONUS: Define a function that accepts a row of data and returns a predicted species. Then, use that function to make predictions for all existing rows of data, and check the accuracy of your predictions.



In [ ]:

    
import pandas as pd
import matplotlib.pyplot as plt

# display plots in the notebook
%matplotlib inline

# increase default figure and font sizes for easier viewing
plt.rcParams['figure.figsize'] = (8, 6)
plt.rcParams['font.size'] = 14

Task 1

Read the iris data into a pandas DataFrame, including column names. Name the dataframe iris.



In [ ]:

    
# define a list of column names (as strings)
col_names = ['sepal_length', 'sepal_width', 'petal_length', 'petal_width', 'species']

# define the URL from which to retrieve the data (as a string)
url = 'http://archive.ics.uci.edu/ml/machine-learning-databases/iris/iris.data'

import pandas as pd

# retrieve the CSV file and add the column names.  Name the dataframe iris
iris = pd.read_csv(url, sep = ",", names=col_names, header=0)

print(iris)

Task 2

Gather some basic information about the data such as:

shape
head
data types of the columns
describe
counts of the values in the column species
count the nulls



In [57]:

    
iris.shape









    Out[57]:





(149, 7)



In [ ]:

    
iris.head()



In [ ]:

    
iris.dtypes



In [ ]:

    
iris.describe()



In [ ]:

    
iris.species.value_counts()



In [ ]:

    
iris.isnull().sum()

Task 3

Use sorting, split-apply-combine, and/or visualization to look for differences between species.

sorting



In [ ]:

    
# Sort the values in the petal_width column and display them

iris.sort_values("petal_width").values

split-apply-combine



In [ ]:

    
# Find the mean of sepal_length grouped by species
iris.groupby("species").sepal_length.mean()



In [ ]:

    
# Find the mean of all numeric columns grouped by species
iris.groupby("species").mean()



In [ ]:

    
# Get the describe information for all numeric columns grouped by species
iris.groupby("species").describe()

visualization



In [ ]:

    
# Generate a histogram of petal_width grouped by species
plt.style.use('bmh')
iris.hist(column="petal_width", by="species")

#iris.groupby('species').petal_width.plot(kind='hist')



In [ ]:

    
# Display a box plot of petal_width grouped by species
plt.style.use("fivethirtyeight")
iris.boxplot(column="petal_width", by="species")



In [ ]:

    
# Display box plot of all numeric columns grouped by species
#iris.groupby("species").plot(kind="boxplot") #not a box plot but does givie you plots of lines broken out by species
iris.boxplot(by='species')



In [62]:

    
# map species to a numeric value so that plots can be colored by species
iris['species_num'] = iris.species.map({'Iris-setosa':0, 'Iris-versicolor':1, 'Iris-virginica':2})

print(iris.species_num)
# alternative method, I like the mapping better since it also documents what integer mappings are
#iris['species_num'] = iris.species.factorize()[0]









    



0      0
1      0
2      0
3      0
4      0
5      0
6      0
7      0
8      0
9      0
10     0
11     0
12     0
13     0
14     0
15     0
16     0
17     0
18     0
19     0
20     0
21     0
22     0
23     0
24     0
25     0
26     0
27     0
28     0
29     0
      ..
119    2
120    2
121    2
122    2
123    2
124    2
125    2
126    2
127    2
128    2
129    2
130    2
131    2
132    2
133    2
134    2
135    2
136    2
137    2
138    2
139    2
140    2
141    2
142    2
143    2
144    2
145    2
146    2
147    2
148    2
Name: species_num, dtype: int64



In [ ]:

    
# Generate a scatter plot of petal_length vs petal_width colored by species
iris.plot(kind="scatter", x="petal_length", y="petal_width", c="species_num", colormap="brg")



In [ ]:

    
# Generate a scatter matrix of all features colored by species.  Make the figure size 12x10
pd.scatter_matrix(iris.drop("species_num", axis=1), c=iris.species_num, figsize=(12,10))

Task 4

Decide on a set of rules that could be used to predict species based on iris measurements.



In [ ]:

    
# Define a new feature that represents petal area ("feature engineering")
iris["petal_area"] = iris.petal_length * iris.petal_width



In [ ]:

    
# Display a describe of petal_area grouped by species
iris.groupby("species").petal_area.describe().unstack()



In [ ]:

    
# Display a box plot of petal_area grouped by species
iris.boxplot(column="petal_area", by="species")

Predicting setosa will be straightforward since all our Iris-setosa pedal_areas are < 2 and the other Iris species have petal_areas larger than 2. But what about the petal_areas of Iris-versicolor and Iris-virginica? Some of their petal_area values overlap. Let's look at that overlap in more detail.



In [ ]:

    
# Show only dataframe rows with a petal_area between 7 and 9
iris[(iris.petal_area > 7) & (iris.petal_area < 9)].sort_values('petal_area')

My set of rules for predicting species:

if petal_area < 2
then "setsosa"
elseif petal_area < 7.5
then "versicolor"
else "virginica"

Bonus

Define a function that accepts a row of data and returns a predicted species. Then, use that function to make predictions for all existing rows of data, and check the accuracy of your predictions.



In [76]:

    
# Define a function that given a row of data, returns a predicted species_num (0/1/2)
def classify_species(row):
    petal_area = (row[2] * row[3]) #define petal area, petal_length * petal_width
    
    if petal_area < 2:
        prediction = "setosa"
    elif petal_area < 7.5: 
        prediction = "versicolor"
    else: 
        prediction = "virginica"
    
    factorize = {'setosa':0, 'versicolor':1, 'virginica':2}    #need to map the strings back to their factors
    return factorize[prediction]



In [73]:

    
# Print the first row
iris.loc[0,:]









    Out[73]:





sepal_length              4.9
sepal_width                 3
petal_length              1.4
petal_width               0.2
species           Iris-setosa
species_num                 0
y_pred_species         setosa
Name: 0, dtype: object



In [74]:

    
# Print the last row
iris.loc[148,:]









    Out[74]:





sepal_length                 5.9
sepal_width                    3
petal_length                 5.1
petal_width                  1.8
species           Iris-virginica
species_num                    2
y_pred_species         verginica
Name: 148, dtype: object



In [77]:

    
# Test the function on the first and last rows
print classify_species(iris.loc[0,:]) 
print classify_species(iris.loc[148,:])

0
2



In [78]:

    
# Make predictions for all rows and store them in the DataFrame
iris["y_pred_species"] = [classify_species(row) for index, row in iris.iterrows()]



In [83]:

    
# Calculate the percentage of correct predictions
sum(iris.species_num == iris.y_pred_species) / 149.









    Out[83]:





0.97315436241610742



In [ ]: