# Exercise: "Human learning" with iris data

Question: Can you predict the species of an iris using petal and sepal measurements?

1. Read the iris data into a Pandas DataFrame, including column names.
2. Gather some basic information about the data.
3. Use sorting, split-apply-combine, and/or visualization to look for differences between species.
4. Write down a set of rules that could be used to predict species based on iris measurements.

BONUS: Define a function that accepts a row of data and returns a predicted species. Then, use that function to make predictions for all existing rows of data, and check the accuracy of your predictions.

``````

In [ ]:

import pandas as pd
import matplotlib.pyplot as plt

# display plots in the notebook
%matplotlib inline

# increase default figure and font sizes for easier viewing
plt.rcParams['figure.figsize'] = (8, 6)
plt.rcParams['font.size'] = 14

``````

Read the iris data into a pandas DataFrame, including column names. Name the dataframe iris.

``````

In [ ]:

# define a list of column names (as strings)
col_names = ['sepal_length', 'sepal_width', 'petal_length', 'petal_width', 'species']

# define the URL from which to retrieve the data (as a string)
url = 'http://archive.ics.uci.edu/ml/machine-learning-databases/iris/iris.data'

import pandas as pd

# retrieve the CSV file and add the column names.  Name the dataframe iris

print(iris)

``````

Gather some basic information about the data such as:

• shape
• data types of the columns
• describe
• counts of the values in the column species
• count the nulls
``````

In [57]:

iris.shape

``````
``````

Out[57]:

(149, 7)

``````
``````

In [ ]:

``````
``````

In [ ]:

iris.dtypes

``````
``````

In [ ]:

iris.describe()

``````
``````

In [ ]:

iris.species.value_counts()

``````
``````

In [ ]:

iris.isnull().sum()

``````

Use sorting, split-apply-combine, and/or visualization to look for differences between species.

### sorting

``````

In [ ]:

# Sort the values in the petal_width column and display them

iris.sort_values("petal_width").values

``````

### split-apply-combine

``````

In [ ]:

# Find the mean of sepal_length grouped by species
iris.groupby("species").sepal_length.mean()

``````
``````

In [ ]:

# Find the mean of all numeric columns grouped by species
iris.groupby("species").mean()

``````
``````

In [ ]:

# Get the describe information for all numeric columns grouped by species
iris.groupby("species").describe()

``````

### visualization

``````

In [ ]:

# Generate a histogram of petal_width grouped by species
plt.style.use('bmh')
iris.hist(column="petal_width", by="species")

#iris.groupby('species').petal_width.plot(kind='hist')

``````
``````

In [ ]:

# Display a box plot of petal_width grouped by species
plt.style.use("fivethirtyeight")
iris.boxplot(column="petal_width", by="species")

``````
``````

In [ ]:

# Display box plot of all numeric columns grouped by species
#iris.groupby("species").plot(kind="boxplot") #not a box plot but does givie you plots of lines broken out by species
iris.boxplot(by='species')

``````
``````

In [62]:

# map species to a numeric value so that plots can be colored by species
iris['species_num'] = iris.species.map({'Iris-setosa':0, 'Iris-versicolor':1, 'Iris-virginica':2})

print(iris.species_num)
# alternative method, I like the mapping better since it also documents what integer mappings are
#iris['species_num'] = iris.species.factorize()[0]

``````
``````

0      0
1      0
2      0
3      0
4      0
5      0
6      0
7      0
8      0
9      0
10     0
11     0
12     0
13     0
14     0
15     0
16     0
17     0
18     0
19     0
20     0
21     0
22     0
23     0
24     0
25     0
26     0
27     0
28     0
29     0
..
119    2
120    2
121    2
122    2
123    2
124    2
125    2
126    2
127    2
128    2
129    2
130    2
131    2
132    2
133    2
134    2
135    2
136    2
137    2
138    2
139    2
140    2
141    2
142    2
143    2
144    2
145    2
146    2
147    2
148    2
Name: species_num, dtype: int64

``````
``````

In [ ]:

# Generate a scatter plot of petal_length vs petal_width colored by species
iris.plot(kind="scatter", x="petal_length", y="petal_width", c="species_num", colormap="brg")

``````
``````

In [ ]:

# Generate a scatter matrix of all features colored by species.  Make the figure size 12x10
pd.scatter_matrix(iris.drop("species_num", axis=1), c=iris.species_num, figsize=(12,10))

``````

Decide on a set of rules that could be used to predict species based on iris measurements.

``````

In [ ]:

# Define a new feature that represents petal area ("feature engineering")
iris["petal_area"] = iris.petal_length * iris.petal_width

``````
``````

In [ ]:

# Display a describe of petal_area grouped by species
iris.groupby("species").petal_area.describe().unstack()

``````
``````

In [ ]:

# Display a box plot of petal_area grouped by species
iris.boxplot(column="petal_area", by="species")

``````

Predicting setosa will be straightforward since all our Iris-setosa pedal_areas are < 2 and the other Iris species have petal_areas larger than 2. But what about the petal_areas of Iris-versicolor and Iris-virginica? Some of their petal_area values overlap. Let's look at that overlap in more detail.

``````

In [ ]:

# Show only dataframe rows with a petal_area between 7 and 9
iris[(iris.petal_area > 7) & (iris.petal_area < 9)].sort_values('petal_area')

``````

My set of rules for predicting species:

• if petal_area < 2
• then "setsosa"
• elseif petal_area < 7.5
• then "versicolor"
• else "virginica"

## Bonus

Define a function that accepts a row of data and returns a predicted species. Then, use that function to make predictions for all existing rows of data, and check the accuracy of your predictions.

``````

In [76]:

# Define a function that given a row of data, returns a predicted species_num (0/1/2)
def classify_species(row):
petal_area = (row[2] * row[3]) #define petal area, petal_length * petal_width

if petal_area < 2:
prediction = "setosa"
elif petal_area < 7.5:
prediction = "versicolor"
else:
prediction = "virginica"

factorize = {'setosa':0, 'versicolor':1, 'virginica':2}    #need to map the strings back to their factors
return factorize[prediction]

``````
``````

In [73]:

# Print the first row
iris.loc[0,:]

``````
``````

Out[73]:

sepal_length              4.9
sepal_width                 3
petal_length              1.4
petal_width               0.2
species           Iris-setosa
species_num                 0
y_pred_species         setosa
Name: 0, dtype: object

``````
``````

In [74]:

# Print the last row
iris.loc[148,:]

``````
``````

Out[74]:

sepal_length                 5.9
sepal_width                    3
petal_length                 5.1
petal_width                  1.8
species           Iris-virginica
species_num                    2
y_pred_species         verginica
Name: 148, dtype: object

``````
``````

In [77]:

# Test the function on the first and last rows
print classify_species(iris.loc[0,:])
print classify_species(iris.loc[148,:])

``````
``````

0
2

``````
``````

In [78]:

# Make predictions for all rows and store them in the DataFrame
iris["y_pred_species"] = [classify_species(row) for index, row in iris.iterrows()]

``````
``````

In [83]:

# Calculate the percentage of correct predictions
sum(iris.species_num == iris.y_pred_species) / 149.

``````
``````

Out[83]:

0.97315436241610742

``````
``````

In [ ]:

``````