Question: Can you predict the species of an iris using petal and sepal measurements?
BONUS: Define a function that accepts a row of data and returns a predicted species. Then, use that function to make predictions for all existing rows of data, and check the accuracy of your predictions.
In [ ]:
import pandas as pd
import matplotlib.pyplot as plt
# display plots in the notebook
%matplotlib inline
# increase default figure and font sizes for easier viewing
plt.rcParams['figure.figsize'] = (8, 6)
plt.rcParams['font.size'] = 14
In [ ]:
# define a list of column names (as strings)
col_names = ['sepal_length', 'sepal_width', 'petal_length', 'petal_width', 'species']
# define the URL from which to retrieve the data (as a string)
url = 'http://archive.ics.uci.edu/ml/machine-learning-databases/iris/iris.data'
import pandas as pd
# retrieve the CSV file and add the column names. Name the dataframe iris
iris = pd.read_csv(url, sep = ",", names=col_names, header=0)
print(iris)
In [57]:
iris.shape
Out[57]:
In [ ]:
iris.head()
In [ ]:
iris.dtypes
In [ ]:
iris.describe()
In [ ]:
iris.species.value_counts()
In [ ]:
iris.isnull().sum()
In [ ]:
# Sort the values in the petal_width column and display them
iris.sort_values("petal_width").values
In [ ]:
# Find the mean of sepal_length grouped by species
iris.groupby("species").sepal_length.mean()
In [ ]:
# Find the mean of all numeric columns grouped by species
iris.groupby("species").mean()
In [ ]:
# Get the describe information for all numeric columns grouped by species
iris.groupby("species").describe()
In [ ]:
# Generate a histogram of petal_width grouped by species
plt.style.use('bmh')
iris.hist(column="petal_width", by="species")
#iris.groupby('species').petal_width.plot(kind='hist')
In [ ]:
# Display a box plot of petal_width grouped by species
plt.style.use("fivethirtyeight")
iris.boxplot(column="petal_width", by="species")
In [ ]:
# Display box plot of all numeric columns grouped by species
#iris.groupby("species").plot(kind="boxplot") #not a box plot but does givie you plots of lines broken out by species
iris.boxplot(by='species')
In [62]:
# map species to a numeric value so that plots can be colored by species
iris['species_num'] = iris.species.map({'Iris-setosa':0, 'Iris-versicolor':1, 'Iris-virginica':2})
print(iris.species_num)
# alternative method, I like the mapping better since it also documents what integer mappings are
#iris['species_num'] = iris.species.factorize()[0]
In [ ]:
# Generate a scatter plot of petal_length vs petal_width colored by species
iris.plot(kind="scatter", x="petal_length", y="petal_width", c="species_num", colormap="brg")
In [ ]:
# Generate a scatter matrix of all features colored by species. Make the figure size 12x10
pd.scatter_matrix(iris.drop("species_num", axis=1), c=iris.species_num, figsize=(12,10))
In [ ]:
# Define a new feature that represents petal area ("feature engineering")
iris["petal_area"] = iris.petal_length * iris.petal_width
In [ ]:
# Display a describe of petal_area grouped by species
iris.groupby("species").petal_area.describe().unstack()
In [ ]:
# Display a box plot of petal_area grouped by species
iris.boxplot(column="petal_area", by="species")
Predicting setosa will be straightforward since all our Iris-setosa pedal_areas are < 2 and the other Iris species have petal_areas larger than 2. But what about the petal_areas of Iris-versicolor and Iris-virginica? Some of their petal_area values overlap. Let's look at that overlap in more detail.
In [ ]:
# Show only dataframe rows with a petal_area between 7 and 9
iris[(iris.petal_area > 7) & (iris.petal_area < 9)].sort_values('petal_area')
My set of rules for predicting species:
In [76]:
# Define a function that given a row of data, returns a predicted species_num (0/1/2)
def classify_species(row):
petal_area = (row[2] * row[3]) #define petal area, petal_length * petal_width
if petal_area < 2:
prediction = "setosa"
elif petal_area < 7.5:
prediction = "versicolor"
else:
prediction = "virginica"
factorize = {'setosa':0, 'versicolor':1, 'virginica':2} #need to map the strings back to their factors
return factorize[prediction]
In [73]:
# Print the first row
iris.loc[0,:]
Out[73]:
In [74]:
# Print the last row
iris.loc[148,:]
Out[74]:
In [77]:
# Test the function on the first and last rows
print classify_species(iris.loc[0,:])
print classify_species(iris.loc[148,:])
In [78]:
# Make predictions for all rows and store them in the DataFrame
iris["y_pred_species"] = [classify_species(row) for index, row in iris.iterrows()]
In [83]:
# Calculate the percentage of correct predictions
sum(iris.species_num == iris.y_pred_species) / 149.
Out[83]:
In [ ]: