In [1]:
%matplotlib inline
import pandas as pd
url = 'http://archive.ics.uci.edu/ml/machine-learning-databases/iris/iris.data'
col_names = ['sepal_length', 'sepal_width', 'petal_length', 'petal_width', 'species']
iris = pd.read_csv(url, header=None, names=col_names)
In [2]:
iris.head()
Out[2]:
How did we (as humans) predict the species of an iris?
We assumed that if an unknown iris has measurements similar to previous irises, then its species is most likely the same as those previous irises.
In [3]:
import matplotlib.pyplot as plt
# increase default figure and font sizes for easier viewing
plt.rcParams['figure.figsize'] = (10, 8)
plt.rcParams['font.size'] = 14
# create a custom colormap
from matplotlib.colors import ListedColormap
cmap_bold = ListedColormap(['#FF0000', '#00FF00', '#0000FF'])
In [4]:
# map each iris species to a number
iris['species_num'] = iris.species.map({'Iris-setosa':0, 'Iris-versicolor':1, 'Iris-virginica':2})
In [5]:
# box plot of all numeric columns grouped by species
iris.drop('species_num', axis=1).boxplot(by='species', rot=45)
Out[5]:
In [6]:
# create a scatter plot of PETAL LENGTH versus PETAL WIDTH and color by SPECIES
iris.plot(kind='scatter', x='petal_length', y='petal_width', c='species_num', colormap=cmap_bold)
Out[6]:
In [7]:
# create a scatter plot of SEPAL LENGTH versus SEPAL WIDTH and color by SPECIES
iris.plot(kind='scatter', x='sepal_length', y='sepal_width', c='species_num', colormap=cmap_bold)
Out[7]:
Question: What's the "best" value for K in this case?
Answer: The value which produces the most accurate predictions on unseen data. We want to create a model that generalizes!
In [8]:
iris.head()
Out[8]:
In [9]:
# store feature matrix in "X"
feature_cols = ['sepal_length', 'sepal_width', 'petal_length', 'petal_width']
X = iris[feature_cols]
In [10]:
# alternative ways to create "X"
X = iris.drop(['species', 'species_num'], axis=1)
X = iris.loc[:, 'sepal_length':'petal_width']
X = iris.iloc[:, 0:4]
In [11]:
# store response vector in "y"
y = iris.species_num
In [12]:
# check X's type
print(type(X))
print(type(X.values))
In [13]:
# check y's type
print(type(y))
print(type(y.values))
In [14]:
# check X's shape (n = number of observations, p = number of features)
print(X.shape)
In [15]:
# check y's shape (single dimension with length n)
print(y.shape)
Step 1: Import the class you plan to use
In [16]:
from sklearn.neighbors import KNeighborsClassifier
Step 2: "Instantiate" the "estimator"
In [17]:
# make an instance of a KNeighborsClassifier object
knn = KNeighborsClassifier(n_neighbors=1)
type(knn)
Out[17]:
In [18]:
print(knn)
Step 3: Fit the model with data (aka "model training")
In [19]:
knn.fit(X, y)
Out[19]:
Step 4: Predict the response for a new observation
In [20]:
knn.predict([[3, 5, 4, 2]])
Out[20]:
In [21]:
X_new = [[3, 5, 4, 2], [5, 4, 3, 2]]
knn.predict(X_new)
Out[21]:
In [22]:
# instantiate the model (using the value K=5)
knn = KNeighborsClassifier(n_neighbors=5)
# fit the model with data
knn.fit(X, y)
# predict the response for new observations
knn.predict(X_new)
Out[22]:
Question: Which model produced the correct predictions for the two unknown irises?
Answer: We don't know, because these are out-of-sample observations, meaning that we don't know the true response values. Our goal with supervised learning is to build models that generalize to out-of-sample data. However, we can't truly measure how well our models will perform on out-of-sample data.
Question: Does that mean that we have to guess how well our models are likely to do?
Answer: Thankfully, no. In the next class, we'll discuss model evaluation procedures, which allow us to use our existing labeled data to estimate how well our models are likely to perform on out-of-sample data. These procedures will help us to tune our models and choose between different types of models.
In [23]:
# calculate predicted probabilities of class membership
knn.predict_proba(X_new)
Out[23]:
Advantages of KNN:
Disadvantages of KNN: