In [33]:
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
import mglearn
from IPython.display import display
%matplotlib inline
In [34]:
import numpy as np
x = np.array([[1,2,3],[4,5,6]])
print("x:\n{}".format(x))
In [35]:
from scipy import sparse
# Create a 2D NumPy array with a diagonal of ones, and zeros everywhere else (aka an identity matrix).
eye = np.eye(4)
print("NumPy array:\n{}".format(eye))
In [36]:
# Convert the NumPy array to a SciPy sparse matrix in CSR format.
# The CSR format stores a sparse m × n matrix M in row form using three (one-dimensional) arrays (A, IA, JA).
# Only the nonzero entries are stored.
# http://www.scipy-lectures.org/advanced/scipy_sparse/csr_matrix.html
sparse_matrix = sparse.csr_matrix(eye)
print("\nSciPy sparse CSR matrix:\n{}".format(sparse_matrix))
Usually it isn't possible to create dense representations of sparse data (they won't fit in memory), so we need to create sparse representations directly.
Here is a way to create the same sparse matrix as before using the COO format:
In [37]:
data = np.ones(4)
row_indices = np.arange(4)
col_indices = np.arange(4)
eye_coo = sparse.coo_matrix((data, (row_indices, col_indices)))
print("COO representation:\n{}".format(eye_coo))
More details on SciPy sparse matrices can be found in the SciPy Lecture Notes.
In [38]:
# %matplotlib inline -- the default, just displays the plot in the browser.
# %matplotlib notebook -- provides an interactive environment for the plot.
import matplotlib.pyplot as plt
# Generate a sequnce of numbers from -10 to 10 with 100 steps (points) in between.
x = np.linspace(-10, 10, 100)
# Create a second array using sine.
y = np.sin(x)
# The plot function makes a line chart of one array against another.
plt.plot(x, y, marker="x")
plt.title("Simple line plot of a sine function using matplotlib")
plt.show()
Here is a small example of creating a pandas DataFrame using a Python dictionary.
In [39]:
import pandas as pd
from IPython.display import display
# Create a simple dataset of people
data = {'Name': ["John", "Anna", "Peter", "Linda"],
'Location' : ["New York", "Paris", "Berlin", "London"],
'Age' : [24, 13, 53, 33]
}
data_pandas = pd.DataFrame(data)
# IPython.display allows for "pretty printing" of dataframes in the Jupyter notebooks.
display(data_pandas)
There are several possible ways to query this table.
Here is one example:
In [40]:
# Select all rows that have an age column greater than 30:
display(data_pandas[data_pandas.Age > 30])
The mglearn
package is a library of utility functions written specifically for this book, so that the code listings don't become too cluttered with details of plotting and data loading.
The mglearn
library can be found at the author's Github repository, and can be installed with the command pip install mglearn
.
All of the code in this book will assume the following imports:
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
import mglearn
from IPython.display import display
We also assume that you will run the code in a Jupyter Notebook with the %matplotlib notebook or %matplotlib inline magic enabled to show plots.
If you are not using the notebook or these magic commands, you will have to call plt.show to actually show any of the figures.
In [41]:
# Make sure your dependencies are similar to the ones in the book.
import sys
print("Python version: {}".format(sys.version))
import pandas as pd
print("pandas version: {}".format(pd.__version__))
import matplotlib
print("matplotlib version: {}".format(matplotlib.__version__))
import numpy as np
print("NumPy version: {}".format(np.__version__))
import scipy as sp
print("SciPy version: {}".format(sp.__version__))
import IPython
print("IPython version: {}".format(IPython.__version__))
import sklearn
print("scikit-learn version: {}".format(sklearn.__version__))
The data we will use for this example is the Iris dataset, which is a commonly used dataset in machine learning and statistics tutorials.
The Iris dataset is included in scikit-learn
in the datasets
module.
We can load it by calling the load_iris
function.
In [42]:
from sklearn.datasets import load_iris
iris_dataset = load_iris()
iris_dataset
Out[42]:
The iris
object that is returned by load_iris
is a Bunch
object, which is very similar to a dictionary.
It contains keys and values:
In [43]:
print("Keys of iris_dataset: \n{}".format(iris_dataset.keys()))
The value of the key DESCR is a short description of the dataset.
In [44]:
print(iris_dataset['DESCR'][:193] + "\n...")
The value of the key target_names
is an array of strings, containing the species of flower that we want to predict.
In [45]:
print("Target names: {}".format(iris_dataset['target_names']))
The value of feature_names
is a list of strings, giving the description of each feature:
In [46]:
print("Feature names: \n{}".format(iris_dataset['feature_names']))
The data itself is contained in the target
and data
fields.
data
contains the numeric measurements of sepal length, sepal width, petal length, and petal width in a NumPy array:
In [47]:
print("Type of data: {}".format(type(iris_dataset['data'])))
The rows in the data
array correspond to flowers, while the columns represent the four measurements that were taken for each flower.
In [48]:
print("Shape of data: {}".format(iris_dataset['data'].shape))
The shape of the data
array is the number of samples (flowers) multiplied by the number of features (properties, e.g. sepal width).
Here are the feature values for the first five samples:
In [49]:
print("First five rows of data:\n{}".format(iris_dataset['data'][:5]))
The data tells us that all of the first five flowers have a petal width of 0.2 cm and that the first flower has the longest sepal (5.1 cm)
The target
array contains the species of each of the flowers that were measured, also as a NumPy array:
In [50]:
print("Type of target: {}".format(type(iris_dataset['target'])))
target
is a one-dimensional array, with one entry per flower:
In [51]:
print("Shape of target: {}".format(iris_dataset['target'].shape))
The species are encoded as integers from 0 to 2:
In [52]:
print("Target:\n{}".format(iris_dataset['target']))
The meanings of the numbers are given by the iris['target_names']
array:
0 means setosa, 1 means versicolor, and 2 means virginica.
We want to build a machine learning model from this data that can predict the species of iris for a new set of measurements.
To assess the model's performance, we show it new data for which we have labels.
This is usually done by splitting the labeled data into training data and test data.
scikit-learn
contains a function called train_test_split
that shuffles the data and splits it for you (the default is 75% train and 25% test).
In scikit-learn
, data is usually denoted with a capital X
, while labels are denoted by a lowercase y
.
This is inspired by the standard formulation f(x)=y
in mathematics, where x
is the input to a function and y
is the output.
Following more conventions from mathematics, we use a capital X
because the data is a two-dimensional array (a matrix) and a lowercase y
because the target is a one-dimensional array (a vector).
Let's call train_test_split
on our data and assign the outputs using this nomenclature:
In [53]:
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(
iris_dataset['data'], iris_dataset['target'], random_state=0)
# The random_state parameter gives the pseudorandom number generator a fixed (set) seed.
# Setting the seed allows us to obtain reproducible results from randomized procedures.
print("X_train: \n{}".format(X_train))
print("X_test: \n{}".format(X_test))
print("y_train: \n{}".format(y_train))
print("y_test: \n{}".format(y_test))
The output of the train_test_split
function is X_train, X_test, y_train,
and y_test
, which are all NumPy arrays.
X_train
contains 75% of the rows in the dataset, and X_test
contains the remaining 25%.
In [54]:
print("X_train shape: \n{}".format(X_train.shape))
print("y_train shape: \n{}".format(y_train.shape))
In [55]:
print("X_test shape: \n{}".format(X_test.shape))
print("y_test shape: \n{}".format(y_test.shape))
Before building a machine learning model, it is often a good idea to inspect the data for several reasons:
One of the best ways to inspect data is to visualize it.
In the example below we will be building a type of scatter plot known as a pair plot.
The data points are colored according to the species the iris belongs to.
To create the plot, we first convert the NumPy array into a pandas DataFrame.
pandas
has a function to create pair plots called scatter_matrix
.
The diagonal of this matrix is filled with histograms of each feature.
In [56]:
# Create dataframe from data in X_train.
# Label the columns using the strings in iris_dataset.feature_names.
iris_dataframe = pd.DataFrame(X_train, columns=iris_dataset.feature_names)
# Create a scatter matrix from the dataframe, color by y_train.
pd.plotting.scatter_matrix(iris_dataframe, c=y_train, figsize=(15, 15), marker='o',
hist_kwds={'bins': 20}, s=60, alpha=.8, cmap=mglearn.cm3)
Out[56]:
From the plots, we can see that the three classes seem to be relatively well separated using the sepal and petal measurements.
This means that a machine learning model will likely be able to learn to separate them.
There are many classification algorithms in scikit-learn
that we can use; here we're going to implement the k-nearest neighbors classifier.
The k
in k-nearest neighbors refers to the number of nearest neighbors that will be used to predict the new data point.
We can consider any fixed number k
of neighbors; the default for sklearn.neighbors.KNeighborsClassifier
is 5, we're going to keep things simple and use 1 for k
.
All machine learning models in scikit-learn
are implemented in their own classes, which are called Estimator
classes.
The k
-nearest neighbors classification algorithm is implemented in the KNeighborsClassifier
class in the neighbors
module.
More information about the Nearest Neighbors Classification can be found here, and an example can be found here.
Before we can use the model, we need to instantiate the class into an object.
This is when we will set any parameters of the model, the most important of which is the number of neighbors, which we will set to 1:
In [57]:
from sklearn.neighbors import KNeighborsClassifier
knn = KNeighborsClassifier(n_neighbors=1)
The knn
object encapsulates the algorithm that will be used to build the model from the training data, as well as the algorithm to make predictions on new data points.
It will also hold the information that the algorithm has extracted from the training data.
In the case of KNeighborsClassifier
, it will just store the training set.
To build the model on the training set, we call the fit
method of the knn
object, which takes as arguments the NumPy array X_train
containing the training data and the NumPy array y_train
of the corresponding training labels:
In [58]:
knn.fit(X_train, y_train)
Out[58]:
The fit
method returns the knn
object itself (and modifies it in place), so we get a string representation of our classifier.
The representation shows us which parameters were used in creating the model.
Nearly all of them are the default values, but you can also find n_neighbors=1
, which is the parameter that we passed.
Most models in scikit-learn
have many parameters, but the majority of them are either speed optimizations or for very special use cases.
The important parameters will be covered in Chapter 2.
Now we can make predictions using this model on new data which isn't labeled.
Let's use an example iris with a sepal length of 5cm, sepal width of 2.9cm, petal length of 1cm, and petal width of 0.2cm.
We can put this data into a NumPy array by calculating the shape, which is the number of samples(1) multiplied by the number of features(4):
In [59]:
X_new = np.array([[5, 2.9, 1, 0.2]])
print("X_new.shape: \n{}".format(X_new.shape))
Note that we made the measurements of this single flower into a row in a two-dimensional NumPy array.
scikit-learn
always expects two-dimensional arrays for the data.
Now, to make a prediction, we call the predict
method of the knn
object:
In [60]:
prediction = knn.predict(X_new)
print("Prediction: \n{}".format(prediction))
print("Predicted target name: \n{}".format(
iris_dataset['target_names'][prediction]))
Our model predicts that this new iris belongs to the class 0, meaning its species is setosa
.
How do we know whether we can trust our model?
We don't know the correct species of this sample, which is the whole point of building the model.
This is where the test set that we created earlier comes into play.
The test data wasn't used to build the model, but we do know what the correct species is for each iris in the test set.
Therefore, thus, hence, ergo, we can make a prediction for each iris in the test data and compare it against its label (the known species).
We can measure how well the model works by computing the accuracy, which is the fraction of flowers for which the correct species was predicted:
In [61]:
y_pred = knn.predict(X_test)
print("Test set predictions: \n{}".format(y_pred))
In [62]:
print("Test set score: \n{:.2f}".format(np.mean(y_pred == y_test)))
We can also use the score
method of the knn
object, which will compute the test set accuracy for us:
In [63]:
print("Test set score: \n{:.2f}".format(knn.score(X_test, y_test)))
For this model, the test set accuracy is about 0.97, which means that we made the correct prediction for 97% of the irises in the test set.
In later chapters we will discuss how we can improve performance, and what caveats there are in tuning a model.
Here is a summary of the code needed for the whole training and evaluation procedure:
In [64]:
X_train, X_test, y_train, y_test = train_test_split(
iris_dataset['data'], iris_dataset['target'], random_state=0)
knn = KNeighborsClassifier(n_neighbors=1)
knn.fit(X_train, y_train)
print("Test set score: \n{:.2f}".format(knn.score(X_test, y_test)))
This snippet contains the core code for applying any machine learning algorithm using scikit-learn
.
The fit
, predict
, and score
methods are the common interface to supervised models in scikit-learn
, and with the concepts introduced in this chapter, you can apply these models to many machine learning tasks.
In the next chapter, we will go more into depth about the different kinds of supervised models in scikit-learn
and how to apply them successfully.