This notebook contains an excerpt from the Python Data Science Handbook by Jake VanderPlas; the content is available on GitHub.
The text is released under the CC-BY-NC-ND license, and code is released under the MIT license. If you find this content useful, please consider supporting the work by buying the book!
Python machine learning
Once you understand the basic use and syntax of Scikit-Learn for one type of model, switching to a new model or algorithm is very straightforward.
Python machine learning
A solid understanding of these API elements will form the foundation for understanding the deeper practical discussion of machine learning algorithms and approaches.
This section provides an overview of the Scikit-Learn API
Machine learning is about creating models from data:
A basic table is a two-dimensional grid of data
For example, consider the Iris dataset, famously analyzed by Ronald Fisher in 1936.
We can download this dataset in the form of a Pandas DataFrame
using the seaborn library:
In [39]:
import seaborn as sns
sns.set_context("talk", font_scale=1.5)
iris = sns.load_dataset('iris')
iris.head()
Out[39]:
n_samples
.n_features
.This table layout of the information can be thought of as a two-dimensional numerical array or matrix
, which we will call the features matrix.
X
.[n_samples, n_features]
, DataFrame
The samples (i.e., rows) always refer to the individual objects described by the dataset.
The features (i.e., columns) always refer to the distinct observations that describe each sample in a quantitative manner.
In addition to the feature matrix X
, we also generally work with a label or target array, which by convention we will usually call y
.
n_samples
Series
.While some Scikit-Learn estimators do handle multiple target values in the form of a two-dimensional, [n_samples, n_targets]
target array, we will primarily be working with the common case of a one-dimensional target array.
The target array is usually the quantity we want to predict from the data: in statistical terms, it is the dependent variable.
For example, in the preceding data we may wish to construct a model that can predict the species of flower based on the other measurements; in this case, the
species
column would be considered the target array.
With this target array in mind, we can use Seaborn (see Visualization With Seaborn) to conveniently visualize the data:
In [95]:
%matplotlib inline
import seaborn as sns;
sns.set()
sns.set_context("talk", font_scale=1)
sns.pairplot(iris, hue='species', size=1.5);
For use in Scikit-Learn, we will extract the features matrix and target array from the DataFrame
DataFrame
operations discussed in the Chapter 3:
In [40]:
X_iris = iris.drop('species', axis=1)
X_iris.shape
Out[40]:
In [41]:
X_iris[:3]
Out[41]:
In [42]:
y_iris = iris['species']
y_iris.shape
Out[42]:
In [9]:
y_iris[:3]
Out[9]:
To summarize, the expected layout of features and target values is visualized in the following diagram:
With this data properly formatted, we can move on to consider the estimator API of Scikit-Learn:
Guiding principles outlined in the Scikit-Learn API paper:
Consistency: All objects share a common interface drawn from a limited set of methods, with consistent documentation.
Inspection: All specified parameter values are exposed as public attributes.
Limited object hierarchy:
DataFrame
s, SciPy sparse matrices) and Guiding principles outlined in the Scikit-Learn API paper:
Composition: Many machine learning tasks can be expressed as sequences of more fundamental algorithms, and Scikit-Learn makes use of this wherever possible.
Sensible defaults: When models require user-specified parameters, the library defines an appropriate default value.
In practice, these principles make Scikit-Learn very easy to use, once the basic principles are understood.
fit()
method of the model instance.predict()
method.transform()
or predict()
method.We will now step through several simple examples of applying supervised and unsupervised learning methods.
In [98]:
import matplotlib.pyplot as plt
import numpy as np
rng = np.random.RandomState(42)
x = 10 * rng.rand(50)
y = 2 * x - 1 + rng.randn(50)
plt.scatter(x, y)
plt.xlabel('x', fontsize = 30)
plt.ylabel('y', fontsize = 30);
With this data in place, we can use the recipe outlined earlier. Let's walk through the process:
In [44]:
from sklearn.linear_model import LinearRegression
Note that other more general linear regression models exist as well; you can read more about them in the sklearn.linear_model
module documentation.
An important point is that a class of model is not the same as an instance of a model.
Once we have decided on our model class, there are still some options open to us. Depending on the model class we are working with, we might need to answer one or more questions like the following:
These are examples of the important choices that must be made once the model class is selected.
We will explore how you can quantitatively motivate the choice of hyperparameters in Hyperparameters and Model Validation.
For our linear regression example, we can instantiate the LinearRegression
class and specify that we would like to fit the intercept using the fit_intercept
hyperparameter:
In [47]:
model = LinearRegression(fit_intercept=True)
model
#help(LinearRegression)
Out[47]:
Keep in mind that when the model is instantiated, the only action is the storing of these hyperparameter values.
Previously we detailed the Scikit-Learn data representation, which requires a two-dimensional features matrix and a one-dimensional target array.
y
is already in the correct form (a length-n_samples
array)x
should be transformed to a matrix of size [n_samples, n_features]
.In this case, this amounts to a simple reshaping of the one-dimensional array:
In [49]:
X = x[:, np.newaxis]
X.shape
Out[49]:
In [50]:
model.fit(X, y)
Out[50]:
This fit()
command causes a number of model-dependent internal computations to take place, and the results of these computations are stored in model-specific attributes that the user can explore.
In Scikit-Learn, by convention all model parameters that were learned during the fit()
process have trailing underscores;
In [16]:
# The parameters represent the slope of the simple linear fit to the data.
model.coef_
Out[16]:
In [17]:
# The parameter represent the intercept of the simple linear fit to the data.
model.intercept_
Out[17]:
Comparing to the data definition, we see that they are very close to the input slope of 2 and intercept of -1.
One question that frequently comes up regards the uncertainty in such internal model parameters.
In general, Scikit-Learn does not provide tools to draw conclusions from internal model parameters themselves:
If you would like to dive into the meaning of fit parameters within the model, other tools are available, including the Statsmodels Python package.
Once the model is trained, the main task of supervised machine learning is to evaluate it based on what it says about new data that was not part of the training set.
In Scikit-Learn, this can be done using the predict()
method.
For the sake of this example, our "new data" will be a grid of x values, and we will ask what y values the model predicts:
In [51]:
xfit = np.linspace(-1, 11)
xfit
Out[51]:
As before, we need to coerce these x values into a [n_samples, n_features]
features matrix, after which we can feed it to the model:
In [52]:
Xfit = xfit[:, np.newaxis]
yfit = model.predict(Xfit)
Finally, let's visualize the results by plotting first the raw data, and then this model fit:
In [20]:
plt.scatter(x, y)
plt.plot(xfit, yfit);
Typically the efficacy of the model is evaluated by comparing its results to some known baseline, as we will see in the next example
Question: given a model trained on a portion of the Iris data, how well can we predict the remaining labels?
For this task, we will use an extremely simple generative model known as Gaussian naive Bayes
Gaussian naive Bayes is often a good model to use as a baseline classification, before exploring whether improvements can be found through more sophisticated models.
In [53]:
from sklearn.cross_validation import train_test_split
Xtrain, Xtest, ytrain, ytest = train_test_split(X_iris, y_iris,
random_state=1)
With the data arranged, we can follow our recipe to predict the labels:
In [59]:
from sklearn.naive_bayes import GaussianNB # 1. choose model class
model = GaussianNB() # 2. instantiate model
model.fit(Xtrain, ytrain) # 3. fit model to data
y_model = model.predict(Xtest) # 4. predict on new data
Finally, we can use the accuracy_score
utility to see the fraction of predicted labels that match their true value:
In [60]:
print(*zip(ytest, y_model))
In [61]:
from sklearn.metrics import accuracy_score
accuracy_score(ytest, y_model)
Out[61]:
With an accuracy topping 97%, we see that even this very naive classification algorithm is effective for this particular dataset!
Reducing the dimensionality of the Iris data to more easily visualize it:
The task of dimensionality reduction is to ask:
whether there is a suitable lower-dimensional representation that retains the essential features of the data.
Dimensionality reduction is often used as an aid to visualizing data:
Here we will use principal component analysis
(PCA; see In Depth: Principal Component Analysis)
We will ask the model to return
In [62]:
from sklearn.decomposition import PCA # 1. Choose the model class
model = PCA(n_components=2) # 2. Instantiate the model with hyperparameters
model.fit(X_iris) # 3. Fit to data. Notice y is not specified!
X_2D = model.transform(X_iris) # 4. Transform the data to two dimensions
To plot the results:
DataFrame
, lmplot
to show the results:
In [100]:
sns.set_context("talk", font_scale=1.5)
iris['PCA1'] = X_2D[:, 0]
iris['PCA2'] = X_2D[:, 1]
sns.lmplot("PCA1", "PCA2", hue='species', data=iris, fit_reg=False);
In the two-dimensional representation, the species are fairly well separated, even though the PCA algorithm had no knowledge of the species labels!
A relatively straightforward classification will probably be effective on the dataset.
Let's next look at applying clustering to the Iris data.
A clustering algorithm attempts to find distinct groups of data without reference to any labels.
We will use a powerful clustering method called a Gaussian mixture model (GMM)
We can fit the Gaussian mixture model as follows:
In [64]:
from sklearn.mixture import GMM # 1. Choose the model class
model = GMM(n_components=3,
covariance_type='full') # 2. Instantiate the model with hyperparameters
model.fit(X_iris) # 3. Fit to data. Notice y is not specified!
y_gmm = model.predict(X_iris) # 4. Determine cluster labels
As before, we will
DataFrame
and
In [78]:
sns.plotting_context()
Out[78]:
In [101]:
sns.set_context("talk", font_scale=1.5)
iris['cluster'] = y_gmm
sns.lmplot("PCA1", "PCA2", data=iris, hue='species',
col='cluster', fit_reg=False);
By splitting the data by cluster number, GMM algorithm recovered the underlying label without an expert:
In the wild, this problem involves
Here we'll take a shortcut and use Scikit-Learn's set of pre-formatted digits, which is built into the library.
In [28]:
from sklearn.datasets import load_digits
digits = load_digits()
digits.images.shape
Out[28]:
The images data is a three-dimensional array:
Let's visualize the first hundred of these:
In [102]:
import matplotlib.pyplot as plt
fig, axes = plt.subplots(10, 10, figsize=(8, 8),
subplot_kw={'xticks':[], 'yticks':[]},
gridspec_kw=dict(hspace=0.1, wspace=0.1))
for i, ax in enumerate(axes.flat):
ax.imshow(digits.images[i], cmap='binary', interpolation='nearest')
ax.text(0.05, 0.05, str(digits.target[i]),
transform=ax.transAxes, color='green')
In order to work with this data within Scikit-Learn,
[n_samples, n_features]
representation.Features and targets are represented as the data
and target
attributes in the digits
dataset respectively:
In [107]:
X = digits.data
X.shape
Out[107]:
In [108]:
X
Out[108]:
In [110]:
y = digits.target
y.shape
Out[110]:
In [111]:
y
Out[111]:
We see here that there are 1,797 samples and 64 features.
We'd like to visualize our points within the 64-dimensional parameter space
Here, we'll make use of a manifold learning algorithm called Isomap (see In-Depth: Manifold Learning), and transform the data to two dimensions:
In [32]:
from sklearn.manifold import Isomap
iso = Isomap(n_components=2)
iso.fit(digits.data)
data_projected = iso.transform(digits.data)
data_projected.shape
Out[32]:
We see that the projected data is now two-dimensional. Let's plot this data to see if we can learn anything from its structure:
In [112]:
plt.scatter(data_projected[:, 0], data_projected[:, 1], c=digits.target,
edgecolor='none', alpha=0.5,
cmap=plt.cm.get_cmap('nipy_spectral', 10))
plt.colorbar(label='digit label', ticks=range(10))
plt.clim(-0.5, 9.5);
This plot gives us some good intuition into how well various numbers are separated in the larger 64-dimensional space.
Overall, however, the different groups appear to be fairly well separated in the parameter space:
Let's give it a try.
In [117]:
Xtrain, Xtest, ytrain, ytest = train_test_split(X, y, random_state=0)
In [118]:
from sklearn.naive_bayes import GaussianNB
model = GaussianNB()
model.fit(Xtrain, ytrain)
y_model = model.predict(Xtest)
Now that we have predicted our model, we can gauge its accuracy by comparing the true values of the test set to the predictions:
In [119]:
from sklearn.metrics import accuracy_score
accuracy_score(ytest, y_model)
Out[119]:
With even this extremely simple model, we find about 80% accuracy for classification of the digits!
However, this single number doesn't tell us where we've gone wrong
In [123]:
from sklearn.metrics import confusion_matrix
sns.set_context("notebook", font_scale=1.7)
mat = confusion_matrix(ytest, y_model)
sns.heatmap(mat, square=True, annot=True, cbar=False)
plt.xlabel('predicted value')
plt.ylabel('true value');
This shows us where the mis-labeled points tend to be:
Another way to gain intuition into the characteristics of the model:
In [124]:
fig, axes = plt.subplots(10, 10, figsize=(8, 8),
subplot_kw={'xticks':[], 'yticks':[]},
gridspec_kw=dict(hspace=0.1, wspace=0.1))
test_images = Xtest.reshape(-1, 8, 8)
for i, ax in enumerate(axes.flat):
ax.imshow(test_images[i], cmap='binary', interpolation='nearest')
ax.text(0.05, 0.05, str(y_model[i]),
transform=ax.transAxes,
color='green' if (ytest[i] == y_model[i]) else 'red')
Examining this subset of the data, we can gain insight regarding where the algorithm might be not performing optimally.
To go beyond our 80% classification rate, we might move to a more sophisticated algorithm such as
In this section we have covered the essential features of the Scikit-Learn
Regardless of the type of estimator, the same import/instantiate/fit/predict pattern holds.
Armed with this information about the estimator API, you can explore the Scikit-Learn documentation and begin trying out various models on your data.
In the next section, we will explore perhaps the most important topic in machine learning: how to select and validate your model.