This notebook was put together by [Jake Vanderplas](http://www.vanderplas.com) for PyCon 2014. Source and license info is on [GitHub](https://github.com/jakevdp/sklearn_pycon2014/).
When using scikit-learn, it is important to have a handle on how data are represented.
By the end of this section you should:
Machine learning is about creating models from data: for that reason, we'll start by discussing how data can be represented in order to be understood by the computer. Along with this, we'll build on our matplotlib examples from the previous section and show some examples of how to visualize data.
Most machine learning algorithms implemented in scikit-learn expect data to be stored in a
two-dimensional array or matrix. The arrays can be
either numpy
arrays, or in some cases scipy.sparse
matrices.
The size of the array is expected to be [n_samples, n_features]
The number of features must be fixed in advance. However it can be very high dimensional
(e.g. millions of features) with most of them being zeros for a given sample. This is a case
where scipy.sparse
matrices can be useful, in that they are
much more memory-efficient than numpy arrays.
Data in scikit-learn is represented as a feature matrix and a label vector
$$ {\rm feature~matrix:~~~} {\bf X}~=~\left[ \begin{matrix} x_{11} & x_{12} & \cdots & x_{1D}\\ x_{21} & x_{22} & \cdots & x_{2D}\\ x_{31} & x_{32} & \cdots & x_{3D}\\ \vdots & \vdots & \ddots & \vdots\\ \vdots & \vdots & \ddots & \vdots\\ x_{N1} & x_{N2} & \cdots & x_{ND}\\ \end{matrix} \right] $$$$ {\rm label~vector:~~~} {\bf y}~=~ [y_1, y_2, y_3, \cdots y_N] $$Here there are $N$ samples and $D$ features.
In [1]:
from IPython.core.display import Image, display
display(Image(filename='images/iris_setosa.jpg'))
print "Iris Setosa\n"
display(Image(filename='images/iris_versicolor.jpg'))
print "Iris Versicolor\n"
display(Image(filename='images/iris_virginica.jpg'))
print "Iris Virginica"
If we want to design an algorithm to recognize iris species, what might the features be? What might the labels be?
Remember: we need a 2D data array of size [n_samples x n_features]
, and a 1D label array of size n_samples
.
What would the n_samples
refer to?
What might the n_features
refer to?
Remember that there must be a fixed number of features for each sample, and feature
number i
must be a similar kind of quantity for each sample.
Scikit-learn has a very straightforward set of data on these iris species. The data consist of the following:
Features in the Iris dataset:
Target classes to predict:
scikit-learn
embeds a copy of the iris CSV file along with a helper function to load it into numpy arrays:
In [2]:
from sklearn.datasets import load_iris
iris = load_iris()
The result is a Bunch()
object, which is basically an enhanced dictionary which contains the data.
Note that bunch objects are not required for performing learning in scikit-learn, they are simply a convenient container for the numpy arrays which are required
In [3]:
iris.keys()
Out[3]:
In [4]:
n_samples, n_features = iris.data.shape
print (n_samples, n_features)
print iris.data[0]
In [5]:
print iris.data.shape
print iris.target.shape
In [6]:
print iris.target
In [7]:
print iris.target_names
This data is four dimensional, but we can visualize two of the dimensions at a time using a simple scatter-plot:
In [8]:
%matplotlib inline
import numpy as np
import matplotlib.pyplot as plt
def plot_iris_projection(x_index, y_index):
# this formatter will label the colorbar with the correct target names
formatter = plt.FuncFormatter(lambda i, *args: iris.target_names[int(i)])
plt.scatter(iris.data[:, x_index], iris.data[:, y_index],
c=iris.target)
plt.colorbar(ticks=[0, 1, 2], format=formatter)
plt.xlabel(iris.feature_names[x_index])
plt.ylabel(iris.feature_names[y_index])
plot_iris_projection(2, 3)
They come in three flavors:
sklearn.datasets.load_*
sklearn.datasets.fetch_*
sklearn.datasets.make_*
You can explore the available dataset loaders, fetchers, and generators using IPython's
tab-completion functionality. After importing the datasets
submodule from sklearn
,
type
datasets.load_ + TAB
or
datasets.fetch_ + TAB
or
datasets.make_ + TAB
to see a list of available functions.
In [9]:
from sklearn import datasets
The data downloaded using the fetch_ scripts are stored locally, within a subdirectory of your home directory. You can use the following to determine where it is:
In [10]:
from sklearn.datasets import get_data_home
get_data_home()
Out[10]:
In [11]:
!ls $HOME/scikit_learn_data/
Be warned: many of these datasets are quite large, and can take a long time to download! (especially on Conference wifi).
If you start a download within the IPython notebook
and you want to kill it, you can use ipython's "kernel interrupt" feature, available in the menu or using
the shortcut Ctrl-m i
.
You can press Ctrl-m h
for a list of all ipython
keyboard shortcuts.
Now we'll take a look at another dataset, one where we have to put a bit more thought into how to represent the data. We can explore the data in a similar manner as above:
In [12]:
from sklearn.datasets import load_digits
digits = load_digits()
In [13]:
digits.keys()
Out[13]:
In [14]:
n_samples, n_features = digits.data.shape
print (n_samples, n_features)
In [15]:
print digits.data[0]
print digits.target
The target here is just the digit represented by the data. The data is an array of length 64... but what does this data mean?
There's a clue in the fact that we have two versions of the data array:
data
and images
. Let's take a look at them:
In [16]:
print digits.data.shape
print digits.images.shape
We can see that they're related by a simple reshaping:
In [17]:
print np.all(digits.images.reshape((1797, 64)) == digits.data)
Aside... numpy and memory efficiency:
You might wonder whether duplicating the data is a problem. In this case, the memory overhead is very small. Even though the arrays are different shapes, they point to the same memory block, which we can see by doing a bit of digging into the guts of numpy:
In [18]:
print digits.data.__array_interface__['data']
print digits.images.__array_interface__['data']
The long integer here is a memory address: the fact that the two are the same tells us that the two arrays simply views of the same underlying data.
Let's visualize the data. It's little bit more involved than the simple scatter-plot we used above, but we can do it rather tersely.
In [19]:
# set up the figure
fig = plt.figure(figsize=(6, 6)) # figure size in inches
fig.subplots_adjust(left=0, right=1, bottom=0, top=1, hspace=0.05, wspace=0.05)
# plot the digits: each image is 8x8 pixels
for i in range(64):
ax = fig.add_subplot(8, 8, i + 1, xticks=[], yticks=[])
ax.imshow(digits.images[i], cmap=plt.cm.binary, interpolation='nearest')
# label the image with the target value
ax.text(0, 7, str(digits.target[i]))
We see now what the features mean. Each feature is a real-valued quantity representing the darkness of a pixel in an 8x8 image of a hand-written digit.
Even though each sample has data that is inherently two-dimensional, the data matrix flattens this 2D data into a single vector, which can be contained in one row of the data matrix.
One dataset often used as an example of a simple nonlinear dataset is the S-cure:
In [20]:
from sklearn.datasets import make_s_curve
data, colors = make_s_curve(n_samples=1000)
print(data.shape)
print(colors.shape)
In [21]:
from mpl_toolkits.mplot3d import Axes3D
ax = plt.axes(projection='3d')
ax.scatter(data[:, 0], data[:, 1], data[:, 2], c=colors)
ax.view_init(10, -60)
This example is typically used with a class of unsupervised learning methods known as manifold learning. We'll explore unsupervised learning in detail later in the tutorial.
Here we'll take a moment for you to explore the datasets yourself. Later on we'll be using the Olivetti faces dataset. Take a moment to fetch the data (about 1.4MB), and visualize the faces. You can copy the code used to visualize the digits above, and modify it for this data.
In [22]:
from sklearn.datasets import fetch_olivetti_faces
In [23]:
# fetch the faces data
In [24]:
# Use a script like above to plot the faces image data.
# hint: plt.cm.bone is a good colormap for this data
In [25]:
# Uncomment the following to load the solution to this exercise
# %load solutions/02_faces.py