For the intro portion of this tutorial, we'll be loading several dataset examples. Scikit-learn has methods to access several datasets: we'll explore two of these here.
The machine learning community often uses a simple flowers database where each row in the database (or CSV file) is a set of measurements of an individual iris flower. Each sample in this dataset is described by 4 features and can belong to one of the target classes:
Features in the Iris dataset:
Target classes to predict:
scikit-learn
embeds a copy of the iris CSV file along with a helper function to load it into numpy arrays:
In [ ]:
from sklearn.datasets import load_iris
iris = load_iris()
The features of each sample flower are stored in the data
attribute of the dataset:
In [ ]:
n_samples, n_features = iris.data.shape
print n_samples
print n_features
print iris.data[0]
The information about the class of each sample is stored in the target
attribute of the dataset:
In [ ]:
len(iris.data) == len(iris.target)
In [ ]:
iris.target
The names of the classes are stored in the last attribute, namely target_names
:
In [ ]:
list(iris.target_names)
The data downloaded from the iris dataset is stored locally, within a subdirectory of your home directory. You can use the following to determine where it is:
In [ ]:
from sklearn.datasets import get_data_home
get_data_home()
Take a moment now to examine this directory and see that the iris data is stored there. You may also be curious about other datasets which are available. These can be found in sklearn.datasets
.
In [ ]:
from sklearn import datasets
You can see which datasets are available by using ipython's tab-completion feature. Simply type
datasets.fetch_
or
datasets.load_
and then press the tab key. This will give you a drop-down menu which lists all the datasets that can be fetched.
In [ ]:
Be warned: many of these datasets are quite large! If you start a download and you want to kill it, you can use ipython's "kernel interrupt" feature, available in the menu or using the shortcut Ctrl-m i
.
(You can press Ctrl-m h
for a list of all ipython
keyboard shortcuts).
Now we'll take a look at another dataset, one where we have to put a bit more thought into how to represent the data.
In [ ]:
from sklearn.datasets import load_digits
digits = load_digits()
n_samples, n_features = digits.data.shape
print (n_samples, n_features)
Let's take a look at the data. As with the iris data, we can access the information as follows:
In [ ]:
print digits.data[0]
print digits.target
Each sample has 64 features, representing a hand-written digit. We can plot the images these features represent to gain more insight.
We want to plot figures using pylab: we'll use the following command to make sure the figures appear in-line (this only works within ipython notebook):
In [ ]:
%pylab inline
We can access the digits data in the same way as the iris data above. Let's plot a sample of the digits
In [ ]:
import pylab as pl
# set up the figure
fig = pl.figure(figsize=(8, 8)) # figure size in inches
fig.subplots_adjust(left=0, right=1, bottom=0, top=1, hspace=0.05, wspace=0.05)
# plot the digits: each image is 8x8 pixels
for i in range(100):
ax = fig.add_subplot(10, 10, i + 1, xticks=[], yticks=[])
ax.imshow(digits.data[i].reshape((8, 8)), cmap=pl.cm.binary)
# label the image with the target value
ax.text(0, 7, str(digits.target[i]), bbox=dict(facecolor='white', edgecolor='none', pad=1))
Notice that we are representing each two-dimensional array of pixels as a single vector. This data representation is a very important aspect of machine learning. All of the algorithms in scikit-learn accept data in a matrix format, of size [n_samples
$\times$ n_features]
.
With the digits data, we saw above that n_samples = 1797
, and n_features = 64
: one integer-valued feature for each pixel.