Loading Datasets with scikit-learn

For the intro portion of this tutorial, we'll be loading several dataset examples. Scikit-learn has methods to access several datasets: we'll explore two of these here.

Loading Iris Data

The machine learning community often uses a simple flowers database where each row in the database (or CSV file) is a set of measurements of an individual iris flower. Each sample in this dataset is described by 4 features and can belong to one of the target classes:

  • Features in the Iris dataset:

    1. sepal length in cm
    2. sepal width in cm
    3. petal length in cm
    4. petal width in cm
  • Target classes to predict:

    1. Iris Setosa
    2. Iris Versicolour
    3. Iris Virginica

scikit-learn embeds a copy of the iris CSV file along with a helper function to load it into numpy arrays:


In [ ]:
from sklearn.datasets import load_iris
iris = load_iris()

The features of each sample flower are stored in the data attribute of the dataset:


In [ ]:
n_samples, n_features = iris.data.shape
print n_samples
print n_features
print iris.data[0]

The information about the class of each sample is stored in the target attribute of the dataset:


In [ ]:
len(iris.data) == len(iris.target)

In [ ]:
iris.target

The names of the classes are stored in the last attribute, namely target_names:


In [ ]:
list(iris.target_names)

The data downloaded from the iris dataset is stored locally, within a subdirectory of your home directory. You can use the following to determine where it is:


In [ ]:
from sklearn.datasets import get_data_home
get_data_home()

Take a moment now to examine this directory and see that the iris data is stored there. You may also be curious about other datasets which are available. These can be found in sklearn.datasets.


In [ ]:
from sklearn import datasets

You can see which datasets are available by using ipython's tab-completion feature. Simply type

datasets.fetch_

or

datasets.load_

and then press the tab key. This will give you a drop-down menu which lists all the datasets that can be fetched.


In [ ]:

Be warned: many of these datasets are quite large! If you start a download and you want to kill it, you can use ipython's "kernel interrupt" feature, available in the menu or using the shortcut Ctrl-m i.

(You can press Ctrl-m h for a list of all ipython keyboard shortcuts).

Loading Digits Data

Now we'll take a look at another dataset, one where we have to put a bit more thought into how to represent the data.


In [ ]:
from sklearn.datasets import load_digits
digits = load_digits()

n_samples, n_features = digits.data.shape
print (n_samples, n_features)

Let's take a look at the data. As with the iris data, we can access the information as follows:


In [ ]:
print digits.data[0]
print digits.target

Each sample has 64 features, representing a hand-written digit. We can plot the images these features represent to gain more insight.

We want to plot figures using pylab: we'll use the following command to make sure the figures appear in-line (this only works within ipython notebook):


In [ ]:
%pylab inline

We can access the digits data in the same way as the iris data above. Let's plot a sample of the digits


In [ ]:
import pylab as pl

# set up the figure
fig = pl.figure(figsize=(8, 8))  # figure size in inches
fig.subplots_adjust(left=0, right=1, bottom=0, top=1, hspace=0.05, wspace=0.05)

# plot the digits: each image is 8x8 pixels
for i in range(100):
    ax = fig.add_subplot(10, 10, i + 1, xticks=[], yticks=[])
    ax.imshow(digits.data[i].reshape((8, 8)), cmap=pl.cm.binary)
    
    # label the image with the target value
    ax.text(0, 7, str(digits.target[i]), bbox=dict(facecolor='white', edgecolor='none', pad=1))

Notice that we are representing each two-dimensional array of pixels as a single vector. This data representation is a very important aspect of machine learning. All of the algorithms in scikit-learn accept data in a matrix format, of size [n_samples $\times$ n_features].

With the digits data, we saw above that n_samples = 1797, and n_features = 64: one integer-valued feature for each pixel.