Note that this excerpt contains only the raw code - the book is rich with additional explanations and illustrations. If you find this content useful, please consider supporting the work by buying the book!
Thanks to the SciPy community, there are tons of resources out there for getting your hands on some data.
A particularly useful resource comes in the form of the sklearn.datasets package of scikit-learn. This package comes pre-installed with some small datasets that do not require to download any files from external websites. These datasets include:
Even better, scikit-learn allows you to download datasets directly from external repositories, such as:
Even better, it is possible to download datasets directly from the machine learning database at http://mldata.org.
For example, to download the MNIST dataset of handwritten digits, simply type:
In [1]:
from sklearn import datasets
In [2]:
mnist = datasets.fetch_mldata('MNIST original')
Note that this might take a while, depending on your internet connection.
The MNIST database contains a total of 70,000 examples of handwritten digits (28x28 pixel images, labeled from 0 to 9). Data and labels are delivered in two separate containers, which we can inspect as follows:
In [3]:
mnist.data.shape
Out[3]:
In [4]:
mnist.target.shape
Out[4]:
Here, we can see that mnist.data contains 70,000 images of 28 x 28 = 784 pixels each.
Labels are stored in mnist.target
, where there is only one label per image.
We can further inspect the values of all targets, but we don't just want to print them all. Instead, we are interested to see all distinct target values, which is easy to do with NumPy:
In [5]:
import numpy as np
In [6]:
np.unique(mnist.target)
Out[6]: