In [ ]:
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline

The problem: classification of hand-written digits

The folder data/ contains 3 datasets:

  • train_images.npy contains 2000 images of hand-written digits (28x28 pixels) that have been previously identified by humans
  • train_labels.npy contains the 2000 corresponding labels (integers from 0 to 9)
  • test_images.npy contains 1000 unlabeled images of hand-written digits

The aim is to automatically attribute the right label to the unlabeled digits.

Let us have a look at the images and corresponding labels:


In [ ]:
# Load the data
train_images = np.load('./data/train_images.npy')
train_labels = np.load('./data/train_labels.npy')
test_images = np.load('./data/test_images.npy')

# Define function to have a look at the data
def show_random_digit( images, labels=None ):
    """"Show a random image out of `images`, 
    with the corresponding label if available"""
    i = np.random.randint(len(images))
    image = images[i].reshape((28, 28))
    plt.imshow( image, cmap='Greys' )
    if labels is not None:
        plt.title('Label: %d' %labels[i])

Let us first have a look at the images from the training set (labeled images).


In [ ]:
show_random_digit( train_images, train_labels )

We can also have a look at the images from the test set (unlabeled images)


In [ ]:
show_random_digit( test_images )

Creating a python script that predicts the labels

In order to predict the labels, we will use the nearest neighbor method:
For each of the images in the test set, we will search for the most similar image in the training set, and return the corresponding label.

Since the actual code for the nearest neighbor method is not of interest here, it has been already implemented in the folder classification/ and we will only use it here as a function call.

Note: The implementation in classification/ is naive and inefficient. scikit-learn has much more efficient implementation, but it has built-in parallel functionalities, which would not be suitable for this tutorial.


In [ ]:
from classification import nearest_neighbor_prediction
nearest_neighbor_prediction?

In addition, instead of performing the calculation directly in the Jupyter notebook, we will write a Python script that performs the calculation, and we will execute this script from the terminal.
NB: This may seem odd, but it will make more sense when we compare this code with the corresponding mpi4py code.

Below, the line magic %%file allows us to write a Python script and save it as serial_script.py, without having to leave the notebook and open a text editor.


In [ ]:
%%file serial_script.py

import numpy as np
from classification import nearest_neighbor_prediction

# Load data
train_images = np.load('./data/train_images.npy')
train_labels = np.load('./data/train_labels.npy')
test_images = np.load('./data/test_images.npy')

# Predict the test labels and save it to a file
test_labels = nearest_neighbor_prediction( test_images, train_images, train_labels )
np.save('data/test_labels_serial.npy', test_labels )

In the line below, the character ! allows to run the command line as if we were in a terminal.
The line magic %%time allows to time the execution.


In [ ]:
%%time 
! python serial_script.py

The execution takes a substantial amount of time: the nearest neighbor method is simple but computationally expensive.


Checking the results

The script saved the predicted labels in test_labels_serial.npy.
Let us have a look at it, and check that the predicted labels match the images.


In [ ]:
test_labels = np.load('./data/test_labels_serial.npy')

In [ ]:
show_random_digit( test_images, test_labels )

Towards a parallel implementation

Up to now, the prediction of the labels was done on a single core.
However, since the single-core execution is time consuming, it is desirable to execute the code on several cores.

Because the prediction of each label is independent, the parallelization is conceptually trivial: the set of 1000 unlabeled images should be split between the different cores, so that each core only takes care of a fraction of the unlabeled images.

Let us see how to implement this with mpi4py here.