MNIST Dataset & Database

In the MNIST tutorial we use a lmdb database. You can also use leveldb or even minidb by changing the type reference when you get ready to read from the db's.

Dataset:

You can download the raw MNIST dataset, g/unzip the dataset and labels, and make the database yourself.

Databases:

We provide a few database formats for you to try with the MNIST tutorial. The default is lmdb.

  • MNIST-nchw-lmdb - contains both the train and test lmdb MNIST databases in NCHW format
  • MNIST-nchw-leveldb - contains both the train and test leveldb MNIST databases in NCHW format
  • MNIST-nchw-minidb - contains both the train and test minidb MNIST databases in NCHW format

Tools:

make_mnist_db

If you like LevelDB you can use Caffe2's make_mnist_db binary to generate leveldb databases. This binary is found in /caffe2/build/caffe2/binaries/ or depending on your OS and installation, in /usr/local/bin/.

Here is an example call to make_mnist_db:

./make_mnist_db --channel_first --db leveldb --image_file ~/Downloads/train-images-idx3-ubyte --label_file ~/Downloads/train-labels-idx1-ubyte --output_file ~/caffe2/caffe2/python/tutorials/tutorial_data/mnist/mnist-train-nchw-leveldb

./make_mnist_db --channel_first --db leveldb --image_file ~/Downloads/t10k-images-idx3-ubyte --label_file ~/Downloads/t10k-labels-idx1-ubyte --output_file ~/caffe2/caffe2/python/tutorials/tutorial_data/mnist/mnist-test-nchw-leveldb

Note leveldb can get deadlocked if more than one user attempts to open the leveldb at the same time. This is why there is logic in the Python below to delete LOCK files if they're found.

TODO: it would be great to extend this binary to create other database formats

Python script

You can use the Python in the code block below to download the dataset with DownloadResource, call the make_mnist_db binary, and generate your database with GenerateDB.

The DownloadResource function can also download and extract a database for you.

Downloads and extracts the MNIST dataset The sample function below will download and extract the dataset for you.

DownloadResource("http://download.caffe2.ai/datasets/mnist/mnist.zip", data_folder)

Downloads and extracts the lmdb databases of MNIST images - both test and train

DownloadResource("http://download.caffe2.ai/databases/mnist-lmdb.zip", data_folder)

(Re)generate the leveldb database (it can get locked with multi-user setups or abandoned threads) Requires the download of the dataset (mnist.zip) - see above.

GenerateDB(image_file_train, label_file_train, "mnist-train-nchw-leveldb")
GenerateDB(image_file_test, label_file_test, "mnist-test-nchw-leveldb")

In [ ]:
import os

def DownloadResource(url, path):
    '''Downloads resources from s3 by url and unzips them to the provided path'''
    import requests, zipfile, StringIO
    print("Downloading... {} to {}".format(url, path))
    r = requests.get(url, stream=True)
    z = zipfile.ZipFile(StringIO.StringIO(r.content))
    z.extractall(path)
    print("Completed download and extraction.")

    
def GenerateDB(image, label, name):
    '''Calls the make_mnist_db binary to generate a leveldb from a mnist dataset'''
    name = os.path.join(data_folder, name)
    print 'DB: ', name
    if not os.path.exists(name):
        syscall = "/usr/local/bin/make_mnist_db --channel_first --db leveldb --image_file " + image + " --label_file " + label + " --output_file " + name
        # print "Creating database with: ", syscall
        os.system(syscall)
    else:
        print "Database exists already. Delete the folder if you have issues/corrupted DB, then rerun this."
        if os.path.exists(os.path.join(name, "LOCK")):
            # print "Deleting the pre-existing lock file"
            os.remove(os.path.join(name, "LOCK"))

            
current_folder = os.path.join(os.path.expanduser('~'), 'caffe2_notebooks')
data_folder = os.path.join(current_folder, 'tutorial_data', 'mnist')

# Downloads and extracts the lmdb databases of MNIST images - both test and train
if not os.path.exists(os.path.join(data_folder,"mnist-train-nchw-lmdb")):
    DownloadResource("http://download.caffe2.ai/databases/mnist-lmdb.zip", data_folder)

# Downloads and extracts the MNIST data set
if not os.path.exists(os.path.join(data_folder, "train-images-idx3-ubyte")):
    DownloadResource("http://download.caffe2.ai/datasets/mnist/mnist.zip", data_folder)

# (Re)generate the leveldb database (it can get locked with multi-user setups or abandoned threads)
# Requires the download of the dataset (mnist.zip) - see DownloadResource above.
# You also need to change references in the MNIST tutorial code where you train or test from lmdb to leveldb
image_file_train = os.path.join(data_folder, "train-images-idx3-ubyte")
label_file_train = os.path.join(data_folder, "train-labels-idx1-ubyte")
image_file_test = os.path.join(data_folder, "t10k-images-idx3-ubyte")
label_file_test = os.path.join(data_folder, "t10k-labels-idx1-ubyte")
GenerateDB(image_file_train, label_file_train, "mnist-train-nchw-leveldb")
GenerateDB(image_file_test, label_file_test, "mnist-test-nchw-leveldb")

Code Changes for Other DBs

If you chose to use a format other than lmdb you will need to change a couple lines of code. When you use ModelHelper to instantiate the CNN, you pass in the db parameter with a path and the db_type with the type of db. You would need to update both of these values. Since you create two networks, one for training and one for testing, you would need to update the code for both of these.

Default code using lmdb

train_model = model_helper.ModelHelper(name="mnist_train", arg_scope=arg_scope)
data, label = AddInput(
    train_model, batch_size=64,
    db=os.path.join(data_folder, 'mnist-train-nchw-lmdb'),
    db_type='lmdb')

Updated code using leveldb

train_model = model_helper.ModelHelper(name="mnist_train", arg_scope=arg_scope)
data, label = AddInput(
    train_model, batch_size=64,
    db=os.path.join(data_folder, 'mnist-train-nchw-leveldb'),
    db_type='leveldb')

In [ ]: