In the MNIST tutorial we use a lmdb database. You can also use leveldb or even minidb by changing the type reference when you get ready to read from the db's.
You can download the raw MNIST dataset, g/unzip the dataset and labels, and make the database yourself.
We provide a few database formats for you to try with the MNIST tutorial. The default is lmdb.
If you like LevelDB you can use Caffe2's make_mnist_db
binary to generate leveldb databases. This binary is found in /caffe2/build/caffe2/binaries/
or depending on your OS and installation, in /usr/local/bin/
.
Here is an example call to make_mnist_db
:
./make_mnist_db --channel_first --db leveldb --image_file ~/Downloads/train-images-idx3-ubyte --label_file ~/Downloads/train-labels-idx1-ubyte --output_file ~/caffe2/caffe2/python/tutorials/tutorial_data/mnist/mnist-train-nchw-leveldb
./make_mnist_db --channel_first --db leveldb --image_file ~/Downloads/t10k-images-idx3-ubyte --label_file ~/Downloads/t10k-labels-idx1-ubyte --output_file ~/caffe2/caffe2/python/tutorials/tutorial_data/mnist/mnist-test-nchw-leveldb
Note leveldb can get deadlocked if more than one user attempts to open the leveldb at the same time. This is why there is logic in the Python below to delete LOCK files if they're found.
TODO: it would be great to extend this binary to create other database formats
You can use the Python in the code block below to download the dataset with DownloadResource
, call the make_mnist_db
binary, and generate your database with GenerateDB
.
The DownloadResource
function can also download and extract a database for you.
Downloads and extracts the MNIST dataset The sample function below will download and extract the dataset for you.
DownloadResource("http://download.caffe2.ai/datasets/mnist/mnist.zip", data_folder)
Downloads and extracts the lmdb databases of MNIST images - both test and train
DownloadResource("http://download.caffe2.ai/databases/mnist-lmdb.zip", data_folder)
(Re)generate the leveldb database (it can get locked with multi-user setups or abandoned threads) Requires the download of the dataset (mnist.zip) - see above.
GenerateDB(image_file_train, label_file_train, "mnist-train-nchw-leveldb")
GenerateDB(image_file_test, label_file_test, "mnist-test-nchw-leveldb")
In [ ]:
import os
def DownloadResource(url, path):
'''Downloads resources from s3 by url and unzips them to the provided path'''
import requests, zipfile, StringIO
print("Downloading... {} to {}".format(url, path))
r = requests.get(url, stream=True)
z = zipfile.ZipFile(StringIO.StringIO(r.content))
z.extractall(path)
print("Completed download and extraction.")
def GenerateDB(image, label, name):
'''Calls the make_mnist_db binary to generate a leveldb from a mnist dataset'''
name = os.path.join(data_folder, name)
print 'DB: ', name
if not os.path.exists(name):
syscall = "/usr/local/bin/make_mnist_db --channel_first --db leveldb --image_file " + image + " --label_file " + label + " --output_file " + name
# print "Creating database with: ", syscall
os.system(syscall)
else:
print "Database exists already. Delete the folder if you have issues/corrupted DB, then rerun this."
if os.path.exists(os.path.join(name, "LOCK")):
# print "Deleting the pre-existing lock file"
os.remove(os.path.join(name, "LOCK"))
current_folder = os.path.join(os.path.expanduser('~'), 'caffe2_notebooks')
data_folder = os.path.join(current_folder, 'tutorial_data', 'mnist')
# Downloads and extracts the lmdb databases of MNIST images - both test and train
if not os.path.exists(os.path.join(data_folder,"mnist-train-nchw-lmdb")):
DownloadResource("http://download.caffe2.ai/databases/mnist-lmdb.zip", data_folder)
# Downloads and extracts the MNIST data set
if not os.path.exists(os.path.join(data_folder, "train-images-idx3-ubyte")):
DownloadResource("http://download.caffe2.ai/datasets/mnist/mnist.zip", data_folder)
# (Re)generate the leveldb database (it can get locked with multi-user setups or abandoned threads)
# Requires the download of the dataset (mnist.zip) - see DownloadResource above.
# You also need to change references in the MNIST tutorial code where you train or test from lmdb to leveldb
image_file_train = os.path.join(data_folder, "train-images-idx3-ubyte")
label_file_train = os.path.join(data_folder, "train-labels-idx1-ubyte")
image_file_test = os.path.join(data_folder, "t10k-images-idx3-ubyte")
label_file_test = os.path.join(data_folder, "t10k-labels-idx1-ubyte")
GenerateDB(image_file_train, label_file_train, "mnist-train-nchw-leveldb")
GenerateDB(image_file_test, label_file_test, "mnist-test-nchw-leveldb")
If you chose to use a format other than lmdb you will need to change a couple lines of code. When you use ModelHelper
to instantiate the CNN, you pass in the db
parameter with a path and the db_type
with the type of db. You would need to update both of these values. Since you create two networks, one for training and one for testing, you would need to update the code for both of these.
Default code using lmdb
train_model = model_helper.ModelHelper(name="mnist_train", arg_scope=arg_scope)
data, label = AddInput(
train_model, batch_size=64,
db=os.path.join(data_folder, 'mnist-train-nchw-lmdb'),
db_type='lmdb')
Updated code using leveldb
train_model = model_helper.ModelHelper(name="mnist_train", arg_scope=arg_scope)
data, label = AddInput(
train_model, batch_size=64,
db=os.path.join(data_folder, 'mnist-train-nchw-leveldb'),
db_type='leveldb')
In [ ]: