In this tutorial we show how to combine NDArray and Symbol together to train a neural network from scratch. This mixed programming flavor is one of the unique feature that make MXNet different to other frameworks. The MX
term in MXNet also often means "mixed".
Note that mx.module
provides all functions will be implemented. So this tutorial is mainly for users who want to build things from scratches.
We will use a two-layer perception as the example to show the idea. Note that the codes apply to other objective functions such as deep convolutional neural networks as well. We first define the network:
In [1]:
import mxnet as mx
num_classes = 10
net = mx.sym.Variable('data')
net = mx.sym.FullyConnected(data=net, name='fc1', num_hidden=128)
net = mx.sym.Activation(data=net, name='relu1', act_type="relu")
net = mx.sym.FullyConnected(data=net, name='fc2', num_hidden=num_classes)
net = mx.sym.SoftmaxOutput(data=net, name='out')
mx.viz.plot_network(net)
Out[1]:
In [10]:
import mxnet as mx
num_classes = 10
net = mx.sym.Variable('data')
net = mx.sym.FullyConnected(data=net, num_hidden=128, name='fc1')
net = mx.sym.Activation(data=net, act_type='relu', name='relu1')
net = mx.sym.FullyConnected(data=net, num_hidden=num_classes, name='fc2')
net = mx.sym.SoftmaxOutput(data=net, name='out')
mx.viz.plot_network(net)
Out[10]:
The free variables include the weight and the bias from each fully connected layer (fc1
and fc2
), the example for variable data
, and the label for the softmax output out
. We list all these variables' name by list_argument
:
In [2]:
print(net.list_arguments())
In [11]:
print(net.list_arguments())
To run forward and backward, we need to bind data to all free variables first. We can create all NDArray
s and then bind them as we did on the Symbol tutorial. There is also function named simple_bind
that simplifies this procedure. This function first inferences the shapes of all free variables by using the provided data shape, and then allocate and bind data, which can be accessed by the attribute arg_arrays
of the returned executor.
In [3]:
num_features = 100
batch_size = 100
ex = net.simple_bind(ctx=mx.cpu(), data=(batch_size, num_features))
args = dict(zip(net.list_arguments(), ex.arg_arrays))
for name in args:
print(name, args[name].shape)
In [15]:
num_features = 100
batch_size = 100
ex = net.simple_bind(data=(batch_size, num_features), ctx=mx.cpu())
args = dict(zip(net.list_arguments(), ex.arg_arrays))
for k, v in args.items():
print (k, v.shape)
Change ctx
to GPU we can let the arrays be allocated on GPU:
In [4]:
ex = net.simple_bind(ctx=mx.gpu(), data=(batch_size, num_features))
args = dict(zip(net.list_arguments(), ex.arg_arrays))
for name in args:
print(name, args[name].shape, args[name].context)
Then we initialize the weights by random values.
In [5]:
for name in args:
data = args[name]
if 'weight' in name:
data[:] = mx.random.uniform(-0.1, 0.1, data.shape)
if 'bias' in name:
data[:] = 0
Before training, we generate a synthetic dataset
In [8]:
import numpy as np
import matplotlib.pyplot as plt
class ToyData:
def __init__(self, num_classes, num_features):
self.num_classes = num_classes
self.num_features = num_features
self.mu = np.random.rand(num_classes, num_features)
self.sigma = np.ones((num_classes, num_features)) * 0.1
def get(self, num_samples):
num_cls_samples = int(num_samples / self.num_classes)
x = np.zeros((num_samples, self.num_features))
y = np.zeros((num_samples, ))
for i in range(self.num_classes):
cls_samples = np.random.normal(self.mu[i,:], self.sigma[i,:], (num_cls_samples, self.num_features))
x[i*num_cls_samples:(i+1)*num_cls_samples] = cls_samples
y[i*num_cls_samples:(i+1)*num_cls_samples] = i
return x, y
def plot(self, x, y):
colors = ['r', 'b', 'g', 'c', 'y']
for i in range(self.num_classes):
cls_x = x[y == i]
plt.scatter(cls_x[:,0], cls_x[:,1], color=colors[i%5], s=1)
plt.show()
toy_data = ToyData(num_classes, num_features)
x, y = toy_data.get(1000)
toy_data.plot(x,y)
Finally we can start the training. Here we use the plain minibatch stochastic gradient descent with fixed learning rate. For every 10 iterations we plot the accuracy.
In [9]:
# @@@ AUTOTEST_OUTPUT_IGNORED_CELL
learning_rate = 0.1
final_acc = 0
for i in range(100):
x, y = toy_data.get(batch_size)
args['data'][:] = x
args['out_label'][:] = y
ex.forward(is_train=True)
ex.backward()
for weight, grad in zip(ex.arg_arrays, ex.grad_arrays):
weight[:] -= learning_rate * (grad / batch_size)
if i % 10 == 0:
acc = (mx.nd.argmax_channel(ex.outputs[0]).asnumpy() == y).sum()
final_acc = acc
print('iteration %d, accuracy %f' % (i, float(acc)/y.shape[0]))
assert final_acc > 0.95, "Low training accuracy."
On this section we show how to use the imperative NDArray and symbolic Symbol together to implement a complete training algorithm. The former can be often used for
While the later can be used for defining the object function, which benefits from the heavy optimization placed on Symbol and auto differentation.
On the NDArray tutorial we mentioned that the backend system is able to automatically parallel the computations. This feature makes developing parallel programs as easy as writing serial programs in MXNet.
Here we show how to develope a training program using mutliple devices, such as GPUs and CPUs, with data parallelism. In MXNet, a device means a computation resource with its own memory. It could be a GPU chip or all CPUs chips:
nvidia-smi
to list all units. Usually a physical GPU card only contains a single GPU chip, but some cards may have more than one unit. For example, each Tesla K80 contains two GK210 chips. mx.cpu()
in MXNet. The reason is that these CPUs share the same main memory. Here is a figure (from nvidia) shown the memory structure and how data are communicated between devices.
Assume each iteration we will train a minibatch with size $n$. In data parallism, we divide this batch into all available devices according to their computational power. Each device will compute the gradient on a part of the batch, and these gradients are then merged.
Now we extend the above training program into multiple devices, the new function accepts a network, a data iterator, a list of devices and their computation power.
In [16]:
def train(network, data_shape, data, devs, devs_power):
# partition the batch into each device
batch_size = float(data_shape[0])
workloads = [int(round(batch_size/sum(devs_power)*p)) for p in devs_power]
print('workload partition: ', zip(devs, workloads))
# create an executor for each device
exs = [network.simple_bind(ctx=d, data=tuple([p]+data_shape[1:])) for d, p in zip(devs, workloads)]
args = [dict(zip(network.list_arguments(), ex.arg_arrays)) for ex in exs]
# initialize weight on dev 0
for name in args[0]:
arr = args[0][name]
if 'weight' in name:
arr[:] = mx.random.uniform(-0.1, 0.1, arr.shape)
if 'bias' in name:
arr[:] = 0
# run 50 iterations
learning_rate = 0.1
acc = 0
for i in range(50):
# broadcast weight from dev 0 to all devices
for j in range(1, len(devs)):
for name, src, dst in zip(network.list_arguments(), exs[0].arg_arrays, exs[j].arg_arrays):
if 'weight' in name or 'bias' in name:
src.copyto(dst)
# get data
x, y = data()
for j in range(len(devs)):
# partition and assign data
idx = range(sum(workloads[:j]), sum(workloads[:j+1]))
args[j]['data'][:] = x[idx,:].reshape(args[j]['data'].shape)
args[j]['out_label'][:] = y[idx].reshape(args[j]['out_label'].shape)
# forward and backward
exs[j].forward(is_train=True)
exs[j].backward()
# sum over gradient on dev 0
if j > 0:
for name, src, dst in zip(network.list_arguments(), exs[j].grad_arrays, exs[0].grad_arrays):
if 'weight' in name or 'bias' in name:
dst += src.as_in_context(dst.context)
# update weight on dev 0
for weight, grad in zip(exs[0].arg_arrays, exs[0].grad_arrays):
weight[:] -= learning_rate * (grad / batch_size)
# monitor
if i % 10 == 0:
pred = np.concatenate([mx.nd.argmax_channel(ex.outputs[0]).asnumpy() for ex in exs])
acc = (pred == y).sum() / batch_size
print('iteration %d, accuracy %f' % (i, acc))
return acc
Now we can train the previous network using both cpu and gpu. It should give similar results as using cpu only.
In [ ]:
# @@@ AUTOTEST_OUTPUT_IGNORED_CELL
batch_size = 100
acc = train(net, [batch_size, num_features], lambda : toy_data.get(batch_size), [mx.cpu(), mx.gpu()], [1, 5])
assert acc > 0.95, "Low training accuracy."
Note that the previous network is too small to see any performance benefits moving to multiple devices on such a network. Now we consider use a slightly more complex network: LeNet-5 for hands digits recognition. We first define the network.
In [17]:
def lenet():
data = mx.sym.Variable('data')
# first conv
conv1 = mx.sym.Convolution(data=data, kernel=(5,5), num_filter=20)
tanh1 = mx.sym.Activation(data=conv1, act_type="tanh")
pool1 = mx.sym.Pooling(data=tanh1, pool_type="max",
kernel=(2,2), stride=(2,2))
# second conv
conv2 = mx.sym.Convolution(data=pool1, kernel=(5,5), num_filter=50)
tanh2 = mx.sym.Activation(data=conv2, act_type="tanh")
pool2 = mx.sym.Pooling(data=tanh2, pool_type="max",
kernel=(2,2), stride=(2,2))
# first fullc
flatten = mx.sym.Flatten(data=pool2)
fc1 = mx.sym.FullyConnected(data=flatten, num_hidden=500)
tanh3 = mx.sym.Activation(data=fc1, act_type="tanh")
# second fullc
fc2 = mx.sym.FullyConnected(data=tanh3, num_hidden=10)
# loss
lenet = mx.sym.SoftmaxOutput(data=fc2, name='out')
return lenet
mx.viz.plot_network(lenet(), shape={'data':(128,1,28,28)})
Out[17]:
Next we prepare the mnist dataset
In [ ]:
from sklearn.datasets import fetch_mldata
import numpy as np
import matplotlib.pyplot as plt
class MNIST:
def __init__(self):
mnist = fetch_mldata('MNIST original')
p = np.random.permutation(mnist.data.shape[0])
self.X = mnist.data[p]
self.Y = mnist.target[p]
self.pos = 0
def get(self, batch_size):
p = self.pos
self.pos += batch_size
return self.X[p:p+batch_size,:], self.Y[p:p+batch_size]
def reset(self):
self.pos = 0
def plot(self):
for i in range(10):
plt.subplot(1,10,i+1)
plt.imshow(self.X[i].reshape((28,28)), cmap='Greys_r')
plt.axis('off')
plt.show()
mnist = MNIST()
mnist.plot()
We first train lenet on a single GPU
In [ ]:
# @@@ AUTOTEST_OUTPUT_IGNORED_CELL
import time
batch_size = 1024
shape = [batch_size, 1, 28, 28]
mnist.reset()
tic = time.time()
acc = train(lenet(), shape, lambda:mnist.get(batch_size), [mx.gpu(),], [1,])
assert acc > 0.8, "Low training accuracy."
print('time for train lenent on cpu %f sec' % (time.time() - tic))
Then we try multiple GPUs. The following codes needs 4 GPUs.
In [ ]:
# @@@ AUTOTEST_OUTPUT_IGNORED_CELL
for ndev in (2, 4):
mnist.reset()
tic = time.time()
acc = train(lenet(), shape, lambda:mnist.get(batch_size),
[mx.gpu(i) for i in range(ndev)], [1]*ndev)
assert acc > 0.9, "Low training accuracy."
print('time for train lenent on %d GPU %f sec' % (
ndev, time.time() - tic))
As can be seen, using more GPUs accelerates the speed. The speedup is not perfect because the network is still simple, we cannot fully hide the communication cost over multiple GPUs by pipelining the computation and communication. We observed better results by using the state-of-the-art networks. The following figure shows the speedup of three imagenet winners by using 8 Nvidia Tesla M40