The DyNet package is intended for training and using neural networks, and is particularly suited for applications with dynamically changing network structures. It is a python-wrapper for the DyNet C++ package.
In neural network packages there are generally two modes of operation:
We will describe both of these modes.
The main piece of DyNet is the ComputationGraph
, which is what essentially defines a neural network.
The ComputationGraph
is composed of expressions, which relate to the inputs and outputs of the network,
as well as the Parameters
of the network. The parameters are the things in the network that are optimized over time, and all of the parameters sit inside a ParameterCollection
. There are trainers
(for example SimpleSGDTrainer
) that are in charge of setting the parameter values.
We will not be using the ComputationGraph
directly, but it is there in the background, as a singleton object.
When dynet
is imported, a new ComputationGraph
is created. We can then reset the computation graph to a new state
by calling renew_cg()
.
The life-cycle of a DyNet program is:
ParameterCollection
, and populate it with Parameters
.Expression
representing the network
(the network will include the Expression
s for the Parameters
defined in the parameter collection).As an example, consider a model for solving the "xor" problem. The network has two inputs, which can be 0 or 1, and a single output which should be the xor of the two inputs. We will model this as a multi-layer perceptron with a single hidden layer.
Let $x = x_1, x_2$ be our input. We will have a hidden layer of 8 nodes, and an output layer of a single node. The activation on the hidden layer will be a $\tanh$. Our network will then be:
$\sigma(V(\tanh(Wx+b)))$
Where $W$ is a $8 \times 2$ matrix, $V$ is an $8 \times 1$ matrix, and $b$ is an 8-dim vector.
We want the output to be either 0 or 1, so we take the output layer to be the logistic-sigmoid function, $\sigma(x)$, that takes values between $-\infty$ and $+\infty$ and returns numbers in $[0,1]$.
We will begin by defining the model and the computation graph.
In [1]:
# we assume that we have the dynet module in your path.
# OUTDATED: we also assume that LD_LIBRARY_PATH includes a pointer to where libcnn_shared.so is.
import dynet as dy
In [2]:
# create a parameter collection and add the parameters.
m = dy.ParameterCollection()
pW = m.add_parameters((8,2))
pV = m.add_parameters((1,8))
pb = m.add_parameters((8))
renew_cg() # new computation graph. not strictly needed here, but good practice.
# associate the parameters with cg Expressions
W = parameter(pW)
V = parameter(pV)
b = parameter(pb)
In [3]:
#b[1:-1].value()
b.value()
Out[3]:
The first block creates a parameter collection and populates it with parameters.
The second block creates a computation graph and adds the parameters to it, transforming them into Expression
s.
The need to distinguish model parameters from "expressions" will become clearer later.
We now make use of the W and V expressions, in order to create the complete expression for the network.
In [4]:
x = vecInput(2) # an input vector of size 2. Also an expression.
output = logistic(V*(tanh((W*x)+b)))
In [5]:
# we can now query our network
x.set([0,0])
output.value()
Out[5]:
In [6]:
# we want to be able to define a loss, so we need an input expression to work against.
y = scalarInput(0) # this will hold the correct answer
loss = binary_log_loss(output, y)
In [7]:
x.set([1,0])
y.set(0)
print loss.value()
y.set(1)
print loss.value()
In [8]:
trainer = SimpleSGDTrainer(m)
To use the trainer, we need to:
forward_scalar
method of ComputationGraph
. This will run a forward pass through the network, calculating all the intermediate values until the last one (loss
, in our case), and then convert the value to a scalar. The final output of our network must be a single scalar value. However, if we do not care about the value, we can just use cg.forward()
instead of cg.forward_sclar()
.backward
method of ComputationGraph
. This will run a backward pass from the last node, calculating the gradients with respect to minimizing the last expression (in our case we want to minimize the loss). The gradients are stored in the parameter collection, and we can now let the trainer
take care of the optimization step.trainer.update()
to optimize the values with respect to the latest gradients.
In [9]:
x.set([1,0])
y.set(1)
loss_value = loss.value() # this performs a forward through the network.
print "the loss before step is:",loss_value
# now do an optimization step
loss.backward() # compute the gradients
trainer.update()
# see how it affected the loss:
loss_value = loss.value(recalculate=True) # recalculate=True means "don't use precomputed value"
print "the loss after step is:",loss_value
The optimization step indeed made the loss decrease. We now need to run this in a loop.
To this end, we will create a training set
, and iterate over it.
For the xor problem, the training instances are easy to create.
In [10]:
def create_xor_instances(num_rounds=2000):
questions = []
answers = []
for round in xrange(num_rounds):
for x1 in 0,1:
for x2 in 0,1:
answer = 0 if x1==x2 else 1
questions.append((x1,x2))
answers.append(answer)
return questions, answers
questions, answers = create_xor_instances()
We now feed each question / answer pair to the network, and try to minimize the loss.
In [11]:
total_loss = 0
seen_instances = 0
for question, answer in zip(questions, answers):
x.set(question)
y.set(answer)
seen_instances += 1
total_loss += loss.value()
loss.backward()
trainer.update()
if (seen_instances > 1 and seen_instances % 100 == 0):
print "average loss is:",total_loss / seen_instances
Our network is now trained. Let's verify that it indeed learned the xor function:
In [12]:
x.set([0,1])
print "0,1",output.value()
x.set([1,0])
print "1,0",output.value()
x.set([0,0])
print "0,0",output.value()
x.set([1,1])
print "1,1",output.value()
In case we are curious about the parameter values, we can query them:
In [13]:
W.value()
Out[13]:
In [14]:
V.value()
Out[14]:
In [15]:
b.value()
Out[15]:
In [16]:
# define the parameters
m = ParameterCollection()
pW = m.add_parameters((8,2))
pV = m.add_parameters((1,8))
pb = m.add_parameters((8))
# renew the computation graph
renew_cg()
# add the parameters to the graph
W = parameter(pW)
V = parameter(pV)
b = parameter(pb)
# create the network
x = vecInput(2) # an input vector of size 2.
output = logistic(V*(tanh((W*x)+b)))
# define the loss with respect to an output y.
y = scalarInput(0) # this will hold the correct answer
loss = binary_log_loss(output, y)
# create training instances
def create_xor_instances(num_rounds=2000):
questions = []
answers = []
for round in xrange(num_rounds):
for x1 in 0,1:
for x2 in 0,1:
answer = 0 if x1==x2 else 1
questions.append((x1,x2))
answers.append(answer)
return questions, answers
questions, answers = create_xor_instances()
# train the network
trainer = SimpleSGDTrainer(m)
total_loss = 0
seen_instances = 0
for question, answer in zip(questions, answers):
x.set(question)
y.set(answer)
seen_instances += 1
total_loss += loss.value()
loss.backward()
trainer.update()
if (seen_instances > 1 and seen_instances % 100 == 0):
print "average loss is:",total_loss / seen_instances
Dynamic networks are very similar to static ones, but instead of creating the network once and then calling "set" in each training example to change the inputs, we just create a new network for each training example.
We present an example below. While the value of this may not be clear in the xor
example, the dynamic approach
is very convenient for networks for which the structure is not fixed, such as recurrent or recursive networks.
In [17]:
import dynet as dy
# create training instances, as before
def create_xor_instances(num_rounds=2000):
questions = []
answers = []
for round in xrange(num_rounds):
for x1 in 0,1:
for x2 in 0,1:
answer = 0 if x1==x2 else 1
questions.append((x1,x2))
answers.append(answer)
return questions, answers
questions, answers = create_xor_instances()
# create a network for the xor problem given input and output
def create_xor_network(pW, pV, pb, inputs, expected_answer):
dy.renew_cg() # new computation graph
W = dy.parameter(pW) # add parameters to graph as expressions
V = dy.parameter(pV)
b = dy.parameter(pb)
x = dy.vecInput(len(inputs))
x.set(inputs)
y = dy.scalarInput(expected_answer)
output = dy.logistic(V*(dy.tanh((W*x)+b)))
loss = dy.binary_log_loss(output, y)
return loss
m2 = dy.ParameterCollection()
pW = m2.add_parameters((8,2))
pV = m2.add_parameters((1,8))
pb = m2.add_parameters((8))
trainer = dy.SimpleSGDTrainer(m2)
seen_instances = 0
total_loss = 0
for question, answer in zip(questions, answers):
loss = create_xor_network(pW, pV, pb, question, answer)
seen_instances += 1
total_loss += loss.value()
loss.backward()
trainer.update()
if (seen_instances > 1 and seen_instances % 100 == 0):
print "average loss is:",total_loss / seen_instances