Theano, Lasagne

and why they matter

got no lasagne?

Install the bleeding edge version from here:

Warming up

  • Implement a function that computes the sum of squares of numbers from 0 to N
  • Use numpy or python
  • An array of numbers 0 to N - numpy.arange(N)

In [ ]:
import numpy as np
def sum_squares(N):
    return <student.Implement_me()>

In [ ]:

theano teaser

Doing the very same thing

In [ ]:
import theano
import theano.tensor as T

In [ ]:
#I gonna be function parameter
N = T.scalar("a dimension",dtype='int32')

#i am a recipe on how to produce sum of squares of arange of N given N
result = (T.arange(N)**2).sum()

#Compiling the recipe of computing "result" given N
sum_function = theano.function(inputs = [N],outputs=result)

In [ ]:

How does it work?

if you're currently in classroom, chances are i am explaining this text wall right now

  • 1 You define inputs f your future function;
  • 2 You write a recipe for some transformation of inputs;
  • 3 You compile it;
  • You have just got a function!
  • The gobbledegooky version: you define a function as symbolic computation graph.
  • There are two main kinвs of entities: "Inputs" and "Transformations"
  • Both can be numbers, vectors, matrices, tensors, etc.
  • Both can be integers, floats of booleans (uint8) of various size.
  • An input is a placeholder for function parameters.
    • N from example above
  • Transformations are the recipes for computing something given inputs and transformation
    • (T.arange(N)^2).sum() are 3 sequential transformations of N
    • Doubles all functions of numpy vector syntax
    • You can almost always go with replacing "np.function" with "T.function" aka "theano.tensor.function"
      • np.mean -> T.mean
      • np.arange -> T.arange
      • np.cumsum -> T.cumsum
      • and so on.
      • builtin operations also work that way
      • np.arange(10).mean() -> T.arange(10).mean()
      • Once upon a blue moon the functions have different names or locations (e.g. T.extra_ops)
        • Ask us or google it

Still confused? We gonna fix that.

In [ ]:
example_input_integer = T.scalar("scalar input",dtype='float32')

example_input_tensor = T.tensor4("four dimensional tensor input") #dtype = theano.config.floatX by default
#не бойся, тензор нам не пригодится

input_vector = T.vector("", dtype='int32') # vector of integers

In [ ]:

#transofrmation: elementwise multiplication
double_the_vector = input_vector*2

#elementwise cosine
elementwise_cosine = T.cos(input_vector)

#difference between squared vector and vector itself
vector_squares = input_vector**2 - input_vector

In [ ]:
#Practice time:
#create two vectors of size float32
my_vector = student.init_float32_vector()
my_vector2 = student.init_one_more_such_vector()

In [ ]:
#Write a transformation(recipe):
#(vec1)*(vec2) / (sin(vec1) +1)
my_transformation = student.implementwhatwaswrittenabove()

In [ ]:
print my_transformation
#it's okay it aint a number


  • So far we were using "symbolic" variables and transformations
    • Defining the recipe for computation, but not computing anything
  • To use the recipe, one should compile it

In [ ]:
inputs = [<two vectors that my_transformation depends on>]
outputs = [<What do we compute (can be a list of several transformation)>]

# The next lines compile a function that takes two vectors and computes your transformation
my_function = theano.function(
    allow_input_downcast=True #automatic type casting for input parameters (e.g. float64 -> float32)

In [ ]:
#using function with, lists:
print "using python lists:"
print my_function([1,2,3],[4,5,6])

#Or using numpy arrays:
#btw, that 'float' dtype is casted to secong parameter dtype which is float32
print "using numpy arrays:"
print my_function(np.arange(10),


  • Compilation can take a while for big functions
  • To avoid waiting, one can evaluate transformations without compiling
  • Without compilation, the code runs slower, so consider reducing input size

In [ ]:
#a dictionary of inputs
my_function_inputs = {

# evaluate my_transformation
# has to match with compiled function output
print my_transformation.eval(my_function_inputs)

# can compute transformations on the fly
print "add 2 vectors", (my_vector + my_vector2).eval(my_function_inputs)

#!WARNING! if your transformation only depends on some inputs,
#do not provide the rest of them
print "vector's shape:", my_vector.shape.eval({
  • When debugging, one would generally want to reduce the computation complexity. For example, if you are about to feed neural network with 1000 samples batch, consider taking first 2.
  • If you really want to debug graph of high computation complexity, you could just as well compile it (e.g. with optimizer='fast_compile')

Do It Yourself

[2 points max]

In [ ]:
# Quest #1 - implement a function that computes a mean squared error of two input vectors
# Your function has to take 2 vectors and return a single number


compute_mse =<student.compile_function()>

In [ ]:
# Tests
from sklearn.metrics import mean_squared_error

for n in [1,5,10,10**3]:
    elems = [np.arange(n),np.arange(n,0,-1), np.zeros(n),
    for el in elems:
        for el_2 in elems:
            true_mse = np.array(mean_squared_error(el,el_2))
            my_mse = compute_mse(el,el_2)
            if not np.allclose(true_mse,my_mse):
                print 'Wrong result:'
                print 'mse(%s,%s)'%(el,el_2)
                print "should be: %f, but your function returned %f"%(true_mse,my_mse)
                raise ValueError,"Что-то не так"

print "All tests passed"

Shared variables

  • The inputs and transformations only exist when function is called

  • Shared variables always stay in memory like global variables

    • Shared variables can be included into a symbolic graph
    • They can be set and evaluated using special methods
      • but they can't change value arbitrarily during symbolic graph computation
      • we'll cover that later;
  • Hint: such variables are a perfect place to store network parameters
    • e.g. weights or some metadata

In [ ]:
#creating shared variable
shared_vector_1 = theano.shared(np.ones(10,dtype='float64'))

In [ ]:
#evaluating shared variable (outside symbolicd graph)
print "initial value",shared_vector_1.get_value()

# within symbolic graph you use them just as any other inout or transformation, not "get value" needed

In [ ]:
#setting new value
shared_vector_1.set_value( np.arange(5) )

#getting that new value
print "new value", shared_vector_1.get_value()

#Note that the vector changed shape
#This is entirely allowed... unless your graph is hard-wired to work with some fixed shape

Your turn

In [ ]:
# Write a recipe (transformation) that computes an elementwise transformation of shared_vector and input_scalar
#Compile as a function of input_scalar

input_scalar = T.scalar('coefficient',dtype='float32')

scalar_times_shared = <student.write_recipe()>

shared_times_n = <student.compile_function()>

In [ ]:
print "shared:", shared_vector_1.get_value()

print "shared_times_n(5)",shared_times_n(5)

print "shared_times_n(-0.5)",shared_times_n(-0.5)

In [ ]:
#Changing value of vector 1 (output should change)
print "shared:", shared_vector_1.get_value()

print "shared_times_n(5)",shared_times_n(5)

print "shared_times_n(-0.5)",shared_times_n(-0.5)

T.grad - why theano matters

  • Theano can compute derivatives and gradients automatically
  • Derivatives are computed symbolically, not numerically


  • You can only compute a gradient of a scalar transformation over one or several scalar or vector (or tensor) transformations or inputs.
  • A transformation has to have float32 or float64 dtype throughout the whole computation graph
    • derivative over an integer has no mathematical sense

In [ ]:
my_scalar = T.scalar(name='input',dtype='float64')

scalar_squared = T.sum(my_scalar**2)

#a derivative of v_squared by my_vector
derivative = T.grad(scalar_squared,my_scalar)

fun = theano.function([my_scalar],scalar_squared)
grad = theano.function([my_scalar],derivative)

In [ ]:
import matplotlib.pyplot as plt
%matplotlib inline

x = np.linspace(-3,3)
x_squared = map(fun,x)
x_squared_der = map(grad,x)

plt.plot(x, x_squared,label="x^2")
plt.plot(x, x_squared_der, label="derivative")

Why that rocks

In [ ]:
my_vector = T.vector('float64')

#Compute the gradient of the next weird function over my_scalar and my_vector
#warning! Trying to understand the meaning of that function may result in permanent brain damage

weird_psychotic_function = ((my_vector+my_scalar)**(1+T.var(my_vector)) +1./T.arcsinh(my_scalar)).mean()/(my_scalar**2 +1) + 0.01*T.sin(2*my_scalar**1.5)*(T.sum(my_vector)* my_scalar**2)*T.exp((my_scalar-4)**2)/(1+T.exp((my_scalar-4)**2))*(1.-(T.exp(-(my_scalar-4)**2))/(1+T.exp(-(my_scalar-4)**2)))**2

der_by_scalar,der_by_vector = <student.compute_grad_over_scalar_and_vector()>

compute_weird_function = theano.function([my_scalar,my_vector],weird_psychotic_function)
compute_der_by_scalar = theano.function([my_scalar,my_vector],der_by_scalar)

In [ ]:
#Plotting your derivative
vector_0 = [1,2,3]

scalar_space = np.linspace(0,7)

y = [compute_weird_function(x,vector_0) for x in scalar_space]
y_der_by_scalar = [compute_der_by_scalar(x,vector_0) for x in scalar_space]

Almost done - Updates

  • updates are a way of changing shared variables at after function call.

  • technically it's a dictionary {shared_variable : a recipe for new value} which is has to be provided when function is compiled

That's how it works:

In [ ]:
# Multiply shared vector by a number and save the product back into shared vector

inputs = [input_scalar]
outputs = [scalar_times_shared] #return vector times scalar

my_updates = {
    shared_vector_1:scalar_times_shared #and write this same result bach into shared_vector_1

compute_and_save = theano.function(inputs, outputs, updates=my_updates)

In [ ]:

#initial shared_vector_1
print "initial shared value:" ,shared_vector_1.get_value()

# evaluating the function (shared_vector_1 will be changed)
print "compute_and_save(2) returns",compute_and_save(2)

#evaluate new shared_vector_1
print "new shared value:" ,shared_vector_1.get_value()

Logistic regression example

[ 4 points max]

Implement the regular logistic regression training algorithm


  • Weights fit in as a shared variable
  • X and y are potential inputs
  • Compile 2 functions:
    • train_function(X,y) - returns error and computes weights' new values (through updates)
    • predict_fun(X) - just computes probabilities ("y") given data

We shall train on a two-class MNIST dataset

  • please note that target y are {0,1} and not {-1,1} as in some formulae

In [ ]:
from sklearn.datasets import load_digits
mnist = load_digits(2)

X,y =,

print "y [shape - %s]:"%(str(y.shape)),y[:10]

print "X [shape - %s]:"%(str(X.shape))
print X[:3]
print y[:10]

In [ ]:
# inputs and shareds
shared_weights = <student.code_me()>
input_X = <student.code_me()>
input_y = <student.code_me()>

In [ ]:
predicted_y = <predicted probabilities for input_X>
loss = <logistic loss (scalar, mean over sample)>

grad = <gradient of loss over model weights>

updates = {
    shared_weights: <new weights after gradient step>

In [ ]:
train_function = <compile function that takes X and y, returns log loss and updates weights>
predict_function = <compile function that takes X and computes probabilities of y>

In [ ]:
from sklearn.cross_validation import train_test_split
X_train,X_test,y_train,y_test = train_test_split(X,y)

In [ ]:
from sklearn.metrics import roc_auc_score

for i in range(5):
    loss_i = train_function(X_train,y_train)
    print "loss at iter %i:%.4f"%(i,loss_i)
    print "train auc:",roc_auc_score(y_train,predict_function(X_train))
    print "test auc:",roc_auc_score(y_test,predict_function(X_test))

print "resulting weights:"


[basic part 4 points max] Your ultimate task for this week is to build your first neural network [almost] from scratch and pure theano.

This time you will same digit recognition problem, but at a larger scale

  • images are now 28x28
  • 10 different digits
  • 50k samples

Note that you are not required to build 152-layer monsters here. A 2-layer (one hidden, one output) NN should already have ive you an edge over logistic regression.

[bonus score] If you've already beaten logistic regression with a two-layer net, but enthusiasm still ain't gone, you can try improving the test accuracy even further! The milestones would be 95%/97.5%/98.5% accuraсy on test set.

SPOILER! At the end of the notebook you will find a few tips and frequently made mistakes. If you feel enough might to shoot yourself in the foot without external assistance, we encourage you to do so, but if you encounter any unsurpassable issues, please do look there before mailing us.

In [ ]:
from mnist import load_dataset

#[down]loading the original MNIST dataset.
#Please note that you should only train your NN on _train sample,
# _val can be used to evaluate out-of-sample error, compare models or perform early-stopping
# _test should be hidden under a rock untill final evaluation... But we both know it is near impossible to catch you evaluating on it.
X_train,y_train,X_val,y_val,X_test,y_test = load_dataset()

print X_train.shape,y_train.shape

In [ ]:

In [ ]:
<here you could just as well create computation graph>

In [ ]:
<this may or may not be a good place to evaluating loss and updates>

In [ ]:
<here one could compile all the required functions>

In [ ]:
<this may be a perfect cell to write a training&evaluation loop in>

In [ ]:
<predict & evaluate on test here, right? No cheating pls.>


I did such and such, that did that cool thing and my stupid NN bloated out that stuff. Finally, i did that thingy and felt like Le'Cun. That cool article and that kind of weed helped me so much (if any).


Recommended pipeline

  • Adapt logistic regression from previous assignment to classify some number against others (e.g. zero vs nonzero)
  • Generalize it to multiclass logistic regression.
    • Either try to remember lecture 0 or google it.
    • Instead of weight vector you'll have to use matrix (feature_id x class_id)
    • softmax (exp over sum of exps) can implemented manually or as T.nnet.softmax (stable)
    • probably better to use STOCHASTIC gradient descent (minibatch)
      • in which case sample should probably be shuffled (or use random subsamples on each iteration)
  • Add a hidden layer. Now your logistic regression uses hidden neurons instead of inputs.

    • Hidden layer uses the same math as output layer (ex-logistic regression), but uses some nonlinearity (sigmoid) instead of softmax
    • You need to train both layers, not just output layer :)
    • Do not initialize layers with zeros (due to symmetry effects). A gaussian noize with small sigma will do.
    • 50 hidden neurons and a sigmoid nonlinearity will do for a start. Many ways to improve.
    • In ideal casae this totals to 2 .dot's, 1 softmax and 1 sigmoid
    • make sure this neural network works better than logistic regression
  • Now's the time to try improving the network. Consider layers (size, neuron count), nonlinearities, optimization methods, initialization - whatever you want, but please avoid convolutions for now.

In [ ]: