Going deeper with Tensorflow

In this seminar, we're going to play with Tensorflow and see how it helps us build deep learning models.

If you're running this notebook outside the course environment, you'll need to install tensorflow:

  • pip install tensorflow should install cpu-only TF on Linux & Mac OS
  • If you want GPU support from offset, see TF install page

In [1]:
import tensorflow as tf
gpu_options = tf.GPUOptions(allow_growth=True, per_process_gpu_memory_fraction=0.1)
s = tf.InteractiveSession(config=tf.ConfigProto(gpu_options=gpu_options))

Warming up

For starters, let's implement a python function that computes the sum of squares of numbers from 0 to N-1.

  • Use numpy or python
  • An array of numbers 0 to N - numpy.arange(N)

In [2]:
import numpy as np
def sum_squares(N):
    v = np.arange(N)
    return np.sum(v**2)

In [3]:
%%time
sum_squares(10**8)


CPU times: user 392 ms, sys: 746 ms, total: 1.14 s
Wall time: 1.14 s
Out[3]:
662921401752298880

Tensorflow teaser

Doing the very same thing


In [4]:
#I gonna be your function parameter
N = tf.placeholder('int64', name="input_to_your_function")

#i am a recipe on how to produce sum of squares of arange of N given N
result = tf.reduce_sum((tf.range(N)**2))

In [5]:
%%time
#example of computing the same as sum_squares
print(result.eval({N:10**8}))


662921401752298880
CPU times: user 707 ms, sys: 380 ms, total: 1.09 s
Wall time: 849 ms

How does it work?

  1. define placeholders where you'll send inputs;
  2. make symbolic graph: a recipe for mathematical transformation of those placeholders;
  3. compute outputs of your graph with particular values for each placeholder
    • output.eval({placeholder:value})
    • s.run(output, {placeholder:value})
  • So far there are two main entities: "placeholder" and "transformation"
  • Both can be numbers, vectors, matrices, tensors, etc.
  • Both can be int32/64, floats of booleans (uint8) of various size.

  • You can define new transformations as an arbitrary operation on placeholders and other transformations

    • tf.reduce_sum(tf.arange(N)**2) are 3 sequential transformations of placeholder N
    • There's a tensorflow symbolic version for every numpy function
      • a+b, a/b, a**b, ... behave just like in numpy
      • np.mean -> tf.reduce_mean
      • np.arange -> tf.range
      • np.cumsum -> tf.cumsum
      • If if you can't find the op you need, see the docs.

Still confused? We gonna fix that.


In [6]:
#Default placeholder that can be arbitrary float32 scalar, vector, matrix, etc.
arbitrary_input = tf.placeholder('float32')

#Input vector of arbitrary length
input_vector = tf.placeholder('float32',shape=(None,))

#Input vector that _must_ have 10 elements and integer type
fixed_vector = tf.placeholder('int32',shape=(10,))

#Matrix of arbitrary n_rows and 15 columns (e.g. a minibatch your data table)
input_matrix = tf.placeholder('float32',shape=(None,15))

#You can generally use None whenever you don't need a specific shape
input1 = tf.placeholder('float64',shape=(None,100,None))
input2 = tf.placeholder('int32',shape=(None,None,3,224,224))

In [7]:
#elementwise multiplication
double_the_vector = input_vector*2

#elementwise cosine
elementwise_cosine = tf.cos(input_vector)

#difference between squared vector and vector itself
vector_squares = input_vector**2 - input_vector

In [8]:
#Practice time: create two vectors of type float32
my_vector =  tf.placeholder('float32', name="my_vector")
my_vector2 = tf.placeholder('float32', name="my_vector2")

In [9]:
#Write a transformation(recipe):
#(vec1)*(vec2) / (sin(vec1) +1)
my_transformation = my_vector * my_vector2 / (tf.math.sin(my_vector) + 1)

In [10]:
print(my_transformation)
#it's okay, it's a symbolic graph


Tensor("truediv:0", dtype=float32)

In [11]:
#
dummy = np.arange(5).astype('float32')

my_transformation.eval({my_vector:dummy,my_vector2:dummy[::-1]})


Out[11]:
array([0.       , 1.6291324, 2.0950115, 2.6289961, 0.       ],
      dtype=float32)

Visualizing graphs

It's often useful to visualize the computation graph when debugging or optimizing. Interactive visualization is where tensorflow really shines as compared to other frameworks.

There's a special instrument for that, called Tensorboard. You can launch it from console:

tensorboard --logdir=/tmp/tboard --port=7007

If you're pathologically afraid of consoles, try this:

os.system("tensorboard --logdir=/tmp/tboard --port=7007 &"

(but don't tell anyone we taught you that)


In [12]:
# launch tensorflow the ugly way, uncomment if you need that
import os
#!killall tensorboard
#os.system("tensorboard --logdir=/tmp/tboard --port=7007 &")

# show graph to tensorboard
writer = tf.summary.FileWriter("/tmp/tboard", graph=tf.get_default_graph())
writer.close()

One basic functionality of tensorboard is drawing graphs. Once you've run the cell above, go to localhost:7007 in your browser and switch to graphs tab in the topbar.

Here's what you should see:

Tensorboard also allows you to draw graphs (e.g. learning curves), record images & audio and play flash games. This is useful when monitoring learning progress and catching some training issues.

One researcher said:

If you spent last four hours of your worktime watching as your algorithm prints numbers and draws figures, you're probably doing deep learning wrong.

You can read more on tensorboard usage here

Do It Yourself

[2 points max]


In [13]:
# Quest #1 - implement a function that computes a mean squared error of two input vectors
# Your function has to take 2 vectors and return a single number

v1 = tf.placeholder('float32', name='v1')
v2 = tf.placeholder('float32', name='v2')
mse = tf.math.reduce_sum(tf.math.square(v1 - v2)) / tf.dtypes.cast(tf.size(v1), tf.float32)

compute_mse = lambda vector1, vector2: mse.eval({v1:vector1, v2:vector2})

In [14]:
# Tests
from sklearn.metrics import mean_squared_error

for n in [1,5,10,10**3]:
    
    elems = [np.arange(n),np.arange(n,0,-1), np.zeros(n),
             np.ones(n),np.random.random(n),np.random.randint(100,size=n)]
    
    for el in elems:
        for el_2 in elems:
            true_mse = np.array(mean_squared_error(el,el_2))
            my_mse = compute_mse(el,el_2)
            if not np.allclose(true_mse,my_mse):
                print('Wrong result:')
                print('mse(%s,%s)' % (el,el_2))
                print("should be: %f, but your function returned %f" % (true_mse,my_mse))
                raise ValueError("Что-то не так")

print("All tests passed")


All tests passed

variables

The inputs and transformations have no value outside function call. This isn't too comfortable if you want your model to have parameters (e.g. network weights) that are always present, but can change their value over time.

Tensorflow solves this with tf.Variable objects.

  • You can assign variable a value at any time in your graph
  • Unlike placeholders, there's no need to explicitly pass values to variables when s.run(...)-ing
  • You can use variables the same way you use transformations

In [15]:
#creating shared variable
shared_vector_1 = tf.Variable(initial_value=np.ones(5))

In [16]:
#initialize variable(s) with initial values
s.run(tf.global_variables_initializer())

#evaluating shared variable (outside symbolic graph)
print("initial value", s.run(shared_vector_1))

# within symbolic graph you use them just as any other inout or transformation, not "get value" needed


initial value [1. 1. 1. 1. 1.]

In [17]:
#setting new value
s.run(shared_vector_1.assign(np.arange(5)))

#getting that new value
print("new value", s.run(shared_vector_1))


new value [0. 1. 2. 3. 4.]

tf.gradients - why graphs matter

  • Tensorflow can compute derivatives and gradients automatically using the computation graph
  • Gradients are computed as a product of elementary derivatives via chain rule:
$$ {\partial f(g(x)) \over \partial x} = {\partial f(g(x)) \over \partial g(x)}\cdot {\partial g(x) \over \partial x} $$

It can get you the derivative of any graph as long as it knows how to differentiate elementary operations


In [18]:
my_scalar = tf.placeholder('float32')

scalar_squared = my_scalar**2

#a derivative of scalar_squared by my_scalar
derivative = tf.gradients(scalar_squared, my_scalar)[0]


WARNING: Logging before flag parsing goes to stderr.
W0106 20:54:19.337931 139865736214272 deprecation.py:323] From /opt/conda/lib/python3.6/site-packages/tensorflow/python/ops/math_grad.py:1205: add_dispatch_support.<locals>.wrapper (from tensorflow.python.ops.array_ops) is deprecated and will be removed in a future version.
Instructions for updating:
Use tf.where in 2.0, which has the same broadcast rule as np.where

In [19]:
import matplotlib.pyplot as plt
%matplotlib inline

x = np.linspace(-3,3)
x_squared, x_squared_der = s.run([scalar_squared,derivative],
                                 {my_scalar:x})

plt.plot(x, x_squared,label="x^2")
plt.plot(x, x_squared_der, label="derivative")
plt.legend();


Why that rocks


In [20]:
my_vector = tf.placeholder('float32',[None])

#Compute the gradient of the next weird function over my_scalar and my_vector
#warning! Trying to understand the meaning of that function may result in permanent brain damage

weird_psychotic_function = tf.reduce_mean((my_vector+my_scalar)**(1+tf.nn.moments(my_vector,[0])[1]) + 1./ tf.atan(my_scalar))/(my_scalar**2 + 1) + 0.01*tf.sin(2*my_scalar**1.5)*(tf.reduce_sum(my_vector)* my_scalar**2)*tf.exp((my_scalar-4)**2)/(1+tf.exp((my_scalar-4)**2))*(1.-(tf.exp(-(my_scalar-4)**2))/(1+tf.exp(-(my_scalar-4)**2)))**2

der_by_scalar = tf.gradients(weird_psychotic_function, my_scalar)[0]
der_by_vector = tf.gradients(weird_psychotic_function, my_vector)[0]

In [21]:
#Plotting your derivative
scalar_space = np.linspace(1, 7, 100)

y = [s.run(weird_psychotic_function, {my_scalar:x, my_vector:[1, 2, 3]})
     for x in scalar_space]

plt.plot(scalar_space, y, label='function')

y_der_by_scalar = [s.run(der_by_scalar, {my_scalar:x, my_vector:[1, 2, 3]})
     for x in scalar_space]

plt.plot(scalar_space, y_der_by_scalar, label='derivative')
plt.grid()
plt.legend();


Almost done - optimizers

While you can perform gradient descent by hand with automatic grads from above, tensorflow also has some optimization methods implemented for you. Recall momentum & rmsprop?


In [22]:
y_guess = tf.Variable(np.zeros(2,dtype='float32'))
y_true = tf.range(1,3,dtype='float32')

loss = tf.reduce_mean((y_guess - y_true + tf.random_normal([2]))**2) 

optimizer = tf.train.MomentumOptimizer(0.01,0.9).minimize(loss,var_list=y_guess)

#same, but more detailed:
#updates = [[tf.gradients(loss,y_guess)[0], y_guess]]
#optimizer = tf.train.MomentumOptimizer(0.01,0.9).apply_gradients(updates)

In [23]:
from IPython.display import clear_output

s.run(tf.global_variables_initializer())

guesses = [s.run(y_guess)]

for _ in range(100):
    s.run(optimizer)
    guesses.append(s.run(y_guess))
    
    clear_output(True)
    plt.plot(*zip(*guesses),marker='.')
    plt.scatter(*s.run(y_true),c='red')
    plt.show()


Logistic regression example

Implement the regular logistic regression training algorithm

Tips:

  • Use a shared variable for weights
  • X and y are potential inputs
  • Compile 2 functions:
    • train_function(X, y) - returns error and computes weights' new values (through updates)
    • predict_fun(X) - just computes probabilities ("y") given data

We shall train on a two-class MNIST dataset

  • please note that target y are {0,1} and not {-1,1} as in some formulae

In [26]:
from sklearn.datasets import load_digits
mnist = load_digits(2)

X,y = mnist.data, mnist.target

print("y [shape - %s]:" % (str(y.shape)), y[:10])
print("X [shape - %s]:" % (str(X.shape)))


y [shape - (360,)]: [0 1 0 1 0 1 0 0 1 1]
X [shape - (360, 64)]:

In [27]:
print('X:\n',X[:3,:10])
print('y:\n',y[:10])
plt.imshow(X[0].reshape([8,8]))


X:
 [[ 0.  0.  5. 13.  9.  1.  0.  0.  0.  0.]
 [ 0.  0.  0. 12. 13.  5.  0.  0.  0.  0.]
 [ 0.  0.  1.  9. 15. 11.  0.  0.  0.  0.]]
y:
 [0 1 0 1 0 1 0 0 1 1]
Out[27]:
<matplotlib.image.AxesImage at 0x7f34106ed978>

In [29]:
# inputs and shareds
weights = tf.Variable(np.zeros(len(y), dtype='float32'))
input_X = tf.placeholder(tf.float32, shape=X.shape)
input_y = tf.placeholder(tf.float32, shape=y.shape)

In [ ]:
predicted_y = <predicted probabilities for input_X>
loss = <logistic loss (scalar, mean over sample)>

optimizer = <optimizer that minimizes loss>

In [ ]:
train_function = <compile function that takes X and y, returns log loss and updates weights>
predict_function = <compile function that takes X and computes probabilities of y>

In [ ]:
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y)

In [ ]:
from sklearn.metrics import roc_auc_score

for i in range(5):
    <run optimizer operation>
    loss_i = <compute loss at iteration i>
    
    print("loss at iter %i:%.4f" % (i, loss_i))
    
    print("train auc:",roc_auc_score(y_train, predict_function(X_train)))
    print("test auc:",roc_auc_score(y_test, predict_function(X_test)))

    
print ("resulting weights:")
plt.imshow(shared_weights.get_value().reshape(8, -1))
plt.colorbar();

Bonus: my1stNN

Your ultimate task for this week is to build your first neural network [almost] from scratch and pure tensorflow.

This time you will same digit recognition problem, but at a larger scale

  • images are now 28x28
  • 10 different digits
  • 50k samples

Note that you are not required to build 152-layer monsters here. A 2-layer (one hidden, one output) NN should already have ive you an edge over logistic regression.

[bonus score] If you've already beaten logistic regression with a two-layer net, but enthusiasm still ain't gone, you can try improving the test accuracy even further! The milestones would be 95%/97.5%/98.5% accuraсy on test set.

SPOILER! At the end of the notebook you will find a few tips and frequently made mistakes. If you feel enough might to shoot yourself in the foot without external assistance, we encourage you to do so, but if you encounter any unsurpassable issues, please do look there before mailing us.


In [ ]:
from mnist import load_dataset

#[down]loading the original MNIST dataset.
#Please note that you should only train your NN on _train sample,
# _val can be used to evaluate out-of-sample error, compare models or perform early-stopping
# _test should be hidden under a rock untill final evaluation... But we both know it is near impossible to catch you evaluating on it.
X_train,y_train,X_val,y_val,X_test,y_test = load_dataset()

print (X_train.shape,y_train.shape)

In [ ]:
plt.imshow(X_train[0,0])

In [ ]:
<here you could just as well create computation graph>

In [ ]:
<this may or may not be a good place to evaluating loss and optimizer>

In [ ]:
<this may be a perfect cell to write a training&evaluation loop in>

In [ ]:
<predict & evaluate on test here, right? No cheating pls.>








SPOILERS!

Recommended pipeline

  • Adapt logistic regression from previous assignment to classify some number against others (e.g. zero vs nonzero)
  • Generalize it to multiclass logistic regression.
    • Either try to remember lecture 0 or google it.
    • Instead of weight vector you'll have to use matrix (feature_id x class_id)
    • softmax (exp over sum of exps) can implemented manually or as T.nnet.softmax (stable)
    • probably better to use STOCHASTIC gradient descent (minibatch)
      • in which case sample should probably be shuffled (or use random subsamples on each iteration)
  • Add a hidden layer. Now your logistic regression uses hidden neurons instead of inputs.

    • Hidden layer uses the same math as output layer (ex-logistic regression), but uses some nonlinearity (sigmoid) instead of softmax
    • You need to train both layers, not just output layer :)
    • Do not initialize layers with zeros (due to symmetry effects). A gaussian noize with small sigma will do.
    • 50 hidden neurons and a sigmoid nonlinearity will do for a start. Many ways to improve.
    • In ideal casae this totals to 2 .dot's, 1 softmax and 1 sigmoid
    • make sure this neural network works better than logistic regression
  • Now's the time to try improving the network. Consider layers (size, neuron count), nonlinearities, optimization methods, initialization - whatever you want, but please avoid convolutions for now.