In [1]:
%matplotlib inline
import matplotlib.pyplot as plt
This notebook is adapted from notes that I took when learning TensorFlow. It provides a slow, thorough introduction to the fundamentals of TensorFlow, answering questions like: What exactly is TensorFlow? Why do we need it? How does the computation graph work? What are operations, variables, placeholders, and sessions?
It assumes you know a good amount of Python. It starts from scratch with TensorFlow, but does not teach numpy
or linear algebra; you won't necessarily need those skills to follow this tutorial, but understanding numpy
basics, and how to "vectorize" your code, is an important skill for data scientists. Feel free to ask about resources for this if you're curious!
What is TensorFlow and what exactly does it do? Why do we need it? Why can't we do deep learning with pure Python or numpy
? To answer these questions, you need to understand the problems TensorFlow trying to solve.
Python is what's known as a dynamically typed programming language. This means that while your program is running, it stores in memory (alongside each variable) information about a variable's type. When you run x + y
in Python, for example, Python's runtime looks up x
's type, y
's type, then figures out how to add those two things together, or throws an error if it doesn't know how. (For example, try adding a list and a string.) This is called a runtime type check.
This automatic bookkeeping makes Python a pleasure to use, but also leads to inefficiencies. If we store a long list of numbers, for instance, we must allocate memory not just for the data itself but for each number's metadata (type information). If we then want to sum the list, using a loop, for instance, we need to do type checks for every addition operation we perform. This makes pure Python nearly unuseable for processing large datasets.
That's where numpy
comes in. In pure Python, a list is a Python object that holds other Python objects, each with its own metadata. The numpy
package, on the other hand, exposes a new kind of list, called a numpy
array, that holds values all of the same type. Since a numpy
array must hold values only of one type, we can store the metadata once for the whole array, instead of separately for each element.
Furthermore, since numpy
array elements are all of one type, they are all guaranteed to be the same size in memory, which allows us to store them more compactly and access them more quickly. (In pure Python, if you stored all elements "next to each other" in memory, it would be costly to calculate, say, where in memory the 100th item started, as this would depend on the sizes of each previous object. So Python actually stores elements all over the place in memory, then keeps an "index" of the memory locations of each element of the array. This means to sum 100 elements, Python needs to look in the index 100 times and go all over your RAM to find the numbers you want to add. Numpy just stores the 100 items in a row, and since they're all the same size, it's easy to calculate where the 5th or 100th or 1000th item is pretty much instantly.)
In addition to this compact array type, numpy
provides a number of operations, implemented in C, that manipulate these arrays, taking advantage of their compact representation. Arrays can be multidimensional, so when we talk about operations on arrays, that includes what you might think of as matrix operations (like fast matrix multiplication) too.
numpy
is wonderful, and enables Python programmers to work efficiently with vast amounts of data. It is the foundation of higher-level packages like pandas
and scipy
. But numpy
's design choices make certain tradeoffs; tensorflow
makes a different set of choices and accordingly has a different set of tradeoffs.
Even though single numpy
operations are blazing-fast, when composing numpy
operations, things can get a bit slower, because between each operation, you still have to go back and forth between Python's intepreter and numpy
's C code. Especially when inspecting numpy
values using, say, print
, or logging them to files, you incur a cost, because a full translation must be made between the numpy
value you are converting and the corresponding pure Python type that can interact with other Python code.
The second downside of numpy
really applies only to deep learning applications. The classic algorithm for training a deep model is (stochastic) gradient descent using backpropagation, which requires taking lots of derivatives. Before TensorFlow and other similar libraries, programmers manually (i.e., using pen and paper) did the calculus, deriving the symbolic gradient of the function to be minimized, then writing special code to take partial derivatives at an arbitrary input point. This is mechanical work that a computer should be able to do automatically. But numpy
's structure does not provide an easy way of computing these derivatives automatically. Why? Automatically computing the derivative of some formula requires having some representation of that formula in memory. But when you run numpy
operations, they simply execute and return their results; no trace is left of the steps used to get from first input to final output. There is no easy way to go back and compute derivatives later on in a program.
TensorFlow solves both these problems with the idea of a computation graph. Unlike in numpy
, where data goes back and forth between Python and C for each operation, in TensorFlow, the programmer creates, in pure Python, an object-oriented representation of the entire computation she wishes to perform. This representation is called a "graph," in the graph theory sense: the nodes in the graph are operations, and the edges represent the data flowing from one operation to the next. Building the graph is like writing down a formula. No data is actually being processed. As such, all of TensorFlow's graph-building functions are lightweight and fast, simply creating a description of computation in memory. Once the graph is complete, it is sent to TensorFlow's low-level execution algorithm. That algorithm, written (like much of numpy
) in C and C++, performs all the requested operations at once, then returns any values of interest (as specified by the user) back to the Python world.
Because an entire graph of computation is processed at once, there is much less shuttling back and forth between one representation and another. And because the computation graph is essentially an in-memory record of every step used to compute each value in your program, TensorFlow is able to do the necessary calculus automatically, computing gradients based on the structure of that graph.
Let's get started by importing TensorFlow.
In [2]:
import tensorflow as tf
Even before you call your first TensorFlow function, a lot is going on behind the scenes. For example, an empty default graph object is created. (In order to make it easier to hit the ground running with TensorFlow, and to make using the library less verbose, Google has laced the system with global state. In order to fully understand what's happening in your program, you need to know that these state variables, like "current graph" and "current scope" and "current session", exist. As we'll soon see, most of the functions you'll call in TensorFlow operate by quietly accessing this hidden state.)
We can access the default graph explicitly using tf.get_default_graph
:
In [3]:
g = tf.get_default_graph()
g
Out[3]:
It is currently empty. We can check this fact by listing the operations (nodes) in the graph:
In [4]:
g.get_operations()
Out[4]:
Let's start adding some operations to g
. An operation is a node of the computation graph. It contains only some light metadata, like "I am an addition operation, and my inputs come from these two other operations." Although Python Operation
objects don't actually do anything, we usually think of an operation in terms of what it will cause the execution engine to do after the graph has been completely built and is handed over to TensorFlow to run.
Every operation takes in some number of inputs (0 or more), and produces 0 or more outputs. Its outputs can become the inputs to other operations. Executing an operation can also have side effects, like printing something to the console, recording data to a file, or modifying a variable in memory. Again, all this computation happens after the graph has been completely built. The Operation object itself is simply a description of computation that will take place.
Perhaps the simplest operation we can create is constant
. It has zero inputs and one output. When we create a constant operation, we define what that constant output will be (this is stored as metadata on the Operation object we create). TensorFlow's tf.constant
function creates a constant operation and adds it to the default graph:
In [5]:
tf.constant(3.14)
g.get_operations()
Out[5]:
g
now has a Const
operation! Note that tf.constant
affected the graph g
, even though we didn't explicitly say we wanted the constant operation to be added to g
. It is possible to add operations to a specific, non-default graph, but most of the time, we add directly to the default graph, using functions like tf.constant
. In fact, we generally don't even call get_default_graph
to give g
a name; we just use it implicitly.
Let's examine the constant operation we created. We can use the inputs
and outputs
attributes of the operation to confirm that there really are zero inputs and one output.
In [6]:
const_operation = g.get_operations()[0]
len(const_operation.inputs), len(const_operation.outputs)
Out[6]:
Those inputs and outputs are of type Tensor
.
In [7]:
const_tensor = const_operation.outputs[0]
const_tensor
Out[7]:
A Tensor
is a lightweight Python object that represents a piece of data flowing along the edges of our graph. That data can be a scalar, a vector, a matrix, or a higher-dimensional array. (The dimensionality of a tensor t
is accessible as an attribute: t.shape
. Here, a shape of ()
, Python's empty tuple, means that t
is a scalar.) The tensor's data is not actually stored inside the Tensor object, and in fact does not exist in Python at all. t
does not know it will refer to the value 3.14
. A Tensor is just a lightweight way to reference a piece of data that will be computed by TensorFlow's execution engine.
All tensors are the outputs of some operation already in the graph. We use Tensors for two things:
To see this in action, let's create another constant operation, then an addition operation that consumes the output tensors of each constant and sums them. First, we call tf.constant
to create our second constant. As a convenience, the tf.constant
function actually returns the output tensor of the constant operation it creates:
In [8]:
another_const_tensor = tf.constant(1.414)
another_const_tensor
Out[8]:
In [9]:
g.get_operations()
Out[9]:
Now, there are two operations in the graph. TensorFlow has named them 'Const'
and 'Const_1'
, but you can also give them names yourself by passing a name
keyword argument to the tf.constant
function. Tensors are named following the formula op_name:output_number
, so Const_1:0
means "the 0th output of the Const_1 operation."
Let's create an addition operation to add the two tensors together, using tf.add
. Again, this creates an operation in the default graph, and returns a reference to the operation's output -- a tensor.
In [10]:
sum_tensor = tf.add(const_tensor, another_const_tensor)
sum_tensor
Out[10]:
In [11]:
g.get_operations()
Out[11]:
In [12]:
add_op = g.get_operations()[2]
len(add_op.inputs), len(add_op.outputs)
Out[12]:
It should make sense that add_op
has two inputs and one output.
In order to execute your computation graph, you need to create a Session
object. A Session object stores a reference to the graph to execute, some configuration information (determining, for instance, how much memory TensorFlow should allocate on your GPU), and storage for stateful components of your computation graph (which must be kept around across multiple executions of the graph). We haven't talked about any "stateful components" yet, so it's okay if that piece doesn't make sense yet. We'll see Variables, the simplest stateful component, very soon.
You can create a session using the tf.Session
constructor. It can take in a graph
parameter, but if you omit it, the session will use the default graph.
In [13]:
# This piece is only necessary so as not to use up an ungodly amount of GPU memory:
config = tf.ConfigProto()
config.gpu_options.allow_growth = True
# This is the actual code creating the session. You can omit the config arg
# if you have no configuration to do.
sess = tf.Session(config=config)
In [14]:
sess.graph == g
Out[14]:
The run
method of the Session
class is used to send your graph to the computation engine and execute it. As an argument, we pass a tensor we'd like to "fetch" the value of. Based on which tensor we'd like to compute, TensorFlow will calculate a path through the graph, and then execute only the parts of the graph necessary to compute our desired tensor.
In [15]:
sess.run(sum_tensor), sess.run(const_tensor), sess.run(another_const_tensor)
Out[15]:
Above, we call sess.run
three times, which invokes the execution engine three times. There is no memoization; each time you call sess.run
, everything is computed anew. Because of this, if you want to fetch more than one tensor, it's more efficient to fetch them all in one go, by passing a list to sess.run
. You can also pass a dictionary, a tuple, a named tuple, or nested combinations of these data structures.
In [16]:
sess.run([sum_tensor, const_tensor, another_const_tensor])
Out[16]:
In [17]:
sess.run({'a': const_tensor, 'b': another_const_tensor, '[a,b,a+b]': [const_tensor, another_const_tensor, sum_tensor]})
Out[17]:
In the last example, TensorFlow created a copy of the fetches
data structure, but with each tensor replaced by its actual value, computed during execution.
When you're done with a session, you should close it, to free any resources it's keeping track of:
In [18]:
sess.close()
Sessions are more powerful than this, but to understand why, we need to talk about placeholders and variables.
Our program above is a bit rigid. If we wanted to perform the exact same addition calculation on different inputs, we would need to create a whole new trio of operations; we have no power to abstract. Placeholders fix this. A placeholder operation, just like a constant, has 0 inputs and 1 output. However, instead of fixing the output value when you define the operation, we pass the placeholder's value to sess.run
when executing the graph. This allows us to execute the same graph multiple times with different placeholder values.
To add a placeholder operation to the default graph, we use tf.placeholder
, passing in the type of the value we'd like the placeholder to hold. Valid types include tf.int8
, tf.bool
, tf.float32
, and a whole lot more. See the documentation for a complete list. Like tf.constant
, tf.placeholder
returns a tensor (the output of the placeholder). We'll make a graph for adding two floats.
In [19]:
a = tf.placeholder(tf.float32)
a
Out[19]:
In [20]:
b = tf.placeholder(tf.float32)
flexible_sum_tensor = tf.add(a, b)
In [21]:
g.get_operations()
Out[21]:
When we call sess.run
, we now pass in a second argument, feed_dict
. This is a Python dictionary in which the keys are placeholder tensors (i.e., the outputs of placeholder operations) and the values are numbers, numpy arrays, or Python lists. (Numbers, lists, and numpy arrays can all be converted automatically into a data format compatible with TensorFlow's execution engine.) Note that the keys of feed_dict
are the actual tensor objects, not strings.
In [22]:
sess = tf.Session(config=config)
sess.run(flexible_sum_tensor, feed_dict={a: 1., b: 2.})
Out[22]:
In [23]:
sess.run(flexible_sum_tensor, feed_dict={a: [1.], b: [2.]})
Out[23]:
In [24]:
sess.run(flexible_sum_tensor, feed_dict={a: [[1., 2.], [3., 4.]], b: [[5., 6.], [7., 8.]]})
Out[24]:
In [25]:
sess.close()
A few things to note:
shape
argument to tf.placeholder
.)numpy
world.As an aside, the Python arithmetic operators ($+$, $-$, $/$, $*$) are overridden for the Tensor
type. Evaluating the expression $a + b$, where $a$ and $b$ are tensors, has the side effect of adding an Add
operation to the default graph with $a$ and $b$ as inputs. The result of evaluating $a + b$ is the output tensor of that add operation. This makes it easy to add many nodes quickly. For instance,
In [26]:
distance_from_origin = tf.sqrt((a * a) + (b * b))
g.get_operations()
Out[26]:
In [27]:
distance_from_origin
Out[27]:
In [28]:
sess = tf.Session(config=config)
sess.run(distance_from_origin, feed_dict={a: 3., b: 4.})
Out[28]:
In [29]:
sess.close()
Another trick: these last three lines can be condensed to two, using Python's with
feature.
In [30]:
with tf.Session(config=config) as sess:
print(sess.run(distance_from_origin, feed_dict={a: 9., b: 12.}))
The session is closed automatically at the end of the with
block.
Note that trying to call sess.run(distance_from_origin)
without feeding in a
and b
will result in an error. Furthermore, placeholder values are not persistent across multiple calls to sess.run
, so even if you've previously provided values for a
and b
in prior calls to sess.run
, these are no longer in memory anywhere by the time you make your next sess.run
call.
One final trick for convenience. Since our graph is getting a bit crowded, we may want to clear it:
In [31]:
tf.reset_default_graph()
g.get_operations()
Out[31]:
The operations are all still there! That's because reset_default_graph
doesn't delete operations, it just creates a new graph and makes it the default. g
still refers to the old graph. We can fix this (and let the old graph be garbage-collected) by reassigning g:
In [32]:
g = tf.get_default_graph()
g.get_operations()
Out[32]:
Like constants and placeholders, variable operations take 0 inputs and produce 1 output; the big difference is that a variable is mutable and persistent across runs of your graph (within a session). Whereas a constant's value is fixed when creating the graph, and a placeholder's value is set anew each time you call sess.run
, a variable's value is set or changed while the graph is running, by side-effect-ful "assign" operations, and remembered even after sess.run
is finished. (That memory is in the Session object, which manages stateful components like variables. Calling sess.close
is necessary to free that memory.)
Let's use tf.Variable
to add a variable to our new graph, initialized to 42.
In [33]:
x = tf.Variable(42)
In [34]:
# summarize the operations now in the graph:
[(op, "{} inputs and {} output".format(len(op.inputs), len(op.outputs))) for op in g.get_operations()]
Out[34]:
Wow -- four operations were just added to the graph! Let's go through them one by one:
Variable/initial_value
, it is actually just a constant operation. Its output is a tensor that will evaluate to 42.int32_ref
, a mutable int32. You can use an int32_ref
basically anywhere you can use an int32
. But you can also use it as an input to an assign operation, which is up next.assign
op. It takes two inputs, a ref
to a mutable value (almost always the output of a Variable
op), and a value, then assigns the variable to that value. This assign
op takes its input from the Variable/initial_value
constant and the Variable
ref just discussed. When executed, it has the effect of initializing our variable to to 42. The output is another int32_ref
tensor, referring to the same mutable storage as the output from (2).identity
operation. It takes the variable ref as input and outputs a non-ref tf.int32
. This isn't necessary in most cases, where it is ok to use the tf.int32_ref
directly.$x$, the return value of the tf.Variable
constructor, is a Variable object, not a tensor, but in practice, you can use it as if it were a tensor. It will either be interpreted as the tensor output of operation (2) above -- the int32_ref
pointing to the variable's current value -- or of operation (4), depending on context.
Let's start a session and play with our variable.
In [35]:
sess = tf.Session(config=config)
In fact, let's start a second session as well. One of the roles sessions play is to keep track of variable values across executions of the graph, and this will help us visualize that.
In [36]:
sess2 = tf.Session(config=config)
If we attempt to get the value of the variable, we will get an error:
In [37]:
sess.run(x)
"Attempting to use uninitialized value Variable." In order to initialize the variable, we actually have to run the assign op that was created for us. Recall that this was the third operation in the graph.
If we want to run an operation just for its side effect, not to fetch its output, sess.run
does support passing in an op
directly anywhere you would normally pass a tensor (as part of the fetches
data structure). Here, we'll just pass it standalone:
In [38]:
sess.run(g.get_operations()[2])
In [39]:
sess.run(x)
Out[39]:
But in sess2
, $x$ is still not initialized, and running sess2.run(x)
still gives an error:
In [40]:
sess2.run(x)
Let's fix that.
In [41]:
sess2.run(x.initializer), sess2.run(x)
Out[41]:
Notice:
initializer
property that refers to the Assign
operation that gives them their initial value.None
-- even if the operation has an output.Let's return to a blank slate.
In [42]:
sess.close()
sess2.close()
In [43]:
tf.reset_default_graph()
g = tf.get_default_graph()
In machine learning, we typically compute some loss function (or error) that quantifies how poorly our model is capturing the patterns in the data. In many models, this loss function is differentiable, which means we can compute partial derivatives of our loss function with respect to the parameters of our model. These derivatives essentially tell us: as each parameter changes, how does it affect the loss? Given this information, we have an easy way to make the model better: perturb the parameters slightly in the direction that causes the loss to go down, according to the partial derivatives. We can do this repeatedly, computing new partial derivatives after each step, until we get to a local minimum. This technique is called "gradient descent."
One benefit of using a computation graph is that Tensorflow can automatically calculate these derivatives for us. More accurately, it can augment our computation graph with operations that compute any derivatives we'd like. To do this, we use the tf.gradients
function. It takes in two arguments: ys
and xs
. ys
is a tensor or list of tensors we'd like to calculate the derivatives of. xs
is a tensor or list of tensors we'd like to calculate those derivatives with respect to. When called, tf.gradients
traverses the computation graph backward from ys
to xs
, adding for each operation along that path one or more supplemental operations for computing the gradient. These individual-operation gradients can be composed using the chain rule (which tf.gradients
also takes care of). The return value of tf.gradients
is a handle to a tensor that represents the answer, $\frac{dy}{dx}$, at the end of that chain.
This will hopefully become clearer with a simple example. Let's create a small graph:
In [44]:
a = tf.constant(5, name='a')
b = tf.constant(7, name='b')
eqn = a + 2 * b
g.get_operations()
Out[44]:
Notice that Tensorflow has added a constant node, mul/x
, to represent the constant 2. Other than that, this should look as expected: an op for a
and for b
, a multiplication, and an addition. (Remember, Python's operations +
and *
have been overridden for the tensor type, and they now have side effects of adding operations to the default graph!)
Let's now calculate the derivative of eqn
with respect to b
.
In [45]:
grad = tf.gradients(eqn, b)
g.get_operations()
Out[45]:
As you can see, a lot of new nodes were added for gradient calculation. The output of the last op listed above will be our derivative, $\frac{d\text{eqn}}{db}$. tf.gradients
returns that tensor, so we don't have to grab it explicitly:
In [46]:
grad
Out[46]:
We can now execute the graph:
In [47]:
sess = tf.Session(config=config)
sess.run(grad)
Out[47]:
As expected, we get 2
, the partial derivative of eqn
with respect to b
at the point (a=5, b=7)
(at any point, actually, but TensorFlow is computing it at this specific point).
Let's look at a couple variations. If we pass multiple xs
, we get a gradient back instead of just a partial derivative -- this is a list of partial derivatives.
In [48]:
gradient_wrt_both = tf.gradients(eqn, [a, b])
sess.run(gradient_wrt_both)
Out[48]:
Let's create a second equation.
In [49]:
eqn2 = a * b + b
In [50]:
gradient_of_eqn2 = tf.gradients(eqn2, [a,b])
sess.run(gradient_of_eqn2)
Out[50]:
Remember that a = 5
and b = 7
, which is why we get the values we did above. Although you can think of what Tensorflow does as a kind of symbolic differentiation (it constructs a formula, i.e. computation graph, for computing a derivative), it always calculates gradients at a specific point.
If we pass in more than one tensor as the first argument, Tensorflow computes the sum of the derivatives of each of these tensors with respect to the x values, $\sum_{y \in \text{ys}}\frac{dy}{dx}$. In other words, computing tf.gradients([eqn, eqn2], ...)
below is not too different from tf.gradients(eqn + eqn2, ...)
, except that the latter actually adds ops to the graph to compute eqn + eqn2
, whereas the former adds ops to sum the gradients.
In [51]:
gradient_with_two_ys = tf.gradients([eqn, eqn2], [a,b])
sess.run(gradient_with_two_ys)
Out[51]:
Here, [8, 8] == [7, 6] + [1, 2]
.
In [52]:
sess.close()
In [53]:
tf.reset_default_graph()
g = tf.get_default_graph()
Why do we care about gradients? As mentioned above, being able to compute them so easily opens up the possibility of automatically implementing gradient descent, an algorithm that follows derivatives downhill to find a local minimum of some function. In machine learning, we typically want to find the model parameters that minimize some loss, or error, function that measures how bad our model is.
TensorFlow provides amazing tools for doing just that, but to make sure we understand how they work, we'll implement a very simple gradient descent algorithm ourselves. We will be doing something very simple: finding the $x$ that minimizes $x^2$. (Yes, you can do this analytically, or by thinking about it for a second. But this example will bring together some of the ideas we've been discussing about variables and gradients.)
Let's start building our simple computational graph. It will consist of a variable, $x$, which is initialized to 5. It will also contain a "loss function," $x^2$.
In [54]:
x = tf.Variable(5.0)
loss = x * x
g.get_operations()
Out[54]:
Two start with, we'll calculate our derivative manually, without even using tf.gradients
. The derivative of $x^2$ at a point is just $2x$:
In [55]:
dloss_dx = 2. * x
We will now create an operation that changes $x$ based on the current derivative of the loss. We can do this using the tf.assign
function, which creates an operation that assigns a variable to a new value. (The Assign operation has two inputs, the reference to the variable's mutable storage, and the new value we wish to assign. As mentioned in the Variables section, we can pass a Variable object as the first input and TensorFlow will substitute the correct tensor. The Assign operation has one output, the tensor for which is returned by tf.assign
, which is the new value of the variable after assignment.)
Because the derivative points in the direction in which our loss is growing fastest, we will subtract $x$'s derivative from $x$. (After all, we want the loss to shrink, not grow.) But rather than subtract the entire derivative, we will subtract a fraction of it. (We generally try not to take large steps in gradient descent; just because the derivative points us in some direction now, doesn't mean that going arbitrarily far in that direction will keep bringing us down the hill. $x^2$ provides a good example: if we are currently at $x=5$, the derivative is $10$, but going $-10$ steps to $-5$ completely steps over the valley we are trying to reach at $x=0$! By multiplying our derivative by a small step size of, say, 0.1, we can avoid this fate and progress slowly but almost-surely to our goal.)
In [56]:
step_size = 0.1
new_x_value = x - dloss_dx * step_size
assign_new_x = tf.assign(x, new_x_value)
Our graph is in place. In order to run our algorithm, we need to:
sess.run(x.initializer)
.assign_new_x
operation. (Note: assign_new_x
is actually a tensor, the output of the assignment operation. Evaluating this tensor will cause the operation to run. If we really wanted to run just the op, we could use assign_new_x.op
; remember that every tensor knows which op it is the output of.)It is essential that we do all this inside of a single session, because that session is what will keep track of the current value of x
.
In [57]:
with tf.Session(config=config) as sess:
# Initialize x:
sess.run(x.initializer)
# We will fetch the following tensors each iteration
to_fetch = {'x': x, 'loss': loss, 'derivative': assign_new_x}
for i in range(100):
fetched = sess.run(to_fetch)
# every tenth step, print our progress
if i % 10 == 0 or i+1==100:
print("Iter {}: {}".format(i, fetched))
As you can see, the value of loss
gets closer and closer to 0 as time goes on.
Now let's replace our manual derivative calculation with automatic differentiation, and convince ourselves that nothing changes:
In [58]:
tf.reset_default_graph()
g = tf.get_default_graph()
x = tf.Variable(5.0)
loss = x * x
# tf.gradients is called "gradients" and not "derivative" for a reason: it
# returns a _list_ of partial derivatives, even if you only pass in one x.
# Pull out first element (in our case, the list only has one element).
dloss_dx = tf.gradients(loss, x)[0]
new_x_value = x - dloss_dx * step_size
assign_new_x = tf.assign(x, new_x_value)
In [59]:
with tf.Session(config=config) as sess:
# Initialize x:
sess.run(x.initializer)
# We will fetch the following tensors each iteration
to_fetch = {'x': x, 'loss': loss, 'derivative': assign_new_x}
for i in range(100):
fetched = sess.run(to_fetch)
# every tenth step, print our progress
if i % 10 == 0 or i+1==100:
print("Iter {}: {}".format(i, fetched))
The algorithm we have used here is vanilla gradient descent. TensorFlow actually comes with a whole family of optimizers that we can just plug into our model. We do not need to call tf.gradients
or manually assign a variable at all; TensorFlow can create that whole portion of our computational graph.
To do this, we use one of TensorFlow's premade Optimizer classes. An Optimizer object keeps track of some parameters of the optimization algorithm (like step size, also called learning rate), and has a minimize
method that can be used to automatically construct the portions of the computation graph that perform gradient computation and gradient descent. Here's how our example from above looks with TensorFlow's GradientDescentOptimizer
:
In [60]:
tf.reset_default_graph()
g = tf.get_default_graph()
x = tf.Variable(5.0)
loss = x * x
# Create the optimizer
optimizer = tf.train.GradientDescentOptimizer(learning_rate=0.1)
# Add optimize nodes to the graph
minimize_loss = optimizer.minimize(loss)
minimize_loss
now refers to an operation that runs a single step of gradient descent, updating x
. (By default optimizer.minimize
assumes you want to run gradient descent on every Variable
in the computation graph. If you want to change this, you can pass in a var_list
argument, specifying exactly which variables should be updated.) Note that before, assign_new_x
referred to a tensor; minimize_loss
refers to an actual operation, and sess.run
ning it will return None
. Still, it is working, and produces the exact same values as our previous two attempts:
In [61]:
with tf.Session(config=config) as sess:
# Initialize x:
sess.run(x.initializer)
# We will fetch the following tensors each iteration
to_fetch = {'x': x, 'loss': loss, 'train_op': minimize_loss}
for i in range(100):
fetched = sess.run(to_fetch)
# every tenth step, print our progress
if i % 10 == 0 or i+1==100:
print("Iter {}: {}".format(i, fetched))
There are other optimizers that implement clever variations of gradient descent. For example, here's the popular Adam optimizer:
In [62]:
adam_optimizer = tf.train.AdamOptimizer(learning_rate=0.1)
minimize_loss = adam_optimizer.minimize(loss)
g.get_operations()
Out[62]:
Notice that some of the operations Adam has added are Variables! This is because Adam keeps track of certain statistics across iterations, to implement a sort of "momentum." Because of this, before we can run our minimize_loss operation, we need to make sure Adam's variables are initialized. Rather than painstakingly initialize each one, we can use tf.global_variables_initializer()
to add an op to the graph that initializes all variables created so far.
In [63]:
initialize_all = tf.global_variables_initializer()
In [64]:
with tf.Session(config=config) as sess:
# Initialize ALL variables
sess.run(initialize_all)
# We will fetch the following tensors each iteration
to_fetch = {'x': x, 'loss': loss, 'train_op': minimize_loss}
for i in range(100):
fetched = sess.run(to_fetch)
# every tenth step, print our progress
if i % 10 == 0 or i+1==100:
print("Iter {}: {}".format(i, fetched))
As you can see, for our extremely simple problem, Adam hurts more than it helps. Its momentum feature means that the big derivatives from early on in training still have an effect even as we get close to the valley, causing us to overshoot the minimum. Running this for more than 100 iterations would eventually bring us back to 0. (An image that is sometimes used to explain momentum-based optimization approaches vs. typical gradient descent is that if normal gradient descent can be thought of as a man slowly and cautiously walking downhill, momentum-based approaches are better understood as a heavy ball rolling its way downhill. The ball will likely overshoot the bottom then have to roll back down, taking a while to settle. The benefit is that the ball can roll straight past small local minima, and stay immune to certain types of pathological terrains.)
So, how do we use all this for machine learning?
At this point, you're likely ready to move onto any other TensorFlow tutorial you find on the Internet (or in this Github repo). But in case you want to stay here, we'll work our way through a simple example: we'll create a fake dataset of temperatures and number of hospital visits in a fictional city over a number of months. (We assume that colder temperatures lead to more hospital visits.) We'll then fit a linear model to this data.
The goal of this exercise is for you to see how placeholders, variables, and gradient descent come together to enable fitting a model to some data. Although we'll use a linear model here -- i.e., we assume that the data y can be understood as a linear function of the input $x$ -- the process is the exact same when using a nonlinear model, like a neural network.
Let's begin by creating some fake data for us to train on. We'll make 1000 samples. We generate temperatures according to a normal distribution using numpy
, then we generate hospital visit numbers according to the formula, 1000 - 5*temps
, plus some random noise.
In [65]:
import numpy as np
temps = np.random.normal(55, 20, 1000)
random_noise = np.random.normal(0, 100, 1000)
hosp_visits = 1000 - 5 * temps + random_noise
plt.plot(temps[:200], hosp_visits[:200], "o")
Out[65]:
It's common to perform some data cleaning operations on our input data before attempting to use a machine learning algorithm. We'll do that here, subtracting the (empirical) mean and dividing by the standard deviation of the temperature:
In [66]:
normalized_temps = (temps - np.mean(temps)) / np.std(temps)
Now, let's divide the data into training, validation, and test sets.
In [67]:
train_X, train_y = normalized_temps[:800], hosp_visits[:800]
valid_X, valid_y = normalized_temps[800:900], hosp_visits[800:900]
test_X, test_y = normalized_temps[900:], hosp_visits[900:]
Finally, we'll create a "batch generator" for the training set. The following function is a Python generator function; instead of returning a value, it continuously yields new batches of data. When we call batch_generator
, Python creates a generator iterator, which we here call training_generator
, that we can use with Python's next
function.
In [68]:
def batch_generator(X, y, batch_size):
total_batches = len(X) // batch_size
current_batch = 0
while True:
start = batch_size * current_batch
end = start + batch_size
yield (X[start:end], y[start:end])
current_batch = (current_batch + 1) % total_batches
training_generator = batch_generator(train_X, train_y, batch_size=100)
# Later, call next(training_generator) to get a new batch of the form (X, y)
Now it's time to build our model. Let's first get a new, empty graph to work with.
In [69]:
tf.reset_default_graph()
g = tf.get_default_graph()
We then set up the major quantities in our model as follows:
X
and the hospital visit numbers y
, are represented with placeholders. This is so we can fill these values with our actual data at execution time. (You may wonder: why not just put all the data in as a constant? Typically, rather than use Gradient Descent, we use Stochastic Gradient Descent, which means that instead of taking gradients of the loss computed on all our data every iteration, we feed in a small "batch" of data to the graph each iteration of training. This is more efficient, lets us handle large datasets, and provably converges to a local minimum just like normal Gradient Descent. To use this technique, we need placeholders: each time we call sess.run
, we'll pass in different data.)y
values based on the X
values, according to our model. The loss will be computed based on this prediction and its divergence from the true y
.
In [70]:
X = tf.placeholder(tf.float32)
y = tf.placeholder(tf.float32)
m = tf.Variable(0.)
b = tf.Variable(0.)
predicted_y = X * m + b
To compute the loss -- a quantity measuring how bad our model is -- we use sum-of-squares formula: for each data point in the current batch of data, we subtract the real y
from our predicted_y
and square the difference; then we take the average of all these. (Taking their sum would work just as well, but by taking the average, we get a number that doesn't depend on the amount of data in a batch, which can be useful for human interpretation and for comparing models with different batch sizes.)
In [71]:
avg_loss = tf.reduce_mean(tf.squared_difference(predicted_y, y))
Now it's time to actually create our optimization op. We'll use the basic GradientDescentOptimizer
here, with a learning rate of 0.0005
. (How did we come by this number? Trying different numbers and seeing how the losses looked on the validation set. Feel free to play with this a bit more.)
In [72]:
train_one_step = tf.train.GradientDescentOptimizer(learning_rate=0.0005).minimize(avg_loss)
Finally, it's time for training. Let's add an op to the graph that initializes all variables, then start a session and run the training code.
In [73]:
init_all_vars = tf.global_variables_initializer()
In [74]:
with tf.Session(config=config) as sess:
sess.run(init_all_vars)
for i in range(5000):
X_batch, y_batch = next(training_generator)
feed_dict = {X: X_batch, y: y_batch}
_, loss, m_pred, b_pred = sess.run([train_one_step, avg_loss, m, b], feed_dict=feed_dict)
if i % 500 == 0:
validation_feed_dict = {X: valid_X, y: valid_y}
valid_loss = sess.run(avg_loss, feed_dict=validation_feed_dict)
print("Iter {}: training loss = {}, validation loss = {}, m={}, b={}".format(i, loss, valid_loss, m_pred, b_pred))
test_feed_dict = {X: test_X, y: test_y}
m_pred, b_pred, loss = sess.run([m, b, avg_loss], test_feed_dict)
print("m: {}, b: {}, test loss: {}".format(m_pred, b_pred, loss))
Because the standard deviation of the noise in our data is 100, we expect the average loss to be no better than $100^2$, which is exactly what we see here. The recovered slope and y-intercept (-100, 729) may look strange to you, but remember that we cleaned our data, which changed the meaning of these parameters. Since we divided temperature values by the standard deviation, approximately 20, a slope of -100 is actually a slope of -5 per degree. And since we subtracted the mean temperature, 55, a y-intercept of 729 is actually a y-intercept of 1004 (that is, $729 + m * \text{mean} = 729 + 5 * 55$).