In [ ]:
#@title Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# https://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
The automatic differentiation guide includes everything required to calculate gradients. This guide focuses on deeper, less common features of the tf.GradientTape
api.
In [ ]:
import tensorflow as tf
import matplotlib as mpl
import matplotlib.pyplot as plt
mpl.rcParams['figure.figsize'] = (8, 6)
In the automatic differentiation guide you saw how to control which variables and tensors are watched by the tape while building the gradient calculation.
The tape also has methods to manipulate the recording.
If you wish to stop recording gradients, you can use GradientTape.stop_recording()
to temporarily suspend recording.
This may be useful to reduce overhead if you do not wish to differentiate a complicated operation in the middle of your model. This could include calculating a metric or an intermediate result:
In [ ]:
x = tf.Variable(2.0)
y = tf.Variable(3.0)
with tf.GradientTape() as t:
x_sq = x * x
with t.stop_recording():
y_sq = y * y
z = x_sq + y_sq
grad = t.gradient(z, {'x': x, 'y': y})
print('dz/dx:', grad['x']) # 2*x => 4
print('dz/dy:', grad['y'])
If you wish to start over entirely, use reset()
. Simply exiting the gradient tape block and restarting is usually easier to read, but you can use reset
when exiting the tape block is difficult or impossible.
In [ ]:
x = tf.Variable(2.0)
y = tf.Variable(3.0)
reset = True
with tf.GradientTape() as t:
y_sq = y * y
if reset:
# Throw out all the tape recorded so far
t.reset()
z = x * x + y_sq
grad = t.gradient(z, {'x': x, 'y': y})
print('dz/dx:', grad['x']) # 2*x => 4
print('dz/dy:', grad['y'])
In [ ]:
x = tf.Variable(2.0)
y = tf.Variable(3.0)
with tf.GradientTape() as t:
y_sq = y**2
z = x**2 + tf.stop_gradient(y_sq)
grad = t.gradient(z, {'x': x, 'y': y})
print('dz/dx:', grad['x']) # 2*x => 4
print('dz/dy:', grad['y'])
In some cases, you may want to control exactly how gradients are calculated rather than using the default. These situations include:
tf.clip_by_value
, tf.math.round
) without modifying the gradient.For writing a new op, you can use tf.RegisterGradient
to set up your own. See that page for details. (Note that the gradient registry is global, so change it with caution.)
For the latter three cases, you can use tf.custom_gradient
.
Here is an example that applies tf.clip_by_norm
to the intermediate gradient.
In [ ]:
# Establish an identity operation, but clip during the gradient pass
@tf.custom_gradient
def clip_gradients(y):
def backward(dy):
return tf.clip_by_norm(dy, 0.5)
return y, backward
v = tf.Variable(2.0)
with tf.GradientTape() as t:
output = clip_gradients(v * v)
print(t.gradient(output, v)) # calls "backward", which clips 4 to 2
See the tf.custom_gradient
decorator for more details.
In [ ]:
x0 = tf.constant(0.0)
x1 = tf.constant(0.0)
with tf.GradientTape() as tape0, tf.GradientTape() as tape1:
tape0.watch(x0)
tape1.watch(x1)
y0 = tf.math.sin(x0)
y1 = tf.nn.sigmoid(x1)
y = y0 + y1
ys = tf.reduce_sum(y)
In [ ]:
tape0.gradient(ys, x0).numpy() # cos(x) => 1.0
In [ ]:
tape1.gradient(ys, x1).numpy() # sigmoid(x1)*(1-sigmoid(x1)) => 0.25
Operations inside of the GradientTape
context manager are recorded for automatic differentiation. If gradients are computed in that context, then the gradient computation is recorded as well. As a result, the exact same API works for higher-order gradients as well. For example:
In [ ]:
x = tf.Variable(1.0) # Create a Tensorflow variable initialized to 1.0
with tf.GradientTape() as t2:
with tf.GradientTape() as t1:
y = x * x * x
# Compute the gradient inside the outer `t2` context manager
# which means the gradient computation is differentiable as well.
dy_dx = t1.gradient(y, x)
d2y_dx2 = t2.gradient(dy_dx, x)
print('dy_dx:', dy_dx.numpy()) # 3 * x**2 => 3.0
print('d2y_dx2:', d2y_dx2.numpy()) # 6 * x => 6.0
While that does give you the second derivative of a scalar function, this pattern does not generalize to produce a Hessian matrix, since GradientTape.gradient
only computes the gradient of a scalar. To construct a Hessian, see the Hessian example under the Jacobian section.
"Nested calls to GradientTape.gradient
" is a good pattern when you are calculating a scalar from a gradient, and then the resulting scalar acts as a source for a second gradient calculation, as in the following example.
Many models are susceptible to "adversarial examples". This collection of techniques modifies the model's input to confuse the model's output. The simplest implementation takes a single step along the gradient of the output with respect to the input; the "input gradient".
One technique to increase robustness to adversarial examples is input gradient regularization, which attempts to minimize the magnitude of the input gradient. If the input gradient is small, then the change in the output should be small too.
Below is a naive implementation of input gradient regularization. The implementation is:
In [ ]:
x = tf.random.normal([7, 5])
layer = tf.keras.layers.Dense(10, activation=tf.nn.relu)
In [ ]:
with tf.GradientTape() as t2:
# The inner tape only takes the gradient with respect to the input,
# not the variables.
with tf.GradientTape(watch_accessed_variables=False) as t1:
t1.watch(x)
y = layer(x)
out = tf.reduce_sum(layer(x)**2)
# 1. Calculate the input gradient.
g1 = t1.gradient(out, x)
# 2. Calculate the magnitude of the input gradient.
g1_mag = tf.norm(g1)
# 3. Calculate the gradient of the magnitude with respect to the model.
dg1_mag = t2.gradient(g1_mag, layer.trainable_variables)
In [ ]:
[var.shape for var in dg1_mag]
All the previous examples took the gradients of a scalar target with respect to some source tensor(s).
The Jacobian matrix represents the gradients of a vector valued function. Each row contains the gradient of one of the vector's elements.
The GradientTape.jacobian
method allows you to efficiently calculate a Jacobian matrix.
Note that:
gradient
: The sources
argument can be a tensor or a container of tensors.gradient
: The target
tensor must be a single tensor.As a first example, here is the Jacobian of a vector-target with respect to a scalar-source.
In [ ]:
x = tf.linspace(-10.0, 10.0, 200+1)
delta = tf.Variable(0.0)
with tf.GradientTape() as tape:
y = tf.nn.sigmoid(x+delta)
dy_dx = tape.jacobian(y, delta)
When you take the Jacobian with respect to a scalar the result has the shape of the target, and gives the gradient of the each element with respect to the source:
In [ ]:
print(y.shape)
print(dy_dx.shape)
In [ ]:
plt.plot(x.numpy(), y, label='y')
plt.plot(x.numpy(), dy_dx, label='dy/dx')
plt.legend()
_ = plt.xlabel('x')
Whether the input is scalar or tensor, GradientTape.jacobian
efficiently calculates the gradient of each element of the source with respect to each element of the target(s).
For example, the output of this layer has a shape of (10, 7)
:
In [ ]:
x = tf.random.normal([7, 5])
layer = tf.keras.layers.Dense(10, activation=tf.nn.relu)
with tf.GradientTape(persistent=True) as tape:
y = layer(x)
y.shape
And the layer's kernel's shape is (5, 10)
:
In [ ]:
layer.kernel.shape
The shape of the Jacobian of the output with respect to the kernel is those two shapes concatenated together:
In [ ]:
j = tape.jacobian(y, layer.kernel)
j.shape
If you sum over the target's dimensions, you're left with the gradient of the sum that would have been calculated by GradientTape.gradient
:
In [ ]:
g = tape.gradient(y, layer.kernel)
print('g.shape:', g.shape)
j_sum = tf.reduce_sum(j, axis=[0, 1])
delta = tf.reduce_max(abs(g - j_sum)).numpy()
assert delta < 1e-3
print('delta:', delta)
While tf.GradientTape
doesn't give an explicit method for constructing a Hessian matrix it's possible to build one using the GradientTape.jacobian
method.
Note: The Hessian matrix contains N**2
parameters. For this and other reasons it is not practical for most models. This example is included more as a demonstration of how to use the GradientTape.jacobian
method, and is not an endorsement of direct Hessian-based optimization.
A Hessian-vector product can be calculated efficiently with nested tapes, and is a much more efficient approach to second-order optimization.
In [ ]:
x = tf.random.normal([7, 5])
layer1 = tf.keras.layers.Dense(8, activation=tf.nn.relu)
layer2 = tf.keras.layers.Dense(6, activation=tf.nn.relu)
with tf.GradientTape() as t2:
with tf.GradientTape() as t1:
x = layer1(x)
x = layer2(x)
loss = tf.reduce_mean(x**2)
g = t1.gradient(loss, layer1.kernel)
h = t2.jacobian(g, layer1.kernel)
In [ ]:
print(f'layer.kernel.shape: {layer1.kernel.shape}')
print(f'h.shape: {h.shape}')
To use this Hessian for a Newton's method step, you would first flatten out its axes into a matrix, and flatten out the gradient into a vector:
In [ ]:
n_params = tf.reduce_prod(layer1.kernel.shape)
g_vec = tf.reshape(g, [n_params, 1])
h_mat = tf.reshape(h, [n_params, n_params])
The Hessian matrix should be symmetric:
In [ ]:
def imshow_zero_center(image, **kwargs):
lim = tf.reduce_max(abs(image))
plt.imshow(image, vmin=-lim, vmax=lim, cmap='seismic', **kwargs)
plt.colorbar()
In [ ]:
imshow_zero_center(h_mat)
The Newton's method update step is shown below.
In [ ]:
eps = 1e-3
eye_eps = tf.eye(h_mat.shape[0])*eps
In [ ]:
# X(k+1) = X(k) - (∇²f(X(k)))^-1 @ ∇f(X(k))
# h_mat = ∇²f(X(k))
# g_vec = ∇f(X(k))
update = tf.linalg.solve(h_mat + eye_eps, g_vec)
# Reshape the update and apply it to the variable.
_ = layer1.kernel.assign_sub(tf.reshape(update, layer1.kernel.shape))
While this is relatively simple for a single tf.Variable
, applying this to a non-trivial model would require careful concatenation and slicing to produce a full Hessian across multiple variables.
In some cases, you want to take the Jacobian of each of a stack of targets with respect to a stack of sources, where the Jacobians for each target-source pair are independent.
For example, here the input x
is shaped (batch, ins)
and the output y
is shaped (batch, outs)
.
In [ ]:
x = tf.random.normal([7, 5])
layer1 = tf.keras.layers.Dense(8, activation=tf.nn.elu)
layer2 = tf.keras.layers.Dense(6, activation=tf.nn.elu)
with tf.GradientTape(persistent=True, watch_accessed_variables=False) as tape:
tape.watch(x)
y = layer1(x)
y = layer2(y)
y.shape
The full Jacobian of y
with respect to x
has a shape of (batch, ins, batch, outs)
, even if you only want (batch, ins, outs)
.
In [ ]:
j = tape.jacobian(y, x)
j.shape
If the gradients of each item in the stack are independent, then every (batch, batch)
slice of this tensor is a diagonal matrix:
In [ ]:
imshow_zero_center(j[:, 0, :, 0])
_ = plt.title('A (batch, batch) slice')
In [ ]:
def plot_as_patches(j):
# Reorder axes so the diagonals will each form a contiguous patch.
j = tf.transpose(j, [1, 0, 3, 2])
# Pad in between each patch.
lim = tf.reduce_max(abs(j))
j = tf.pad(j, [[0, 0], [1, 1], [0, 0], [1, 1]],
constant_values=-lim)
# Reshape to form a single image.
s = j.shape
j = tf.reshape(j, [s[0]*s[1], s[2]*s[3]])
imshow_zero_center(j, extent=[-0.5, s[2]-0.5, s[0]-0.5, -0.5])
plot_as_patches(j)
_ = plt.title('All (batch, batch) slices are diagonal')
To get the desired result you can sum over the duplicate batch
dimension, or else select the diagonals using tf.einsum
.
In [ ]:
j_sum = tf.reduce_sum(j, axis=2)
print(j_sum.shape)
j_select = tf.einsum('bxby->bxy', j)
print(j_select.shape)
It would be much more efficient to do the calculation without the extra dimension in the first place. The GradientTape.batch_jacobian
method does exactly that.
In [ ]:
jb = tape.batch_jacobian(y, x)
jb.shape
In [ ]:
error = tf.reduce_max(abs(jb - j_sum))
assert error < 1e-3
print(error.numpy())
Caution: GradientTape.batch_jacobian
only verifies that the first dimension of the source and target match. It doesn't check that the gradients are actually independent. It's up to the user to ensure they only use batch_jacobian
where it makes sense. For example adding a layers.BatchNormalization
destroys the independence, since it normalizes across the batch
dimension:
In [ ]:
x = tf.random.normal([7, 5])
layer1 = tf.keras.layers.Dense(8, activation=tf.nn.elu)
bn = tf.keras.layers.BatchNormalization()
layer2 = tf.keras.layers.Dense(6, activation=tf.nn.elu)
with tf.GradientTape(persistent=True, watch_accessed_variables=False) as tape:
tape.watch(x)
y = layer1(x)
y = bn(y, training=True)
y = layer2(y)
j = tape.jacobian(y, x)
print(f'j.shape: {j.shape}')
In [ ]:
plot_as_patches(j)
_ = plt.title('These slices are not diagonal')
_ = plt.xlabel("Don't use `batch_jacobian`")
In this case batch_jacobian
still runs and returns something with the expected shape, but it's contents have an unclear meaning.
In [ ]:
jb = tape.batch_jacobian(y, x)
print(f'jb.shape: {jb.shape}')