In [1]:
# %load /Users/facai/Study/book_notes/preconfig.py
%matplotlib inline
import matplotlib.pyplot as plt
import seaborn as sns
sns.set(color_codes=True)
sns.set(font='SimHei', font_scale=2.5)
plt.rcParams['axes.grid'] = False
#import numpy as np
#import pandas as pd
#pd.options.display.max_rows = 20
#import sklearn
#import itertools
#import logging
#logger = logging.getLogger()
#from IPython.display import SVG
def show_image(filename, figsize=None, res_dir=True):
if figsize:
plt.figure(figsize=figsize)
if res_dir:
filename = './res/{}'.format(filename)
plt.imshow(plt.imread(filename))
In [2]:
show_image('fig10_2.png', figsize=(12, 5))
two major advantages:
unfolded graph: computing gradients
In [6]:
# A.
show_image('fig10_3.png', figsize=(10, 8))
The total loss for a given sequence of $x$ values paired with a sequence of $y$ values would be just the sum of the losses over all the time steps:
\begin{align} &L \left ( \{x^1, \cdots, x^\tau\}, \{y^1, \cdots, y^\tau\}) \right ) \\ &= \sum_t L^t \\ &= - \sum_t \log p_{\text{model}} \left ( y^t \, | \, \{x^1, \cdots, x^\tau\} \right ) \\ \end{align}So the back-propagation algorithm need $O(\tau)$ running time moving right to left through the graph, and also $O(\tau)$ memory cost to store the intermediate states. => back-propagation through time (BPTT): powerful but also expensive to train
In [2]:
# B.
show_image('fig10_4.png', figsize=(10, 8))
advantage of eliminating hidden-to-hidden recurrence: decouple all the time steps (replace predition by its ground truth in sample for $t-1$) => teacher forcing
In [4]:
show_image('fig10_6.png', figsize=(10, 8))
disadvantage: open-loop network (outputs fed back as input) => inputs are quite different between training and testing.
In [7]:
# C.
show_image('fig10_5.png', figsize=(10, 8))
In [8]:
show_image("formula_gradient.png", figsize=(12, 8))
In [9]:
show_image('fig10_7.png', figsize=(10, 8))
RNNs obtain the same full connectivity above but efficient parametrization, as illustrated below:
In [10]:
show_image('fig10_8.png', figsize=(10, 8))
determining the length of the sequence:
In [13]:
# fixed-length vector x, share R
show_image('fig10_9.png', figsize=(10, 8))
In [14]:
# variable-length sequence
show_image('fig10_10.png', figsize=(10, 8))
restriction: the length of both sequences $x$ and $y$ must be the same.
In [8]:
show_image("fig10_11.png", figsize=(10, 12))
In [7]:
show_image("fig10_12.png", figsize=(10, 8))
$C$: context, a vector or sequence of vectors that summarize the input sequence $X$.
limitation: $C$ cannot be so small that it cannot properly summarize a long sequence.
In [11]:
show_image("fig10_13.png", figsize=(12, 15))
In [13]:
show_image("fig10_14.png", figsize=(5, 12))
basic problem: gradients => vanish / explode (over many stages)
\begin{align} h^t &= (W^t)^T h^0 \\ W &= Q \Lambda A^T \\ h^t &= Q^T \Lambda^t Q h^0 \end{align}In $\Lambda^t$, $v < 1 \implies 0$. The largest eigenvalue dominates.
=> Any component of $h^0$ that is not aligned with the largest eignevector => discard.
The problem is particular to RNN <= they share the same weight at each time step.
The remaining sections describe approaches to overcoming the problem:
design a model that operates at multiple time scalse => hanlde both near past and distant past => mitigate long-term dependencies
In [3]:
show_image("fig10_16.png", figsize=(10, 15))
GRU: gated recurrent units
a single gating unit simultaneously controls the forgetting factor and the decision to update the state unit
\begin{equation} h_i^t = u_i^{(t-1)} h_i^{(i-1)} + (1 - u_i^{(t-1)} \sigma \left ( b_i + \sum_j U_{i, j} x_j^{(t-1)} + \sum_j W_{i, j} r_j^{(t-1)} h_j^{(t-1)} \right ) \end{equation}
In [6]:
show_image("fig10_18.png", figsize=(8, 12))
In [ ]: