Basic Neural Networks

Brett Naul

UC Berkeley

About me

- `cesium` library for machine learning w/ time series data

https://cesium.ml
Come talk to me about your time series data

- PTF/ZTF Marshal 2.0

- Deep learning / neural networks for time series data

Why neural networks?

- General set of models for solving wide variety of problems

- State of the art for many machine learning tasks

Computer vision
Speech recognition
Natural language processing
Probably something else cool since the beginning of this talk

- Hype hype hype

(Extremely) Brief history

- Inspired by neurobiology, can trace back to the 1940s(!)

(Extremely) Brief history

- For a long time, thought to have only a few niche applications

Handwritten recognition (zip codes)

- LeNet

(Extremely) Brief history

- Recent (2010-?) explosion in popularity

Increases in volume of data
Computing power / GPU computing
Better optimization procedures

- ImageNet: rapid increase in accuracy

Simple neural network

- "Perceptron"

- Graph structure: nodes represent values, edges functions

- Just a specific structure for defining a function $f(\mathbf{x}) \rightarrow \mathbf{y}$

Simple(st) neural network

- Linear regression

Let edges represent linear functions
Equivalent to linear regression model

Simple(st) neural network

- Linear regression

Let edges represent linear functions
Equivalent to linear regression model

Demo

http://jalammar.github.io/visual-interactive-guide-basics-neural-networks/

Simple neural network: linear classification

- Edges still linear functions

- Apply sigmoid function to output

- Logistic regression!

Simple neural network: multiclass classification

- Edges still linear functions

- Multiple outputs

- Apply softmax function (normalized sigmoid) to output

- Multiclass classification!

Activation functions

- Biological motivation

- Practical significance: introduce non-linearities

- Universal representation theorem

- Different choices: sigmoid, tanh, ReLU



In [1]:

    
import numpy as np
%matplotlib inline
import matplotlib.pyplot as plt
import seaborn as sns
sns.set_context('poster')

def sigmoid(x): return 1 / (1 + np.exp(-x))

fig, ax = plt.subplots(1, 3, figsize=(15, 6))
x = np.linspace(-4, 4, 501)
ax[0].plot(x, sigmoid(x)); ax[0].set_title("Sigmoid")
ax[1].plot(x, np.tanh(x)); ax[1].set_title("Tanh")
ax[2].plot(x, np.maximum(x, 0)); ax[2].set_title("ReLU");

Multiple layers

- Non-linear activation functions allow representation of complex functions

- More layers -> more complexity

Training neural networks

- Input training data + loss function

- Single-layer network: explicit functional form

- Minimize loss in closed form (as function of weights+bias)

Training neural networks: multiple layers

- Deeper networks: can't compute optimal weights in closed form

- Alternative: numerical optimization

Make initial guess
Find direction in which loss is decreasing (compute gradient)
Take a step in that direction
Repeat



In [2]:

    
from mpl_toolkits.mplot3d import Axes3D

def f(x, y):
    return (1 - x / 2 + x ** 5 + y ** 3) * np.exp(-x ** 2 -y ** 2)

def df(x, y, h=1e-3):
    return np.r_[(f(x + h, y) - f(x, y)) / h,
                 (f(x, y + h) - f(x, y)) / h]

n = 256
x = np.linspace(-3, 3, n)
y = np.linspace(-2, 2, n)
X, Y = np.meshgrid(x, y)

fig = plt.figure(figsize=(15, 15))
ax = fig.add_subplot(2, 1, 1, projection='3d', azim=-90)
ax.plot_surface(X, Y, f(X, Y), cmap='inferno')

x0 = -0.55; y0 = -0.1
step = 1.0
ax = fig.add_subplot(2, 1, 2)
plt.contourf(X, Y, f(X, Y), V=12, alpha=0.75, cmap='inferno')
plt.contour(X, Y, f(X, Y), V=12, colors='black', linewidth=0.5)
ax.scatter(x0, y0, s=160., c='w', edgecolors='k', linewidths=2.5)
ax.scatter(x0 - step * df(x0, y0)[0], y0 - step * df(x0, y0)[1],
            s=160., c='w', edgecolors='k', linewidths=2.5)
ax.arrow(x0, y0, *(-step * df(x0, y0)), linewidth=2.5, head_width=0.1, color='k',
          length_includes_head=True);

Training neural networks: multiple layers

- For complex networks, even writing down the gradient is hard

- Key idea: backpropagation

- Use chain rule to compute derivatives one edge at a time

Training neural networks: multiple layers

- "Forward pass": compute $\mathbf{y}=f(\mathbf{x})$ by traversing graph starting from input

- "Backward pass": compute $\frac{\partial \mathbf{y}}{\partial \mathbf{x}} = \nabla f(\mathbf{x})$ by traversing graph starting from output and applying chain rule at each step

Backpropagation issues

- Frameworks such as Tensorflow, Theano, etc. mostly eliminate the need to manually compute gradients

- Still important to understand basics for debugging

Andrej Karpathy: "leaky abstraction"

- Vanishing gradients

- Exploding gradients



In [3]:

    
fig, ax = plt.subplots(1, 3, figsize=(15, 6))
x = np.linspace(-4, 4, 501)
ax[0].plot(x, sigmoid(x)); ax[0].set_title("Sigmoid")
ax[1].plot(x, np.tanh(x)); ax[1].set_title("Tanh")
ax[2].plot(x, np.maximum(x, 0)); ax[2].set_title("ReLU");

Demo

http://playground.tensorflow.org

Exercise time!

"Advanced" neural networks

- Beyond perceptrons...

Types of neural networks

- "Neural Network Zoo" (Asimov Institute)

Hype

"The Great AI Awakening" (NYTimes 12/2016) https://www.nytimes.com/2016/12/14/magazine/the-great-ai-awakening.html
"Deep learning" (Nature 2015) http://www.nature.com/nature/journal/v521/n7553/full/nature14539.html
"Deep learning algorithm does as well as dermatologists in identifying skin cancer" (yesterday!) http://news.stanford.edu/2017/01/25/artificial-intelligence-used-identify-skin-cancer/

Hype...?

"Creating a deep learning model is, ironically, a highly manual process. Training a model takes a long time, and even for the top practitioners, it is a hit or miss affair where you don’t know whether it will work until the end. No mature tools exist to ensure models train successfully, or to ensure that the original set up is done appropriately for the data." -- J. Howard (Fast.ai; http://www.fast.ai/2016/10/07/fastai-launch/)

Convolutional neural networks

- Convolution: sliding window product of two functions(?)

- Not even remotely a new idea: signal processing, differential equations, etc.

- Two-dimensional convolution: move a filter around an image and compute activations at each location

- Preserves local structure of image

Convolutional neural networks

Convolutional filter demo

http://setosa.io/ev/image-kernels

Convolutional neural networks

- Convolutional filters for image processing aren't new either

- Key idea of deep learning: learn filters instead of hand-coding

- Humans not as good at designing features as we'd like to believe...

Convolutional neural networks

- Learned filters (AlexNet):

Convolutional neural networks

- Good: small number of parameters can identify complicated patterns

- Bad: output of convolutional network is large (image size x # of filters)

- (One) solution: pooling

Convolutional neural networks

Common approach: convolutional -> pooling -> repeat -> fully connected

Example: VGGNet

INPUT: [224x224x3]        memory:  224*224*3=150K   weights: 0
CONV3-64: [224x224x64]  memory:  224*224*64=3.2M   weights: (3*3*3)*64 = 1,728
CONV3-64: [224x224x64]  memory:  224*224*64=3.2M   weights: (3*3*64)*64 = 36,864
POOL2: [112x112x64]  memory:  112*112*64=800K   weights: 0
CONV3-128: [112x112x128]  memory:  112*112*128=1.6M   weights: (3*3*64)*128 = 73,728
CONV3-128: [112x112x128]  memory:  112*112*128=1.6M   weights: (3*3*128)*128 = 147,456
POOL2: [56x56x128]  memory:  56*56*128=400K   weights: 0
CONV3-256: [56x56x256]  memory:  56*56*256=800K   weights: (3*3*128)*256 = 294,912
CONV3-256: [56x56x256]  memory:  56*56*256=800K   weights: (3*3*256)*256 = 589,824
CONV3-256: [56x56x256]  memory:  56*56*256=800K   weights: (3*3*256)*256 = 589,824
POOL2: [28x28x256]  memory:  28*28*256=200K   weights: 0
CONV3-512: [28x28x512]  memory:  28*28*512=400K   weights: (3*3*256)*512 = 1,179,648
CONV3-512: [28x28x512]  memory:  28*28*512=400K   weights: (3*3*512)*512 = 2,359,296
CONV3-512: [28x28x512]  memory:  28*28*512=400K   weights: (3*3*512)*512 = 2,359,296
POOL2: [14x14x512]  memory:  14*14*512=100K   weights: 0
CONV3-512: [14x14x512]  memory:  14*14*512=100K   weights: (3*3*512)*512 = 2,359,296
CONV3-512: [14x14x512]  memory:  14*14*512=100K   weights: (3*3*512)*512 = 2,359,296
CONV3-512: [14x14x512]  memory:  14*14*512=100K   weights: (3*3*512)*512 = 2,359,296
POOL2: [7x7x512]  memory:  7*7*512=25K  weights: 0
FC: [1x1x4096]  memory:  4096  weights: 7*7*512*4096 = 102,760,448
FC: [1x1x4096]  memory:  4096  weights: 4096*4096 = 16,777,216
FC: [1x1x1000]  memory:  1000 weights: 4096*1000 = 4,096,000

TOTAL memory: 24M * 4 bytes ~= 93MB / image (only forward! ~*2 for bwd)
TOTAL params: 138M parameters

Recurrent neural networks

- Mostly discussed image data so far

- What about other types? e.g. sequence data?

Recurrent neural networks

- Designed for processing sequence data

- Loop through input sequence

- At each step:

Compute function of input + neuron internal state
Update internal state

Recurrent neural networks

- Cell state gets updated at each step

- Long term dependencies are problematic

- Various other types of network designed to mitigate this (LSTM, GRU, etc.)

For details see http://colah.github.io/posts/2015-08-Understanding-LSTMs/.

Recurrent neural networks

- Excellent performance for text analysis

"Unreasonable Effectiveness of Recurrent Neural Networks" http://karpathy.github.io/2015/05/21/rnn-effectiveness/

Deep learning frameworks

- Python(-compatible)

Tensorflow (Google)
Theano (Université de Montréal)
CNTK (Microsoft)
MXNet (Amazon+Baidu+...)
Keras (frontend for TensorFlow + Theano
Caffe (UC Berkeley)

- Lua(...?)

Torch (Facebook)

Deep learning frameworks

Source: https://twitter.com/fchollet/status/765212287531495424/photo/1?ref_src=twsrc%5Etfw

Basic Neural Networks

Brett Naul

UC Berkeley

About me

- cesium library for machine learning w/ time series data

- PTF/ZTF Marshal 2.0

- Deep learning / neural networks for time series data

Why neural networks?

- General set of models for solving wide variety of problems

- State of the art for many machine learning tasks

- Hype hype hype

(Extremely) Brief history

- Inspired by neurobiology, can trace back to the 1940s(!)

(Extremely) Brief history

- For a long time, thought to have only a few niche applications

- LeNet

(Extremely) Brief history

- Recent (2010-?) explosion in popularity

- ImageNet: rapid increase in accuracy

Simple neural network

- "Perceptron"

- Graph structure: nodes represent values, edges functions

- Just a specific structure for defining a function $f(\mathbf{x}) \rightarrow \mathbf{y}$

Simple(st) neural network

- Linear regression

Simple(st) neural network

- Linear regression

Demo

http://jalammar.github.io/visual-interactive-guide-basics-neural-networks/

Simple neural network: linear classification

- Edges still linear functions

- Apply sigmoid function to output

- Logistic regression!

Simple neural network: multiclass classification

- Edges still linear functions

- Multiple outputs

- Apply softmax function (normalized sigmoid) to output

- Multiclass classification!

Activation functions

- Biological motivation

- Practical significance: introduce non-linearities

- Universal representation theorem

- Different choices: sigmoid, tanh, ReLU

Multiple layers

- Non-linear activation functions allow representation of complex functions

- More layers -> more complexity

Training neural networks

- Input training data + loss function

- Single-layer network: explicit functional form

- Minimize loss in closed form (as function of weights+bias)

Training neural networks: multiple layers

- Deeper networks: can't compute optimal weights in closed form

- Alternative: numerical optimization

Training neural networks: multiple layers

- For complex networks, even writing down the gradient is hard

- Key idea: backpropagation

- Use chain rule to compute derivatives one edge at a time

Training neural networks: multiple layers

- "Forward pass": compute $\mathbf{y}=f(\mathbf{x})$ by traversing graph starting from input

- "Backward pass": compute $\frac{\partial \mathbf{y}}{\partial \mathbf{x}} = \nabla f(\mathbf{x})$ by traversing graph starting from output and applying chain rule at each step

Backpropagation issues

- Frameworks such as Tensorflow, Theano, etc. mostly eliminate the need to manually compute gradients

- Still important to understand basics for debugging

- Vanishing gradients

- Exploding gradients

Demo

http://playground.tensorflow.org

Exercise time!

"Advanced" neural networks

- Beyond perceptrons...

Types of neural networks

- "Neural Network Zoo" (Asimov Institute)

Hype

Hype...?

Convolutional neural networks

- Convolution: sliding window product of two functions(?)

- Not even remotely a new idea: signal processing, differential equations, etc.

- Two-dimensional convolution: move a filter around an image and compute activations at each location

- Preserves local structure of image

Convolutional neural networks

- `cesium` library for machine learning w/ time series data