Understanding the vanishing gradient problem through visualization

There're reasons why deep neural network could work very well, while few people get a promising result or make it possible by simply make their neural network deep.

Computational power and data grow tremendously. People need more complex model and faster computer to make it feasible.
Realize and understand the difficulties associated with training a deep model.

In this tutorial, we would like to show you some insights of the techniques that researchers find useful in training a deep model, using MXNet and its visualizing tool -- TensorBoard.

Let’s recap some of the relevant issues on training a deep model:

Weight initialization. If you initialize the network with random and small weights, when you look at the gradients down the top layer, you would find they’re getting smaller and smaller, then the first layer almost doesn’t change as the gradients are too small to make a significant update. Without a chance to learn the first layer effectively, it's impossible to update and learn a good deep model.
Nonlinearity activation. When people use sigmoid or tanh as activation function, the gradient, same as the above, is getting smaller and smaller. Just remind the formula of the parameter updates and the gradient.

Experiment Setting

Here we create a simple MLP for cifar10 dataset and visualize the learning processing through loss/accuracy, and its gradient distributions, by changing its initialization and activation setting.

General Setting

We adopt MLP as our model and run our experiment in MNIST dataset. Then we'll visualize the weight and gradient of a layer using Monitor in MXNet and Histogram in TensorBoard.

Network Structure

Here's the network structure:

def get_mlp(acti="relu"):
    """
    multi-layer perceptron
    """
    data = mx.symbol.Variable('data')
    fc   = mx.symbol.FullyConnected(data = data, name='fc', num_hidden=512)
    act  = mx.symbol.Activation(data = fc, name='act', act_type=acti)
    fc0  = mx.symbol.FullyConnected(data = act, name='fc0', num_hidden=256)
    act0 = mx.symbol.Activation(data = fc0, name='act0', act_type=acti)
    fc1  = mx.symbol.FullyConnected(data = act0, name='fc1', num_hidden=128)
    act1 = mx.symbol.Activation(data = fc1, name='act1', act_type=acti)
    fc2  = mx.symbol.FullyConnected(data = act1, name = 'fc2', num_hidden = 64)
    act2 = mx.symbol.Activation(data = fc2, name='act2', act_type=acti)
    fc3  = mx.symbol.FullyConnected(data = act2, name='fc3', num_hidden=32)
    act3 = mx.symbol.Activation(data = fc3, name='act3', act_type=acti)
    fc4  = mx.symbol.FullyConnected(data = act3, name='fc4', num_hidden=16)
    act4 = mx.symbol.Activation(data = fc4, name='act4', act_type=acti)
    fc5  = mx.symbol.FullyConnected(data = act4, name='fc5', num_hidden=10)
    mlp  = mx.symbol.SoftmaxOutput(data = fc5, name = 'softmax')
    return mlp

As you might already notice, we intentionally add more layers than usual, as the vanished gradient problem becomes severer as the network goes deeper.

Weight Initialization

The weight initialization also has uniform and xavier.

if args.init == 'uniform':
        init = mx.init.Uniform(0.1)
if args.init == 'xavier':
    init = mx.init.Xavier(factor_type="in", magnitude=2.34)

Note that we intentionally choose a near zero setting in uniform.

Activation Function

We would compare two different activations, sigmoid and relu.

# acti = sigmoid or relu.
act  = mx.symbol.Activation(data = fc, name='act', act_type=acti)

Logging with TensorBoard and Monitor

In order to monitor the weight and gradient of this network in different settings, we could use MXNet's monitor for logging and TensorBoard for visualization.

Usage

Here's a code snippet from train_model.py:

import mxnet as mx
from tensorboard import summary
from tensorboard import FileWriter

# where to keep your TensorBoard logging file
logdir = './logs/'
summary_writer = FileWriter(logdir)

# mx.mon.Monitor's callback 
def get_gradient(g):
    # get flatten list
    grad = g.asnumpy().flatten()
    # logging using tensorboard, use histogram type.
    s = summary.histogram('fc_backward_weight', grad)
    summary_writer.add_summary(s)
    return mx.nd.norm(g)/np.sqrt(g.size)

mon = mx.mon.Monitor(int(args.num_examples/args.batch_size), get_gradient, pattern='fc_backward_weight')  # get the gradient passed to the first fully-connnected layer.

# training
model.fit(
        X                  = train,
        eval_data          = val,
        eval_metric        = eval_metrics,
        kvstore            = kv,
        monitor            = mon,
        epoch_end_callback = checkpoint)

# close summary_writer
summary_writer.close()



In [3]:

    
import sys
sys.path.append('./mnist/')
from train_mnist import *









    



---------------------------------------------------------------------------
ImportError                               Traceback (most recent call last)
<ipython-input-3-43114e58bc70> in <module>()
      1 import sys
      2 sys.path.append('./mnist/')
----> 3 from train_mnist import *

/home/baomingkun/mxconsole/demo/mnist/train_mnist.py in <module>()
      3 import argparse
      4 import os, sys
----> 5 import train_model
      6 
      7 def _download(data_dir):

/home/baomingkun/mxconsole/demo/mnist/train_model.py in <module>()
      4 import os
      5 import numpy as np
----> 6 from mxconsole.summary import summary
      7 from mxconsole.summary.writer.writer import FileWriter
      8 

/home/baomingkun/Res/mxconsole/summary/summary.py in <module>()
     61 # exports FileWriter, FileWriterCache
     62 # pylint: disable=unused-import
---> 63 from mxconsole.summary.writer.writer import FileWriter
     64 from mxconsole.summary.writer.writer_cache import FileWriterCache
     65 # pylint: enable=unused-import

/home/baomingkun/Res/mxconsole/summary/writer/writer.py in <module>()
     23 from mxconsole.protobuf import summary_pb2
     24 from mxconsole.protobuf import event_pb2
---> 25 from mxconsole.python.platform import ops
     26 from mxconsole.python.platform import tf_logging as logging
     27 from mxconsole.summary.writer.event_file_writer import EventFileWriter

ImportError: No module named 'mxconsole.python'

What to expect?

If a setting suffers from an vanish gradient problem, the gradients passed from the top should be very close to zero, and the weight of the network barely change/update.

Uniform and Sigmoid

Uniform and sigmoid

args = parse_args('uniform', 'uniform_sigmoid') data_shape = (784, ) net = get_mlp("sigmoid")

train

train_model.fit(args, net, get_iterator(data_shape))

As you've seen, the metrics of fc_backward_weight is so close to zero, and it didn't change a lot during batchs.

2017-01-07 15:44:38,845 Node[0] Batch:       1 fc_backward_weight             5.1907e-07    
2017-01-07 15:44:38,846 Node[0] Batch:       1 fc_backward_weight             4.2085e-07    
2017-01-07 15:44:38,847 Node[0] Batch:       1 fc_backward_weight             4.31894e-07   
2017-01-07 15:44:38,848 Node[0] Batch:       1 fc_backward_weight             5.80652e-07

2017-01-07 15:45:50,199 Node[0] Batch:    4213 fc_backward_weight             5.49988e-07   
2017-01-07 15:45:50,200 Node[0] Batch:    4213 fc_backward_weight             5.89305e-07   
2017-01-07 15:45:50,201 Node[0] Batch:    4213 fc_backward_weight             3.71941e-07   
2017-01-07 15:45:50,202 Node[0] Batch:    4213 fc_backward_weight             8.05085e-07

You might wonder why we have 4 different fc_backward_weight, cause we use 4 cpus.

Uniform and ReLu



In [3]:

    
# Uniform and sigmoid
args = parse_args('uniform', 'uniform_relu')
data_shape = (784, )
net = get_mlp("relu")

# train
train_model.fit(args, net, get_iterator(data_shape))









    



2017-03-09 20:35:06,282 Node[0] start with arguments Namespace(batch_size=128, data_dir='mnist/', gpus=None, init='uniform', kv_store='local', load_epoch=None, lr=0.1, lr_factor=1, lr_factor_epoch=1, model_prefix=None, name='uniform_relu', network='mlp', num_epochs=10, num_examples=60000, save_model_prefix=None)
./mnist/train_model.py:93: DeprecationWarning: mxnet.model.FeedForward has been deprecated. Please use mxnet.mod.Module instead.
  **model_args)
/home/baomingkun/mxnet/python/mxnet/model.py:516: DeprecationWarning: Calling initializer with init(str, NDArray) has been deprecated.please use init(mx.init.InitDesc(...), NDArray) instead.
  self.initializer(k, v)
2017-03-09 20:35:07,689 Node[0] Start training with [cpu(0), cpu(1), cpu(2), cpu(3)]
2017-03-09 20:35:09,517 Node[0] Batch:       1 fc_backward_weight             0.000190986	
2017-03-09 20:35:09,518 Node[0] Batch:       1 fc_backward_weight             0.000166287	
2017-03-09 20:35:09,519 Node[0] Batch:       1 fc_backward_weight             0.000202455	
2017-03-09 20:35:09,520 Node[0] Batch:       1 fc_backward_weight             0.00020052	
2017-03-09 20:36:12,389 Node[0] Epoch[0] Resetting Data Iterator
2017-03-09 20:36:12,392 Node[0] Epoch[0] Time cost=64.690
2017-03-09 20:36:15,195 Node[0] Epoch[0] Validation-accuracy=0.593349
2017-03-09 20:36:15,196 Node[0] Epoch[0] Validation-top_k_accuracy_5=0.961839
2017-03-09 20:36:17,003 Node[0] Batch:     469 fc_backward_weight             0.062034	
2017-03-09 20:36:17,004 Node[0] Batch:     469 fc_backward_weight             0.0879578	
2017-03-09 20:36:17,004 Node[0] Batch:     469 fc_backward_weight             0.0543833	
2017-03-09 20:36:17,005 Node[0] Batch:     469 fc_backward_weight             0.0748588	
2017-03-09 20:37:14,620 Node[0] Epoch[1] Resetting Data Iterator
2017-03-09 20:37:14,621 Node[0] Epoch[1] Time cost=59.424
2017-03-09 20:37:17,500 Node[0] Epoch[1] Validation-accuracy=0.913261
2017-03-09 20:37:17,501 Node[0] Epoch[1] Validation-top_k_accuracy_5=0.988982
2017-03-09 20:37:19,510 Node[0] Batch:     937 fc_backward_weight             0.035328	
2017-03-09 20:37:19,511 Node[0] Batch:     937 fc_backward_weight             0.0796854	
2017-03-09 20:37:19,511 Node[0] Batch:     937 fc_backward_weight             0.0719981	
2017-03-09 20:37:19,512 Node[0] Batch:     937 fc_backward_weight             0.0603604	
2017-03-09 20:38:20,607 Node[0] Epoch[2] Resetting Data Iterator
2017-03-09 20:38:20,610 Node[0] Epoch[2] Time cost=63.104
2017-03-09 20:38:24,131 Node[0] Epoch[2] Validation-accuracy=0.934595
2017-03-09 20:38:24,132 Node[0] Epoch[2] Validation-top_k_accuracy_5=0.991587
2017-03-09 20:38:26,040 Node[0] Batch:    1405 fc_backward_weight             0.04173	
2017-03-09 20:38:26,041 Node[0] Batch:    1405 fc_backward_weight             0.0424923	
2017-03-09 20:38:26,042 Node[0] Batch:    1405 fc_backward_weight             0.00410111	
2017-03-09 20:38:26,043 Node[0] Batch:    1405 fc_backward_weight             0.0284391	
2017-03-09 20:39:23,479 Node[0] Epoch[3] Resetting Data Iterator
2017-03-09 20:39:23,481 Node[0] Epoch[3] Time cost=59.348
2017-03-09 20:39:26,661 Node[0] Epoch[3] Validation-accuracy=0.944010
2017-03-09 20:39:26,670 Node[0] Epoch[3] Validation-top_k_accuracy_5=0.990986
2017-03-09 20:39:28,751 Node[0] Batch:    1873 fc_backward_weight             0.0261623	
2017-03-09 20:39:28,752 Node[0] Batch:    1873 fc_backward_weight             0.0385912	
2017-03-09 20:39:28,753 Node[0] Batch:    1873 fc_backward_weight             0.00650257	
2017-03-09 20:39:28,754 Node[0] Batch:    1873 fc_backward_weight             0.0793627	
2017-03-09 20:40:26,347 Node[0] Epoch[4] Resetting Data Iterator
2017-03-09 20:40:26,349 Node[0] Epoch[4] Time cost=59.665
2017-03-09 20:40:29,447 Node[0] Epoch[4] Validation-accuracy=0.948217
2017-03-09 20:40:29,449 Node[0] Epoch[4] Validation-top_k_accuracy_5=0.992788
2017-03-09 20:40:31,304 Node[0] Batch:    2341 fc_backward_weight             0.0250681	
2017-03-09 20:40:31,305 Node[0] Batch:    2341 fc_backward_weight             0.0261628	
2017-03-09 20:40:31,305 Node[0] Batch:    2341 fc_backward_weight             0.00609674	
2017-03-09 20:40:31,310 Node[0] Batch:    2341 fc_backward_weight             0.0240946	
2017-03-09 20:41:33,076 Node[0] Epoch[5] Resetting Data Iterator
2017-03-09 20:41:33,077 Node[0] Epoch[5] Time cost=63.626
2017-03-09 20:41:38,163 Node[0] Epoch[5] Validation-accuracy=0.950421
2017-03-09 20:41:38,164 Node[0] Epoch[5] Validation-top_k_accuracy_5=0.994191
2017-03-09 20:41:40,050 Node[0] Batch:    2809 fc_backward_weight             0.0283313	
2017-03-09 20:41:40,051 Node[0] Batch:    2809 fc_backward_weight             0.0718007	
2017-03-09 20:41:40,052 Node[0] Batch:    2809 fc_backward_weight             0.00180333	
2017-03-09 20:41:40,053 Node[0] Batch:    2809 fc_backward_weight             0.0210647	
2017-03-09 20:42:41,195 Node[0] Epoch[6] Resetting Data Iterator
2017-03-09 20:42:41,196 Node[0] Epoch[6] Time cost=63.031
2017-03-09 20:42:43,956 Node[0] Epoch[6] Validation-accuracy=0.956230
2017-03-09 20:42:43,957 Node[0] Epoch[6] Validation-top_k_accuracy_5=0.993990
2017-03-09 20:42:45,830 Node[0] Batch:    3277 fc_backward_weight             0.0261316	
2017-03-09 20:42:45,831 Node[0] Batch:    3277 fc_backward_weight             0.12104	
2017-03-09 20:42:45,832 Node[0] Batch:    3277 fc_backward_weight             0.000944085	
2017-03-09 20:42:45,832 Node[0] Batch:    3277 fc_backward_weight             0.125992	
2017-03-09 20:43:39,207 Node[0] Epoch[7] Resetting Data Iterator
2017-03-09 20:43:39,209 Node[0] Epoch[7] Time cost=55.251
2017-03-09 20:43:41,654 Node[0] Epoch[7] Validation-accuracy=0.955929
2017-03-09 20:43:41,655 Node[0] Epoch[7] Validation-top_k_accuracy_5=0.994091
2017-03-09 20:43:43,715 Node[0] Batch:    3745 fc_backward_weight             0.0842549	
2017-03-09 20:43:43,716 Node[0] Batch:    3745 fc_backward_weight             0.0456381	
2017-03-09 20:43:43,717 Node[0] Batch:    3745 fc_backward_weight             0.00103301	
2017-03-09 20:43:43,717 Node[0] Batch:    3745 fc_backward_weight             0.14231	
2017-03-09 20:44:44,473 Node[0] Epoch[8] Resetting Data Iterator
2017-03-09 20:44:44,474 Node[0] Epoch[8] Time cost=62.819
2017-03-09 20:44:47,457 Node[0] Epoch[8] Validation-accuracy=0.959836
2017-03-09 20:44:47,458 Node[0] Epoch[8] Validation-top_k_accuracy_5=0.993089
2017-03-09 20:44:49,309 Node[0] Batch:    4213 fc_backward_weight             0.0262557	
2017-03-09 20:44:49,309 Node[0] Batch:    4213 fc_backward_weight             0.00164993	
2017-03-09 20:44:49,310 Node[0] Batch:    4213 fc_backward_weight             0.000262939	
2017-03-09 20:44:49,311 Node[0] Batch:    4213 fc_backward_weight             0.0247553	
2017-03-09 20:45:48,824 Node[0] Epoch[9] Resetting Data Iterator
2017-03-09 20:45:48,833 Node[0] Epoch[9] Time cost=61.373
2017-03-09 20:45:53,051 Node[0] Epoch[9] Validation-accuracy=0.959936
2017-03-09 20:45:53,052 Node[0] Epoch[9] Validation-top_k_accuracy_5=0.995393

Even we have a "poor" initialization, the model could still converge quickly with proper activation function. And its magnitude has significant difference.

2017-01-07 15:54:12,286 Node[0] Batch:       1 fc_backward_weight             0.000267409   
2017-01-07 15:54:12,287 Node[0] Batch:       1 fc_backward_weight             0.00031988    
2017-01-07 15:54:12,288 Node[0] Batch:       1 fc_backward_weight             0.000306785   
2017-01-07 15:54:12,289 Node[0] Batch:       1 fc_backward_weight             0.000347533

2017-01-07 15:55:25,936 Node[0] Batch:    4213 fc_backward_weight             0.0226081 
2017-01-07 15:55:25,937 Node[0] Batch:    4213 fc_backward_weight             0.0039793 
2017-01-07 15:55:25,937 Node[0] Batch:    4213 fc_backward_weight             0.0306151 
2017-01-07 15:55:25,938 Node[0] Batch:    4213 fc_backward_weight             0.00818676

Xavier and Sigmoid



In [4]:

    
# Xavier and sigmoid
args = parse_args('xavier', 'xavier_sigmoid')
data_shape = (784, )
net = get_mlp("sigmoid")

# train
train_model.fit(args, net, get_iterator(data_shape))









    



2017-01-07 15:59:10,021 Node[0] start with arguments Namespace(batch_size=128, data_dir='mnist/', gpus=None, init='xavier', kv_store='local', load_epoch=None, lr=0.1, lr_factor=1, lr_factor_epoch=1, model_prefix=None, name='xavier_sigmoid', network='mlp', num_epochs=10, num_examples=60000, save_model_prefix=None)
2017-01-07 15:59:13,291 Node[0] [Deprecation Warning] mxnet.model.FeedForward has been deprecated. Please use mxnet.mod.Module instead.
2017-01-07 15:59:13,299 Node[0] Start training with [cpu(0), cpu(1), cpu(2), cpu(3)]
2017-01-07 15:59:15,909 Node[0] Batch:       1 fc_backward_weight             9.27798e-06	
2017-01-07 15:59:15,909 Node[0] Batch:       1 fc_backward_weight             8.58008e-06	
2017-01-07 15:59:15,910 Node[0] Batch:       1 fc_backward_weight             8.96261e-06	
2017-01-07 15:59:15,911 Node[0] Batch:       1 fc_backward_weight             7.33611e-06	
2017-01-07 15:59:20,779 Node[0] Epoch[0] Resetting Data Iterator
2017-01-07 15:59:20,780 Node[0] Epoch[0] Time cost=7.433
2017-01-07 15:59:21,086 Node[0] Epoch[0] Validation-accuracy=0.105769
2017-01-07 15:59:21,087 Node[0] Epoch[0] Validation-top_k_accuracy_5=0.509115
2017-01-07 15:59:23,778 Node[0] Batch:     469 fc_backward_weight             6.76125e-06	
2017-01-07 15:59:23,779 Node[0] Batch:     469 fc_backward_weight             6.54805e-06	
2017-01-07 15:59:23,780 Node[0] Batch:     469 fc_backward_weight             6.80302e-06	
2017-01-07 15:59:23,782 Node[0] Batch:     469 fc_backward_weight             7.39115e-06	
2017-01-07 15:59:29,174 Node[0] Epoch[1] Resetting Data Iterator
2017-01-07 15:59:29,175 Node[0] Epoch[1] Time cost=8.087
2017-01-07 15:59:29,477 Node[0] Epoch[1] Validation-accuracy=0.105769
2017-01-07 15:59:29,477 Node[0] Epoch[1] Validation-top_k_accuracy_5=0.504507
2017-01-07 15:59:32,143 Node[0] Batch:     937 fc_backward_weight             5.83071e-06	
2017-01-07 15:59:32,144 Node[0] Batch:     937 fc_backward_weight             5.59626e-06	
2017-01-07 15:59:32,145 Node[0] Batch:     937 fc_backward_weight             5.776e-06	
2017-01-07 15:59:32,147 Node[0] Batch:     937 fc_backward_weight             6.28738e-06	
2017-01-07 15:59:37,783 Node[0] Epoch[2] Resetting Data Iterator
2017-01-07 15:59:37,784 Node[0] Epoch[2] Time cost=8.305
2017-01-07 15:59:38,085 Node[0] Epoch[2] Validation-accuracy=0.105769
2017-01-07 15:59:38,086 Node[0] Epoch[2] Validation-top_k_accuracy_5=0.510216
2017-01-07 15:59:41,031 Node[0] Batch:    1405 fc_backward_weight             4.951e-06	
2017-01-07 15:59:41,032 Node[0] Batch:    1405 fc_backward_weight             4.72836e-06	
2017-01-07 15:59:41,033 Node[0] Batch:    1405 fc_backward_weight             4.8514e-06	
2017-01-07 15:59:41,034 Node[0] Batch:    1405 fc_backward_weight             5.26915e-06	
2017-01-07 15:59:47,042 Node[0] Epoch[3] Resetting Data Iterator
2017-01-07 15:59:47,043 Node[0] Epoch[3] Time cost=8.957
2017-01-07 15:59:47,424 Node[0] Epoch[3] Validation-accuracy=0.105769
2017-01-07 15:59:47,425 Node[0] Epoch[3] Validation-top_k_accuracy_5=0.509014
2017-01-07 15:59:50,295 Node[0] Batch:    1873 fc_backward_weight             4.22193e-06	
2017-01-07 15:59:50,296 Node[0] Batch:    1873 fc_backward_weight             4.03044e-06	
2017-01-07 15:59:50,297 Node[0] Batch:    1873 fc_backward_weight             4.11877e-06	
2017-01-07 15:59:50,298 Node[0] Batch:    1873 fc_backward_weight             4.45402e-06	
2017-01-07 15:59:56,082 Node[0] Epoch[4] Resetting Data Iterator
2017-01-07 15:59:56,083 Node[0] Epoch[4] Time cost=8.653
2017-01-07 15:59:56,378 Node[0] Epoch[4] Validation-accuracy=0.105769
2017-01-07 15:59:56,379 Node[0] Epoch[4] Validation-top_k_accuracy_5=0.509014
2017-01-07 15:59:58,837 Node[0] Batch:    2341 fc_backward_weight             3.64564e-06	
2017-01-07 15:59:58,838 Node[0] Batch:    2341 fc_backward_weight             3.48901e-06	
2017-01-07 15:59:58,839 Node[0] Batch:    2341 fc_backward_weight             3.55765e-06	
2017-01-07 15:59:58,840 Node[0] Batch:    2341 fc_backward_weight             3.82692e-06	
2017-01-07 16:00:03,458 Node[0] Epoch[5] Resetting Data Iterator
2017-01-07 16:00:03,459 Node[0] Epoch[5] Time cost=7.078
2017-01-07 16:00:03,790 Node[0] Epoch[5] Validation-accuracy=0.105769
2017-01-07 16:00:03,791 Node[0] Epoch[5] Validation-top_k_accuracy_5=0.509014
2017-01-07 16:00:06,406 Node[0] Batch:    2809 fc_backward_weight             3.19336e-06	
2017-01-07 16:00:06,407 Node[0] Batch:    2809 fc_backward_weight             3.06777e-06	
2017-01-07 16:00:06,409 Node[0] Batch:    2809 fc_backward_weight             3.12543e-06	
2017-01-07 16:00:06,410 Node[0] Batch:    2809 fc_backward_weight             3.34344e-06	
2017-01-07 16:00:12,052 Node[0] Epoch[6] Resetting Data Iterator
2017-01-07 16:00:12,053 Node[0] Epoch[6] Time cost=8.261
2017-01-07 16:00:12,352 Node[0] Epoch[6] Validation-accuracy=0.107472
2017-01-07 16:00:12,353 Node[0] Epoch[6] Validation-top_k_accuracy_5=0.509014
2017-01-07 16:00:14,968 Node[0] Batch:    3277 fc_backward_weight             2.83478e-06	
2017-01-07 16:00:14,969 Node[0] Batch:    3277 fc_backward_weight             2.73443e-06	
2017-01-07 16:00:14,970 Node[0] Batch:    3277 fc_backward_weight             2.78607e-06	
2017-01-07 16:00:14,971 Node[0] Batch:    3277 fc_backward_weight             2.9644e-06	
2017-01-07 16:00:20,252 Node[0] Epoch[7] Resetting Data Iterator
2017-01-07 16:00:20,253 Node[0] Epoch[7] Time cost=7.899
2017-01-07 16:00:20,541 Node[0] Epoch[7] Validation-accuracy=0.105970
2017-01-07 16:00:20,542 Node[0] Epoch[7] Validation-top_k_accuracy_5=0.512620
2017-01-07 16:00:23,036 Node[0] Batch:    3745 fc_backward_weight             2.54587e-06	
2017-01-07 16:00:23,037 Node[0] Batch:    3745 fc_backward_weight             2.46527e-06	
2017-01-07 16:00:23,038 Node[0] Batch:    3745 fc_backward_weight             2.51372e-06	
2017-01-07 16:00:23,039 Node[0] Batch:    3745 fc_backward_weight             2.66109e-06	
2017-01-07 16:00:27,410 Node[0] Epoch[8] Resetting Data Iterator
2017-01-07 16:00:27,411 Node[0] Epoch[8] Time cost=6.868
2017-01-07 16:00:27,718 Node[0] Epoch[8] Validation-accuracy=0.105970
2017-01-07 16:00:27,719 Node[0] Epoch[8] Validation-top_k_accuracy_5=0.512620
2017-01-07 16:00:30,358 Node[0] Batch:    4213 fc_backward_weight             2.30903e-06	
2017-01-07 16:00:30,359 Node[0] Batch:    4213 fc_backward_weight             2.24373e-06	
2017-01-07 16:00:30,360 Node[0] Batch:    4213 fc_backward_weight             2.29058e-06	
2017-01-07 16:00:30,361 Node[0] Batch:    4213 fc_backward_weight             2.41351e-06	
2017-01-07 16:00:35,874 Node[0] Epoch[9] Resetting Data Iterator
2017-01-07 16:00:35,875 Node[0] Epoch[9] Time cost=8.156
2017-01-07 16:00:36,182 Node[0] Epoch[9] Validation-accuracy=0.105970
2017-01-07 16:00:36,183 Node[0] Epoch[9] Validation-top_k_accuracy_5=0.512620

Visualization

Now start using TensorBoard:

tensorboard --logdir=logs/

References

You might find these materials useful:

[1] Rohan #4: The vanishing gradient problem – A Year of Artificial Intelligence
[2] On the difficulty of training recurrent and deep neural networks - YouTube
[3] What is the vanishing gradient problem? - Quora