Week 4. Training Issues

In this part, we will formally set up a simple but powerful classification network, to recogize 0-9 nubmers in MNIST dataset.

Yep, we will build a classification network and train from scratch.

We would introduce some techniques to improve your train model performance.

This part is designed and completed by Jiaxin Zhuang( zhuangjx5@mail2.sysu.edu.cn ) and Feifei Xue(xueff@mail2.sysu.edu.cn), if you have some questions about this part and you think there are still some things to do, dont't hesitate to email us or add our wechat.

Outline

Outline
1. Required modules ( If you use your own computer, Just pip install it ! )
2. Common Setup
classificatioon network
1. short introdution of MNIST
2. Define a convolutional network
Training
1. Including that define a model, loss function, metric, data-augmentation for training data
2. Pre-set hyper-parameters
3. Initialize model parameters
4. repeat over certain number of epochs
  1. Shuffle whole training data
  2. For each mini-batch data
    1. load mini-batch data
    2. compute gradient of loss over parameters
    3. update parameters with gradient descent
5. save model
Training advanced
1. l2_norm
2. dropout
3. batch_normalization
4. data augmentation
Visualizatio of training and validation phase
1. add tensorboardX to writer summary into tensorboard
2. download your file in local
3. run tensorboard in pc and open http://localhost:6666 to browse the tensorboard
Gradient
1. Gradient vanishing
2. Gradient exploding



In [1]:

    
%load_ext autoreload
%autoreload 2

1.1 Required Module

numpy: NumPy is the fundamental package for scientific computing in Python.

pytorch: End-to-end deep learning platform.

torchvision: This package consists of popular datasets, model architectures, and common image transformations for computer vision.

tensorflow: An open source machine learning framework.

tensorboard: A suite of visualization tools to make training easier to understand, debug, and optimize TensorFlow programs.

tensorboardX: Tensorboard for Pytorch.

matplotlib: It is a Python 2D plotting library which produces publication quality figures in a variety of hardcopy formats and interactive environments across platforms.

1.2 Common Setup



In [1]:

    
# Load all necessary modules here, for clearness
import torch
import numpy as np
import torch.nn as nn
import torch.nn.functional as F
import torch.optim as optim
# from torchvision.datasets import MNIST
import torchvision
from torchvision import transforms
from torch.optim import lr_scheduler
# from tensorboardX import SummaryWriter
from collections import OrderedDict
import matplotlib.pyplot as plt
# from tqdm import tqdm



In [2]:

    
# Whether to put data in GPU according to GPU is available or not 
# cuda = torch.cuda.is_available() 
#  In case the default gpu does not have enough space, you can choose which device to use
# torch.cuda.set_device(device) # device: id

# Since gpu in lab is not enough for your guys, we prefer to cpu computation
cuda = torch.device('cpu')

2. Classfication Model

Ww would define a simple Convolutional Neural Network to classify MNIST

2.1 Short indroduction of MNIST

The MNIST database (Modified National Institute of Standards and Technology database) is a large database of handwritten digits that is commonly used for training various image processing systems.

The MNIST database contains 60,000 training images and 10,000 testing images. Each class has 5000 traning images and 1000 test images.

Each image is 32x32.

And they look like images below.

2.2 Define A FeedForward Neural Network

We would fefine a FeedForward Neural Network with 3 hidden layers.

Each layer is followed a activation function, we would try sigmoid and relu respectively.

For simplicity, each hidden layer has the equal neurons.

In reality, however, we would apply different amount of neurons in different hidden layers.

2.2.1 Activation Function

There are many useful activation function and you can choose one of them to use. Usually we use relu as our network function.

2.2.1.1 ReLU

Applies the rectified linear unit function element-wise

\begin{equation} ReLU(x) = max(0, x) \end{equation}

2.2.1.2 Sigmoid

Applies the element-wise function:

\begin{equation} Sigmoid(x)=\frac{1}{1+e^{-x}} \end{equation}

2.2.2 Network's Input and output

Inputs: For every batch

[batchSize, channels, height, width] -> [B,C,H,W]

Outputs: prediction scores of each images, eg. [0.001, 0.0034 ..., 0.3]

[batchSize, classes]

Network Structure

    Inputs                Linear/Function        Output
    [128, 1, 28, 28]   -> Linear(28*28, 100) -> [128, 100]  # first hidden lyaer
                       -> ReLU               -> [128, 100]  # relu activation function, may sigmoid
                       -> Linear(100, 100)   -> [128, 100]  # second hidden lyaer
                       -> ReLU               -> [128, 100]  # relu activation function, may sigmoid
                       -> Linear(100, 100)   -> [128, 100]  # third hidden lyaer
                       -> ReLU               -> [128, 100]  # relu activation function, may sigmoid
                       -> Linear(100, 10)    -> [128, 10]   # Classification Layer



In [3]:

    
class FeedForwardNeuralNetwork(nn.Module):
    """
    Inputs                Linear/Function        Output
    [128, 1, 28, 28]   -> Linear(28*28, 100) -> [128, 100]  # first hidden lyaer
                       -> ReLU               -> [128, 100]  # relu activation function, may sigmoid
                       -> Linear(100, 100)   -> [128, 100]  # second hidden lyaer
                       -> ReLU               -> [128, 100]  # relu activation function, may sigmoid
                       -> Linear(100, 100)   -> [128, 100]  # third hidden lyaer
                       -> ReLU               -> [128, 100]  # relu activation function, may sigmoid
                       -> Linear(100, 10)    -> [128, 10]   # Classification Layer                                                          
   """
    def __init__(self, input_size, hidden_size, output_size, activation_function='RELU'):
        super(FeedForwardNeuralNetwork, self).__init__()
        self.use_dropout = False
        self.use_bn = False
        self.hidden1 = nn.Linear(input_size, hidden_size)  # Linear function 1: 784 --> 100 
        self.hidden2 = nn.Linear(hidden_size, hidden_size) # Linear function 2: 100 --> 100
        self.hidden3 = nn.Linear(hidden_size, hidden_size) # Linear function 3: 100 --> 100
        # Linear function 4 (readout): 100 --> 10
        self.classification_layer = nn.Linear(hidden_size, output_size)
        self.dropout = nn.Dropout(p=0.5) # Drop out with prob = 0.5
        self.hidden1_bn = nn.BatchNorm1d(hidden_size) # Batch Normalization 
        self.hidden2_bn = nn.BatchNorm1d(hidden_size)
        self.hidden3_bn = nn.BatchNorm1d(hidden_size)
        
        # Non-linearity
        if activation_function == 'SIGMOID':
            self.activation_function1 = nn.Sigmoid()
            self.activation_function2 = nn.Sigmoid()
            self.activation_function3 = nn.Sigmoid()
        elif activation_function == 'RELU':
            self.activation_function1 = nn.ReLU()
            self.activation_function2 = nn.ReLU()
            self.activation_function3 = nn.ReLU()
        
    def forward(self, x):
        """Defines the computation performed at every call.
           Should be overridden by all subclasses.
        Args:
            x: [batch_size, channel, height, width], input for network
        Returns:
            out: [batch_size, n_classes], output from network
        """
        
        x = x.view(x.size(0), -1) # flatten x in [128, 784]
        out = self.hidden1(x)
        out = self.activation_function1(out) # Non-linearity 1
        if self.use_bn == True:
            out = self.hidden1_bn(out)
        out = self.hidden2(out)
        out = self.activation_function2(out)
        if self.use_bn == True:
            out = self.hidden2_bn(out)
        out = self.hidden3(out)
        if self.use_bn == True:
            out = self.hidden3_bn(out)
        out = self.activation_function3(out)
        if self.use_dropout == True:
            out = self.dropout(out)
        out = self.classification_layer(out)
        return out
    
    def set_use_dropout(self, use_dropout):
        """Whether to use dropout. Auxiliary function for our exp, not necessary.
        Args:
            use_dropout: True, False
        """
        self.use_dropout = use_dropout
        
    def set_use_bn(self, use_bn):
        """Whether to use batch normalization. Auxiliary function for our exp, not necessary.
        Args:
            use_bn: True, False
        """
        self.use_bn = use_bn
        
    def get_grad(self):
        """Return average grad for hidden2, hidden3. Auxiliary function for our exp, not necessary.
        """
        hidden2_average_grad = np.mean(np.sqrt(np.square(self.hidden2.weight.grad.detach().numpy())))
        hidden3_average_grad = np.mean(np.sqrt(np.square(self.hidden3.weight.grad.detach().numpy())))
        return hidden2_average_grad, hidden3_average_grad

3. Training

We would define training function here. Additionally, hyper-parameters, loss function, metric would be included here too.

3.1 Pre-set hyper-parameters

setting hyperparameters like below

hyper paprameters include following part

learning rate: usually we start from a quite bigger lr like 1e-1, 1e-2, 1e-3, and slow lr as epoch moves.
n_epochs: training epoch must set large so model has enough time to converge. Usually, we will set a quite big epoch at the first training time.
batch_size: usually, bigger batch size mean's better usage of GPU and model would need less epoches to converge. And the exponent of 2 is used, eg. 2, 4, 8, 16, 32, 64, 128. 256.



In [5]:

    
### Hyper parameters

batch_size = 128 # batch size is 128
n_epochs = 5 # train for 5 epochs
learning_rate = 0.01 # learning rate is 0.01
input_size = 28*28 # input image has size 28x28
hidden_size = 100 # hidden neurons is 100 for each layer
output_size = 10 # classes of prediction
l2_norm = 0 # not to use l2 penalty
dropout = False # not to use
get_grad = False # not to obtain grad



In [6]:

    
# create a model object
model = FeedForwardNeuralNetwork(input_size=input_size, hidden_size=hidden_size, output_size=output_size)
# Cross entropy
loss_fn = torch.nn.CrossEntropyLoss()
# l2_norm can be done in SGD
optimizer = torch.optim.SGD(model.parameters(), lr=learning_rate, weight_decay=l2_norm)

3.2 Initialize model parameters

Pytorch provide default initialization (uniform intialization) for linear layer. But there is still some useful intialization method.

Read more about initialization from this link

    torch.nn.init.normal_
    torch.nn.init.uniform_
    torch.nn.init.constant_
    torch.nn.init.eye_
    torch.nn.init.xavier_uniform_
    torch.nn.init.xavier_normal_
    torch.nn.init.kaiming_uniform_



In [4]:

    
def show_weight_bias(model):
    """Show some weights and bias distribution every layers in model. 
       !!YOU CAN READ THIS CODE LATER!! 
    """
    # Create a figure and a set of subplots
    fig, axs = plt.subplots(2,3, sharey=False, tight_layout=True)
    
    # weight and bias for every hidden layer
    h1_w = model.hidden1.weight.detach().numpy().flatten()
    h1_b = model.hidden1.bias.detach().numpy().flatten()
    h2_w = model.hidden2.weight.detach().numpy().flatten()
    h2_b = model.hidden2.bias.detach().numpy().flatten()
    h3_w = model.hidden3.weight.detach().numpy().flatten()
    h3_b = model.hidden3.bias.detach().numpy().flatten()
    
    axs[0,0].hist(h1_w)
    axs[0,1].hist(h2_w)
    axs[0,2].hist(h3_w)
    axs[1,0].hist(h1_b)
    axs[1,1].hist(h2_b)
    axs[1,2].hist(h3_b)
    
    # set title for every sub plots
    axs[0,0].set_title('hidden1_weight')
    axs[0,1].set_title('hidden2_weight')
    axs[0,2].set_title('hidden3_weight')
    axs[1,0].set_title('hidden1_bias')
    axs[1,1].set_title('hidden2_bias')
    axs[1,2].set_title('hidden3_bias')



In [8]:

    
# Show default initialization for every hidden layer by pytorch
# it's uniform distribution 
show_weight_bias(model)









    



C:\ProgramData\Anaconda3\lib\site-packages\matplotlib\figure.py:2366: UserWarning: This figure includes Axes that are not compatible with tight_layout, so results might be incorrect.
  warnings.warn("This figure includes Axes that are not compatible "



In [5]:

    
# If you want to use other intialization method, you can use code below
# and define your initialization below

def weight_bias_reset(model):
    """Custom initialization, you can use your favorable initialization method.
    """
    for m in model.modules():
        if isinstance(m, nn.Linear):
            # initialize linear layer with mean and std
            mean, std = 0, 0.1 
            
            # Initialization method
            torch.nn.init.normal_(m.weight, mean, std)
            torch.nn.init.normal_(m.bias, mean, std)
            
#             Another way to initialize
#             m.weight.data.normal_(mean, std)
#             m.bias.data.normal_(mean, std)



In [10]:

    
weight_bias_reset(model) # reset parameters for each hidden layer
show_weight_bias(model) # show weight and bias distribution, normal distribution now.

作业1

使用 torch.nn.init.constant_, torch.nn.init.xavieruniform, torch.nn.kaiming_uniform_去重写初始化函数，使用对应函数初始化模型，并且使用show_weight_bias显示模型隐藏层的参数分布。此处应该有6个cell作答。



In [11]:

    
def weight_bias_reset_constant(model):
    """Constant initalization
    """
    for m in model.modules():
        if isinstance(m, nn.Linear):
            val = 0.1
            torch.nn.init.constant_(m.weight, val)
            torch.nn.init.constant_(m.bias, val)



In [12]:

    
weight_bias_reset_constant(model)
show_weight_bias(model)



In [13]:

    
def weight_bias_reset_xavier_uniform(model):
    """xaveir_uniform, gain=1
    """
    for m in model.modules():
        if isinstance(m, nn.Linear):
            gain = 1
            torch.nn.init.xavier_uniform_(m.weight, gain)
            # torch.nn.init.xavier_uniform_(m.bias, gain)



In [14]:

    
weight_bias_reset_xavier_uniform(model)
show_weight_bias(model)



In [15]:

    
def weight_bias_reset_kaiming_uniform(model):
    """kaiming_uniform, a=0, mode='fan_in', non_linearity='relu'
    """
    for m in model.modules():
        if isinstance(m, nn.Linear):
            a = 0
            torch.nn.init.kaiming_uniform_(m.weight, a=a, mode='fan_in', nonlinearity='relu')
            # torch.nn.init.kaiming_uniform_(m.bias, a=a, mode='fan_in', nonlinearity='relu')



In [16]:

    
weight_bias_reset_kaiming_uniform(model)
show_weight_bias(model)

3.3 Repeat over certain numbers of epoch

Shuffle whole training data

shuffle
train_loader = torch.utils.data.DataLoader(train_dataset, batch_size=batch_size, shuffle=True, **kwargs)

For each mini-batch data

load mini-batch data

for batch_idx, (data, target) in enumerate(train_loader): \
  ...

compute gradient of loss over parameters

output = net(data) # make prediction
loss = loss_fn(output, target)  # compute loss 
loss.backward() # compute gradient of loss over parameters

update parameters with gradient descent

optimzer.step() # update parameters with gradient descent

3.3.1 Shuffle whole traning data

3.3.1.1 Data Loading

Please pay attention to data augmentation.

Read more data augmentation method from this link.

torchvision.transforms.RandomVerticalFlip
torchvision.transforms.RandomHorizontalFlip
...



In [19]:

    
# define method of preprocessing data for evaluating

train_transform = transforms.Compose([
    transforms.ToTensor(), # Convert a PIL Image or numpy.ndarray to tensor.
    # Normalize a tensor image with mean 0.1307 and standard deviation 0.3081
    transforms.Normalize((0.1307,), (0.3081,))
])

test_transform = transforms.Compose([
    transforms.ToTensor(),
    transforms.Normalize((0.1307,), (0.3081,))
])



In [20]:

    
# use MNIST provided by torchvision

# torchvision.datasets provide MNIST dataset for classification

train_dataset = torchvision.datasets.MNIST(root='./data', 
                            train=True, 
                            transform=train_transform,
                            download=True)

test_dataset = torchvision.datasets.MNIST(root='./data', 
                           train=False, 
                           transform=test_transform,
                           download=False)



In [21]:

    
# pay attention to this, train_dataset doesn't load any data
# It just defined some method and store some message to preprocess data
train_dataset









    Out[21]:





Dataset MNIST
    Number of datapoints: 60000
    Split: train
    Root Location: ./data
    Transforms (if any): Compose(
                             ToTensor()
                             Normalize(mean=(0.1307,), std=(0.3081,))
                         )
    Target Transforms (if any): None



In [22]:

    
# Data loader. 

# Combines a dataset and a sampler, 
# and provides single- or multi-process iterators over the dataset.

train_loader = torch.utils.data.DataLoader(dataset=train_dataset, 
                                           batch_size=batch_size, 
                                           shuffle=False)

test_loader = torch.utils.data.DataLoader(dataset=test_dataset, 
                                          batch_size=batch_size, 
                                          shuffle=False)



In [21]:

    
# functions to show an image

def imshow(img):
    """show some imgs in datasets
        !!YOU CAN READ THIS CODE LATER!! """
    
    npimg = img.numpy() # convert tensor to numpy
    plt.imshow(np.transpose(npimg, (1, 2, 0))) # [channel, height, width] -> [height, width, channel]
    plt.show()



In [22]:

    
# get some random training images by batch

dataiter = iter(train_loader)
images, labels = dataiter.next() # get a batch of images

# show images
imshow(torchvision.utils.make_grid(images))









    



Clipping input data to the valid range for imshow with RGB data ([0..1] for floats or [0..255] for integers).

3.3.2 & 3.3.3 compute gradient of loss over parameters & update parameters with gradient descent



In [28]:

    
def train(train_loader, model, loss_fn, optimizer, get_grad=False):
    """train model using loss_fn and optimizer. When thid function is called, model trains for one epoch.
    Args:
        train_loader: train data
        model: prediction model
        loss_fn: loss function to judge the distance between target and outputs
        optimizer: optimize the loss function
        get_grad: True, False
    Returns:
        total_loss: loss
        average_grad2: average grad for hidden 2 in this epoch
        average_grad3: average grad for hidden 3 in this epoch
    """
    
    # set the module in training model, affecting module e.g., Dropout, BatchNorm, etc.
    model.train()
    
    total_loss = 0
    grad_2 = 0.0 # store sum(grad) for hidden 3 layer
    grad_3 = 0.0 # store sum(grad) for hidden 3 layer
    
    for batch_idx, (data, target) in enumerate(train_loader):
        optimizer.zero_grad() # clear gradients of all optimized torch.Tensors'
        outputs = model(data) # make predictions 
        loss = loss_fn(outputs, target) # compute loss 
        total_loss += loss.item() # accumulate every batch loss in a epoch
        loss.backward() # compute gradient of loss over parameters 
        
        if get_grad == True:
            g2, g3 = model.get_grad() # get grad for hiddern 2 and 3 layer in this batch
            grad_2 += g2 # accumulate grad for hidden 2
            grad_3 += g3 # accumulate grad for hidden 2
            
        optimizer.step() # update parameters with gradient descent 
            
    average_loss = total_loss / batch_idx # average loss in this epoch
    average_grad2 = grad_2 / batch_idx # average grad for hidden 2 in this epoch
    average_grad3 = grad_3 / batch_idx # average grad for hidden 3 in this epoch
    
    return average_loss, average_grad2, average_grad3



In [8]:

    
def evaluate(loader, model, loss_fn):
    """test model's prediction performance on loader.  
    When thid function is called, model is evaluated.
    Args:
        loader: data for evaluation
        model: prediction model
        loss_fn: loss function to judge the distance between target and outputs
    Returns:
        total_loss
        accuracy
    """
    
    # context-manager that disabled gradient computation
    with torch.no_grad():
        
        # set the module in evaluation mode
        model.eval()
        
        correct = 0.0 # account correct amount of data
        total_loss = 0  # account loss
        
        for batch_idx, (data, target) in enumerate(loader):
            outputs = model(data) # make predictions 
            # return the maximum value of each row of the input tensor in the 
            # given dimension dim, the second return vale is the index location
            # of each maxium value found(argmax)
            _, predicted = torch.max(outputs, 1)
            # Detach: Returns a new Tensor, detached from the current graph.
            #The result will never require gradient.
            correct += (predicted == target).sum().detach().numpy()
            loss = loss_fn(outputs, target)  # compute loss 
            total_loss += loss.item() # accumulate every batch loss in a epoch
            
        accuracy = correct*100.0 / len(loader.dataset) # accuracy in a epoch
        
    return total_loss, accuracy

Define function fit and use train_epoch and test_epoch



In [26]:

    
def fit(train_loader, val_loader, model, loss_fn, optimizer, n_epochs, get_grad=False):
    """train and val model here, we use train_epoch to train model and 
    val_epoch to val model prediction performance
    Args: 
        train_loader: train data
        val_loader: validation data
        model: prediction model
        loss_fn: loss function to judge the distance between target and outputs
        optimizer: optimize the loss function
        n_epochs: training epochs
        get_grad: Whether to get grad of hidden2 layer and hidden3 layer
    Returns:
        train_accs: accuracy of train n_epochs, a list
        train_losses: loss of n_epochs, a list
    """
    
    grad_2 = [] # save grad for hidden 2 every epoch
    grad_3 = [] # save grad for hidden 3 every epoch
    
    train_accs = [] # save train accuracy every epoch
    train_losses = [] # save train loss every epoch
    
    for epoch in range(n_epochs): # train for n_epochs 
        # train model on training datasets, optimize loss function and update model parameters 
        train_loss, average_grad2, average_grad3 = train(train_loader, model, loss_fn, optimizer, get_grad)
        
        # evaluate model performance on train dataset
        _, train_accuracy = evaluate(train_loader, model, loss_fn)
        message = 'Epoch: {}/{}. Train set: Average loss: {:.4f}, Accuracy: {:.4f}'.format(epoch+1, \
                                                                n_epochs, train_loss, train_accuracy)
        print(message)
    
        # save loss, accuracy, grad
        train_accs.append(train_accuracy)
        train_losses.append(train_loss)
        grad_2.append(average_grad2)
        grad_3.append(average_grad3)
    
        # evaluate model performance on val dataset
        val_loss, val_accuracy = evaluate(val_loader, model, loss_fn)
        message = 'Epoch: {}/{}. Validation set: Average loss: {:.4f}, Accuracy: {:.4f}'.format(epoch+1, \
                                                                n_epochs, val_loss, val_accuracy)
        print(message)
        
        
    # Whether to get grad for showing
    if get_grad == True:
        fig, ax = plt.subplots() # add a set of subplots to this figure
        ax.plot(grad_2, label='Gradient for Hidden 2 Layer') # plot grad 2 
        ax.plot(grad_3, label='Gradient for Hidden 3 Layer') # plot grad 3 
        plt.ylim(top=0.004)
        # place a legend on axes
        legend = ax.legend(loc='best', shadow=True, fontsize='x-large')
    
    return train_accs, train_losses



In [26]:

    
def show_curve(ys, title):
    """plot curlve for Loss and Accuacy
    
    !!YOU CAN READ THIS LATER, if you are interested
    
    Args:
        ys: loss or acc list
        title: Loss or Accuracy
    """
    x = np.array(range(len(ys)))
    y = np.array(ys)
    plt.plot(x, y, c='b')
    plt.axis()
    plt.title('{} Curve:'.format(title))
    plt.xlabel('Epoch')
    plt.ylabel('{} Value'.format(title))
    plt.show()

作业 2

运行一下fit函数，根据结束时候训练集的accuracy，回答：模型是否训练到过拟合。
使用提供的show_curve函数，画出训练的时候loss和accuracy的变化

Hints: 因为jupyter对变量有上下文关系，模型，优化器需要重新声明。可以使用以下代码进行重新定义模型和优化器。注意到此处用的是默认初始化。



In [27]:

    
### Hyper parameters

batch_size = 128 # batch size is 128
n_epochs = 5 # train for 5 epochs
learning_rate = 0.01 # learning rate is 0.01
input_size = 28*28 # input image has size 28x28
hidden_size = 100 # hidden neurons is 100 for each layer
output_size = 10 # classes of prediction
l2_norm = 0 # not to use l2 penalty
dropout = False # not to use
get_grad = False # not to obtain grad

# declare a model
model = FeedForwardNeuralNetwork(input_size=input_size, hidden_size=hidden_size, output_size=output_size)
# Cross entropy
loss_fn = torch.nn.CrossEntropyLoss()
# l2_norm can be done in SGD
optimizer = torch.optim.SGD(model.parameters(), lr=learning_rate, weight_decay=l2_norm)



In [28]:

    
train_accs, train_losses = fit(train_loader, test_loader, model, loss_fn, optimizer, n_epochs, get_grad)









    



Epoch: 1/5. Train set: Average loss: 1.7793, Accuracy: 77.2800
Epoch: 1/5. Validation set: Average loss: 64.9456, Accuracy: 78.3500
Epoch: 2/5. Train set: Average loss: 0.5580, Accuracy: 87.4033
Epoch: 2/5. Validation set: Average loss: 32.5510, Accuracy: 87.6100
Epoch: 3/5. Train set: Average loss: 0.3719, Accuracy: 89.9333
Epoch: 3/5. Validation set: Average loss: 26.0567, Accuracy: 89.8500
Epoch: 4/5. Train set: Average loss: 0.3165, Accuracy: 91.1150
Epoch: 4/5. Validation set: Average loss: 22.9635, Accuracy: 91.2200
Epoch: 5/5. Train set: Average loss: 0.2831, Accuracy: 92.0483
Epoch: 5/5. Validation set: Average loss: 20.7507, Accuracy: 92.1900

模型没有训练到过拟合，观察上面训练数据，随着代数增多，测试集的正确率并没有下降。



In [29]:

    
show_curve(train_accs, 'accuracy')
show_curve(train_losses, 'loss')

作业 3

将n_epochs设为10，观察模型是否能在训练集上达到过拟合，使用show_curve作图。
当希望模型在5个epoch内在训练集上达到过拟合，可以通过适当调整learning rate来实现。选择一个合适的learing rate，训练模型，并且使用show_curve作图，验证你的learning rate

Hints: 因为jupyter对变量有上下文关系，模型，优化器需要重新声明。可以使用以下代码进行重新定义模型和优化器。注意到此处用的是默认初始化。



In [11]:

    
### Hyper parameters

batch_size = 128 # batch size is 128
n_epochs = 5 # train for 5 epochs
learning_rate = 0.01 # learning rate is 0.01
input_size = 28*28 # input image has size 28x28
hidden_size = 100 # hidden neurons is 100 for each layer
output_size = 10 # classes of prediction
l2_norm = 0 # not to use l2 penalty
dropout = False # not to use
get_grad = False # not to obtain grad

# declare a model
model = FeedForwardNeuralNetwork(input_size=input_size, hidden_size=hidden_size, output_size=output_size)
# Cross entropy
loss_fn = torch.nn.CrossEntropyLoss()
# l2_norm can be done in SGD
optimizer = torch.optim.SGD(model.parameters(), lr=learning_rate, weight_decay=l2_norm)



In [31]:

    
# 3.1 Train
n_epochs = 10
train_accs, train_losses = fit(train_loader, test_loader, model, loss_fn, optimizer, n_epochs, get_grad)









    



Epoch: 1/10. Train set: Average loss: 1.8809, Accuracy: 73.4550
Epoch: 1/10. Validation set: Average loss: 77.8742, Accuracy: 74.1300
Epoch: 2/10. Train set: Average loss: 0.6316, Accuracy: 86.3283
Epoch: 2/10. Validation set: Average loss: 35.4519, Accuracy: 86.8600
Epoch: 3/10. Train set: Average loss: 0.4034, Accuracy: 89.2550
Epoch: 3/10. Validation set: Average loss: 27.6977, Accuracy: 89.4800
Epoch: 4/10. Train set: Average loss: 0.3335, Accuracy: 90.7450
Epoch: 4/10. Validation set: Average loss: 23.7937, Accuracy: 90.9700
Epoch: 5/10. Train set: Average loss: 0.2933, Accuracy: 91.8183
Epoch: 5/10. Validation set: Average loss: 21.2545, Accuracy: 91.8400
Epoch: 6/10. Train set: Average loss: 0.2639, Accuracy: 92.6317
Epoch: 6/10. Validation set: Average loss: 19.2905, Accuracy: 92.7200
Epoch: 7/10. Train set: Average loss: 0.2396, Accuracy: 93.3200
Epoch: 7/10. Validation set: Average loss: 17.6487, Accuracy: 93.3800
Epoch: 8/10. Train set: Average loss: 0.2189, Accuracy: 93.8767
Epoch: 8/10. Validation set: Average loss: 16.2273, Accuracy: 93.9700
Epoch: 9/10. Train set: Average loss: 0.2012, Accuracy: 94.3717
Epoch: 9/10. Validation set: Average loss: 15.0385, Accuracy: 94.3900
Epoch: 10/10. Train set: Average loss: 0.1859, Accuracy: 94.7533
Epoch: 10/10. Validation set: Average loss: 14.0058, Accuracy: 94.8900

观察数据可以发现其实10代也没有过拟合。



In [32]:

    
# 3.1 show_curve
show_curve(train_accs, 'accuracy')
show_curve(train_losses, 'loss')



In [36]:

    
# 3.2 Train
batch_size = 128 # batch size is 128
n_epochs = 5 # train for 5 epochs
learning_rate = 0.7
input_size = 28*28 # input image has size 28x28
hidden_size = 100 # hidden neurons is 100 for each layer
output_size = 10 # classes of prediction
l2_norm = 0 # not to use l2 penalty
dropout = False # not to use
get_grad = False # not to obtain grad

# declare a model
model = FeedForwardNeuralNetwork(input_size=input_size, hidden_size=hidden_size, output_size=output_size)
# Cross entropy
loss_fn = torch.nn.CrossEntropyLoss()
# l2_norm can be done in SGD
optimizer = torch.optim.SGD(model.parameters(), lr=learning_rate, weight_decay=l2_norm) 
train_accs, train_losses = fit(train_loader, test_loader, model, loss_fn, optimizer, n_epochs, get_grad)









    



Epoch: 1/5. Train set: Average loss: 1.0305, Accuracy: 87.6967
Epoch: 1/5. Validation set: Average loss: 37.3559, Accuracy: 87.7400
Epoch: 2/5. Train set: Average loss: 0.3130, Accuracy: 76.4033
Epoch: 2/5. Validation set: Average loss: 61.8865, Accuracy: 76.2900
Epoch: 3/5. Train set: Average loss: 0.2946, Accuracy: 84.5883
Epoch: 3/5. Validation set: Average loss: 37.8307, Accuracy: 84.0200
Epoch: 4/5. Train set: Average loss: 0.2273, Accuracy: 92.8750
Epoch: 4/5. Validation set: Average loss: 20.9198, Accuracy: 92.8200
Epoch: 5/5. Train set: Average loss: 0.1853, Accuracy: 92.4583
Epoch: 5/5. Validation set: Average loss: 24.2679, Accuracy: 91.5800



In [37]:

    
# 3.2 show_curve
show_curve(train_accs, 'accuracy')
show_curve(train_losses, 'loss')

3.4 save model

Pytorch provide two kinds of method to save model. We recommmend the method which only saves parameters. Because it's more feasible and dont' rely on fixed model.

When saving parameters, we not only save learnable parameters in model, but also learnable parameters in optimizer.

A common PyTorch convention is to save models using either a .pt or .pth file extension.

Read more abount save load from this link



In [38]:

    
# show parameters in model

# Print model's state_dict
print("Model's state_dict:")
for param_tensor in model.state_dict():
    print(param_tensor, "\t", model.state_dict()[param_tensor].size())

# Print optimizer's state_dict
print("\nOptimizer's state_dict:")
for var_name in optimizer.state_dict():
    print(var_name, "\t", optimizer.state_dict()[var_name])









    



Model's state_dict:
hidden1.weight 	 torch.Size([100, 784])
hidden1.bias 	 torch.Size([100])
hidden2.weight 	 torch.Size([100, 100])
hidden2.bias 	 torch.Size([100])
hidden3.weight 	 torch.Size([100, 100])
hidden3.bias 	 torch.Size([100])
classification_layer.weight 	 torch.Size([10, 100])
classification_layer.bias 	 torch.Size([10])
hidden1_bn.weight 	 torch.Size([100])
hidden1_bn.bias 	 torch.Size([100])
hidden1_bn.running_mean 	 torch.Size([100])
hidden1_bn.running_var 	 torch.Size([100])
hidden1_bn.num_batches_tracked 	 torch.Size([])
hidden2_bn.weight 	 torch.Size([100])
hidden2_bn.bias 	 torch.Size([100])
hidden2_bn.running_mean 	 torch.Size([100])
hidden2_bn.running_var 	 torch.Size([100])
hidden2_bn.num_batches_tracked 	 torch.Size([])
hidden3_bn.weight 	 torch.Size([100])
hidden3_bn.bias 	 torch.Size([100])
hidden3_bn.running_mean 	 torch.Size([100])
hidden3_bn.running_var 	 torch.Size([100])
hidden3_bn.num_batches_tracked 	 torch.Size([])

Optimizer's state_dict:
state 	 {}
param_groups 	 [{'lr': 0.7, 'momentum': 0, 'dampening': 0, 'weight_decay': 0, 'nesterov': False, 'params': [1266842026920, 1266842027064, 1266842028576, 1266848983440, 1266845398288, 1266845400304, 1266849141168, 1266849140880, 1266849143976, 1266841813280, 1266841816160, 1266841814720, 1266841813352, 1266841815656]}]



In [13]:

    
# save model

save_path = './model.pt'
torch.save(model.state_dict(), save_path)



In [14]:

    
# load parameters from files
saved_parametes = torch.load(save_path)
print(saved_parametes)









    



OrderedDict([('hidden1.weight', tensor([[ 0.0258, -0.0232,  0.0018,  ..., -0.0117,  0.0356,  0.0333],
        [-0.0275,  0.0047, -0.0284,  ...,  0.0333, -0.0012, -0.0011],
        [ 0.0064,  0.0039,  0.0199,  ..., -0.0211, -0.0345,  0.0220],
        ...,
        [ 0.0256, -0.0188,  0.0104,  ..., -0.0167, -0.0339, -0.0092],
        [ 0.0056,  0.0050, -0.0245,  ..., -0.0089, -0.0229,  0.0224],
        [-0.0259,  0.0204,  0.0046,  ...,  0.0008, -0.0026, -0.0019]])), ('hidden1.bias', tensor([ 0.0098,  0.0252,  0.0284,  0.0171, -0.0210, -0.0213, -0.0348,  0.0310,
         0.0163,  0.0286, -0.0064, -0.0343, -0.0140, -0.0217,  0.0248,  0.0037,
        -0.0192, -0.0235, -0.0194, -0.0332, -0.0027, -0.0279,  0.0189, -0.0216,
         0.0013,  0.0047, -0.0175, -0.0333, -0.0147, -0.0054,  0.0027, -0.0028,
        -0.0065, -0.0100, -0.0231, -0.0238,  0.0065,  0.0224, -0.0174,  0.0313,
        -0.0070,  0.0087, -0.0135,  0.0139,  0.0169,  0.0287, -0.0211, -0.0246,
         0.0313, -0.0278,  0.0281,  0.0186,  0.0089,  0.0340,  0.0254,  0.0204,
         0.0049, -0.0176,  0.0020,  0.0348,  0.0214,  0.0295, -0.0297,  0.0002,
         0.0190,  0.0231, -0.0300, -0.0169,  0.0333,  0.0215, -0.0342,  0.0140,
        -0.0004, -0.0340, -0.0111, -0.0090, -0.0200, -0.0075,  0.0177, -0.0333,
         0.0357,  0.0244, -0.0191, -0.0203, -0.0259, -0.0309,  0.0027, -0.0299,
         0.0353, -0.0339, -0.0169, -0.0084,  0.0290,  0.0276,  0.0262, -0.0353,
         0.0157,  0.0071, -0.0044, -0.0052])), ('hidden2.weight', tensor([[ 0.0269, -0.0167, -0.0503,  ..., -0.0823, -0.0126, -0.0932],
        [-0.0662,  0.0524, -0.0240,  ...,  0.0445, -0.0296, -0.0347],
        [-0.0487,  0.0245, -0.0673,  ..., -0.0293,  0.0974,  0.0637],
        ...,
        [-0.0424,  0.0274,  0.0135,  ..., -0.0450, -0.0588,  0.0282],
        [ 0.0582,  0.0370,  0.0106,  ...,  0.0253,  0.0836, -0.0638],
        [-0.0394,  0.0104,  0.0761,  ...,  0.0141, -0.0320, -0.0427]])), ('hidden2.bias', tensor([ 0.0592,  0.0070,  0.0953, -0.0603, -0.0486,  0.0936, -0.0671,  0.0228,
        -0.0144, -0.0165,  0.0458, -0.0189, -0.0674, -0.0577,  0.0151, -0.0036,
        -0.0950,  0.0609, -0.0809,  0.0651,  0.0810,  0.0141,  0.0606,  0.0192,
         0.0121, -0.0034,  0.0347,  0.0026,  0.0181, -0.0831,  0.0515, -0.0516,
         0.0032,  0.0698,  0.0186, -0.0882,  0.0729, -0.0439,  0.0625, -0.0366,
         0.0818,  0.0723, -0.0210, -0.0835, -0.0800, -0.0324, -0.0341,  0.0776,
        -0.0115,  0.0528,  0.0285,  0.0975,  0.0554,  0.0867, -0.0578,  0.0194,
        -0.0977, -0.0650, -0.0717,  0.0830, -0.0516,  0.0120, -0.0323, -0.0824,
         0.0944, -0.0278, -0.0458,  0.0786,  0.0473,  0.0612,  0.0550,  0.0272,
         0.0797,  0.0641, -0.0068,  0.0940, -0.0314, -0.0158,  0.0527, -0.0832,
         0.0398,  0.0368, -0.0210,  0.0382, -0.0280, -0.0868, -0.0499,  0.0526,
         0.0970,  0.0554,  0.0864, -0.0112,  0.0247, -0.0424, -0.0817,  0.0410,
        -0.0927,  0.0870, -0.0637,  0.0483])), ('hidden3.weight', tensor([[ 0.0268,  0.0198, -0.0751,  ...,  0.0947, -0.0966,  0.0163],
        [ 0.0066, -0.0110, -0.0298,  ..., -0.0354,  0.0100,  0.0854],
        [ 0.0517, -0.0663, -0.0751,  ..., -0.0386,  0.0075,  0.0254],
        ...,
        [ 0.0311, -0.0384, -0.0566,  ...,  0.0943, -0.0168,  0.0129],
        [-0.0512, -0.0139,  0.0433,  ...,  0.0156,  0.0435, -0.0221],
        [-0.0760, -0.0418, -0.0374,  ..., -0.0043,  0.0582,  0.0111]])), ('hidden3.bias', tensor([ 0.0349,  0.0234,  0.0599,  0.0161,  0.0529, -0.0551,  0.0415,  0.0310,
        -0.0258, -0.0735,  0.0944,  0.0875, -0.0597,  0.0270, -0.0719,  0.0573,
        -0.0967, -0.0745, -0.0459,  0.0608,  0.0548, -0.0920,  0.0659,  0.0042,
        -0.0886,  0.0294, -0.0596, -0.0192,  0.0294, -0.0750,  0.0359,  0.0726,
         0.0204,  0.0065,  0.0735,  0.0534, -0.0510, -0.0781, -0.0323,  0.0005,
         0.0420,  0.0213, -0.0977,  0.0307, -0.0627, -0.0820, -0.0844,  0.0867,
        -0.0606, -0.0474,  0.0449,  0.0266,  0.0715,  0.0042,  0.0101, -0.0684,
         0.0621, -0.0996, -0.0281, -0.0435, -0.0767, -0.0258, -0.0638, -0.0750,
         0.0994, -0.0425,  0.0683,  0.0729, -0.0614, -0.0906,  0.0444, -0.0830,
         0.0234, -0.0572,  0.0712,  0.0902, -0.0351, -0.0851, -0.0601, -0.0636,
        -0.0270,  0.0111, -0.0426, -0.0483, -0.0586, -0.0443, -0.0701, -0.0761,
         0.0519, -0.0247,  0.0853, -0.0334, -0.0920, -0.0924,  0.0622, -0.0949,
        -0.0571, -0.0188,  0.0506,  0.0207])), ('classification_layer.weight', tensor([[ 0.0999, -0.0330,  0.0919, -0.0937, -0.0072,  0.0293, -0.0499,  0.0157,
          0.0629, -0.0096, -0.0544, -0.0802,  0.0011, -0.0965, -0.0324, -0.0182,
          0.0313,  0.0034,  0.0172, -0.0669,  0.0538,  0.0515, -0.0762,  0.0836,
         -0.0795, -0.0113,  0.0629,  0.0821, -0.0882, -0.0119,  0.0638,  0.0296,
         -0.0360, -0.0444,  0.0888,  0.0173, -0.0512,  0.0804, -0.0544,  0.0265,
          0.0226,  0.0667,  0.0557, -0.0267,  0.0818,  0.0339,  0.0517,  0.0652,
         -0.0621,  0.0072,  0.0189, -0.0817, -0.0283, -0.0399, -0.0774, -0.0075,
         -0.0306, -0.0092,  0.0651,  0.0624, -0.0225,  0.0829,  0.0414, -0.0426,
          0.0601, -0.0780,  0.0424, -0.0604,  0.0284,  0.0661, -0.0931, -0.0641,
          0.0549,  0.0005, -0.0698, -0.0408,  0.0301,  0.0605,  0.0960, -0.0673,
          0.0171, -0.0929, -0.0159, -0.0836, -0.0985,  0.0608,  0.0685, -0.0220,
          0.0022, -0.0987, -0.0760, -0.0300, -0.0438,  0.0257,  0.0070,  0.0621,
          0.0525,  0.0088, -0.0464, -0.0109],
        [-0.0593, -0.0421, -0.0296, -0.0293,  0.0293,  0.0439,  0.0524, -0.0911,
         -0.0491,  0.0555, -0.0649,  0.0402,  0.0715, -0.0666,  0.0569, -0.0456,
         -0.0027, -0.0116,  0.0611,  0.0450, -0.0215,  0.0414, -0.0866,  0.0606,
          0.0627,  0.0043,  0.0208, -0.0204, -0.0139,  0.0760, -0.0848, -0.0396,
         -0.0021, -0.0320, -0.0297,  0.0315,  0.0603, -0.0121,  0.0255, -0.0377,
         -0.0930, -0.0628, -0.0586,  0.0305, -0.0497,  0.0413, -0.0473,  0.0607,
         -0.0243, -0.0826,  0.0318,  0.0284,  0.0998, -0.0629, -0.0336, -0.0872,
         -0.0045,  0.0698,  0.0081, -0.0181, -0.0010,  0.0131, -0.0251,  0.0597,
         -0.0127,  0.0365,  0.0559,  0.0749, -0.0190, -0.0330, -0.0504,  0.0501,
          0.0263,  0.0757,  0.0242,  0.0245,  0.0173,  0.0600, -0.0878, -0.0653,
          0.0674, -0.0002,  0.0873, -0.0573,  0.0101,  0.0845, -0.0784,  0.0264,
          0.0914,  0.0440, -0.0262, -0.0931, -0.0565, -0.0795,  0.0342, -0.0718,
         -0.0621,  0.0366,  0.0424, -0.0188],
        [-0.0162,  0.0346,  0.0007, -0.0966,  0.0413,  0.0098, -0.0605, -0.0688,
         -0.0394,  0.0486, -0.0076, -0.0737,  0.0423,  0.0636,  0.0385,  0.0794,
          0.0106,  0.0083,  0.0708,  0.0591, -0.0904, -0.0332, -0.0394, -0.0420,
         -0.0462, -0.0221,  0.0781, -0.0371, -0.0377,  0.0083, -0.0512, -0.0327,
          0.0513, -0.0063,  0.0805,  0.0858,  0.0800, -0.0399, -0.0488,  0.0980,
         -0.0976,  0.0557, -0.0493, -0.0974,  0.0150,  0.0718, -0.0817,  0.0690,
         -0.0039, -0.0060,  0.0543, -0.0540, -0.0919, -0.0382,  0.0281,  0.0150,
         -0.0078,  0.0234, -0.0899, -0.0707,  0.0235, -0.0469,  0.0910,  0.0927,
          0.0217,  0.0408,  0.0972, -0.0064,  0.0643,  0.0966,  0.0466,  0.0169,
          0.0091,  0.0042, -0.0580,  0.0729, -0.0692, -0.0325, -0.0148, -0.0154,
         -0.0678,  0.0331,  0.0373, -0.0432,  0.0674, -0.0658,  0.0281,  0.0623,
          0.0563, -0.0562,  0.0460, -0.0227,  0.0133, -0.0573,  0.0674, -0.0108,
         -0.0550,  0.0768,  0.0031, -0.0644],
        [-0.0078,  0.0437,  0.0681, -0.0738,  0.0152,  0.0411, -0.0573,  0.0640,
         -0.0641, -0.0048, -0.0783,  0.0868,  0.0270, -0.0452,  0.0837, -0.0793,
          0.0027,  0.0782,  0.0025, -0.0106, -0.0377, -0.0991, -0.0683, -0.0768,
          0.0099, -0.0782, -0.0200, -0.0233, -0.0635,  0.0562, -0.0055, -0.0406,
          0.0154,  0.0741, -0.0826, -0.0298, -0.0346, -0.0863, -0.0861, -0.0474,
         -0.0887, -0.0665, -0.0604, -0.0763, -0.0781,  0.0171,  0.0557,  0.0373,
          0.0291, -0.0601, -0.0553,  0.0334, -0.0690, -0.0104,  0.0318, -0.0584,
          0.0182,  0.0528, -0.0258, -0.0181, -0.0862,  0.0061,  0.0868,  0.0813,
          0.0499,  0.0667,  0.0647, -0.0078,  0.0028, -0.0587,  0.0548, -0.0746,
         -0.0652,  0.0468, -0.0217,  0.0880,  0.0783,  0.0668, -0.0023, -0.0134,
         -0.0865,  0.0570,  0.0781, -0.0857, -0.0286,  0.0625, -0.0006, -0.0763,
         -0.0212, -0.0597,  0.0687,  0.0953, -0.0809,  0.0788, -0.0147, -0.0004,
         -0.0142,  0.0668, -0.0869,  0.0304],
        [ 0.0217, -0.0809,  0.0611, -0.0535, -0.0005,  0.0247,  0.0819, -0.0560,
          0.0678,  0.0032, -0.0574, -0.0395,  0.0732, -0.0426,  0.0461, -0.0425,
         -0.0018, -0.0938,  0.0871,  0.0940, -0.0589,  0.0929,  0.0326,  0.0755,
          0.0698, -0.0433, -0.0665, -0.0833, -0.0145,  0.0989,  0.0729, -0.0405,
         -0.0392, -0.0570,  0.0435, -0.0607,  0.0135,  0.0050,  0.0562, -0.0702,
         -0.0638, -0.0077,  0.0871, -0.0300,  0.0676, -0.0226, -0.0796,  0.0659,
          0.0244,  0.0273, -0.0039,  0.0164, -0.0179, -0.0431, -0.0850, -0.0924,
         -0.0942, -0.0046, -0.0348,  0.0507,  0.0678, -0.0052, -0.0495, -0.0197,
          0.0627,  0.0812,  0.0084, -0.0836,  0.0594,  0.0798, -0.0657, -0.0852,
          0.0957, -0.0937,  0.0038,  0.0043, -0.0952, -0.0675,  0.0496, -0.0886,
         -0.0384, -0.0230,  0.0768,  0.0993,  0.0292, -0.0231, -0.0329, -0.0511,
         -0.0500, -0.0050, -0.0131, -0.0615, -0.0381,  0.0440,  0.0836, -0.0812,
         -0.0229, -0.0507, -0.0257,  0.0769],
        [ 0.0122,  0.0233, -0.0284, -0.0904, -0.0181, -0.0030,  0.0180,  0.0276,
          0.0581,  0.0201, -0.0386, -0.0265,  0.0390,  0.0727, -0.0418, -0.0565,
         -0.0634,  0.0989,  0.0549,  0.0613, -0.0131, -0.0086,  0.0198, -0.0440,
          0.0171,  0.0725,  0.0874,  0.0506,  0.0244,  0.0519,  0.0818,  0.0716,
         -0.0484, -0.0416,  0.0604, -0.0445, -0.0591, -0.0819, -0.0007, -0.0205,
          0.0442,  0.0178,  0.0503,  0.0514, -0.0133,  0.0865,  0.0917, -0.0953,
          0.0186, -0.0964, -0.0152,  0.0342,  0.0864, -0.0413, -0.0749,  0.0250,
          0.0024,  0.0523, -0.0692,  0.0094,  0.0501,  0.0433,  0.0285,  0.0609,
          0.0110, -0.0924, -0.0290, -0.0141, -0.0755,  0.0117,  0.0770, -0.0042,
         -0.0882,  0.0074, -0.0164,  0.0207, -0.0104, -0.0791,  0.0258, -0.0125,
          0.0133, -0.0963, -0.0562, -0.0139, -0.0454, -0.0964, -0.0757,  0.0467,
         -0.0576,  0.0017, -0.0376, -0.0121,  0.0060,  0.0229,  0.0645, -0.0041,
          0.0990,  0.0617, -0.0846,  0.0634],
        [ 0.0278,  0.0922,  0.0573, -0.0491, -0.0507,  0.0700,  0.0762, -0.0189,
         -0.0771,  0.0171,  0.0676,  0.0770, -0.0255, -0.0756,  0.0687, -0.0205,
         -0.0989,  0.0632, -0.0794,  0.0313,  0.0773,  0.0360,  0.0821, -0.0606,
          0.0738,  0.0887,  0.0387,  0.0356, -0.0625,  0.0735,  0.0581,  0.0891,
          0.0843, -0.0639,  0.0472, -0.0790,  0.0919,  0.0697,  0.0893, -0.0754,
         -0.0450,  0.0656,  0.0014,  0.0830, -0.0343, -0.0342, -0.0828, -0.0698,
         -0.0338,  0.0516,  0.0759,  0.0012, -0.0863,  0.0071,  0.0026, -0.0456,
         -0.0868,  0.0452, -0.0117,  0.0123, -0.0595, -0.0123,  0.0908,  0.0386,
         -0.0237,  0.0488, -0.0312,  0.0567,  0.0903, -0.0303,  0.0800,  0.0048,
          0.0213,  0.0564, -0.0492,  0.0013, -0.0353,  0.0380, -0.0363,  0.0450,
          0.0432, -0.0073, -0.0096, -0.0067, -0.0547,  0.0976, -0.0679,  0.0644,
          0.0461,  0.0356,  0.0514,  0.0574, -0.0588,  0.0070, -0.0702, -0.0119,
          0.0469, -0.0897,  0.0059,  0.0710],
        [-0.0569, -0.0580,  0.0950,  0.0261,  0.0199,  0.0075, -0.0742, -0.0741,
         -0.0016, -0.0968,  0.0022,  0.0067, -0.0547, -0.0121,  0.0197,  0.0031,
         -0.0159, -0.0429, -0.0434, -0.0114,  0.0366, -0.0072,  0.0111, -0.0537,
          0.0765,  0.0465,  0.0190, -0.0521, -0.0592,  0.0156,  0.0583, -0.0603,
          0.0103, -0.0224, -0.0165, -0.0542, -0.0197, -0.0198,  0.0428,  0.0004,
          0.0954, -0.0315,  0.0590, -0.0052,  0.0810, -0.0907,  0.0533, -0.0159,
         -0.0210,  0.0466,  0.0877,  0.0252,  0.0814, -0.0547,  0.0793,  0.0108,
         -0.0546, -0.0297,  0.0288, -0.0645,  0.0746,  0.0036, -0.0932,  0.0743,
         -0.0392,  0.0633, -0.0489, -0.0548, -0.0350, -0.0783,  0.0061,  0.0109,
         -0.0888,  0.0533, -0.0490,  0.0929,  0.0808,  0.0182, -0.0692, -0.0578,
          0.0206,  0.0472, -0.0144, -0.0076,  0.0609,  0.0112,  0.0800, -0.0447,
         -0.0599, -0.0947,  0.0179, -0.0841, -0.0340, -0.0213, -0.0607,  0.0661,
          0.0400, -0.0156,  0.0955,  0.0149],
        [ 0.0440, -0.0734, -0.0455,  0.0447, -0.0768, -0.0439, -0.0895,  0.0398,
          0.0552, -0.0228,  0.0745, -0.0040, -0.0553,  0.0257,  0.0624,  0.0556,
          0.0709, -0.0792, -0.0627,  0.0941, -0.0729,  0.0988, -0.0146,  0.0798,
          0.0721, -0.0825, -0.0531,  0.0756, -0.0460, -0.0211,  0.0951,  0.0329,
          0.0125, -0.0693, -0.0977, -0.0203,  0.0767,  0.0714, -0.0439, -0.0786,
         -0.0467, -0.0538, -0.0354, -0.0078, -0.0010,  0.0358,  0.0802, -0.0442,
          0.0781, -0.0548, -0.0205, -0.0482, -0.0718,  0.0170,  0.0689, -0.0280,
         -0.0738,  0.0265,  0.0620, -0.0961, -0.0664, -0.0123, -0.0428,  0.0448,
         -0.0674,  0.0843, -0.0316, -0.0658, -0.0895, -0.0977,  0.0581,  0.0530,
          0.0723,  0.0291,  0.0401,  0.0898,  0.0238,  0.0188,  0.0174, -0.0084,
          0.0348,  0.0688,  0.0532, -0.0724,  0.0398,  0.0491, -0.0138, -0.0088,
          0.0993,  0.0153, -0.0429,  0.0477,  0.0339,  0.0970,  0.0480,  0.0254,
         -0.0635, -0.0199, -0.0032, -0.0939],
        [-0.0933,  0.0052,  0.0916,  0.0297, -0.0359, -0.0384, -0.0779,  0.0442,
          0.0042,  0.0020,  0.0017,  0.0337, -0.0746,  0.0956,  0.0787, -0.0066,
          0.0601, -0.0256,  0.0986, -0.0069,  0.0835, -0.0594, -0.0561, -0.0460,
          0.0802,  0.0581, -0.0684, -0.0005,  0.0625, -0.0339,  0.0106,  0.0919,
         -0.0054,  0.0877,  0.0816, -0.0957, -0.0228, -0.0003,  0.0676,  0.0232,
          0.0174,  0.0314,  0.0896, -0.0134, -0.0496,  0.0221, -0.0814,  0.1000,
         -0.0899, -0.0759, -0.0860,  0.0397, -0.0226, -0.0938,  0.0948, -0.0186,
         -0.0265, -0.0048, -0.0230,  0.0374, -0.0603,  0.0673,  0.0463, -0.0323,
          0.0243,  0.0272, -0.0914, -0.0894, -0.0599, -0.0455,  0.0723,  0.0504,
         -0.0427,  0.0443,  0.0850,  0.0159,  0.0227, -0.0325,  0.0324,  0.0667,
         -0.0186, -0.0308, -0.0905,  0.0660,  0.0077, -0.0925, -0.0997, -0.0911,
         -0.0871,  0.0165, -0.0616, -0.0222, -0.0515, -0.0197, -0.0619, -0.0817,
          0.0138, -0.0629, -0.0115,  0.0221]])), ('classification_layer.bias', tensor([ 0.0042,  0.0854, -0.0021,  0.0236, -0.0088,  0.0362,  0.0319, -0.0139,
        -0.0840,  0.0676])), ('hidden1_bn.weight', tensor([0.3303, 0.1530, 0.4894, 0.0333, 0.3025, 0.2013, 0.9865, 0.9697, 0.7924,
        0.8207, 0.4707, 0.5570, 0.6802, 0.8253, 0.8100, 0.8153, 0.5605, 0.0151,
        0.6982, 0.4166, 0.0589, 0.7718, 0.2630, 0.1117, 0.4796, 0.8429, 0.8209,
        0.2309, 0.3231, 0.6622, 0.6709, 0.4836, 0.8513, 0.9036, 0.1443, 0.2100,
        0.6374, 0.4794, 0.4892, 0.5651, 0.4934, 0.7259, 0.3749, 0.0639, 0.1535,
        0.2868, 0.5568, 0.0284, 0.0931, 0.9997, 0.3133, 0.5693, 0.8812, 0.3149,
        0.6502, 0.6609, 0.6166, 0.5628, 0.6978, 0.2779, 0.2199, 0.6082, 0.7619,
        0.4476, 0.7589, 0.1980, 0.4659, 0.7656, 0.7482, 0.1789, 0.5198, 0.4987,
        0.2993, 0.3912, 0.4700, 0.1714, 0.9863, 0.5004, 0.3511, 0.0591, 0.1343,
        0.1484, 0.3613, 0.6602, 0.0777, 0.2051, 0.3520, 0.6841, 0.7000, 0.1150,
        0.9655, 0.1546, 0.8524, 0.4086, 0.9189, 0.4478, 0.5603, 0.9001, 0.9168,
        0.7823])), ('hidden1_bn.bias', tensor([0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
        0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
        0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
        0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
        0., 0., 0., 0.])), ('hidden1_bn.running_mean', tensor([0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
        0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
        0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
        0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
        0., 0., 0., 0.])), ('hidden1_bn.running_var', tensor([1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1.,
        1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1.,
        1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1.,
        1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1.,
        1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1.,
        1., 1., 1., 1., 1., 1., 1., 1., 1., 1.])), ('hidden1_bn.num_batches_tracked', tensor(0)), ('hidden2_bn.weight', tensor([0.7493, 0.9214, 0.9472, 0.9243, 0.8332, 0.8463, 0.4678, 0.4960, 0.1752,
        0.4570, 0.1298, 0.3417, 0.0792, 0.0747, 0.3442, 0.0376, 0.6607, 0.2929,
        0.2193, 0.8552, 0.2669, 0.9112, 0.9030, 0.8778, 0.7961, 0.1492, 0.4633,
        0.7622, 0.3114, 0.6397, 0.5136, 0.8160, 0.2155, 0.9784, 0.7383, 0.0043,
        0.6815, 0.9142, 0.9557, 0.5914, 0.1865, 0.6271, 0.1805, 0.5324, 0.9511,
        0.9795, 0.3186, 0.3581, 0.4183, 0.7610, 0.8776, 0.1946, 0.4580, 0.6802,
        0.3041, 0.6565, 0.6393, 0.0467, 0.6928, 0.7857, 0.5894, 0.1432, 0.3318,
        0.0910, 0.7382, 0.2964, 0.5349, 0.6373, 0.0262, 0.4608, 0.6420, 0.4851,
        0.9226, 0.4673, 0.7099, 0.4537, 0.6070, 0.3125, 0.7148, 0.1355, 0.0331,
        0.3455, 0.3887, 0.2448, 0.4032, 0.7863, 0.5776, 0.0539, 0.2255, 0.8319,
        0.3218, 0.4303, 0.8619, 0.2611, 0.6131, 0.3324, 0.0570, 0.1367, 0.6644,
        0.1867])), ('hidden2_bn.bias', tensor([0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
        0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
        0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
        0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
        0., 0., 0., 0.])), ('hidden2_bn.running_mean', tensor([0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
        0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
        0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
        0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
        0., 0., 0., 0.])), ('hidden2_bn.running_var', tensor([1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1.,
        1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1.,
        1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1.,
        1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1.,
        1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1.,
        1., 1., 1., 1., 1., 1., 1., 1., 1., 1.])), ('hidden2_bn.num_batches_tracked', tensor(0)), ('hidden3_bn.weight', tensor([2.8238e-01, 3.4066e-01, 3.9813e-01, 5.8191e-01, 8.3979e-02, 9.9005e-01,
        4.2584e-02, 6.9372e-01, 2.0077e-01, 3.5931e-01, 7.8329e-02, 6.5821e-01,
        6.5080e-01, 1.1287e-01, 3.8611e-01, 3.4630e-01, 3.2333e-01, 8.6465e-01,
        2.1532e-01, 2.7145e-02, 2.2934e-01, 9.3514e-01, 4.9618e-01, 4.5476e-01,
        2.1973e-02, 3.6897e-02, 5.6621e-01, 8.6553e-01, 4.9917e-01, 2.2922e-01,
        7.3066e-02, 3.4956e-01, 7.1112e-01, 6.2052e-01, 5.5618e-01, 2.1682e-01,
        4.2153e-01, 5.0732e-01, 2.8934e-01, 1.7915e-01, 4.8453e-01, 5.7194e-01,
        9.4652e-01, 1.4297e-01, 2.4173e-01, 6.0499e-01, 2.2689e-02, 5.2565e-01,
        8.1428e-02, 8.5968e-01, 2.0033e-01, 9.1389e-02, 7.3871e-01, 4.4835e-01,
        3.7537e-01, 8.7001e-01, 4.0890e-01, 4.2064e-01, 6.7415e-01, 3.4865e-01,
        7.6585e-02, 4.7201e-04, 7.8155e-02, 1.7354e-01, 9.0434e-01, 1.1520e-02,
        4.9015e-01, 3.5558e-01, 8.7879e-01, 8.1560e-01, 5.6668e-02, 2.7647e-01,
        9.7132e-01, 9.2903e-01, 7.7654e-01, 2.4732e-01, 7.1677e-01, 5.1995e-01,
        6.7501e-02, 6.3600e-01, 8.3822e-01, 1.1168e-01, 6.9491e-01, 4.6589e-01,
        2.9681e-01, 6.2065e-01, 7.2516e-01, 4.3853e-01, 1.3577e-01, 2.8497e-01,
        7.5760e-02, 7.0469e-01, 7.5912e-01, 4.2590e-01, 4.8499e-01, 7.5445e-01,
        5.9169e-01, 8.1267e-01, 6.6282e-01, 7.5266e-01])), ('hidden3_bn.bias', tensor([0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
        0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
        0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
        0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
        0., 0., 0., 0.])), ('hidden3_bn.running_mean', tensor([0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
        0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
        0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
        0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
        0., 0., 0., 0.])), ('hidden3_bn.running_var', tensor([1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1.,
        1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1.,
        1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1.,
        1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1.,
        1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1.,
        1., 1., 1., 1., 1., 1., 1., 1., 1., 1.])), ('hidden3_bn.num_batches_tracked', tensor(0))])



In [15]:

    
# initailze model by saved parameters
new_model = FeedForwardNeuralNetwork(input_size, hidden_size, output_size)
new_model.load_state_dict(saved_parametes)

作业 4

使用 test_epoch 函数，预测new_model在test_loader上的accuracy和loss



In [23]:

    
# test your model prediction performance

new_test_loss, new_test_accuracy = evaluate(test_loader, new_model, loss_fn)
message = 'Average loss: {:.4f}, Accuracy: {:.4f}'.format(new_test_loss, new_test_accuracy)
print(message)









    



Average loss: 182.0958, Accuracy: 10.5100

4. Training Advanced

4.1 l2_norm

we could minimize the regularization term below by use $weight\_decay$ in SGD optimizer \begin{equation} L\_norm = {\sum_{i=1}^{m}{\theta_{i}^{2}}} \end{equation}

set l2_norm=0.01, let's train and see



In [24]:

    
### Hyper parameters
batch_size = 128
n_epochs = 5
learning_rate = 0.01
input_size = 28*28
hidden_size = 100
output_size = 10
l2_norm = 0.01 # use l2 penalty
get_grad = False

# declare a model
model = FeedForwardNeuralNetwork(input_size=input_size, hidden_size=hidden_size, output_size=output_size)
# Cross entropy
loss_fn = torch.nn.CrossEntropyLoss()
# l2_norm can be done in SGD
optimizer = torch.optim.SGD(model.parameters(), lr=learning_rate, weight_decay=l2_norm)



In [29]:

    
train_accs, train_losses = fit(train_loader, test_loader, model, loss_fn, optimizer, n_epochs, get_grad)









    



Epoch: 1/5. Train set: Average loss: 1.8436, Accuracy: 73.9700
Epoch: 1/5. Validation set: Average loss: 70.7305, Accuracy: 74.9600
Epoch: 2/5. Train set: Average loss: 0.6281, Accuracy: 85.8167
Epoch: 2/5. Validation set: Average loss: 36.5876, Accuracy: 86.1100
Epoch: 3/5. Train set: Average loss: 0.4182, Accuracy: 88.9500
Epoch: 3/5. Validation set: Average loss: 28.9424, Accuracy: 89.0400
Epoch: 4/5. Train set: Average loss: 0.3529, Accuracy: 90.2733
Epoch: 4/5. Validation set: Average loss: 25.6781, Accuracy: 90.1900
Epoch: 5/5. Train set: Average loss: 0.3201, Accuracy: 91.0383
Epoch: 5/5. Validation set: Average loss: 23.6667, Accuracy: 91.0800

作业 5

思考正则项在loss中占比的影响。使用 l2_norm = 1, 训练模型

Hints: 因为jupyter对变量有上下文关系，模型，优化器需要重新声明。可以使用以下代码进行重新定义模型和优化器。注意到此处用的是默认初始化。



In [30]:

    
# Hyper parameters
batch_size = 128
n_epochs = 5
learning_rate = 0.01
input_size = 28*28
hidden_size = 100
output_size = 10
l2_norm = 1 # use l2 penalty
get_grad = False

# declare a model
model = FeedForwardNeuralNetwork(input_size=input_size, hidden_size=hidden_size, output_size=output_size)
# Cross entropy
loss_fn = torch.nn.CrossEntropyLoss()
# l2_norm can be done in SGD
optimizer = torch.optim.SGD(model.parameters(), lr=learning_rate, weight_decay=l2_norm)



In [31]:

    
train_accs, train_losses = fit(train_loader, test_loader, model, loss_fn, optimizer, n_epochs, get_grad)









    



Epoch: 1/5. Train set: Average loss: 2.3070, Accuracy: 11.2367
Epoch: 1/5. Validation set: Average loss: 181.8875, Accuracy: 11.3500
Epoch: 2/5. Train set: Average loss: 2.3073, Accuracy: 11.2367
Epoch: 2/5. Validation set: Average loss: 181.8871, Accuracy: 11.3500
Epoch: 3/5. Train set: Average loss: 2.3073, Accuracy: 11.2367
Epoch: 3/5. Validation set: Average loss: 181.8871, Accuracy: 11.3500
Epoch: 4/5. Train set: Average loss: 2.3073, Accuracy: 11.2367
Epoch: 4/5. Validation set: Average loss: 181.8871, Accuracy: 11.3500
Epoch: 5/5. Train set: Average loss: 2.3073, Accuracy: 11.2367
Epoch: 5/5. Validation set: Average loss: 181.8871, Accuracy: 11.3500

4.2 dropout

During training, randomly zeroes some of the elements of the input tensor with probability p using samples from a Bernoulli distribution.

Each channel will be zeroed out independently on every forward call.

Hints: 因为jupyter对变量有上下文关系，模型，优化器需要重新声明。可以使用以下代码进行重新定义模型和优化器。注意到此处用的是默认初始化。



In [32]:

    
### Hyper parameters
batch_size = 128
n_epochs = 5
learning_rate = 0.01
input_size = 28*28
hidden_size = 100
output_size = 10
l2_norm = 0 # without using l2 penalty
get_grad = False

# declare a model
model = FeedForwardNeuralNetwork(input_size=input_size, hidden_size=hidden_size, output_size=output_size)
# Cross entropy
loss_fn = torch.nn.CrossEntropyLoss()
# l2_norm can be done in SGD
optimizer = torch.optim.SGD(model.parameters(), lr=learning_rate, weight_decay=l2_norm)



In [33]:

    
# Set dropout to True and probability = 0.5
model.set_use_dropout(True)



In [34]:

    
train_accs, train_losses = fit(train_loader, test_loader, model, loss_fn, optimizer, n_epochs, get_grad)









    



Epoch: 1/5. Train set: Average loss: 1.9050, Accuracy: 75.4100
Epoch: 1/5. Validation set: Average loss: 76.3831, Accuracy: 76.5800
Epoch: 2/5. Train set: Average loss: 0.7758, Accuracy: 86.8933
Epoch: 2/5. Validation set: Average loss: 35.0280, Accuracy: 87.4100
Epoch: 3/5. Train set: Average loss: 0.5237, Accuracy: 89.2383
Epoch: 3/5. Validation set: Average loss: 27.6756, Accuracy: 89.6300
Epoch: 4/5. Train set: Average loss: 0.4375, Accuracy: 90.5500
Epoch: 4/5. Validation set: Average loss: 24.0231, Accuracy: 90.8000
Epoch: 5/5. Train set: Average loss: 0.3886, Accuracy: 91.5400
Epoch: 5/5. Validation set: Average loss: 21.4149, Accuracy: 91.8100

4.3 batch_normalization

Batch normalization is a technique for improving the performance and stability of artificial neural networks

\begin{equation} y=\frac{x-E[x]}{\sqrt{Var[x]+\epsilon}} * \gamma + \beta, \end{equation}

$\gamma$ and $\beta$ are learnable parameters

Hints: 因为jupyter对变量有上下文关系，模型，优化器需要重新声明。可以使用以下代码进行重新定义模型和优化器。注意到此处用的是默认初始化。



In [35]:

    
### Hyper parameters
batch_size = 128
n_epochs = 5
learning_rate = 0.01
input_size = 28*28
hidden_size = 100
output_size = 10
l2_norm = 0 # without using l2 penalty
get_grad = False

# declare a model
model = FeedForwardNeuralNetwork(input_size=input_size, hidden_size=hidden_size, output_size=output_size)
# Cross entropy
loss_fn = torch.nn.CrossEntropyLoss()
# l2_norm can be done in SGD
optimizer = torch.optim.SGD(model.parameters(), lr=learning_rate, weight_decay=l2_norm)



In [36]:

    
model.set_use_bn(True)



In [37]:

    
model.use_bn









    Out[37]:





True



In [38]:

    
train_accs, train_losses = fit(train_loader, test_loader, model, loss_fn, optimizer, n_epochs, get_grad)









    



Epoch: 1/5. Train set: Average loss: 1.0816, Accuracy: 91.3100
Epoch: 1/5. Validation set: Average loss: 37.5112, Accuracy: 91.7500
Epoch: 2/5. Train set: Average loss: 0.3474, Accuracy: 94.7317
Epoch: 2/5. Validation set: Average loss: 19.5819, Accuracy: 94.6000
Epoch: 3/5. Train set: Average loss: 0.2171, Accuracy: 95.9817
Epoch: 3/5. Validation set: Average loss: 14.2652, Accuracy: 95.6400
Epoch: 4/5. Train set: Average loss: 0.1636, Accuracy: 96.8250
Epoch: 4/5. Validation set: Average loss: 11.7539, Accuracy: 96.1100
Epoch: 5/5. Train set: Average loss: 0.1323, Accuracy: 97.3433
Epoch: 5/5. Validation set: Average loss: 10.3684, Accuracy: 96.3100

4.4 data augmentation

data augmentation can be more complicated to gain a better generalization on test dataset



In [51]:

    
# only add random horizontal flip
train_transform_1 = transforms.Compose([
    transforms.RandomHorizontalFlip(),
    transforms.ToTensor(), # Convert a PIL Image or numpy.ndarray to tensor.
    # Normalize a tensor image with mean and standard deviation
    transforms.Normalize((0.1307,), (0.3081,))
])

# only add random crop
train_transform_2 = transforms.Compose([
    transforms.RandomCrop(size=[28,28], padding=4),
    transforms.ToTensor(), # Convert a PIL Image or numpy.ndarray to tensor.
    # Normalize a tensor image with mean and standard deviation
    transforms.Normalize((0.1307,), (0.3081,))
])

# add random horizontal flip and random crop
train_transform_3 = transforms.Compose([
    transforms.RandomHorizontalFlip(),
    transforms.RandomCrop(size=[28,28], padding=4),
    transforms.ToTensor(), # Convert a PIL Image or numpy.ndarray to tensor.
    # Normalize a tensor image with mean and standard deviation
    transforms.Normalize((0.1307,), (0.3081,))
])



In [40]:

    
# reload train_loader using trans

train_dataset_1 = torchvision.datasets.MNIST(root='./data', 
                            train=True, 
                            transform=train_transform_1,
                            download=False)

train_loader_1 = torch.utils.data.DataLoader(dataset=train_dataset_1, 
                                           batch_size=batch_size, 
                                           shuffle=True)



In [42]:

    
print(train_dataset_1)









    



Dataset MNIST
    Number of datapoints: 60000
    Split: train
    Root Location: ./data
    Transforms (if any): Compose(
                             RandomHorizontalFlip(p=0.5)
                             ToTensor()
                             Normalize(mean=(0.1307,), std=(0.3081,))
                         )
    Target Transforms (if any): None



In [43]:

    
### Hyper parameters
batch_size = 128
n_epochs = 5
learning_rate = 0.01
input_size = 28*28
hidden_size = 100
output_size = 10
l2_norm = 0 # without using l2 penalty
get_grad = False

# declare a model
model = FeedForwardNeuralNetwork(input_size=input_size, hidden_size=hidden_size, output_size=output_size)
# Cross entropy
loss_fn = torch.nn.CrossEntropyLoss()
# l2_norm can be done in SGD
optimizer = torch.optim.SGD(model.parameters(), lr=learning_rate, weight_decay=l2_norm)



In [44]:

    
train_accs, train_losses = fit(train_loader_1, test_loader, model, loss_fn, optimizer, n_epochs, get_grad)









    



Epoch: 1/5. Train set: Average loss: 2.0889, Accuracy: 62.0867
Epoch: 1/5. Validation set: Average loss: 110.2829, Accuracy: 62.6000
Epoch: 2/5. Train set: Average loss: 0.8733, Accuracy: 78.4167
Epoch: 2/5. Validation set: Average loss: 51.4453, Accuracy: 78.9400
Epoch: 3/5. Train set: Average loss: 0.6015, Accuracy: 81.9600
Epoch: 3/5. Validation set: Average loss: 42.2355, Accuracy: 82.5000
Epoch: 4/5. Train set: Average loss: 0.5208, Accuracy: 84.5533
Epoch: 4/5. Validation set: Average loss: 36.6755, Accuracy: 84.9000
Epoch: 5/5. Train set: Average loss: 0.4663, Accuracy: 86.0500
Epoch: 5/5. Validation set: Average loss: 33.3696, Accuracy: 86.7800

作业 6

使用提供的train_transform_2, train_transform_3，重新加载train_loader，并且使用fit进行训练

Hints: 因为jupyter对变量有上下文关系，模型，优化器需要重新声明。注意到此处用的是默认初始化。



In [49]:

    
# train_transform_2
batch_size = 128
train_dataset_2 = torchvision.datasets.MNIST(root='./data', 
                            train=True, 
                            transform=train_transform_2,
                            download=False)
train_loader_2 = torch.utils.data.DataLoader(dataset=train_dataset_2, 
                                           batch_size=batch_size, 
                                           shuffle=True)
n_epochs = 5
learning_rate = 0.01
input_size = 28*28
hidden_size = 100
output_size = 10
l2_norm = 0 # without using l2 penalty
get_grad = False

# declare a model
model = FeedForwardNeuralNetwork(input_size=input_size, hidden_size=hidden_size, output_size=output_size)
# Cross entropy
loss_fn = torch.nn.CrossEntropyLoss()
# l2_norm can be done in SGD
optimizer = torch.optim.SGD(model.parameters(), lr=learning_rate, weight_decay=l2_norm) 
train_accs, train_losses = fit(train_loader_2, test_loader, model, loss_fn, optimizer, n_epochs, get_grad)









    



Epoch: 1/5. Train set: Average loss: 2.2871, Accuracy: 28.6200
Epoch: 1/5. Validation set: Average loss: 174.5120, Accuracy: 34.5300
Epoch: 2/5. Train set: Average loss: 2.0614, Accuracy: 38.5500
Epoch: 2/5. Validation set: Average loss: 124.8742, Accuracy: 49.4800
Epoch: 3/5. Train set: Average loss: 1.5858, Accuracy: 53.0250
Epoch: 3/5. Validation set: Average loss: 88.4804, Accuracy: 68.1300
Epoch: 4/5. Train set: Average loss: 1.2063, Accuracy: 67.6167
Epoch: 4/5. Validation set: Average loss: 66.4097, Accuracy: 76.6300
Epoch: 5/5. Train set: Average loss: 0.8979, Accuracy: 75.0117
Epoch: 5/5. Validation set: Average loss: 50.8784, Accuracy: 81.4400



In [52]:

    
# train_transform_3
batch_size = 128
train_dataset_3 = torchvision.datasets.MNIST(root='./data', 
                            train=True, 
                            transform=train_transform_3,
                            download=False)
train_loader_3 = torch.utils.data.DataLoader(dataset=train_dataset_3, 
                                           batch_size=batch_size, 
                                           shuffle=True)
n_epochs = 5
learning_rate = 0.01
input_size = 28*28
hidden_size = 100
output_size = 10
l2_norm = 0 # without using l2 penalty
get_grad = False

# declare a model
model = FeedForwardNeuralNetwork(input_size=input_size, hidden_size=hidden_size, output_size=output_size)
# Cross entropy
loss_fn = torch.nn.CrossEntropyLoss()
# l2_norm can be done in SGD
optimizer = torch.optim.SGD(model.parameters(), lr=learning_rate, weight_decay=l2_norm) 
train_accs, train_losses = fit(train_loader_3, test_loader, model, loss_fn, optimizer, n_epochs, get_grad)









    



Epoch: 1/5. Train set: Average loss: 2.2908, Accuracy: 26.0033
Epoch: 1/5. Validation set: Average loss: 174.8120, Accuracy: 31.9400
Epoch: 2/5. Train set: Average loss: 2.1295, Accuracy: 33.1017
Epoch: 2/5. Validation set: Average loss: 137.3742, Accuracy: 46.2500
Epoch: 3/5. Train set: Average loss: 1.7426, Accuracy: 44.8083
Epoch: 3/5. Validation set: Average loss: 102.0611, Accuracy: 61.7700
Epoch: 4/5. Train set: Average loss: 1.4166, Accuracy: 57.8800
Epoch: 4/5. Validation set: Average loss: 81.5938, Accuracy: 68.2600
Epoch: 5/5. Train set: Average loss: 1.1065, Accuracy: 67.5600
Epoch: 5/5. Validation set: Average loss: 65.6093, Accuracy: 75.0400

5. Visualizatio of training and validation phase

We could use tensorboard to visualize our training and test phase. You could find example here

6. Gradient explosion and vanishing

We have embedded code which shows grad for hidden2 and hidden3 layer. By observing their grad changes, we can see whether gradient is normal or not.

For plot grad changes, you need to set get_grad=True in fit function



In [53]:

    
### Hyper parameters
batch_size = 128
n_epochs = 15
learning_rate = 0.01
input_size = 28*28
hidden_size = 100
output_size = 10
l2_norm = 0 # use l2 penalty
get_grad = True

# declare a model
model = FeedForwardNeuralNetwork(input_size=input_size, hidden_size=hidden_size, output_size=output_size)
# Cross entropy
loss_fn = torch.nn.CrossEntropyLoss()
# l2_norm can be done in SGD
optimizer = torch.optim.SGD(model.parameters(), lr=learning_rate, weight_decay=l2_norm)



In [54]:

    
fit(train_loader, test_loader, model, loss_fn, optimizer, n_epochs, get_grad)









    



Epoch: 1/15. Train set: Average loss: 1.6968, Accuracy: 78.9350
Epoch: 1/15. Validation set: Average loss: 58.5215, Accuracy: 79.2600
Epoch: 2/15. Train set: Average loss: 0.5273, Accuracy: 87.8183
Epoch: 2/15. Validation set: Average loss: 31.1567, Accuracy: 88.2600
Epoch: 3/15. Train set: Average loss: 0.3674, Accuracy: 89.9267
Epoch: 3/15. Validation set: Average loss: 25.3640, Accuracy: 90.4100
Epoch: 4/15. Train set: Average loss: 0.3131, Accuracy: 91.1333
Epoch: 4/15. Validation set: Average loss: 22.4276, Accuracy: 91.5200
Epoch: 5/15. Train set: Average loss: 0.2785, Accuracy: 92.0250
Epoch: 5/15. Validation set: Average loss: 20.2820, Accuracy: 92.2100
Epoch: 6/15. Train set: Average loss: 0.2514, Accuracy: 92.7400
Epoch: 6/15. Validation set: Average loss: 18.5566, Accuracy: 92.8200
Epoch: 7/15. Train set: Average loss: 0.2292, Accuracy: 93.3267
Epoch: 7/15. Validation set: Average loss: 17.1110, Accuracy: 93.4900
Epoch: 8/15. Train set: Average loss: 0.2104, Accuracy: 93.8750
Epoch: 8/15. Validation set: Average loss: 15.8754, Accuracy: 93.8100
Epoch: 9/15. Train set: Average loss: 0.1945, Accuracy: 94.3883
Epoch: 9/15. Validation set: Average loss: 14.8164, Accuracy: 94.1700
Epoch: 10/15. Train set: Average loss: 0.1807, Accuracy: 94.7867
Epoch: 10/15. Validation set: Average loss: 13.9327, Accuracy: 94.5900
Epoch: 11/15. Train set: Average loss: 0.1687, Accuracy: 95.1533
Epoch: 11/15. Validation set: Average loss: 13.1597, Accuracy: 94.8700
Epoch: 12/15. Train set: Average loss: 0.1581, Accuracy: 95.4600
Epoch: 12/15. Validation set: Average loss: 12.4931, Accuracy: 95.1500
Epoch: 13/15. Train set: Average loss: 0.1488, Accuracy: 95.6933
Epoch: 13/15. Validation set: Average loss: 11.9081, Accuracy: 95.3700
Epoch: 14/15. Train set: Average loss: 0.1402, Accuracy: 95.9233
Epoch: 14/15. Validation set: Average loss: 11.3730, Accuracy: 95.5000
Epoch: 15/15. Train set: Average loss: 0.1324, Accuracy: 96.1600
Epoch: 15/15. Validation set: Average loss: 10.8900, Accuracy: 95.7300






    Out[54]:





([78.935,
  87.81833333333333,
  89.92666666666666,
  91.13333333333334,
  92.025,
  92.74,
  93.32666666666667,
  93.875,
  94.38833333333334,
  94.78666666666666,
  95.15333333333334,
  95.46,
  95.69333333333333,
  95.92333333333333,
  96.16],
 [1.6967720274105031,
  0.5273040285349911,
  0.36739778456588584,
  0.31308586588209003,
  0.2784546338833677,
  0.2514480789168141,
  0.2291561434379755,
  0.21044628407296717,
  0.19451218298198575,
  0.1807335936026568,
  0.16873590495739865,
  0.158139585573067,
  0.14875695638310832,
  0.14024000589011443,
  0.13244163799378225])

6.1.1 Gradient Vanishing

Set learning=e-10



In [59]:

    
### Hyper parameters
batch_size = 128
n_epochs = 15
learning_rate = 1e-10
input_size = 28*28
hidden_size = 100
output_size = 10
l2_norm = 0 # use l2 penalty
get_grad = True

# declare a model
model = FeedForwardNeuralNetwork(input_size=input_size, hidden_size=hidden_size, output_size=output_size)
# Cross entropy
loss_fn = torch.nn.CrossEntropyLoss()
# l2_norm can be done in SGD
optimizer = torch.optim.SGD(model.parameters(), lr=learning_rate, weight_decay=l2_norm)



In [60]:

    
fit(train_loader, test_loader, model, loss_fn, optimizer, n_epochs, get_grad=get_grad)









    



Epoch: 1/15. Train set: Average loss: 2.3112, Accuracy: 9.9867
Epoch: 1/15. Validation set: Average loss: 182.2560, Accuracy: 9.9100
Epoch: 2/15. Train set: Average loss: 2.3112, Accuracy: 9.9867
Epoch: 2/15. Validation set: Average loss: 182.2560, Accuracy: 9.9100
Epoch: 3/15. Train set: Average loss: 2.3112, Accuracy: 9.9867
Epoch: 3/15. Validation set: Average loss: 182.2560, Accuracy: 9.9100
Epoch: 4/15. Train set: Average loss: 2.3112, Accuracy: 9.9867
Epoch: 4/15. Validation set: Average loss: 182.2560, Accuracy: 9.9100
Epoch: 5/15. Train set: Average loss: 2.3112, Accuracy: 9.9867
Epoch: 5/15. Validation set: Average loss: 182.2560, Accuracy: 9.9100
Epoch: 6/15. Train set: Average loss: 2.3112, Accuracy: 9.9867
Epoch: 6/15. Validation set: Average loss: 182.2560, Accuracy: 9.9100
Epoch: 7/15. Train set: Average loss: 2.3112, Accuracy: 9.9867
Epoch: 7/15. Validation set: Average loss: 182.2560, Accuracy: 9.9100
Epoch: 8/15. Train set: Average loss: 2.3112, Accuracy: 9.9867
Epoch: 8/15. Validation set: Average loss: 182.2560, Accuracy: 9.9100
Epoch: 9/15. Train set: Average loss: 2.3112, Accuracy: 9.9867
Epoch: 9/15. Validation set: Average loss: 182.2560, Accuracy: 9.9100
Epoch: 10/15. Train set: Average loss: 2.3112, Accuracy: 9.9867
Epoch: 10/15. Validation set: Average loss: 182.2560, Accuracy: 9.9100
Epoch: 11/15. Train set: Average loss: 2.3112, Accuracy: 9.9867
Epoch: 11/15. Validation set: Average loss: 182.2560, Accuracy: 9.9100
Epoch: 12/15. Train set: Average loss: 2.3112, Accuracy: 9.9867
Epoch: 12/15. Validation set: Average loss: 182.2560, Accuracy: 9.9100
Epoch: 13/15. Train set: Average loss: 2.3112, Accuracy: 9.9867
Epoch: 13/15. Validation set: Average loss: 182.2560, Accuracy: 9.9100
Epoch: 14/15. Train set: Average loss: 2.3112, Accuracy: 9.9867
Epoch: 14/15. Validation set: Average loss: 182.2560, Accuracy: 9.9100
Epoch: 15/15. Train set: Average loss: 2.3112, Accuracy: 9.9867
Epoch: 15/15. Validation set: Average loss: 182.2560, Accuracy: 9.9100






    Out[60]:





([9.986666666666666,
  9.986666666666666,
  9.986666666666666,
  9.986666666666666,
  9.986666666666666,
  9.986666666666666,
  9.986666666666666,
  9.986666666666666,
  9.986666666666666,
  9.986666666666666,
  9.986666666666666,
  9.986666666666666,
  9.986666666666666,
  9.986666666666666,
  9.986666666666666],
 [2.3111575093024816,
  2.3111575093024816,
  2.3111575093024816,
  2.3111575093024816,
  2.311157509811923,
  2.311157509811923,
  2.311157509811923,
  2.311157509811923,
  2.311157509811923,
  2.3111575103213644,
  2.3111575103213644,
  2.3111575103213644,
  2.3111575103213644,
  2.3111575103213644,
  2.311157509811923])

6.1.2 Gradient Explosion

6.1.2.1 learning rate

set learning rate = 10



In [61]:

    
### Hyper parameters
batch_size = 128
n_epochs = 15
learning_rate = 10
input_size = 28*28
hidden_size = 100
output_size = 10
l2_norm = 0 # not to use l2 penalty
get_grad = True

# declare a model
model = FeedForwardNeuralNetwork(input_size=input_size, hidden_size=hidden_size, output_size=output_size)
# Cross entropy
loss_fn = torch.nn.CrossEntropyLoss()
# l2_norm can be done in SGD
optimizer = torch.optim.SGD(model.parameters(), lr=learning_rate, weight_decay=l2_norm)



In [62]:

    
fit(train_loader, test_loader, model, loss_fn, optimizer, n_epochs, get_grad=True)









    



C:\ProgramData\Anaconda3\lib\site-packages\ipykernel_launcher.py:80: RuntimeWarning: overflow encountered in square
C:\ProgramData\Anaconda3\lib\site-packages\ipykernel_launcher.py:81: RuntimeWarning: overflow encountered in square






    



Epoch: 1/15. Train set: Average loss: nan, Accuracy: 9.8717
Epoch: 1/15. Validation set: Average loss: nan, Accuracy: 9.8000
Epoch: 2/15. Train set: Average loss: nan, Accuracy: 9.8717
Epoch: 2/15. Validation set: Average loss: nan, Accuracy: 9.8000
Epoch: 3/15. Train set: Average loss: nan, Accuracy: 9.8717
Epoch: 3/15. Validation set: Average loss: nan, Accuracy: 9.8000
Epoch: 4/15. Train set: Average loss: nan, Accuracy: 9.8717
Epoch: 4/15. Validation set: Average loss: nan, Accuracy: 9.8000
Epoch: 5/15. Train set: Average loss: nan, Accuracy: 9.8717
Epoch: 5/15. Validation set: Average loss: nan, Accuracy: 9.8000
Epoch: 6/15. Train set: Average loss: nan, Accuracy: 9.8717
Epoch: 6/15. Validation set: Average loss: nan, Accuracy: 9.8000
Epoch: 7/15. Train set: Average loss: nan, Accuracy: 9.8717
Epoch: 7/15. Validation set: Average loss: nan, Accuracy: 9.8000
Epoch: 8/15. Train set: Average loss: nan, Accuracy: 9.8717
Epoch: 8/15. Validation set: Average loss: nan, Accuracy: 9.8000
Epoch: 9/15. Train set: Average loss: nan, Accuracy: 9.8717
Epoch: 9/15. Validation set: Average loss: nan, Accuracy: 9.8000
Epoch: 10/15. Train set: Average loss: nan, Accuracy: 9.8717
Epoch: 10/15. Validation set: Average loss: nan, Accuracy: 9.8000
Epoch: 11/15. Train set: Average loss: nan, Accuracy: 9.8717
Epoch: 11/15. Validation set: Average loss: nan, Accuracy: 9.8000
Epoch: 12/15. Train set: Average loss: nan, Accuracy: 9.8717
Epoch: 12/15. Validation set: Average loss: nan, Accuracy: 9.8000
Epoch: 13/15. Train set: Average loss: nan, Accuracy: 9.8717
Epoch: 13/15. Validation set: Average loss: nan, Accuracy: 9.8000
Epoch: 14/15. Train set: Average loss: nan, Accuracy: 9.8717
Epoch: 14/15. Validation set: Average loss: nan, Accuracy: 9.8000
Epoch: 15/15. Train set: Average loss: nan, Accuracy: 9.8717
Epoch: 15/15. Validation set: Average loss: nan, Accuracy: 9.8000






    Out[62]:





([9.871666666666666,
  9.871666666666666,
  9.871666666666666,
  9.871666666666666,
  9.871666666666666,
  9.871666666666666,
  9.871666666666666,
  9.871666666666666,
  9.871666666666666,
  9.871666666666666,
  9.871666666666666,
  9.871666666666666,
  9.871666666666666,
  9.871666666666666,
  9.871666666666666],
 [nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan])

6.1.2.2 normalization for input data

6.1.2.3 unsuitable weight initialization



In [63]:

    
### Hyper parameters
batch_size = 128
n_epochs = 15
learning_rate = 1
input_size = 28*28
hidden_size = 100
output_size = 10
l2_norm = 0 # not to use l2 penalty
get_grad = True

# declare a model
model = FeedForwardNeuralNetwork(input_size=input_size, hidden_size=hidden_size, output_size=output_size)
# Cross entropy
loss_fn = torch.nn.CrossEntropyLoss()
# l2_norm can be done in SGD
optimizer = torch.optim.SGD(model.parameters(), lr=learning_rate, weight_decay=l2_norm)



In [64]:

    
# reset parameters as 10
def wrong_weight_bias_reset(model):
    """Using normalization with mean=0, std=1 to initialize model's parameter
    """
    for m in model.modules():
        if isinstance(m, nn.Linear):
            # initialize linear layer with mean and std
            mean, std = 0, 1 
            
            # Initialization method
            torch.nn.init.normal_(m.weight, mean, std)
            torch.nn.init.normal_(m.bias, mean, std)



In [65]:

    
wrong_weight_bias_reset(model)
show_weight_bias(model)









    



C:\ProgramData\Anaconda3\lib\site-packages\matplotlib\figure.py:2366: UserWarning: This figure includes Axes that are not compatible with tight_layout, so results might be incorrect.
  warnings.warn("This figure includes Axes that are not compatible "



In [66]:

    
fit(train_loader, test_loader, model, loss_fn, optimizer, n_epochs, get_grad=True)









    



C:\ProgramData\Anaconda3\lib\site-packages\ipykernel_launcher.py:80: RuntimeWarning: overflow encountered in square
C:\ProgramData\Anaconda3\lib\site-packages\ipykernel_launcher.py:81: RuntimeWarning: overflow encountered in square






    



Epoch: 1/15. Train set: Average loss: nan, Accuracy: 9.8717
Epoch: 1/15. Validation set: Average loss: nan, Accuracy: 9.8000
Epoch: 2/15. Train set: Average loss: nan, Accuracy: 9.8717
Epoch: 2/15. Validation set: Average loss: nan, Accuracy: 9.8000
Epoch: 3/15. Train set: Average loss: nan, Accuracy: 9.8717
Epoch: 3/15. Validation set: Average loss: nan, Accuracy: 9.8000
Epoch: 4/15. Train set: Average loss: nan, Accuracy: 9.8717
Epoch: 4/15. Validation set: Average loss: nan, Accuracy: 9.8000
Epoch: 5/15. Train set: Average loss: nan, Accuracy: 9.8717
Epoch: 5/15. Validation set: Average loss: nan, Accuracy: 9.8000
Epoch: 6/15. Train set: Average loss: nan, Accuracy: 9.8717
Epoch: 6/15. Validation set: Average loss: nan, Accuracy: 9.8000
Epoch: 7/15. Train set: Average loss: nan, Accuracy: 9.8717
Epoch: 7/15. Validation set: Average loss: nan, Accuracy: 9.8000
Epoch: 8/15. Train set: Average loss: nan, Accuracy: 9.8717
Epoch: 8/15. Validation set: Average loss: nan, Accuracy: 9.8000
Epoch: 9/15. Train set: Average loss: nan, Accuracy: 9.8717
Epoch: 9/15. Validation set: Average loss: nan, Accuracy: 9.8000
Epoch: 10/15. Train set: Average loss: nan, Accuracy: 9.8717
Epoch: 10/15. Validation set: Average loss: nan, Accuracy: 9.8000
Epoch: 11/15. Train set: Average loss: nan, Accuracy: 9.8717
Epoch: 11/15. Validation set: Average loss: nan, Accuracy: 9.8000
Epoch: 12/15. Train set: Average loss: nan, Accuracy: 9.8717
Epoch: 12/15. Validation set: Average loss: nan, Accuracy: 9.8000
Epoch: 13/15. Train set: Average loss: nan, Accuracy: 9.8717
Epoch: 13/15. Validation set: Average loss: nan, Accuracy: 9.8000
Epoch: 14/15. Train set: Average loss: nan, Accuracy: 9.8717
Epoch: 14/15. Validation set: Average loss: nan, Accuracy: 9.8000
Epoch: 15/15. Train set: Average loss: nan, Accuracy: 9.8717
Epoch: 15/15. Validation set: Average loss: nan, Accuracy: 9.8000






    Out[66]:





([9.871666666666666,
  9.871666666666666,
  9.871666666666666,
  9.871666666666666,
  9.871666666666666,
  9.871666666666666,
  9.871666666666666,
  9.871666666666666,
  9.871666666666666,
  9.871666666666666,
  9.871666666666666,
  9.871666666666666,
  9.871666666666666,
  9.871666666666666,
  9.871666666666666],
 [nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan])

Week 4. Training Issues

Outline

1.1 Required Module

1.2 Common Setup

2. Classfication Model

2.1 Short indroduction of MNIST

2.2 Define A FeedForward Neural Network

2.2.1 Activation Function

2.2.1.1 ReLU

2.2.1.2 Sigmoid

2.2.2 Network's Input and output

3. Training

3.1 Pre-set hyper-parameters

3.2 Initialize model parameters

作业1

3.3 Repeat over certain numbers of epoch

3.3.1 Shuffle whole traning data

3.3.1.1 Data Loading

3.3.2 & 3.3.3 compute gradient of loss over parameters & update parameters with gradient descent

作业 2

作业 3

3.4 save model

作业 4

4. Training Advanced

4.1 l2_norm

作业 5

4.2 dropout

4.3 batch_normalization

4.4 data augmentation

作业 6

5. Visualizatio of training and validation phase

6. Gradient explosion and vanishing

6.1.1 Gradient Vanishing

6.1.2 Gradient Explosion

6.1.2.1 learning rate

6.1.2.2 normalization for input data

6.1.2.3 unsuitable weight initialization

References