CSCI 5622 Fall 2017 HW#3

Brian McKean

1. Back Propagation (35pts)

In this homework you’ll implement a feed-forward neural network for classifying handwritten dig- its. Your tasks will be to implement back propagation to compute the parameter derivatives for SGD and also do L2 regularization for SGD. First, make sure your code works on a small dataset(tinyTOY.pkl.gz) before moving on to lower-resolution version of MNIST(tinyMNIST.pkl.gz).

1.1 Programming questions (20 pts)

Finish nn.py. --- see code

Finish back prop function to compute the weights and biases.
Finish SGD train function to do L2 regularization.
Add code to test on the tinyMNIST dataset.

1.2 Analysis (15 points)

1. What is the structure of your neural network (for both tinyTOY and tinyMNIST dataset)? Show the dimensions of the input layer, hidden layer and output layer.

tinyToy dataset

Input Layer 2
Hidden layer 30
Output Layer 2

tinyMIST

Input Layer 196
Hidden layer 30
Output Layer 10

2. What the role of the size of the hidden layer on train and test accuracy (plot accuracy vs. size of hidden layer using tinyMNIST dataset)?

The number of nodes in the hidden layer helps with devloping greater accuracy up to a point. In this data analyis the best accuracy is achieved with a 30 node hidden layer and higher number of nodes do not significantly increase the test accuracy.

Incresing the number of nodes too far cause the test accruacy to vary higher and lower indicating overfitting

The number of nodes helps to reach the best accuracy faster as shown below.

3. How does the number of epochs affect train and test accuracy (plot accuracy vs. epochs using tinyMINST dataset)?

The number of epochs help to increase test accuracy. With more nodes the higher test accuracies are acheived in fewer epochs.

2 Keras CNN (35pts)

Here, you will use the Conv2D layer in Keras to build a Convolutional Neural Network for the MNIST dataset. The input dataset is the same as the MNIST dataset in HW1, so you need to reshape the vector of each image into matrix for the use of Conv2D. And you need to build your model using the layers provided by Keras and achieve an accuracy higher than 98.5%

2.1 Programming questions (20pts)

Finish the CNN.py to build a CNN model, train and improve your model to achieve 98.5% accuracy on MNIST dataset. (Hint: use one hot encoding for label, input for the final Dense layer need to be flattened, try Dropout layer to improve your model and don’t give up).

Reshape your MNIST data.
Finish init function to construct your model.
Finish train function and fit to your training data.

see code

2.1 Analysis (15pts)

1. Point out at least three layer types you used in your model. Explain what are they used for.

Layer (type)                 Output Shape              Param #   
=================================================================
conv2d_1 (Conv2D)            (None, 24, 24, 32)        832       
_________________________________________________________________
max_pooling2d_1 (MaxPooling2 (None, 12, 12, 32)        0         
_________________________________________________________________
dropout_1 (Dropout)          (None, 12, 12, 32)        0         
_________________________________________________________________
flatten_1 (Flatten)          (None, 4608)              0         
_________________________________________________________________
dense_1 (Dense)              (None, 1000)              4609000   
_________________________________________________________________
dropout_2 (Dropout)          (None, 1000)              0         
_________________________________________________________________
dense_2 (Dense)              (None, 10)                10010     

 =================================================================
Total params: 4,619,842
Trainable params: 4,619,842
Non-trainable params: 0

Convolution Layer -- slides filter over the 5x5 kernals with a strid of 1
Max_Pooling Layer -- make a single sample out if each 2x2 result
Dropout -- causes the previous results to be randomly dropped in each pass allowing more nodes to have influence in the result
Flatten layer -- takes the result down to a 1D array
Dense -- layer of nodes that have full connections
Final Dense Layer -- perepares data in form desired for outuput with 10 classifiers

2. How did you improve your model for higher accuracy?

- I added dropout to decrease the chances of overfitting.
- Then I increase epochs from 50 to 100 with a single 32 node convolution layer and got the desired result.

- I also tried a second convolutional layer of 64 nodes and was able to get to the desired result in 50 epochs.  Se results below

Layer (type)                 Output Shape              Param #   
    =================================================================
conv2d_1 (Conv2D)            (None, 24, 24, 32)        832       
_________________________________________________________________
max_pooling2d_1 (MaxPooling2 (None, 12, 12, 32)        0         
_________________________________________________________________
dropout_1 (Dropout)          (None, 12, 12, 32)        0         
_________________________________________________________________
conv2d_2 (Conv2D)            (None, 8, 8, 64)          51264     
_________________________________________________________________
max_pooling2d_2 (MaxPooling2 (None, 4, 4, 64)          0         
_________________________________________________________________
dropout_2 (Dropout)          (None, 4, 4, 64)          0         
_________________________________________________________________
flatten_1 (Flatten)          (None, 1024)              0         
_________________________________________________________________
dense_1 (Dense)              (None, 1000)              1025000   
_________________________________________________________________
dropout_3 (Dropout)          (None, 1000)              0         
_________________________________________________________________
dense_2 (Dense)              (None, 10)                10010       


  =================================================================


Total params: 1,087,106  
Trainable params: 1,087,106  
Non-trainable params: 0

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.






#### 3\. Try different activation functions and batch sizes. Show the corresponding accuracy.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
```

3. Keras RNN (30pts)

Here you will use Keras to build a RNN model for sentiment analysis. You should use word embeddings and LSTM to finish LSTM.py. You will test your model on the IMDB dataset. And you are expected to achieve an accuracy higher than 90%.

3.1 Programming questions (15pts)

Finish the LSTM.py to build an RNN model. Use word embeddings as the first layer and use LSTM for sequential prediction.

Preprocess data for LSTM (require data of the same length).
Finish init function to construct your model.
Finish train function and fit to your training data.

Here is one f my better results

Using TensorFlow backend.
dict_size=10000, example_length=512, embedding_length=128,  batch_size=4, epochs=15
_________________________________________________________________
Layer (type)                 Output Shape              Param #
=================================================================
embedding_1 (Embedding)      (None, 512, 128)          1280000
_________________________________________________________________
dropout_1 (Dropout)          (None, 512, 128)          0
_________________________________________________________________
lstm_1 (LSTM)                (None, 64)                49408
_________________________________________________________________
dense_1 (Dense)              (None, 1)                 65
=================================================================

25000/25000 [==============================] - 5350s - loss: 0.4669 - acc: 0.7792 - val_loss: 0.3705 - val_acc: 0.8202
Epoch 2/15
25000/25000 [==============================] - 5300s - loss: 0.2429 - acc: 0.9028 - val_loss: 0.2671 - val_acc: 0.8930
Epoch 3/15
25000/25000 [==============================] - 5296s - loss: 0.1613 - acc: 0.9393 - val_loss: 0.2935 - val_acc: 0.8885
Epoch 4/15
25000/25000 [==============================] - 5309s - loss: 0.1064 - acc: 0.9622 - val_loss: 0.3304 - val_acc: 0.8858

3.2 Analysis (15pts)

1. What is the purpose of the embedding layer? (Hint: think about the input and the output).

The embedding layer build word vectors based on the data that you have. The input is a numercial representation of the word and the output vectors of the relationships between words

2. What is the effect of the hidden dimension size in LSTM?

More nodes take longer to train and the accuracy rises over more epochs before platueauing. More nodes generally take longer to train and test.

Here a variety of runs I did to check out how changing parameters affects the test accuracy.

The highest test accuracy I achievd was with a small er number of nodes 64


Columns
----
run 
  dict_size 
        example_len 
            batch_size  
                embedding_len   
                    lstm_units  
                        Best Accuracy   
                            Best accuracy in this epoch (of 3 epochs)

Changing example length        
0   5000    128 32  64  128 0.8624  2
1   5000    256 32  64  128 0.8676  2
2   5000    512 32  64  128 0.8664  2
3   5000    768 32  64  128 0.8671  3
4   5000    1024    32  64  128 0.8682  2
Changing dict size
5   1000    512 32  64  128 0.8490  3
6   2500    512 32  64  128 0.8622  2
7   5000    512 32  64  128 0.8707  3
8   7500    512 32  64  128 0.8613  3
9   10000   512 32  64  128 0.8519  3
Changing embedding length
10  5000    512 32  16  128 0.8571  3
11  5000    512 32  32  128 0.8608  3
12  5000    512 32  64  128 0.8659  3
13  5000    512 32  128 128 0.8715  3
14  5000    512 32  256 128 0.8744  2
Changing batch size
15  5000    512 16  64  128 0.8535  3
16  5000    512 32  64  128 0.8525  3
17  5000    512 64  64  128 0.8532  1
18  5000    512 128 64  128 0.8712  2
19  5000    512 256 64  128 0.8792  2


Changing LSTM units

20  5000    512 32  64  64  0.8687  2
21  5000    512 32  64  96  0.8616  3
22  5000    512 32  64  128 0.8739  2
23  5000    512 32  64  192 0.8058  1
24  5000    512 32  64  256 0.8758  3

3. Replace LSTM with GRU and compare their performance.

GRU performance was about the asme as LSTM with the same setup.
The 



Here is an example
dict_size=20000, example_length=512, embedding_length=128,  batch_size=32, epochs=15

_________________________________________________________________
Layer (type)                 Output Shape              Param #
=================================================================
embedding_1 (Embedding)      (None, 512, 128)          2560000
_________________________________________________________________
dropout_1 (Dropout)          (None, 512, 128)          0
_________________________________________________________________
gru_1 (GRU)                  (None, 128)               98688
_________________________________________________________________
dense_1 (Dense)              (None, 1)                 129
=================================================================

25000/25000 [==============================] - 1311s - loss: 0.5039 - acc: 0.7541 - val_loss: 0.3929 - val_acc: 0.8264
Epoch 2/15
25000/25000 [==============================] - 1256s - loss: 0.3221 - acc: 0.8674 - val_loss: 0.3383 - val_acc: 0.8573
Epoch 3/15
25000/25000 [==============================] - 1459s - loss: 0.1999 - acc: 0.9250 - val_loss: 0.2929 - val_acc: 0.8843
Epoch 4/15
25000/25000 [==============================] - 1303s - loss: 0.1188 - acc: 0.9581 - val_loss: 0.3674 - val_acc: 0.8635
Epoch 5/15
25000/25000 [==============================] - 1376s - loss: 0.0713 - acc: 0.9763 - val_loss: 0.4488 - val_acc: 0.8546
Epoch 6/15
25000/25000 [==============================] - 1477s - loss: 0.0456 - acc: 0.9848 - val_loss: 0.4656 - val_acc: 0.8605
Epoch 7/15
25000/25000 [==============================] - 1281s - loss: 0.0335 - acc: 0.9884 - val_loss: 0.5412 - val_acc: 0.8617
Epoch 8/15
25000/25000 [==============================] - 1112s - loss: 0.0203 - acc: 0.9938 - val_loss: 0.6372 - val_acc: 0.8548
Epoch 9/15
25000/25000 [==============================] - 1105s - loss: 0.0192 - acc: 0.9938 - val_loss: 0.7120 - val_acc: 0.8537
Epoch 10/15
25000/25000 [==============================] - 1117s - loss: 0.0120 - acc: 0.9960 - val_loss: 0.6570 - val_acc: 0.8589
Epoch 11/15
25000/25000 [==============================] - 1092s - loss: 0.0127 - acc: 0.9957 - val_loss: 0.7253 - val_acc: 0.8550
Epoch 12/15
25000/25000 [==============================] - 1121s - loss: 0.0082 - acc: 0.9972 - val_loss: 0.7711 - val_acc: 0.8578
Epoch 13/15
25000/25000 [==============================] - 1091s - loss: 0.0079 - acc: 0.9970 - val_loss: 0.8059 - val_acc: 0.8573
Epoch 14/15
25000/25000 [==============================] - 1122s - loss: 0.0055 - acc: 0.9984 - val_loss: 0.8002 - val_acc: 0.8556
Epoch 15/15
25000/25000 [==============================] - 1179s - loss: 0.0058 - acc: 0.9982 - val_loss: 0.8742 - val_acc: 0.8565


And another GRU example
dict_size=20000, example_length=512, embedding_length=128,  batch_size=32, epochs=15
_________________________________________________________________
Layer (type)                 Output Shape              Param #
=================================================================
embedding_1 (Embedding)      (None, 512, 128)          2560000
_________________________________________________________________
dropout_1 (Dropout)          (None, 512, 128)          0
_________________________________________________________________
gru_1 (GRU)                  (None, 128)               98688
_________________________________________________________________
dense_1 (Dense)              (None, 1)                 129
=================================================================
25000/25000 [==============================] - 1305s - loss: 0.5115 - acc: 0.7475 - val_loss: 0.3811 - val_acc: 0.8444
Epoch 2/15
25000/25000 [==============================] - 1227s - loss: 0.3575 - acc: 0.8498 - val_loss: 0.3368 - val_acc: 0.8539
Epoch 3/15
25000/25000 [==============================] - 1456s - loss: 0.2065 - acc: 0.9202 - val_loss: 0.2895 - val_acc: 0.8784
Epoch 4/15
25000/25000 [==============================] - 1316s - loss: 0.1224 - acc: 0.9551 - val_loss: 0.3285 - val_acc: 0.8743
Epoch 5/15
25000/25000 [==============================] - 1372s - loss: 0.0720 - acc: 0.9762 - val_loss: 0.4011 - val_acc: 0.8704
Epoch 6/15
25000/25000 [==============================] - 1472s - loss: 0.0430 - acc: 0.9853 - val_loss: 0.4521 - val_acc: 0.8607
Epoch 7/15
25000/25000 [==============================] - 1308s - loss: 0.0301 - acc: 0.9898 - val_loss: 0.5105 - val_acc: 0.8640
Epoch 8/15
25000/25000 [==============================] - 1114s - loss: 0.0180 - acc: 0.9946 - val_loss: 0.6100 - val_acc: 0.8609
Epoch 9/15
25000/25000 [==============================] - 1098s - loss: 0.0144 - acc: 0.9956 - val_loss: 0.6792 - val_acc: 0.8538
Epoch 10/15
25000/25000 [==============================] - 1116s - loss: 0.0123 - acc: 0.9959 - val_loss: 0.6622 - val_acc: 0.8522
Epoch 11/15
25000/25000 [==============================] - 1097s - loss: 0.0108 - acc: 0.9962 - val_loss: 0.7350 - val_acc: 0.8570
Epoch 12/15
25000/25000 [==============================] - 1119s - loss: 0.0116 - acc: 0.9959 - val_loss: 0.7120 - val_acc: 0.8472
Epoch 13/15
25000/25000 [==============================] - 1096s - loss: 0.0063 - acc: 0.9980 - val_loss: 0.7980 - val_acc: 0.8534
Epoch 14/15
25000/25000 [==============================] - 1119s - loss: 0.0054 - acc: 0.9981 - val_loss: 0.8704 - val_acc: 0.8454
Epoch 15/15
25000/25000 [==============================] - 1184s - loss: 0.0061 - acc: 0.9978 - val_loss: 0.9434 - val_acc: 0.8522
25000/25000 [==============================] - 359s



In [ ]:

Extra credits (5pts) Try to use pretrained word embeddings to initialize the embedding layer and see how that changes the performance.

I did not do the extra credit. I spent all my time this week on the LSTM trying to get to 90%



In [ ]:



In [ ]: