In this homework you’ll implement a feed-forward neural network for classifying handwritten dig- its. Your tasks will be to implement back propagation to compute the parameter derivatives for SGD and also do L2 regularization for SGD. First, make sure your code works on a small dataset(tinyTOY.pkl.gz) before moving on to lower-resolution version of MNIST(tinyMNIST.pkl.gz).
Finish nn.py. --- see code
tinyToy dataset
tinyMIST
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
The number of nodes in the hidden layer helps with devloping greater accuracy up to a point. In this data analyis the best accuracy is achieved with a 30 node hidden layer and higher number of nodes do not significantly increase the test accuracy.
Incresing the number of nodes too far cause the test accruacy to vary higher and lower indicating overfitting
The number of nodes helps to reach the best accuracy faster as shown below.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
The number of epochs help to increase test accuracy. With more nodes the higher test accuracies are acheived in fewer epochs.
Here, you will use the Conv2D layer in Keras to build a Convolutional Neural Network for the MNIST dataset. The input dataset is the same as the MNIST dataset in HW1, so you need to reshape the vector of each image into matrix for the use of Conv2D. And you need to build your model using the layers provided by Keras and achieve an accuracy higher than 98.5%
Finish the CNN.py to build a CNN model, train and improve your model to achieve 98.5% accuracy on MNIST dataset. (Hint: use one hot encoding for label, input for the final Dense layer need to be flattened, try Dropout layer to improve your model and don’t give up).
see code
Layer (type) Output Shape Param #
=================================================================
conv2d_1 (Conv2D) (None, 24, 24, 32) 832
_________________________________________________________________
max_pooling2d_1 (MaxPooling2 (None, 12, 12, 32) 0
_________________________________________________________________
dropout_1 (Dropout) (None, 12, 12, 32) 0
_________________________________________________________________
flatten_1 (Flatten) (None, 4608) 0
_________________________________________________________________
dense_1 (Dense) (None, 1000) 4609000
_________________________________________________________________
dropout_2 (Dropout) (None, 1000) 0
_________________________________________________________________
dense_2 (Dense) (None, 10) 10010
=================================================================
Total params: 4,619,842
Trainable params: 4,619,842
Non-trainable params: 0
- I added dropout to decrease the chances of overfitting.
- Then I increase epochs from 50 to 100 with a single 32 node convolution layer and got the desired result.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
- I also tried a second convolutional layer of 64 nodes and was able to get to the desired result in 50 epochs. Se results below
Layer (type) Output Shape Param #
=================================================================
conv2d_1 (Conv2D) (None, 24, 24, 32) 832
_________________________________________________________________
max_pooling2d_1 (MaxPooling2 (None, 12, 12, 32) 0
_________________________________________________________________
dropout_1 (Dropout) (None, 12, 12, 32) 0
_________________________________________________________________
conv2d_2 (Conv2D) (None, 8, 8, 64) 51264
_________________________________________________________________
max_pooling2d_2 (MaxPooling2 (None, 4, 4, 64) 0
_________________________________________________________________
dropout_2 (Dropout) (None, 4, 4, 64) 0
_________________________________________________________________
flatten_1 (Flatten) (None, 1024) 0
_________________________________________________________________
dense_1 (Dense) (None, 1000) 1025000
_________________________________________________________________
dropout_3 (Dropout) (None, 1000) 0
_________________________________________________________________
dense_2 (Dense) (None, 10) 10010
=================================================================
Total params: 1,087,106
Trainable params: 1,087,106
Non-trainable params: 0
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
#### 3\. Try different activation functions and batch sizes. Show the corresponding accuracy.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
```
Finish the LSTM.py to build an RNN model. Use word embeddings as the first layer and use LSTM for sequential prediction.
Here is one f my better results
Using TensorFlow backend.
dict_size=10000, example_length=512, embedding_length=128, batch_size=4, epochs=15
_________________________________________________________________
Layer (type) Output Shape Param #
=================================================================
embedding_1 (Embedding) (None, 512, 128) 1280000
_________________________________________________________________
dropout_1 (Dropout) (None, 512, 128) 0
_________________________________________________________________
lstm_1 (LSTM) (None, 64) 49408
_________________________________________________________________
dense_1 (Dense) (None, 1) 65
=================================================================
25000/25000 [==============================] - 5350s - loss: 0.4669 - acc: 0.7792 - val_loss: 0.3705 - val_acc: 0.8202
Epoch 2/15
25000/25000 [==============================] - 5300s - loss: 0.2429 - acc: 0.9028 - val_loss: 0.2671 - val_acc: 0.8930
Epoch 3/15
25000/25000 [==============================] - 5296s - loss: 0.1613 - acc: 0.9393 - val_loss: 0.2935 - val_acc: 0.8885
Epoch 4/15
25000/25000 [==============================] - 5309s - loss: 0.1064 - acc: 0.9622 - val_loss: 0.3304 - val_acc: 0.8858
The embedding layer build word vectors based on the data that you have. The input is a numercial representation of the word and the output vectors of the relationships between words
More nodes take longer to train and the accuracy rises over more epochs before platueauing. More nodes generally take longer to train and test.
Here a variety of runs I did to check out how changing parameters affects the test accuracy.
The highest test accuracy I achievd was with a small er number of nodes 64
Columns
----
run
dict_size
example_len
batch_size
embedding_len
lstm_units
Best Accuracy
Best accuracy in this epoch (of 3 epochs)
Changing example length
0 5000 128 32 64 128 0.8624 2
1 5000 256 32 64 128 0.8676 2
2 5000 512 32 64 128 0.8664 2
3 5000 768 32 64 128 0.8671 3
4 5000 1024 32 64 128 0.8682 2
Changing dict size
5 1000 512 32 64 128 0.8490 3
6 2500 512 32 64 128 0.8622 2
7 5000 512 32 64 128 0.8707 3
8 7500 512 32 64 128 0.8613 3
9 10000 512 32 64 128 0.8519 3
Changing embedding length
10 5000 512 32 16 128 0.8571 3
11 5000 512 32 32 128 0.8608 3
12 5000 512 32 64 128 0.8659 3
13 5000 512 32 128 128 0.8715 3
14 5000 512 32 256 128 0.8744 2
Changing batch size
15 5000 512 16 64 128 0.8535 3
16 5000 512 32 64 128 0.8525 3
17 5000 512 64 64 128 0.8532 1
18 5000 512 128 64 128 0.8712 2
19 5000 512 256 64 128 0.8792 2
Changing LSTM units
20 5000 512 32 64 64 0.8687 2
21 5000 512 32 64 96 0.8616 3
22 5000 512 32 64 128 0.8739 2
23 5000 512 32 64 192 0.8058 1
24 5000 512 32 64 256 0.8758 3
GRU performance was about the asme as LSTM with the same setup.
The
Here is an example
dict_size=20000, example_length=512, embedding_length=128, batch_size=32, epochs=15
_________________________________________________________________
Layer (type) Output Shape Param #
=================================================================
embedding_1 (Embedding) (None, 512, 128) 2560000
_________________________________________________________________
dropout_1 (Dropout) (None, 512, 128) 0
_________________________________________________________________
gru_1 (GRU) (None, 128) 98688
_________________________________________________________________
dense_1 (Dense) (None, 1) 129
=================================================================
25000/25000 [==============================] - 1311s - loss: 0.5039 - acc: 0.7541 - val_loss: 0.3929 - val_acc: 0.8264
Epoch 2/15
25000/25000 [==============================] - 1256s - loss: 0.3221 - acc: 0.8674 - val_loss: 0.3383 - val_acc: 0.8573
Epoch 3/15
25000/25000 [==============================] - 1459s - loss: 0.1999 - acc: 0.9250 - val_loss: 0.2929 - val_acc: 0.8843
Epoch 4/15
25000/25000 [==============================] - 1303s - loss: 0.1188 - acc: 0.9581 - val_loss: 0.3674 - val_acc: 0.8635
Epoch 5/15
25000/25000 [==============================] - 1376s - loss: 0.0713 - acc: 0.9763 - val_loss: 0.4488 - val_acc: 0.8546
Epoch 6/15
25000/25000 [==============================] - 1477s - loss: 0.0456 - acc: 0.9848 - val_loss: 0.4656 - val_acc: 0.8605
Epoch 7/15
25000/25000 [==============================] - 1281s - loss: 0.0335 - acc: 0.9884 - val_loss: 0.5412 - val_acc: 0.8617
Epoch 8/15
25000/25000 [==============================] - 1112s - loss: 0.0203 - acc: 0.9938 - val_loss: 0.6372 - val_acc: 0.8548
Epoch 9/15
25000/25000 [==============================] - 1105s - loss: 0.0192 - acc: 0.9938 - val_loss: 0.7120 - val_acc: 0.8537
Epoch 10/15
25000/25000 [==============================] - 1117s - loss: 0.0120 - acc: 0.9960 - val_loss: 0.6570 - val_acc: 0.8589
Epoch 11/15
25000/25000 [==============================] - 1092s - loss: 0.0127 - acc: 0.9957 - val_loss: 0.7253 - val_acc: 0.8550
Epoch 12/15
25000/25000 [==============================] - 1121s - loss: 0.0082 - acc: 0.9972 - val_loss: 0.7711 - val_acc: 0.8578
Epoch 13/15
25000/25000 [==============================] - 1091s - loss: 0.0079 - acc: 0.9970 - val_loss: 0.8059 - val_acc: 0.8573
Epoch 14/15
25000/25000 [==============================] - 1122s - loss: 0.0055 - acc: 0.9984 - val_loss: 0.8002 - val_acc: 0.8556
Epoch 15/15
25000/25000 [==============================] - 1179s - loss: 0.0058 - acc: 0.9982 - val_loss: 0.8742 - val_acc: 0.8565
And another GRU example
dict_size=20000, example_length=512, embedding_length=128, batch_size=32, epochs=15
_________________________________________________________________
Layer (type) Output Shape Param #
=================================================================
embedding_1 (Embedding) (None, 512, 128) 2560000
_________________________________________________________________
dropout_1 (Dropout) (None, 512, 128) 0
_________________________________________________________________
gru_1 (GRU) (None, 128) 98688
_________________________________________________________________
dense_1 (Dense) (None, 1) 129
=================================================================
25000/25000 [==============================] - 1305s - loss: 0.5115 - acc: 0.7475 - val_loss: 0.3811 - val_acc: 0.8444
Epoch 2/15
25000/25000 [==============================] - 1227s - loss: 0.3575 - acc: 0.8498 - val_loss: 0.3368 - val_acc: 0.8539
Epoch 3/15
25000/25000 [==============================] - 1456s - loss: 0.2065 - acc: 0.9202 - val_loss: 0.2895 - val_acc: 0.8784
Epoch 4/15
25000/25000 [==============================] - 1316s - loss: 0.1224 - acc: 0.9551 - val_loss: 0.3285 - val_acc: 0.8743
Epoch 5/15
25000/25000 [==============================] - 1372s - loss: 0.0720 - acc: 0.9762 - val_loss: 0.4011 - val_acc: 0.8704
Epoch 6/15
25000/25000 [==============================] - 1472s - loss: 0.0430 - acc: 0.9853 - val_loss: 0.4521 - val_acc: 0.8607
Epoch 7/15
25000/25000 [==============================] - 1308s - loss: 0.0301 - acc: 0.9898 - val_loss: 0.5105 - val_acc: 0.8640
Epoch 8/15
25000/25000 [==============================] - 1114s - loss: 0.0180 - acc: 0.9946 - val_loss: 0.6100 - val_acc: 0.8609
Epoch 9/15
25000/25000 [==============================] - 1098s - loss: 0.0144 - acc: 0.9956 - val_loss: 0.6792 - val_acc: 0.8538
Epoch 10/15
25000/25000 [==============================] - 1116s - loss: 0.0123 - acc: 0.9959 - val_loss: 0.6622 - val_acc: 0.8522
Epoch 11/15
25000/25000 [==============================] - 1097s - loss: 0.0108 - acc: 0.9962 - val_loss: 0.7350 - val_acc: 0.8570
Epoch 12/15
25000/25000 [==============================] - 1119s - loss: 0.0116 - acc: 0.9959 - val_loss: 0.7120 - val_acc: 0.8472
Epoch 13/15
25000/25000 [==============================] - 1096s - loss: 0.0063 - acc: 0.9980 - val_loss: 0.7980 - val_acc: 0.8534
Epoch 14/15
25000/25000 [==============================] - 1119s - loss: 0.0054 - acc: 0.9981 - val_loss: 0.8704 - val_acc: 0.8454
Epoch 15/15
25000/25000 [==============================] - 1184s - loss: 0.0061 - acc: 0.9978 - val_loss: 0.9434 - val_acc: 0.8522
25000/25000 [==============================] - 359s
In [ ]:
Extra credits (5pts) Try to use pretrained word embeddings to initialize the embedding layer and see how that changes the performance.
I did not do the extra credit. I spent all my time this week on the LSTM trying to get to 90%
In [ ]:
In [ ]: