This homework requires sending **"multiple** files, please do not forget to include all the files when sending to TA. The list of files:

- This notebook
- HW3_Modules.ipynb
- HW3_differentiation.ipynb

```
In [ ]:
```%matplotlib inline
from time import time, sleep
import numpy as np
import matplotlib.pyplot as plt
from IPython import display

Implement everything in `Modules.ipynb`

. Read all the comments thoughtfully to ease the pain. Please try not to change the prototypes.

Do not forget, that each module should return AND store `output`

and `gradInput`

.

The typical assumption is that `module.backward`

is always executed after `module.forward`

,
so `output`

is stored, this would be useful for `SoftMax`

.

```
In [ ]:
```"""
--------------------------------------
-- Tech note
--------------------------------------
Inspired by torch I would use
np.multiply, np.add, np.divide, np.subtract instead of *,+,/,-
for better memory handling
Suppose you allocated a variable
a = np.zeros(...)
So, instead of
a = b + c # will be reallocated, GC needed to free
I would go for:
np.add(b,c,out = a) # puts result in `a`
But it is completely up to you.
"""
%run HW3_Modules.ipynb

Optimizer is implemented for you.

```
In [ ]:
```def sgd_momentum(x, dx, config, state):
"""
This is a very ugly implementation of sgd with momentum
just to show an example how to store old grad in state.
config:
- momentum
- learning_rate
state:
- old_grad
"""
# x and dx have complex structure, old dx will be stored in a simpler one
state.setdefault('old_grad', {})
i = 0
for cur_layer_x, cur_layer_dx in zip(x,dx):
for cur_x, cur_dx in zip(cur_layer_x,cur_layer_dx):
cur_old_grad = state['old_grad'].setdefault(i, np.zeros_like(cur_dx))
np.add(config['momentum'] * cur_old_grad, config['learning_rate'] * cur_dx, out = cur_old_grad)
cur_x -= cur_old_grad
i += 1

```
In [ ]:
```# Generate some data
N = 500
X1 = np.random.randn(N,2) + np.array([2,2])
X2 = np.random.randn(N,2) + np.array([-2,-2])
Y = np.concatenate([np.ones(N),np.zeros(N)])[:,None]
Y = np.hstack([Y, 1-Y])
X = np.vstack([X1,X2])
plt.scatter(X[:,0],X[:,1], c = Y[:,0], edgecolors= 'none')

Define a **logistic regression** for debugging.

```
In [ ]:
```net = Sequential()
net.add(Linear(2, 2))
net.add(SoftMax())
criterion = ClassNLLCriterion()
print(net)
# Test something like that then
# net = Sequential()
# net.add(Linear(2, 4))
# net.add(ReLU())
# net.add(Linear(4, 2))
# net.add(SoftMax())

Start with batch_size = 1000 to make sure every step lowers the loss, then try stochastic version.

```
In [ ]:
```# Iptimizer params
optimizer_config = {'learning_rate' : 1e-1, 'momentum': 0.9}
optimizer_state = {}
# Looping params
n_epoch = 20
batch_size = 128

```
In [ ]:
```# batch generator
def get_batches(dataset, batch_size):
X, Y = dataset
n_samples = X.shape[0]
# Shuffle at the start of epoch
indices = np.arange(n_samples)
np.random.shuffle(indices)
for start in range(0, n_samples, batch_size):
end = min(start + batch_size, n_samples)
batch_idx = indices[start:end]
yield X[batch_idx], Y[batch_idx]

Basic training loop. Examine it.

```
In [ ]:
```loss_history = []
for i in range(n_epoch):
for x_batch, y_batch in get_batches((X, Y), batch_size):
net.zeroGradParameters()
# Forward
predictions = net.forward(x_batch)
loss = criterion.forward(predictions, y_batch)
# Backward
dp = criterion.backward(predictions, y_batch)
net.backward(x_batch, dp)
# Update weights
sgd_momentum(net.getParameters(),
net.getGradParameters(),
optimizer_config,
optimizer_state)
loss_history.append(loss)
# Visualize
display.clear_output(wait=True)
plt.figure(figsize=(8, 6))
plt.title("Training loss")
plt.xlabel("#iteration")
plt.ylabel("loss")
plt.plot(loss_history, 'b')
plt.show()
print('Current loss: %f' % loss)

```
In [ ]:
```import os
from sklearn.datasets import fetch_mldata
# Fetch MNIST dataset and create a local copy.
if os.path.exists('mnist.npz'):
with np.load('mnist.npz', 'r') as data:
X = data['X']
y = data['y']
else:
mnist = fetch_mldata("mnist-original")
X, y = mnist.data / 255.0, mnist.target
np.savez('mnist.npz', X=X, y=y)

One-hot encode the labels first.

```
In [ ]:
``````
# Your code goes here. ################################################
```

**Compare**`ReLU`

,`ELU`

,`LeakyReLU`

,`SoftPlus`

activation functions. You would better pick the best optimizer params for each of them, but it is overkill for now. Use an architecture of your choice for the comparison.**Try**inserting`BatchMeanSubtraction`

between`Linear`

module and activation functions.- Plot the losses both from activation functions comparison and
`BatchMeanSubtraction`

comparison on one plot. Please find a scale (log?) when the lines are distinguishable, do not forget about naming the axes, the plot should be goodlooking. - Hint: logloss for MNIST should be around 0.5.

```
In [ ]:
``````
# Your code goes here. ################################################
```

`BatchMeanSubtraction`

help?

```
In [ ]:
``````
# Your answer goes here. ################################################
```

**Finally**, use all your knowledge to build a super cool model on this dataset, do not forget to split dataset into train and validation. Use **dropout** to prevent overfitting, play with **learning rate decay**. You can use **data augmentation** such as rotations, translations to boost your score. Use your knowledge and imagination to train a model.

```
In [ ]:
``````
# Your code goes here. ################################################
```

Print here your accuracy. It should be around 90%.

```
In [ ]:
``````
# Your answer goes here. ################################################
```

This part is **OPTIONAL**, you may not do it. It will not be scored, but it is easy and interesting.

Now we are going to build a cool model, named autoencoder. The aim is simple: **encode** the data to a lower dimentional representation. Why? Well, if we can **decode** this representation back to original data with "small" reconstuction loss then we can store only compressed representation saving memory. But the most important thing is -- we can reuse trained autoencoder for classification.

Picture from this site.

Now implement an autoencoder:

Build it such that dimetionality inside autoencoder changes like that:

$$784 \text{ (data)} -> 512 -> 256 -> 128 -> 30 -> 128 -> 256 -> 512 -> 784$$Use **MSECriterion** to score the reconstruction. Use **BatchMeanNormalization** between **Linear** and **ReLU**. You may not use nonlinearity in bottleneck layer.

You may train it for 9 epochs with batch size = 256, initial lr = 0.1 droping by a factor of 2 every 3 epochs. The reconstruction loss should be about 6.0 and visual quality decent already. Do not spend time on changing architecture, they are more or less the same.

```
In [ ]:
``````
# Your code goes here. ################################################
```

```
In [ ]:
```# Extract inner representation for train and validation,
# you should get (n_samples, 30) matrices
# Your code goes here. ################################################
# Now build a logistic regression or small classification net
cnet = Sequential()
cnet.add(Linear(30, 2))
cnet.add(SoftMax())
# Learn the weights
# Your code goes here. ################################################
# Now chop off decoder part
# (you may need to implement `remove` method for Sequential container)
# Your code goes here. ################################################
# And add learned layers ontop.
autoenc.add(cnet[0])
autoenc.add(cnet[1])
# Now optimize whole model
# Your code goes here. ################################################

*train set*, plot original image, autoencoder and PCA reconstructions side by side for 10 samples from *validation set*.
Probably you need to use the following snippet to make aoutpencoder examples look comparible.

```
In [ ]:
```# np.clip(prediction,0,1)
#
# Your code goes here. ################################################