```
In [1]:
```import xor_network

The first issue that I has was that I was trying to output a single scalar whose value could be thresholded to determine whether the network should return TRUE or FALSE. It turns out loss functions for this are much more complicated than if I had instead treated the XOR problem as a classification task with one output per possible label ('TRUE', 'FALSE'). This is the approach I have implemented here.

Another issue I encountered at first was that I was using too few hidden nodes. I originally thought that such a simple problem would only need a couple nodes in a single hidden layer to implement. However, such small networks were extremely slow to converge. This is exemplified in the Architectures section.

Lastly, when I was using small batch sizes (<= 5 examples), and randomly populating the batches, the network would sometimes fail to converge, probably because the batches didn't contain all the possible examples.

I tried ReLU, sigmoid, and tanh activation functions. I only successfully uses a softmax cross-entropy loss function.

The results for the different activation functions can be seen by running the block below. The sigmoid function consistently takes the longest to converge. I'm unsure why tanh does significantly better than sigmoid.

```
In [2]:
```batch_size = 100
num_steps = 10000
num_hidden = 7
num_hidden_layers = 2
learning_rate = 0.2
xor_network.run_network(batch_size, num_steps, num_hidden, num_hidden_layers, learning_rate, False, 'sigmoid')

```
```

```
In [5]:
```xor_network.run_network(batch_size, num_steps, num_hidden, num_hidden_layers, learning_rate, False, 'tanh')

```
```

```
In [4]:
```xor_network.run_network(batch_size, num_steps, num_hidden, num_hidden_layers, learning_rate, False, 'relu')

```
```

The results for several different architectures can be seen by running the code below. Since there is no reading from disk, each iteration takes almost exactly the same amount of time. Therefore, I will report "how long it takes" in number of iterations rather than in time.

```
In [11]:
```# Network with 2 hidden layers of 5 nodes
xor_network.run_network(batch_size, num_steps, 5, 2, learning_rate, False, 'relu')

```
```

```
In [12]:
```# Network with 5 hidden layers of 2 nodes each
num_steps = 3000 # (so it doesn't go on forever)
xor_network.run_network(batch_size, num_steps, 2, 5, learning_rate, False, 'relu')

```
```

**Conclusion from the above:** With the number of parameters held constant, a deeper network does not necessarily perform better than a shallower one. I am guessing this is because fewer nodes in a layer means that the network can keep around less information from layer to layer.

```
In [14]:
```xor_network.run_network(batch_size, num_steps, 3, 5, learning_rate, False, 'relu')

```
```

**Conclusion from the above:** Indeed, the problem is not the number of layers, but the number of nodes in each layer.

```
In [20]:
```# This is the minimum number of nodes I can use to consistently get convergence with Gradient Descent.
xor_network.run_network(batch_size, num_steps, 5, 1, learning_rate, False, 'relu')

```
```

```
In [31]:
```# If I switch to using Adam Optimizer, I can get down to 2 hidden nodes and consistently have convergence.
xor_network.run_network(batch_size, num_steps, 2, 1, learning_rate, True, 'relu')

```
```