Homework 2

Daphne Ippolito


In [1]:
import xor_network

What issues did you have?

The first issue that I has was that I was trying to output a single scalar whose value could be thresholded to determine whether the network should return TRUE or FALSE. It turns out loss functions for this are much more complicated than if I had instead treated the XOR problem as a classification task with one output per possible label ('TRUE', 'FALSE'). This is the approach I have implemented here.

Another issue I encountered at first was that I was using too few hidden nodes. I originally thought that such a simple problem would only need a couple nodes in a single hidden layer to implement. However, such small networks were extremely slow to converge. This is exemplified in the Architectures section.

Lastly, when I was using small batch sizes (<= 5 examples), and randomly populating the batches, the network would sometimes fail to converge, probably because the batches didn't contain all the possible examples.

Which activation functions did you try? Which loss functions?

I tried ReLU, sigmoid, and tanh activation functions. I only successfully uses a softmax cross-entropy loss function.

The results for the different activation functions can be seen by running the block below. The sigmoid function consistently takes the longest to converge. I'm unsure why tanh does significantly better than sigmoid.


In [2]:
batch_size = 100
num_steps = 10000
num_hidden = 7
num_hidden_layers = 2
learning_rate = 0.2

xor_network.run_network(batch_size, num_steps, num_hidden, num_hidden_layers, learning_rate, False, 'sigmoid')


Adding hidden layer with 7 nodes
Adding hidden layer with 7 nodes
Step 50: loss = 0.61, batch acc = 0.67, test acc = 0.67 (0.004 sec)
Step 100: loss = 0.52, batch acc = 0.72, test acc = 0.72 (0.004 sec)
Step 150: loss = 0.45, batch acc = 0.73, test acc = 0.73 (0.003 sec)
Step 200: loss = 0.37, batch acc = 0.71, test acc = 0.71 (0.004 sec)
Step 250: loss = 0.28, batch acc = 1.00, test acc = 1.00 (0.004 sec)
Test accuracy is perfect after 250 iterations. Quitting.

In [5]:
xor_network.run_network(batch_size, num_steps, num_hidden, num_hidden_layers, learning_rate, False, 'tanh')


Adding hidden layer with 7 nodes
Adding hidden layer with 7 nodes
Step 50: loss = 0.07, batch acc = 1.00, test acc = 1.00 (0.003 sec)
Test accuracy is perfect after 50 iterations. Quitting.

In [4]:
xor_network.run_network(batch_size, num_steps, num_hidden, num_hidden_layers, learning_rate, False, 'relu')


Adding hidden layer with 7 nodes
Adding hidden layer with 7 nodes
Step 50: loss = 0.10, batch acc = 1.00, test acc = 1.00 (0.004 sec)
Test accuracy is perfect after 50 iterations. Quitting.

What architectures did you try? What were the different results? How long did it take?

The results for several different architectures can be seen by running the code below. Since there is no reading from disk, each iteration takes almost exactly the same amount of time. Therefore, I will report "how long it takes" in number of iterations rather than in time.


In [11]:
# Network with 2 hidden layers of 5 nodes
xor_network.run_network(batch_size, num_steps, 5, 2, learning_rate, False, 'relu')


Adding hidden layer with 5 nodes
Adding hidden layer with 5 nodes
Step 50: loss = 0.44, batch acc = 0.75, test acc = 0.75 (0.003 sec)
Step 100: loss = 0.19, batch acc = 1.00, test acc = 1.00 (0.003 sec)
Test accuracy is perfect after 100 iterations. Quitting.

In [12]:
# Network with 5 hidden layers of 2 nodes each
num_steps = 3000 # (so it doesn't go on forever)
xor_network.run_network(batch_size, num_steps, 2, 5, learning_rate, False, 'relu')


Adding hidden layer with 2 nodes
Adding hidden layer with 2 nodes
Adding hidden layer with 2 nodes
Adding hidden layer with 2 nodes
Adding hidden layer with 2 nodes
Step 50: loss = 0.56, batch acc = 0.75, test acc = 0.75 (0.005 sec)
Step 100: loss = 0.53, batch acc = 0.78, test acc = 0.78 (0.005 sec)
Step 150: loss = 0.53, batch acc = 0.78, test acc = 0.78 (0.005 sec)
Step 200: loss = 0.59, batch acc = 0.72, test acc = 0.72 (0.004 sec)
Step 250: loss = 0.53, batch acc = 0.78, test acc = 0.78 (0.006 sec)
Step 300: loss = 0.57, batch acc = 0.74, test acc = 0.74 (0.005 sec)
Step 350: loss = 0.52, batch acc = 0.79, test acc = 0.79 (0.005 sec)
Step 400: loss = 0.60, batch acc = 0.72, test acc = 0.72 (0.006 sec)
Step 450: loss = 0.55, batch acc = 0.76, test acc = 0.76 (0.005 sec)
Step 500: loss = 0.53, batch acc = 0.78, test acc = 0.78 (0.006 sec)
Step 550: loss = 0.58, batch acc = 0.73, test acc = 0.73 (0.005 sec)
Step 600: loss = 0.58, batch acc = 0.73, test acc = 0.73 (0.004 sec)
Step 650: loss = 0.59, batch acc = 0.72, test acc = 0.72 (0.005 sec)
Step 700: loss = 0.54, batch acc = 0.77, test acc = 0.77 (0.005 sec)
Step 750: loss = 0.50, batch acc = 0.81, test acc = 0.81 (0.005 sec)
Step 800: loss = 0.60, batch acc = 0.72, test acc = 0.72 (0.005 sec)
Step 850: loss = 0.68, batch acc = 0.64, test acc = 0.64 (0.005 sec)
Step 900: loss = 0.56, batch acc = 0.75, test acc = 0.75 (0.004 sec)
Step 950: loss = 0.53, batch acc = 0.78, test acc = 0.78 (0.005 sec)
Step 1000: loss = 0.56, batch acc = 0.75, test acc = 0.75 (0.005 sec)
Step 1050: loss = 0.54, batch acc = 0.77, test acc = 0.77 (0.006 sec)
Step 1100: loss = 0.60, batch acc = 0.72, test acc = 0.72 (0.004 sec)
Step 1150: loss = 0.58, batch acc = 0.73, test acc = 0.73 (0.005 sec)
Step 1200: loss = 0.62, batch acc = 0.70, test acc = 0.70 (0.005 sec)
Step 1250: loss = 0.57, batch acc = 0.74, test acc = 0.74 (0.005 sec)
Step 1300: loss = 0.54, batch acc = 0.77, test acc = 0.77 (0.006 sec)
Step 1350: loss = 0.58, batch acc = 0.73, test acc = 0.73 (0.005 sec)
Step 1400: loss = 0.59, batch acc = 0.72, test acc = 0.72 (0.005 sec)
Step 1450: loss = 0.55, batch acc = 0.76, test acc = 0.76 (0.006 sec)
Step 1500: loss = 0.50, batch acc = 0.81, test acc = 0.81 (0.003 sec)
Step 1550: loss = 0.59, batch acc = 0.73, test acc = 0.73 (0.004 sec)
Step 1600: loss = 0.54, batch acc = 0.77, test acc = 0.77 (0.006 sec)
Step 1650: loss = 0.53, batch acc = 0.78, test acc = 0.78 (0.005 sec)
Step 1700: loss = 0.55, batch acc = 0.76, test acc = 0.76 (0.004 sec)
Step 1750: loss = 0.54, batch acc = 0.77, test acc = 0.77 (0.006 sec)
Step 1800: loss = 0.53, batch acc = 0.78, test acc = 0.78 (0.006 sec)
Step 1850: loss = 0.56, batch acc = 0.75, test acc = 0.75 (0.006 sec)
Step 1900: loss = 0.50, batch acc = 0.81, test acc = 0.81 (0.006 sec)
Step 1950: loss = 0.55, batch acc = 0.76, test acc = 0.76 (0.005 sec)
Step 2000: loss = 0.60, batch acc = 0.71, test acc = 0.71 (0.004 sec)
Step 2050: loss = 0.51, batch acc = 0.80, test acc = 0.80 (0.004 sec)
Step 2100: loss = 0.55, batch acc = 0.76, test acc = 0.76 (0.006 sec)
Step 2150: loss = 0.53, batch acc = 0.78, test acc = 0.78 (0.005 sec)
Step 2200: loss = 0.54, batch acc = 0.77, test acc = 0.77 (0.004 sec)
Step 2250: loss = 0.54, batch acc = 0.77, test acc = 0.77 (0.003 sec)
Step 2300: loss = 0.62, batch acc = 0.70, test acc = 0.70 (0.006 sec)
Step 2350: loss = 0.60, batch acc = 0.72, test acc = 0.72 (0.006 sec)
Step 2400: loss = 0.55, batch acc = 0.76, test acc = 0.76 (0.006 sec)
Step 2450: loss = 0.56, batch acc = 0.75, test acc = 0.75 (0.006 sec)
Step 2500: loss = 0.59, batch acc = 0.73, test acc = 0.73 (0.005 sec)
Step 2550: loss = 0.56, batch acc = 0.75, test acc = 0.75 (0.007 sec)
Step 2600: loss = 0.64, batch acc = 0.68, test acc = 0.68 (0.009 sec)
Step 2650: loss = 0.64, batch acc = 0.68, test acc = 0.68 (0.005 sec)
Step 2700: loss = 0.49, batch acc = 0.82, test acc = 0.82 (0.005 sec)
Step 2750: loss = 0.56, batch acc = 0.75, test acc = 0.75 (0.006 sec)
Step 2800: loss = 0.54, batch acc = 0.77, test acc = 0.77 (0.006 sec)
Step 2850: loss = 0.55, batch acc = 0.76, test acc = 0.76 (0.004 sec)
Step 2900: loss = 0.54, batch acc = 0.77, test acc = 0.77 (0.004 sec)
Step 2950: loss = 0.65, batch acc = 0.67, test acc = 0.67 (0.005 sec)
After 3000 iterations, the network has still not converged. Something must be very wrong.

Conclusion from the above: With the number of parameters held constant, a deeper network does not necessarily perform better than a shallower one. I am guessing this is because fewer nodes in a layer means that the network can keep around less information from layer to layer.


In [14]:
xor_network.run_network(batch_size, num_steps, 3, 5, learning_rate, False, 'relu')


Adding hidden layer with 3 nodes
Adding hidden layer with 3 nodes
Adding hidden layer with 3 nodes
Adding hidden layer with 3 nodes
Adding hidden layer with 3 nodes
Step 50: loss = 0.46, batch acc = 0.76, test acc = 0.76 (0.003 sec)
Step 100: loss = 0.07, batch acc = 1.00, test acc = 1.00 (0.005 sec)
Test accuracy is perfect after 100 iterations. Quitting.

Conclusion from the above: Indeed, the problem is not the number of layers, but the number of nodes in each layer.


In [20]:
# This is the minimum number of nodes I can use to consistently get convergence with Gradient Descent.
xor_network.run_network(batch_size, num_steps, 5, 1, learning_rate, False, 'relu')


Adding hidden layer with 5 nodes
Step 50: loss = 0.20, batch acc = 1.00, test acc = 1.00 (0.003 sec)
Test accuracy is perfect after 50 iterations. Quitting.

In [31]:
# If I switch to using Adam Optimizer, I can get down to 2 hidden nodes and consistently have convergence.
xor_network.run_network(batch_size, num_steps, 2, 1, learning_rate, True, 'relu')


Adding hidden layer with 2 nodes
Step 50: loss = 0.00, batch acc = 1.00, test acc = 1.00 (0.003 sec)
Test accuracy is perfect after 50 iterations. Quitting.