We ran 200 different processes, each one training on how to mimic the function multiply(n, m) with all possible ranges between 1 and 10. Each Neural Network Regressor trained with 100,000 steps. Before doing this, I had a hypothesis that the best number of units in the hidden layer is the square of the max number in the range. For example, since these range from 1 to 10, my hypothesis is the best number of units in the hidden layer is 10^2, or 100.

The graph below displays the results of this analysis. Best Hidden Layer Graph

At 90 hidden units, it reaches an accurate prediction with MSE of 0.148.
At 100 hidden units, it predicts with MSE of 0.303
It best predicts with 200 hidden units with an MSE of 0.039

This tells me that my hypothesis was very wrong. Because of this, I wondered why having 200 hidden units performed so well. All of the errors were calculated



In [ ]:

Best number of hidden units for the function multiply(n, m)