We ran 200 different processes, each one training on how to mimic the function multiply(n, m) with all possible ranges between 1 and 10. Each Neural Network Regressor trained with 100,000 steps. Before doing this, I had a hypothesis that the best number of units in the hidden layer is the square of the max number in the range. For example, since these range from 1 to 10, my hypothesis is the best number of units in the hidden layer is 10^2, or 100.
The graph below displays the results of this analysis.
This tells me that my hypothesis was very wrong. Because of this, I wondered why having 200 hidden units performed so well. All of the errors were calculated
In [ ]: