In [2]:
import plots, processing
import numpy as np
Neural networks are very popular nowadays, with a lot of libraries offering very efficient way of using them. Many problems are solvable using a neural network, but finding the right one, the more efficient, the more accurate is very complex and time consuming. Finding the optimal parameters of a neural network can be difficult. People are used to optmise them by hand, using their knowledge on neural networks. The effort required to find an optimal or quasi optimal configuration is so high that people usually use non-optmized netwoks. In this report we compare a few solutions proposed to optimize these parameters. We applyed the solutions on the dataset MNIST [6].
Our case study undergo the following steps:
Last, we list the references and tools used for this project in Reference and acknowledgements
For this project we used the well known dataset MNIST [6]. The dataset is composed by 80 000 classified images reprensenting the numbers from 0 to 9. The training set is composed of 60 000 images, the test set and the validation set 10 000. We do not use any augmentation.
We choose to implement the model using Keras [3]. We used both tensorflow [2] and Theano [1] as the underlying layer.
The basic model is a logistic regression with 784 inputs and 10 outputs using as loss the categorical crossentropy and as optimizer a SGD.
We choose to fix the number of epochs to 100 and the batch size to 200. This choice was made to simplify the problem and to impose the network to find a solution in a reasonable time.
All remaning parameters are used for the optimisation.
One MNIST training took 1-2 minutes on our computers.
We choose for this project to optimise the network on the following parameters with a stochastic gradient descent optimisation:
The number of layers and neurons is a strict policy as it controls the capacity of the model. However, for the other ranges, this is more of a general guideline. There are here to guide the algorithm toward "relevant values". When an algorithm would not behave well close to zero or with too large values, we would ajust them a little. This is a tradeoff between the maximum accuracy possible and the convergence of the algorithm.
The measure for one set of hyperparameters is the validation accuracy after the training of the full training set (100 epochs).
For a pure theorical approch, we would need to run the an algorithm on a fixed number of training on MNIST (e.g. 200) multiple times to account for the probabilistic perspective of the algorithm. We would also need to repeat this operation on a number of problem different than MNIST. We clearly do not have this kind of computing power.
Running a specific algorithm also need to adjust a couple of parameters (hopefully less than training MNIST), or even small redefinitions of the problem as stated previously. This kind of intervention is hard to account for on a theorical level.
Therefore, we will compare the different algorithms from the perspective of an individual trying to optimize the training of his neural network. The individual can make some obvious adjustments but expects the optimisation algorithm to do the heavy lifting.
The most natural idea after bruteforce is to explore a random search. Any solution comparing worse to this solution cannot be retained as a good solution. We used the following marginal probabilty distribution to sample for our experiments:
The number of layer has been fixed to 2 layers to compare better with hyperMads and hyper that we didn't have time to run on more than three layers hyperBayes. The conclusion would not change thought.
In [2]:
accRand = processing.accuracy("results/hyperRand.json")
plots.summary(accRand, .8)
plots.accuracy(accRand)
With only ten solutions better than 80%, and a maximum accuracy of 91.73%, the random search is a bad algorithm for finding good optimal solutions. One could arguee that the solutions found could be used to initialize another algorithm but all other algorithm already have their own way of generating and exploiting randomness.
In a way, the results really depends on the probabilistic distributions given for the parameters. An extreme case would be the a distribution with little to no variance around an optimal solution. This distribution would perform well but require a dramatic amount of knowledge from the indivual running the algorithm.
The strengths of this algorithm is that it's easy to set up and easy to parallelize (no synchronizing steps).
For this approach we've re-implemented the solution proposed by [9].
The idea of NN designing NN is to learn a neural network that tries to predict the performance of a Network givens its parameters. The complicated point of this approach is to generate some points on which the neural network can learn, then to refine the neural network on interessant zones to find the optimal solution. The sampling of training points is essential.
Sample some random hyperparameters and train MNIST on them train the neural network (RSM)
a = 1e-4
while (n_iteration < max_iter):
sample a random parameter
predict its performance p
if (p> max_performance):
with probability (1-a):
train_MNIST
add result in RSM training set
train the RSM
update max_performance if necessary
else do nothing
else :
with probability a
train_MNIST
add result in RSM training set
train the RSM
update max_performance if necessary
else do nothing
return the best configuration
The previous algorithm is very dependant on sampling hyperparameters to test. The sampling method is a variation of Metropolis–Hastings algorithm. The first sample are random, the following are sampled with a gaussian distribution around the previous one. This method enable a local search around the optimal solution found, but also allows the exploration of the rest of the search space.
We've beginned using when possible continous hyperparameters values. But this approach was stuck is a local minima and was not able to explore the entire space to find a good solution. This bad result is understandable because of the meaning of the values : a learning rate of 0.1 is very similar to a leaning rate of 0.13, but a learning rate of 0.001 is quite different of a learning rate of 0.031. The gaussian distribution doesn't make any difference between these two examples. Then we followed what was done in the article and we discretized the search space. This way we were able to counter the previous problem, but we've lost in the precision of the optained solution.
Here are the possible values of the parameters :
values = np.array([
[0, 1, 2, 3], #n_couches
range(10, 500, 10), range(10, 500, 10), range(10, 500, 10), #couches
[0.001, 0.002, 0.004, 0.008, 0.016, 0.03, 0.06, 0.012, 0.025, 0.05, 0.1, 0.2, 0.4, 0.8], #learning rate
[0.000001,0.00001,0.0001,0.001,0.01,0.1], #reg_l1
[0.000001,0.00001,0.0001,0.001,0.01,0.1], #reg_l2
[0.001, 0.002, 0.004, 0.008, 0.016, 0.03, 0.06, 0.012, 0.025, 0.05, 0.1, 0.2, 0.4, 0.8], #moment
[.0,0.001, 0.002, 0.004, 0.008, 0.016, 0.03, 0.06, 0.012, 0.025, 0.05, 0.1, 0.2, 0.4, 0.8], #decay
[0,1], #nesterov
[0, 1, 2] #activation
])
In [4]:
accNN = np.load('../hyperLearn/hist_res.npy')
plots.accuracy(accNN)
plots.summary(accNN,.9)
The RSM algorithm find quickly a good result (96% at iteration 9) and the final result is comparable to the ones of the oder algorithms.
The algorithm manage to search most of the time in interesting areas, mostly at the beginnig. Once he founds a good solution it begins to search for other good solutions in the search space.
This algorithm is limited by the discretization of the search space.
The MADS algorithm [7] stands for Mesh Adaptive Direct Search. It's an algorithm designed for black box optimization, that is to say an algorithm to look for the minimum of a function on which we have little information. The function is hard to compute (can be days), may not be smooth, and may even fail to compute. The algorithm use an adaptive mesh to search the solution space.
We use the NOMAD and OPAL [4][8] software build at GERAD to look for solutions. The following choices were made:
{"activation": "relu",
"nesterov": True}
{"learnnig_rate": 1e-3,
"reg_l1": 1e-4,
"reg_l2": 1e-4,
"moment": 1e-2,
"decay": 1e-6,
"noeuds": [200, 200, 0] # neurons
}
In [4]:
accMads = processing.accuracy("results/hyperMads.json")
plots.summary(accMads, .9)
print("Accuracy at iteration 9: {}".format(accMads[8]))
plots.accuracy(accMads)
The algorithm quickly found a good solution (96.29% at iteration 9), but we can notive that it was lucky on it first guess (94,22% accuracy).
Using MDS on the parameters used for the search (normalized by the size of the search space), we notice two things. First First that all the good solutions were found close to each other. Second that even in this close space, the accuracy is pretty noisy.
In [5]:
varMads = []
for p in processing.params("results/hyperMads.json"):
noeuds = p["noeuds"] + [0]*(3 - p["n_couches"])
varMads.append([p["decay"], p["learning_rate"],
p["moment"], p["reg_l1"], p["reg_l2"],
noeuds[0], noeuds[1], noeuds[2]])
varMads = np.array(varMads)
plots.mds_accuracy(varMads / [1,1,1,1,1,500, 500, 500], accMads)
The function to optimize has a difficult structure to study and MADS fails to find new solutions. A good example of this is that only 22 training points have stricltly more than 1 neurons on the third layer.
To improve performance, we could make use of suroggate functions. Namely, a function that behave the same way as the original function but do not necessary approximate it. Such a function could be obptained by reducing the max number of epoch for instance.
Bayesian optimization is another type of blackbox optimization for continuous function which models the function with a Gaussian process [10]. It's also possible to use neural network to represents those distribution [11] but this was not explored here.
We used Spearmint [5] to run our optimization. As previously encoutered, we couldn't set properly the categorical variables and decided to use the following values:
{"activation": "relu",
"nesterov": True}
After a couple of failed attempts (or at least slow start), we decided to restrict a little the hyperparameters possbible values to:
1e-4
and.5
,1e-6
and .1
,1e-4
and .1
,With this configuration, we got very good solutions:
In [6]:
accBayes3 = processing.accuracy("results/hyperBayes3.json")
plots.summary(accBayes3, .9)
plots.accuracy(accBayes3)
We want to check if this solutions are close to one another, therefore, we apply MDS (nomralized with the possible research space)
In [7]:
varBayes3 = []
for p in processing.params("results/hyperBayes3.json"):
varBayes3.append([p["learning_rate"], p["decay"], p["moment"],
p["reg_l1"], p["reg_l2"], noeuds[0], noeuds[1]])
varBayes3 = np.array(varBayes3)
plots.mds_accuracy(varBayes3/ [.5,.1,.1,.1,.1,100, 100], accBayes3)
The algorithm is able to find good solutions throughout an important part of the solution space.
Next we tried to remove the limitation on the number of neurons per layer and enlarge it to 2 layers of 100 to 500 neurons. The density of solutions is well under the previous suite and the best solution a little worse.
In [8]:
accBayes5 = processing.accuracy("results/hyperBayes5.json")
plots.summary(accBayes5, .9)
plots.accuracy(accBayes5)
The different models are difficult to compare because each one of them needed its own type of adjustments. Futhermore, every algorithm has a stochatic aspect (often more important at the begining of the algorithm) that can change dramatically the performance. Making statistical sense would require running a huge amount of simulations.
However, there are several conclsions that we can draw:
These algorithm makes possible to run a large ammount of tests to opimize the parameters. For this project we managed to test 100 to 500 configurations per algorithm, where less than 50 are computable manually.
A first idea would be to run all algorithms, and multiple random initialization of the parameters, inpendently and in parallel. This could allow to get the best out of all algorithms.
One other idea would be that since Bayesian optimization recalculates the function density at every step with the previous observations, it could be easy to incorporate results from other algorithms into this search. The bayesian optimization. The other algorithms could then run again starting with the best solution predicted by bayesian optimization. The same idea can be applyed to the neural network algorithm.
Below, we show an example of our idea using only MADS as a side algorithm. To simplify, we will assume that one iteration on any algorithm takes exactly the same time.
This scenario enable us to provide a lot of points to the bayesian optimization as well as using different algorithm to refine solutions.
We could add the NN designing NN as a side algorithm, as a replacement of bayesian optimization, or even as another master algorithm.
We did not have time to implement this apporach but we believe that, like bagging, this could lower the weakness of every algorithm.
Our experimentation would not be possible without the use of these great tools:
[6] LeCun, Y., Cortes, C., & Burges, C. J. (1998). The MNIST database of handwritten digits.
[7] Audet, C., & Dennis Jr, J. E. (2006). Mesh adaptive direct search algorithms for constrained optimization. SIAM Journal on optimization, 17(1), 188-217.
[8] Audet, C., Dang, C. K., & Orban, D. (2011). Algorithmic parameter optimization of the DFO method with the OPAL framework. In Software Automatic Tuning (pp. 255-274). Springer New York.
[9] Smithson, S. C., Yang, G., Gross, W. J., & Meyer, B. H. (2016). Neural networks designing neural networks: multi-objective hyper-parameter optimization. arXiv preprint arXiv:1611.02120.
[10] Snoek, J., Larochelle, H., & Adams, R. P. (2012). Practical bayesian optimization of machine learning algorithms. In Advances in neural information processing systems (pp. 2951-2959).
[11] Snoek, J., Rippel, O., Swersky, K., Kiros, R., Satish, N., Sundaram, N., ... & Adams, R. P. (2015, February). Scalable Bayesian Optimization Using Deep Neural Networks. In ICML (pp. 2171-2180).
In [ ]: