Genetic Algorithm

GA is a simple optimisation algorithm that mimics evolution. It starts with a random population of elements, and evaluates how fit for survival they are. Elements are then randomly selected in pairs to mate and produce a new generation of offspring. The selection ensures that elements with higher fitness are more likely to be chosen for mating, thus increasing the chances that their offspring retains the characteristics that made their parents fit in the first place. Random mutation can also appear.

If an optimisation problem is too complex to be solved via gradient methods, GA can still give an answer.

In this tutorial we are going to use GA to find a set of parameters that fit a signal.
More details about the GA are given in the following sections.



In [ ]:

    
# IMPORTS
import os
import math
import random
import numpy
from GA import GAEngine
import matplotlib.pyplot as plt

Definition of the problem

We are given an compound oscillatory signal $W(t)$ at given times and we would like to find its components. We are too primitive to use FFT or non-linear fitting. We are dumb and strong, so we brute-force it!

Definition of element

Each $i$-th element of the genetic population represents a possible candidate signal $W_i(t) = \sum_{k=1}^n A_k \cos\left( 2\pi k t\right)$. Thus, the element can be represented by the list of amplitudes for each wave component: $\mathbf{e} = \{ A_1, A_2, ... A_n\}$.

The training set is a collection of $\left( t,W(t) \right)$ points corresponding to the signal we are trying to fit.



In [ ]:

    
# set a decent population size to work with
populationSize = 100

# set the amount of plane waves
nWaves = 5

# limit amplitudes to [-2, 2]
scale = 2.0

# create a GA engine
ga = GAEngine(popSize=populationSize, dnaSize=nWaves, scale=scale)

ga.tau = 0.1

# print the first 3 elements to see what we are dealing with!
print(ga.population[0:3])

# load a training set
trainT = numpy.load("./data/ga-train-T.npy") # time values
trainY = numpy.load("./data/ga-train-Y.npy") # corresponding signal values

# plot the training set
plt.plot(trainT, trainY, 'o')
plt.show()

Mixing elements

After the population is evaluated, all elements are sorted by descending fitness (best fit first). Pairs of elements are selected at random by their index, using an exponential distribution, and mixed to produce an offspring population. The simplest mixing function would be the one that assign each DNA component of the offspring from either parent at random. This function is already defined in the GAEngine.

We can control the PDF for the selection using one parameter:



In [ ]:

    
ga.tau = 1
selected = ga.TrySelection(10000)
plt.hist(selected, populationSize, density=True)
plt.xlabel('element index')
plt.ylabel('Probability')
plt.show()

Large values of $\tau$ will cause elements with low fitness to be almost never selected. This will make the GA "converge" faster, halting evolution.

Small values of $\tau$ will eventually give equal chance to all elements regardless of their fitness. This will slow down the evolution.

You will have to find a good compromise!

Mutating elements

Occasionally, mutation can appear when mixing two elements. There are several ways to implement mutations, but here we stick to simplicity. A mutation adds a random value to one random gene of the element.
Mutations should be rare or they will mess up the evolution!



In [ ]:

    
# set the mutation rate
ga.mutationRate = 0.01 # this means 1% of the elements will get it

Evaluating fitness

This function takes in the element descriptor, performs the calculations necessary to estimate how well it performs, and returns its fitness.
It is actually easy to calculate its badness (mean square error) and flip the sign. This way, the perfect element will have 0 fitness, while any mismatch with the training will give a negative value.



In [ ]:

    
def EvaluateFitness(element):

    # prepare a 0 array
    y = numpy.zeros(len(trainT))
    freq = 1 # starting frequency
    
    for amp in element: # loop through the amplitudes
        y += amp * numpy.cos(2*numpy.pi*freq*trainT) # add the wave
        freq += 1 # increment the frequency for the next wave
    
    # compute square error
    y = (y-trainY)*(y-trainY)
    
    return -numpy.mean(y)

# tell the engine to use this evaluation function
ga.Evaluate = EvaluateFitness

Evolving...

Let the population evolve!



In [ ]:

    
# number of generations to compute
nGens = 30

stats = numpy.zeros((nGens,3))

# start the evolution loop!
for g in range(nGens):
    stats[g] = ga.Evolve()



In [ ]:

    
# evolution plot
plt.plot(stats[:,0],'o-', label='best')
plt.plot(stats[:,1],'o-', label='average')
plt.plot(stats[:,2],'o-', label='worst')
plt.legend()
plt.show()



In [ ]:

    
# best element info
print("best model amplitudes:", ga.best)
print("best model fitness:", ga.bestFit)
t = numpy.arange(0,1,0.01)
y = t * 0
freq = 1 # starting frequency

for amp in ga.best: # loop through the amplitudes
    y += amp * numpy.cos(2*numpy.pi*freq*t) # add the wave
    freq += 1 # increment the frequency for the next wave
        
plt.plot(trainT, trainY, 'o', label='training points')
plt.plot(t, y, '-', label='best model')
plt.legend()
plt.show()

Final remarks

Do not be fooled by the speed of this example... in real applications, training an AI with GA can take up to several weeks on supercomputers.
You can see one in action here! Try to beat it!

Beast Exercise

Train neural networks to output the energy of a molecule using ACSFs as input. There should be one network for each atom type and ACSF contribution. For example, all H atoms will have 5 NNs that process the ACSFs calculated for that atom with respect to C, N, O, or F atoms. Additionally there will be NNs for the 3-body parts: H-HH, H-HC, H-HN, ... total 15 NNs In total, each atom type should have 5 + 15 = 20 NNs, giving a total of 100 NNs and lots of parameters to train!

Some serious python hacking is required!



In [ ]:

    
# load structures and energies from the database
# the file is data/structures.xyz and it contains lots of molecules in the following xyz format:
# 
# natoms_mol1
# energy_mol1
# type1 x y z
# type2 x y z
# ...
# natoms_mol2
# energy_mol2
# type1 x y z
# ...
#



In [ ]:

    
# use DScribe to compute ACSFs
# ...

# the ACSFs should be divided into their logical contributions (x-H, x-C, ... x-HH, x-HC, ...)
# the length of each contribution



In [ ]:

    
from sklearn.neural_network import MLPRegressor

# setup model networks for all possible atoms
# there are only H, C, N, O, F atoms in the database
#
# H -> H (2-body)
# H -> C (2-body)
# ...
# H -> H-H (3-body)
# H -> H-C (3-body) (no need for H -> C-H, it is symmetric!)
# ...



In [ ]:

    
# sklearn network weights and neuron biases are stored in
# model.coefs_ and model.intercepts_
# these are assigned only after calling model.fit(x, y), does not matter if the training succeeded
# as long as x,y have the correct shape
#
# the GA DNA is the concatenation of all these numbers
# 
# calculate the total length of the DNA
# ...

# write a function that converts the DNA (list of numbers) into a set of model.coefs_ and 
# model.intercepts_ for a sklearn model
# ...

# write an evaluation function for the fitness of a GA element
# ...

# evolve the population
# ...