In this notebook we will machine-learn the relationship between descriptors of molecules and their total energy using kernel regression.
KRR is a one of the ML methods to perform regression (fitting).
In practice, we use kernel function $d$ that measures the distance between two data points. With the function we build the kernel matrix:
\begin{equation} K_{ij} = d(M_{i}, M_{j}) - \alpha I_{ij} \end{equation}for all pairs of samples $M_i$ and $M_j$ in the training set. $I_{ij}$ is the identity matrix and $\alpha$ is a hyperparameter chosen between 0 and 1.
When we want to predict the output of a new sample or set of samples that were not in the training set, we need to first evaluate the distance between them and the training samples:
\begin{equation} D_{ij} = d(T_{i}, M_{j}) \end{equation}where $T_i$ is one of the new samples. The output prediction $\mathbf{y_T}$ for the test set $T$ is obtained by:
\begin{equation} \mathbf{y_T} = \mathbf{y_M} \cdot K^{-1} \cdot D^T \end{equation}where $\mathbf{y_M}$ is the vector of known outputs for the training set $M$.
The energy of ~134k molecules was calculated at the CCSD level.
In [ ]:
# INITIAL DEFINITIONS
from pyKRR import KRRsolver # import our KRR solver object
import numpy, random
import numpy, math, random
import matplotlib.pyplot as plt
from mpl_toolkits.mplot3d import Axes3D
from scipy.sparse import load_npz
Let's pick a descriptor. Allowed types are:
In [ ]:
# TYPE is the descriptor type
TYPE = "cnt"
#show descriptor details
print("\nDescriptor details")
desc = open("./data/descriptor."+TYPE.split('.')[0]+".txt","r").readlines()
for l in desc: print(l.strip())
print(" ")
and load the databases with the descriptors (input) and the correct charge densities (output). Databases are quite big, so we can decide how many samples to use for training.
In [ ]:
# load input/output data
trainIn = load_npz("./data/energy.input."+TYPE+".npz").toarray()
trainOut = numpy.load("./data/energy.output.npy")
trainIn = trainIn.astype(dtype=numpy.float64, casting='safe')
# decide how many samples to take from the database
samples = min(trainIn.shape[0], 1000) # change the number here!
vsamples = min(trainIn.shape[0]-samples,1000)
print("training samples: "+str(samples))
print("validation samples: "+str(vsamples))
# split between training and validation
validIn = trainIn[samples:samples+vsamples]
validOut = trainOut[samples:samples+vsamples]
trainIn = trainIn[0:samples]
trainOut = trainOut[0:samples]
# show the first few descriptors
print("\nDescriptors for the first 5 molecules:")
print(trainIn[0:5])
In the training phase we use the kernel function $d$ to measure the distance between all pairs of molecules in the dataset. The results for the kernel matrix:
\begin{equation} K_{ij} = d\left(M_{i}, M_{j}\right) - \alpha I_{ij} \end{equation}$I_{ij}$ is the identity matrix and $\alpha$ is a regularisation hyperparameter $\in \left[0,1\right]$. The matrix is then inverted.
In [ ]:
# create a new solver
solver = KRRsolver()
# set the regularisation hyperparameter
solver.alpha = 0.1
# call its training function with the training inputs and outputs
# WARNING: building the kernel matrix is O(N^2)
# WARNING: inverting the kernel matris is O(N^3)
# Keep the training set small
solver.Train(trainIn, trainOut)
The evaluation function computes the distance between each validation sample $T_i$ and the training ones $M_i$:
$$ D_{ij} = d(T_{i}, M_{j}) $$The predicted energies $E\left(T\right)$ are then given by:
$$ E\left( T \right) = E\left( M \right) \cdot K^{-1} \cdot D^T $$
In [ ]:
# get a list of predicted outputs for the validation inputs
predict = solver.Evaluate(validIn)
print("Mean Abs Error (validation): " + str((numpy.abs(predict-validOut)).mean()))
# do the regression plot
plt.plot(validOut, predict, 'o')
plt.plot([numpy.min(validOut),numpy.max(validOut)], [numpy.min(validOut),numpy.max(validOut)], '-')
plt.xlabel('correct output')
plt.ylabel('KRR output')
plt.show()
# check the distribution of energies in the training set
plt.hist(validOut, bins=20, density=True, alpha=0.5, facecolor='gray')
plt.hist(trainOut, bins=20, density=True, alpha=0.2, facecolor='red')
plt.xlabel("Energy [H]")
plt.ylabel("Probability")
plt.show()
In [ ]:
In [ ]:
In [ ]: