In this notebook, we will recreate some key results from Ling et al. IMMI (2017). We will show that the errors produced from the Random Forest implemented in lolo are well-calibrated and that the uncertainties can be used with Sequential Learning to quickly find optimal materials within a search space.
Note: This notebook will require you to install matminer and establish an account with Citrination to get an an API key (see Quickstart), and set it as an environment variable named CITRINE_KEY.
In [1]:
%matplotlib inline
from matplotlib import pyplot as plt
from matminer.data_retrieval.retrieve_Citrine import CitrineDataRetrieval
from matminer.featurizers.base import MultipleFeaturizer
from matminer.featurizers import composition as cf
from lolopy.learners import RandomForestRegressor
from sklearn.model_selection import KFold
from pymatgen import Composition
from scipy.stats import norm
import pandas as pd
import numpy as np
import os
Set the random seed
In [2]:
np.random.seed(8)
In [3]:
cdr = CitrineDataRetrieval()
In [4]:
data = cdr.get_dataframe(criteria={'data_set_id': 150888}, print_properties_options=False)
Convert the composition and class variable from strings
In [5]:
def get_compostion(c):
"""Attempt to parse composition, return None if failed"""
try:
return Composition(c)
except:
return None
In [6]:
data['composition'] = data['chemicalFormula'].apply(get_compostion)
data['ZT'] = pd.to_numeric(data['ZT'], errors='coerce')
In [7]:
data.reset_index(drop=True, inplace=True)
Every dataset except the steel fatigue dataset uses the composition-based features of Ward et al..
In [8]:
f = MultipleFeaturizer([cf.Stoichiometry(), cf.ElementProperty.from_preset("magpie"),
cf.ValenceOrbital(props=['avg']), cf.IonProperty(fast=True)])
In [9]:
X = np.array(f.featurize_many(data['composition']))
As described in the Ling paper, ideally-calibrated uncertainty estimaes should have a particular relationship with the errors of a machine learning model. Specifically, the distribution of $r(x)/\sigma(x)$ where $r(x)$ is the residual of the prediction and $\sigma(x)$ is the uncertainty of the prediction for x should have a Gaussian distribution with zero mean and unit standard deviation.
In [10]:
model = RandomForestRegressor()
Get the errors from 8-fold cross-validation
In [11]:
y = data['ZT'].values
In [12]:
y_resid = []
y_uncer = []
for train_id, test_id in KFold(8, shuffle=True).split(X):
model.fit(X[train_id], y[train_id])
yf_pred, yf_std = model.predict(X[test_id], return_std=True)
y_resid.extend(yf_pred - y[test_id])
y_uncer.extend(yf_std)
Plot the normalized residuals ($r(x)/\sigma(x)$) against the normal distribution
In [13]:
fig, axs = plt.subplots(1, 2, sharey=True)
x = np.linspace(-8, 8, 50)
# Plot the RF uncertainty
resid = np.divide(y_resid, y_uncer)
axs[0].hist(resid, x, density=True)
axs[0].set_title('With Lolo Uncertainty Estimates')
# Plot assuming constant errors
resid = np.divide(y_resid, np.sqrt(np.power(y_resid, 2).mean()))
axs[1].hist(resid, x, density=True)
axs[1].set_title('Assuming Constant Error')
for ax in axs:
ax.plot(x, norm.pdf(x), 'k--', lw=0.75)
ax.set_xlabel('Normalized Residual')
axs[0].set_ylabel('Probability Density')
fig.set_size_inches(6.5, 2)
fig.tight_layout()
Here, we compare the error distribution using the Lolo uncertainty estimates (left) and the assumption that all entries have the same error (right). The normalized residuals for the uncertainty estimates have a distribution closer to the unit normal distribution, which means - as expected - that it better captures which predictions will have a higher error.
One important use of model uncertainties is to employ them to guide which experiments to pick to find optimal materials with minimal experiments/computations. As described in the Ling paper (and other nice articles), it is not always best to pick the experiment that the model predicts to have the best properties if you can perform more than one experiment sequentially. Rather, it can be better to pick entries with large uncertainities that, when tested and added to the training set, can improve the models predictions for the next experiments.
Here, we demonstrate one approach for picking experiments: Maximum Liklihood of Improvement (MLI). In contrast to picking the material with the best predicted properties (an approach we refer to Maximum Expected Improvment (MEU)), the MLI approach pickes the material with with the highest liklihood of being better than the best material in the training set - a measure that uses both the predicted value and the uncertainty.
In [14]:
in_train = np.zeros(len(data), dtype=np.bool)
in_train[np.random.choice(len(data), 10, replace=False)] = True
print('Picked {} training entries'.format(in_train.sum()))
In [15]:
assert not np.isclose(max(y), max(y[in_train]))
Make the predictions
In [16]:
model.fit(X[in_train], y[in_train])
y_pred, y_std = model.predict(X[~in_train], return_std=True)
For MEI, we pick the highest predicted value. For MLI, we pick the material that has the highest probability of being better than any material in the training set. As we assume the predictions to be normally distributed, the probability of materials can be computed from the Z-score $Z = (y - y^*)/\sigma$ where $y^*$ is the maximum of the $y$ of the training set. Formally, the probability can be computed from the Z-score using the cumulative distribution function of the normal distribution. For our purposes, we can use the Z-score becuase the probability is a monotonic function of the Z-score (stated simply: the material with the highest probability will have the highest Z-score).
In [17]:
mei_selection = np.argmax(y_pred)
mli_selection = np.argmax(np.divide(y_pred - np.max(y[in_train]), y_std))
In [18]:
print('Predicted ZT of material #{} selected based on MEI: {:.2f} +/- {:.2f}'.format(mei_selection, y_pred[mei_selection], y_std[mei_selection]))
print('Predicted ZT of material #{} selected based on MLI: {:.2f} +/- {:.2f}'.format(mli_selection, y_pred[mli_selection], y_std[mli_selection]))
For this particular iteration, the MEI and MLI strategies pick the same material. Depending on the random seed of this notebook and that used by lolo, you may see that the material picked by MLI has a lower predicted $ZT$ but a higher variance. According to the logic behind MLI, picking that entry will (1) yield a higher liklihood of finding a well-performing material and (2) lead to an improved model.
In [19]:
n_steps = 32
In [20]:
all_inds = set(range(len(y)))
In [21]:
random_train = [set(np.where(in_train)[0].tolist())]
In [22]:
for i in range(n_steps):
# Get the current train set and search space
train_inds = set(random_train[-1]) # Last iteration
search_inds = sorted(all_inds.difference(train_inds))
# Pick an entry at random
train_inds.add(np.random.choice(search_inds))
# Add it to the list of training sets
random_train.append(train_inds)
In [23]:
mei_train = [set(np.where(in_train)[0].tolist())]
In [24]:
for i in range(n_steps):
# Get the current train set and search space
train_inds = sorted(set(mei_train[-1])) # Last iteration
search_inds = sorted(all_inds.difference(train_inds))
# Pick entry with the largest maximum value
model.fit(X[train_inds], y[train_inds])
y_pred = model.predict(X[search_inds])
train_inds.append(search_inds[np.argmax(y_pred)])
# Add it to the list of training sets
mei_train.append(set(train_inds))
In [25]:
mli_train = [set(np.where(in_train)[0].tolist())]
In [26]:
for i in range(n_steps):
# Get the current train set and search space
train_inds = sorted(set(mei_train[-1])) # Last iteration
search_inds = sorted(all_inds.difference(train_inds))
# Pick entry with the largest maximum value
model.fit(X[train_inds], y[train_inds])
y_pred, y_std = model.predict(X[search_inds], return_std=True)
train_inds.append(search_inds[np.argmax(np.divide(y_pred - np.max(y[train_inds]), y_std))])
# Add it to the list of training sets
mli_train.append(set(train_inds))
Plot the results
In [27]:
fig, ax = plt.subplots()
for train_inds, label in zip([random_train, mei_train, mli_train], ['Random', 'MEI', 'MLI']):
ax.plot(np.arange(len(train_inds)), [max(y[list(t)]) for t in train_inds], label=label)
ax.set_xlabel('Number of New Experiments')
ax.set_ylabel('Best $ZT$ Found')
fig.set_size_inches(3.5, 2)
ax.legend()
fig.tight_layout()
For this particular case, we find that the MLI strategy finds the best material more quickly than the Random or MEI approaches. In Ling 2017, they evaluate the performance of these strategies over many iterations and find that, on average, MLI finds the optimal materials as fast or better than any other approach.
In [ ]: