Inference with SMURFF

In this notebook we will continue on the first example. After running a training trainSession again in SMURFF, we will look deeper into how to use SMURFF for making predictions.

To make predictions we recall that the value of a tensor model is given by a tensor contraction of all latent matrices. Specifically, the prediction for the element $\hat{Y}_{ijk}$ of a rank-3 tensor is given by

$$ \hat{Y}_{ijk} = \sum_{d=1}^D u^{(1)}_{d,i} u^{(2)}_{d,j} u^{(3)}_{d,k} + mean $$

Since a matrix is a rank-2 tensor the prediction for a matrix is given by:

$$ \hat{Y}_{ij} = \sum_{d=1}^D u^{(1)}_{d,i} u^{(2)}_{d,j} + mean $$

These inner products are computed by SMURFF automagicaly, as we will see below.

Saving models

We run a Macau training trainSession using side information (ecfp) from the chembl dataset. We make sure we save every 10th sample, such that we can load the model afterwards. This run will take some minutes to run.


In [ ]:
import smurff
import os

ic50_train, ic50_test, ecfp = smurff.load_chembl()

os.makedirs("ic50-macau", exist_ok=True)
trainSession = smurff.MacauSession(
                       Ytrain     = ic50_train,
                       Ytest      = ic50_test,
                       side_info  = [ecfp, None],
                       num_latent = 16,
                       burnin     = 200,
                       nsamples   = 10,
                       save_freq  = 1,
                       save_prefix= "ic50-macau",
                       verbose    = 1,)

predictions = trainSession.run()

Saved files

The saved files are indexed in a root ini-file, in this case the root ini-file will be ic50-macau/root.ini. The content of this file lists all saved info for this training run. For example

[options]
options = ic50-save-options.ini

[steps]
sample_step_10 = sample-10-step.ini
sample_step_20 = sample-20-step.ini
sample_step_30 = sample-30-step.ini
sample_step_40 = sample-40-step.ini

Each step ini-file contains the matrices saved in the step:

[models]
num_models = 2
model_0 = sample-50-U0-latents.ddm
model_1 = sample-50-U1-latents.ddm
[predictions]
pred = sample-50-predictions.csv
pred_state = sample-50-predictions-state.ini
[priors]
num_priors = 2
prior_0 = sample-50-F0-link.ddm
prior_1 = sample-50-F1-link.ddm

Making predictions from a TrainSession

The easiest way to make predictions is from an existing TrainSession:


In [ ]:
predictor = trainSession.makePredictSession()
print(predictor)

Once we have a PredictSession, there are serveral ways to make predictions:

  • From a sparse matrix
  • For all possible elements in the matrix (the complete $U \times V$)
  • For a single point in the matrix
  • Using only side-information

Predict all elements

We can make predictions for all rows $\times$ columns in our matrix


In [ ]:
p = predictor.predict_all()
print(p.shape) # p is a numpy array of size: (num samples) x (num rows) x (num columns)

Predict element in a sparse matrix

We can make predictions for a sparse matrix, for example our ic50_test matrix:


In [ ]:
p = predictor.predict_some(ic50_test)
print(len(p),"predictions") # p is a list of Predictions
print("predictions 1:", p[0])

Predict just one element

Or just one element. Let's predict the first element of our ic50_test matrix:


In [ ]:
from scipy.sparse import find
(i,j,v) = find(ic50_test)
p = predictor.predict_one((i[0],j[0]),v[0])
print(p)

And plot the histogram of predictions for this element.


In [ ]:
%matplotlib inline
import matplotlib.pyplot as plt

# Plot a histogram of the samples.
plt.subplot(111)
plt.hist(p.pred_all, bins=10, density=True, label = "predictions's histogram")
plt.plot(p.val, 1., 'ro', markersize =5, label = 'actual value')
plt.legend()
plt.title('Histogram of ' + str(len(p.pred_all)) + ' predictions')
plt.show()

Make predictions using side information

We can make predictions for rows/columns not in our train matrix, using only side info:


In [ ]:
import numpy as np
from scipy.sparse import find

(i,j,v) = find(ic50_test)
row_side_info = ecfp.tocsr().getrow(i[0])
p = predictor.predict_one((row_side_info,j[0]),v[0])
print(p)

Accessing the saved model itself

The latents matrices for all samples are stored in the PredictSession as numpy arrays


In [ ]:
# print the U matrices for all samples
for i,s in enumerate(predictor.samples):
    print("sample", i, ":", [ (m, u.shape) for m,u in enumerate(s.latents) ])

This will allow us to compute predictions for arbitraty slices of the matrix or tensors using numpy.einsum:


In [ ]:
sample1 = predictor.samples[0]
(U1, U2) = sample1.latents

## predict the slice Y[7, : ] from sample 1
Yhat_7x = np.einsum(U1[:,7], [0], U2, [0, 2])

## predict the slice Y[:, 0:10] from sample 1
Yhat_x10 = np.einsum(U1, [0, 1], U2[:,0:10], [0, 2])

The two examples above give a matrix (rank-2 tensor) as a result. It is adviced to make predictions on all samples, and average the predictions.

Making predictions from saved run

One can also make a PredictSession from a save root ini-file:


In [ ]:
import smurff

predictor = smurff.PredictSession("ic50-macau/save-root.ini")
print(predictor)

In [ ]: