Estimating Phosim CPU Times:

What can we learn from Twinkles Run 1?

Twinkles Run 1 is 1227 visits, a subset of the observations of a Deep Drilling Field from the kraken-1042 Opsim simulation. Each of the visits is simulated as a 30-second observation with Phosim. Together, the 1227 runs of Phosim required about 4 CPU years on the SLAC batch farm. The CPU time required per visit varied widely. Some required only a few hours. More than 100 visits (representing about one third of the total CPU time) never completed because they reached the 5-day CPU time limit for the batch hosts, and produced no output.

Studying the dependence of the CPU time on the various inputs to the Phosim simulations has several motivations. Certainly there's no sense in investing CPU time in a run that cannot finish with 5 CPU days. And Run 1 was relatively modest in terms of resource requirements. Essentially all 1227 simulations ran in parallel on the batch farm. For Twinkles1 and 2, which will be much larger, the batch farm will not be able to run all of the simulations in parallel and the effective throughput could well be determined by the individual simulations that take the most CPU time, if they are allowed to hog the available hosts. Twinkles1 and 2 might run at NERSC, which has more hosts, but reportedly a 2-day CPU limit per host. Absent some solution involving checkpointing, we may need to make judicious adjustments to the Phosim inputs for these runs. Also, John Peterson has pointed out that we should not assume that all of the inputs to Phosim for Run 1 were sensible.

For this study, a number of Phosim inputs were extracted from the instance catalogs for the runs, and combined with execution time information from the batch farm. The columns of the resulting csv file are described here.

This Notebook is based on Phil Marshall's machine learning example.


In [ ]:
# For pretty plotting
# !pip install --upgrade seaborn

In [ ]:
import numpy as np
import pandas as pd
pd.set_option('display.max_columns', None)
%pylab inline
import seaborn as sns
sns.set()
import copy
from __future__ import print_function

Reading the data

  • This is a regression problem, to be able to predict the CPU time in terms of various metadata related to the runs. These apparently are called "features" in machine learning terminology. Also the CPU time will be referred to as the "response variable."

  • Read in the data. Here just selected columns are being read from the file. They may or may not be the most relevant. Also, the filter and hostname columns would need to be converted to numerical values in order to be used. And the runlimit column probably needs special treatment because it is clearly related to large CPU times


In [ ]:
run1meta = pd.read_csv("http://www.slac.stanford.edu/~digel/lsst/run1_metadata_v2.csv",index_col=0,usecols=["obshistid","expmjd","filter","rotskypos","altitude",\
                                                "rawseeing","airmass","sunalt","moonalt","moonphase","dist2moon","cputime",\
                                                "runlimit", "hostname"])

# Omit the two runs that have not actually finished running (or been 
# terminated at the CPU time limit).  These are flagged with -999.0 in
# the input file.  Also omit the runs that reached the CPU limit:
run1meta = run1meta[(run1meta["cputime"] > 0.) & (run1meta["runlimit"] == 0)]

# Response variables: cputime
cputime = np.log10(run1meta["cputime"])

# Features or attributes: photometric measurements
run1meta_features = copy.copy(run1meta)
del run1meta_features["cputime"]
del run1meta_features["runlimit"]

run1meta_features.head()

In [ ]:
print(run1meta_features.keys())

In [ ]:
bins =  hist(cputime.values,bins=100) ; xlabel("CPU time (s)") ; ylabel("N")

The distribution of CPU time has a very long tail, plus a number of runs piled up at the 5-day CPU time limit.

Let's plot all the features, colored by the CPU time, to look for structure. (Possibly the logarithm of the CPU time should be used.)


In [ ]:
import matplotlib as mpl
import matplotlib.cm as cm

# Truncate the color at 3e5 CPU sec just to keep some contrast.
norm = mpl.colors.Normalize(vmin=min(cputime.values), vmax=6)
cmap = cm.jet_r
m = cm.ScalarMappable(norm=norm, cmap=cmap)

# Plot everything against everything else:
rez = pd.scatter_matrix(run1meta_features,alpha=0.2,figsize=[15,15],color=m.to_rgba(cputime.values))

Now we have our machine learning inputs and outputs:


In [ ]:
X = run1meta_features.values  # Data: 5-d feature space
y = cputime.values # Target: redshifts

In [ ]:
print("Design matrix shape =", X.shape)
print("Response variable vector shape =", y.shape)

In [ ]:
from sklearn.cross_validation import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X, y)

Linear Regression

Let's follow the same procedure as in the SciKit-Learn tutorial we just went through:


In [ ]:
from sklearn import linear_model
linear = linear_model.LinearRegression()

# Fit the model, using all the attributes:
linear.fit(X_train, y_train)

# Do the prediction on the test data:
y_lr_pred = linear.predict(X_test)

# How well did we do?
from sklearn.metrics import mean_squared_error
mse_linear = np.sqrt(mean_squared_error(y_test,y_lr_pred))
r2_linear = linear.score(X_test, y_test)
print("Linear regression: MSE = ",mse_linear)
print("R2 score =",r2_linear)

In [ ]:
plot(y_test,y_lr_pred - y_test,'o',alpha=0.5)
title("Linear Regression Residuals - MSE = %.2f" % mse_linear)
xlabel("CPU Time")
ylabel("Residual")
ylim(-1,1)
hlines(0,min(y_test),max(y_test),color="red")

Just how bad is this? Here's the MSE from guessing the average redshift of the training set for all new objects:


In [ ]:
print("Naive MSE", ((1./len(y_train))*(y_train - y_train.mean())**2).sum())
print("Linear regression: MSE = ",mse_linear)

In [ ]:
mean_squared_error?

k-Nearest Neighbor (KNN) Regression

Now let's try a different kind of model: a non-parametric one.

"Regression based on k-nearest neighbors. The target is predicted by local interpolation of the targets associated of the nearest neighbors in the training set."

Question:

What underlying model is implied by the KNN algorithm? How many hidden parameters does it have?


In [ ]:
from sklearn import neighbors
from sklearn import preprocessing

X_scaled_knn = preprocessing.scale(X) # Many methods work better on scaled X.

KNN = neighbors.KNeighborsRegressor(5)

X_train_knn, X_test_knn, y_train_knn, y_test_knn = train_test_split(X_scaled_knn, y)

KNN.fit(X_train_knn,y_train_knn)

In [ ]:
y_knn_pred = KNN.predict(X_test_knn)
mse_knn = mean_squared_error(y_test_knn,y_knn_pred)
r2_knn = KNN.score(X_test_knn, y_test_knn)
print("MSE (KNN) =", mse_knn)
print("R2 score (KNN) =",r2_knn)
print("cf.")
print("MSE (linear regression) = ",mse_linear)
print("R2 score (linear regression) =",r2_linear)

In [ ]:
plot(y_test_knn, y_knn_pred - y_test_knn,'o',alpha=0.5)
#plot(y_test, y_lr_pred - y_test,'x')
title("k-NN Residuals - MSE = %.2f" % mse_knn)
xlabel("CPU Time")
ylabel("Residual")
ylim(-1,1)
hlines(0,min(y_test),max(y_test),color="red")

Tuning the KNN Model

  • Let's vary the control parameters of the KNN model, to see how good we can make our predictions.

  • We can see our options in the model repr:

KNeighborsRegressor(algorithm='auto', leaf_size=30, metric='minkowski', metric_params=None, n_neighbors=5, p=2, weights='uniform')

  • Let's first make a "validation curve" to investigate one parameter: the number of nearest neighbors averaged over.

In [ ]:
# We'll vary the number of neighbors used:
param_name = "n_neighbors"
param_range = np.array([1,2,4,8,16,32,64])

# And we'll need a cv iterator:
from sklearn.cross_validation import ShuffleSplit
shuffle_split = ShuffleSplit(len(X_scaled_knn), 10, test_size=0.4)

# Compute our cv scores for a range of the no. of neighbors:
from sklearn.learning_curve import validation_curve
training_scores, validation_scores = validation_curve(KNN, X_scaled_knn, y,
                                                      param_name=param_name,
                                                      param_range=param_range, 
                                                      cv=shuffle_split, scoring='r2')

In [ ]:
def plot_validation_curve(param_name,parameter_values, training_scores, validation_scores):
    training_scores_mean = np.mean(training_scores, axis=1)
    training_scores_std = np.std(training_scores, axis=1)
    validation_scores_mean = np.mean(validation_scores, axis=1)
    validation_scores_std = np.std(validation_scores, axis=1)

    plt.fill_between(parameter_values, training_scores_mean - training_scores_std,
                     training_scores_mean + training_scores_std, alpha=0.1, color="r")
    plt.fill_between(parameter_values, validation_scores_mean - validation_scores_std,
                     validation_scores_mean + validation_scores_std, alpha=0.1, color="g")
    plt.plot(parameter_values, training_scores_mean, 'o-', color="r",
             label="Training score")
    plt.plot(parameter_values, validation_scores_mean, 'o-', color="g",
             label="Cross-validation score")
    plt.ylim(validation_scores_mean.min() - .1, training_scores_mean.max() + .1)
    plt.xlabel(param_name)
    plt.legend(loc="best")

In [ ]:
plot_validation_curve(param_name, param_range, training_scores, validation_scores)

COMMENT HERE!

We see an optimum n_neighbors where cross-validation score maximizes. The score is small for really small and large n_neighbors; former case overfits to the test sample, while the latter averages over too much data.

Model tuning with GridSearchCV

  • Now, let's see if we can do better by varying some other KNN options as well - in a grid search.

In [ ]:
param_grid = {'n_neighbors': np.array([1,2,4,8,16,32,64]),
                  'weights': ['uniform','distance'],
                       'p' : np.array([1,2])}

np.set_printoptions(suppress=True)
print(param_grid)

In [ ]:
from sklearn.grid_search import GridSearchCV
KNN_tuned = GridSearchCV(KNN, param_grid, verbose=3)

In [ ]:
KNN_tuned.fit(X_train_knn, y_train_knn)

In [ ]:
y_knn_tuned_pred = KNN_tuned.predict(X_test_knn)

mse_knn_tuned = mean_squared_error(y_test_knn,y_knn_tuned_pred)
r2_knn_tuned = KNN_tuned.score(X_test_knn, y_test_knn)

print("MSE (tuned KNN) =", mse_knn_tuned)
print("R2 score (tuned KNN) =",r2_knn_tuned)
print("cf.")
print("MSE (KNN) = ",mse_knn)
print("R2 score (KNN) =",r2_knn)

Which are the best KNN control parameters we found?


In [ ]:
KNN_tuned.best_params_

Hmm. This value of n_neighbors is inconsistent with the peak in cross-validation score in the validation curve plot.

This is because the tuned KNN works with a smaller training set (as it divides the original training set into test and training sets), while the un-tuned KNN trains over the entire input set.

Generalization Error

Notice that all the above tuning happened while training on a single split (X_train and y_train).

It's possible that that particular fold prefers a slightly different set of parameters than a different one - so to assess our generalization error, we need a further level of cross-validation.

We can do this by passing a GridSearchCV model to the cross validation score calculator. This will take a few moments, as the grid search is carried out for each CV fold...


In [ ]:
from sklearn.cross_validation import cross_val_score

R2 = cross_val_score(KNN_tuned, X_scaled_knn, y, cv=shuffle_split, scoring='r2')

In [ ]:
meanR2,errR2 = np.mean(R2),np.std(R2)
print("Mean score:",meanR2,"+/-",errR2)

Notes

  • Optimizing over control parameters (or hyper parameters) with grid search cross validation is a form of model selection.
  • When presented with new metadata samples, and asked to predict the target response variables (CPU time), we'll need a trained machine that has not been over-fitted to the training data.
  • Minimizing and estimating the generalization error is a way to reduce the risk of getting this prediction wrong.
  • Let's finish off our CPU time machine-learning algorithm.

In [ ]:
KNNbest = KNN_tuned.best_estimator_
# KNNbest.fit(X_train_knn, y_train_knn)

How will we actually use this tuned model? Like this:


In [ ]:
j = 34
one_pretend_run = X_test_knn[j,:]
cpupredicted = KNNbest.predict(one_pretend_run)
cpuactual = y_test_knn[j]
print("True CPU cf. KNN predicted CPU time:",cpuactual,cpupredicted)

In [ ]:
cpuactual = y_test_knn
cpupredicted = KNNbest.predict(X_test_knn)

plot(cpuactual, cpupredicted,'o',alpha=0.5)
title("KNNbest performance")
xlabel("CPU Time")
ylabel("Predicted CPU Time")
lims = [3.5,6]
#lims= [0, 4.5e5]
xlim(lims)
ylim(lims)
plot(lims, lims, ':k')

Random Forest Regression

Now try the random forest regression.


In [ ]:
from sklearn.ensemble import RandomForestRegressor

RF = RandomForestRegressor()
RF.fit(X_train,y_train)

In [ ]:
y_pred_RF = RF.predict(X_test)


mse_RF = mean_squared_error(y_test,y_pred_RF)
r2_RF = RF.score(X_test, y_test)

print("MSE (RF) =", mse_RF)
print("R2 score (RF) =",r2_RF)
print("cf.")
print("MSE (tuned KNN) =", mse_knn_tuned)
print("R2 score (tuned KNN) =",r2_knn_tuned)
print("cf.")
print("MSE (KNN) =", mse_knn)
print("R2 score (KNN) =",r2_knn)
print("cf.")
print("MSE (linear regression) = ",mse_linear)
print("R2 score (linear regression) =",r2_linear)

In [ ]:
print(run1meta_features.keys())
print(RF.feature_importances_)

In [ ]:
plot(y_test, y_pred_RF - y_test,'o',alpha=0.5)
title("RF Residuals - MSE = %.2f" % mse_RF)
xlabel("CPU Time")
ylabel("Residual")
ylim(-1,1)
hlines(0,min(y_test),max(y_test),color="red")

In [ ]:
plot(y_test,y_lr_pred - y_test,'o',alpha=0.3, color= 'r', label= "Linear MSE = %.2f" % mse_linear)
plot(y_test_knn, y_knn_pred - y_test_knn,'o',alpha=0.3, color= 'b', label= "KNN MSE = %.2f" % mse_knn)
plot(y_test, y_pred_RF - y_test,'o',alpha=0.3, color= 'g', label= "RF MSE = %.2f" % mse_RF)
title("Regression Residuals")
legend()
xlabel("CPU Time")
ylabel("Residual")
#ylim(-3,2)
hlines(0,min(y_test),max(y_test),color="red")
fig = plt.gcf()
fig.set_size_inches(10, 7)

Plot the validation curve for RF over n_estimator space


In [ ]:
# We'll vary the number of neighbors used:
param_name = "n_estimators"
param_range = np.array([1,2,4,16,64, 128, 256, 512])

# And we'll need a cv iterator:
shuffle_split = ShuffleSplit(len(X), 10, test_size=0.4)

# Compute our cv scores for a range of the no. of estimators:
training_scores_RF, validation_scores_RF = validation_curve(RF, X, y,
                                                          param_name=param_name,
                                                          param_range=param_range, 
                                                          cv=shuffle_split, scoring='r2')

In [ ]:
plot_validation_curve(param_name, param_range, training_scores_RF, validation_scores_RF)

Run grid search, as we did before with KNN


In [ ]:
# run a grid search with Random Forests over n_estimators
param_grid_RF = {'n_estimators': np.array([16,32,64, 128, 256, 512])}

np.set_printoptions(suppress=True)
print(param_grid_RF)

In [ ]:
RF_tuned = GridSearchCV(RF, param_grid_RF, verbose=3)

RF_tuned.fit(X_train, y_train)

In [ ]:
y_RF_tuned_pred = RF_tuned.predict(X_test)

mse_RF_tuned = mean_squared_error(y_test,y_RF_tuned_pred)
r2_RF_tuned = RF_tuned.score(X_test, y_test)


print("MSE (tuned RF) =", mse_RF_tuned)
print("R2 score (tuned RF) =",r2_RF_tuned)
print("cf.")

print("MSE (RF) =", mse_RF)
print("R2 score (RF) =",r2_RF)

In [ ]:
RF_tuned.best_params_

In [ ]:
RFbest = RF_tuned.best_estimator_

In [ ]:
cpuactual_RF = y_test
cpupredicted_RF = RFbest.predict(X_test)

plot(cpuactual_RF, cpupredicted_RF,'o',alpha=0.5)
title("RFbest performance")
xlabel("CPU Time")
ylabel("Predicted CPU Time")
lims = [3.5,6]
#lims = [0,4e5]
xlim(lims)
ylim(lims)
plot(lims, lims, ':k')

In [ ]:
plot(cpuactual, cpupredicted,'o',alpha=0.2, label= 'KNNBest', color='b')
plot(cpuactual_RF, cpupredicted_RF,'o',alpha=0.3, label= 'RFBest', color= 'g')

title("Best performance")
xlabel("CPU Time")
ylabel("Predicted CPU Time")
lims = [3.5,6]
#lims = [0,4e5]
xlim(lims)
ylim(lims)
plot(lims, lims, ':k')
legend(loc= 0)
fig = plt.gcf()
fig.set_size_inches(10, 7)

In [ ]:
# color code the plot, based on the host name
hostname= np.array(run1meta_features["hostname"].values)

# host name: h, b, k, f, d
uniqHostName= np.unique(hostname)
realnames_host = ["b","d","f","h","k"]
colors= ['r', 'g', 'b', 'k', 'c', 'm']

In [ ]:
# get the indices from the actual array that corresponds to the test sample.
matchInd= []
for j in y_test:
    matchInd.append(np.where(y == j)[0])

In [ ]:
# get the host names for the data in test sample
host_test= hostname[matchInd]

In [ ]:
for i in range(len(uniqHostName)):
    ind= np.where(host_test == uniqHostName[i])[0]
    plot(cpuactual_RF[ind], cpupredicted_RF[ind], 'o' ,alpha=0.5, label= realnames_host[i], color= colors[i])
title("RF Best performance: Colors are HostNames")
xlabel("CPU Time")
ylabel("Predicted CPU Time")
lims = [3.5,6]
#lims = [0,4e5]
xlim(lims)
ylim(lims)
plot(lims, lims, ':k')
legend(loc= 0)
fig = plt.gcf()
fig.set_size_inches(10, 7)

In [ ]:
for i in range(len(uniqHostName)):
    ind= np.where(host_test == uniqHostName[i])[0]
    plot(cpuactual[ind], cpupredicted[ind], 'o' ,alpha=0.5, label= realnames_host[i], color= colors[i])
title("KNN Best performance: Colors are HostNames")
xlabel("CPU Time")
ylabel("Predicted CPU Time")
lims = [3.5,6]
#lims = [0,4e5]
xlim(lims)
ylim(lims)
plot(lims, lims, ':k')
legend(loc= 0)
fig = plt.gcf()
fig.set_size_inches(10, 7)

In [ ]:
# color code the plot, based on the filter
filters= np.array(run1meta_features["filter"].values)

# host name: h, b, k, f, d
uniqFilters= np.unique(filters)
realnames = ["u","g","r","i","z", "y"]

# get the host names for the data in test sample
filters_test= filters[matchInd]

In [ ]:
for i in range(len(uniqFilters)):
    ind= np.where(filters_test == uniqFilters[i])[0]
    plot(cpuactual_RF[ind], cpupredicted_RF[ind], 'o' ,alpha=0.5, label= realnames[i], color= colors[i])
title("RF Best performance: Colors are Filters")
xlabel("CPU Time")
ylabel("Predicted CPU Time")
lims = [3.5,6]
#lims = [0,4e5]
xlim(lims)
ylim(lims)
plot(lims, lims, ':k')
legend(loc= 0)
fig = plt.gcf()
fig.set_size_inches(10, 7)

In [ ]:
# color code the plot, based on moonalt
moonalt= np.array(run1meta_features["moonalt"].values)

# get the moonalt entries for the data in test sample
moonalt_test= moonalt[matchInd].flatten()

In [ ]:
norm = mpl.colors.Normalize(vmin= min(moonalt_test), vmax= max(moonalt_test))
cmap = cm.jet_r
m = cm.ScalarMappable(norm=norm, cmap=cmap)

In [ ]:
plt.scatter(cpuactual_RF, cpupredicted_RF ,alpha=0.5, c= m.to_rgba(moonalt_test))
title("RF Best performance. Bluer -> Higher MoonAlt", fontsize= 16)
xlabel("log$_{10}$(Actual CPU Time)", fontsize= 16)
ylabel("log$_{10}$(Predicted CPU Time)", fontsize= 16)
lims = [3.5,6]
#lims = [0,4e5]
xlim(lims)
ylim(lims)
plot(lims, lims, ':k')
legend(loc= 0)
fig = plt.gcf()
fig.set_size_inches(10, 7)

In [ ]: