In this question, we fit a model which can predict what torques a robot needs to apply in order to make its arm reach a desired point in space. The data was collected from a SARCOS robot arm with $7$ degrees of freedom. The input vector $x \in \mathbb{R}^{21}$ encodes the desired position, velocity and accelaration of the $7$ joints. The output vector $y \in \mathbb{R}^7$ encodes the torques that should be applied to the joints to reach that point. The mapping from $x$ to $y$ is highly nonlinear.
We can find the data at http://www.gaussianprocess.org/gpml/data/. Obviously the first step is to read this data.
In [1]:
%matplotlib inline
import matplotlib
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
import time
from util import read_all_sarcos_data, TARGET_COLUMN
training_data, test_data = read_all_sarcos_data()
training_data.head()
Out[1]:
Let's standardize our data by making each column have mean $0$ and unit variance.
In [2]:
from sklearn import preprocessing
def standardize_data(data):
feature_column_names = [column_name for column_name in all_data.columns if 'x' in column_name]
features = data[feature_column_names]
# only care about the first output for this problem
target = data[TARGET_COLUMN]
X = features.as_matrix()
Y = target.as_matrix()
X_scaled = preprocessing.scale(X, axis = 0, with_mean = True, with_std = True, copy = True)
# only standardize inputs
# Y_scaled = preprocessing.scale(Y, axis = 0, with_mean = True, with_std = True, copy = True)
return X_scaled, Y
# standardize together and separate
all_data = pd.concat([training_data, test_data], axis = 0)
standardize_data(all_data)
X, Y = standardize_data(all_data)
X_scaled, Y_scaled = X[:len(training_data)], Y[:len(training_data)]
X_scaled_test, Y_scaled_test = X[len(training_data):], Y[len(training_data):]
In [3]:
from sklearn import linear_model
## since data is already normalized
linearRegressor = linear_model.LinearRegression(fit_intercept = True, normalize = False)
linearRegressor.fit(X_scaled, Y_scaled)
linearRegressor.score(X_scaled_test, Y_scaled_test)
Out[3]:
According to the problem, we should be able to get a standardized mean squared error (SMSE) of 0.075. We define
\begin{equation} SMSE = \left(\frac{1}{N_{test}}\sum_{i=1}^{N_{test}}\left(y_i - \hat{y}_i\right)^2\right)/\sigma^2, \end{equation}where
\begin{equation} \sigma^2 = \frac{1}{N_{train}}\sum_{i=1}^{N_{train}}\left(y_i - \bar{y}\right)^2. \end{equation}
In [4]:
def compute_mse(prediction, labels):
return (sum((labels - prediction)**2)/len(prediction))
def compute_model_mse(model, X, y):
predicted_y = model.predict(X)
return compute_mse(predicted_y, y)
def log_mse(mse):
print('MSE: {:.10f}'.format(mse))
print('SMSE: {:.10f}'.format(mse/np.var(Y_scaled, ddof = 0)))
test_mse = compute_model_mse(linearRegressor, X_scaled_test, Y_scaled_test)
log_mse(test_mse)
Looks like we got the correct result. Yay!
Next, we'll try using a radial basis function (RBF) network. First, we need to do some K-means clustering. We'll pick the number of clusters with the qualitative kink, knee, or elbow method (three different names for the same method). The book says to use cross-validation, but that's nonsense since the book itself says that using cross-validation is not possible for non-probablistic methods. I guess we could use a Gaussian Mixture Model if we want to do cross-validation?
We'll fit the model with various values of $k$ and look at the reconstruction error.
In [6]:
from sklearn import cluster
from joblib import Parallel, delayed
k_range = np.arange(1, 107, 3)
def score_k(k):
kMeansModel = cluster.KMeans(n_clusters = k, init = 'k-means++', algorithm = 'elkan', precompute_distances = True)
kMeansModel.fit(X_scaled)
return -kMeansModel.score(X_scaled_test)
start = time.perf_counter() ## time how long it takes
reconstruction_error = Parallel(n_jobs = 2, backend='threading')(delayed(score_k)(k) for k in k_range)
print('Time elapsed: ' + str(time.perf_counter() - start))
reconstruction_error
Out[6]:
Let's plot the reconstruction error to find the kink.
In [7]:
reconstruction_error = np.array(reconstruction_error, dtype = np.float64)
fig = plt.figure(figsize=(10,6))
ax = plt.gca()
ax.plot(k_range, reconstruction_error, '.k-')
ax.grid(True)
ax.set_ylabel('Reconstruction error')
ax.set_xlabel('$k$ - the number of clusters')
ax.set_title('Test Data Reconstruction Error versus $k$')
ax.get_yaxis().set_major_formatter(
matplotlib.ticker.FuncFormatter(lambda y, pos : format(int(y), ',')))
fig.show()
Qualitatively, $100$ looks good to me. Now, let's fit our RBF network, which is a super simple neural network. We have 1 hidden layer of 100 units with the RBF activation function, and we run a linear regression on that hidden layer.
In [8]:
n_clusters = 100
kMeansModel = cluster.KMeans(n_clusters = n_clusters, init = 'k-means++', algorithm = 'elkan', precompute_distances = True)
kMeansModel.fit(X_scaled)
prototypes = kMeansModel.cluster_centers_
The RBF kerneal is
\begin{equation} \kappa(\mathbf{x}, \mathbf{x}^\prime) = \exp\left(-\frac{\lVert\mathbf{x}-\mathbf{x}^\prime\rVert^2}{2\sigma^2}\right), \end{equation}so we'll perform our linear regression with design matrix $\Phi$, where rows of $\Phi$ are
\begin{equation} \phi(\mathbf{x}) = \begin{pmatrix} \kappa(\mathbf{x}, \boldsymbol\mu_1) & \kappa(\mathbf{x}, \boldsymbol\mu_2) & \cdots & \kappa(\mathbf{x}, \boldsymbol\mu_{k}) \end{pmatrix}, \end{equation}where the $\boldsymbol\mu_j$ for $j = 1,2,\ldots,k$ are our cluster centers. $\sigma^2$ is parameter called bandwidth that we'll have to find with cross-validation.
In [9]:
from sklearn import model_selection
class RBFNetwork:
"""
Implements RBFNetwork as a special case of linear regression.
"""
def __init__(self, mu, sigma2):
self.linearRegressor = linear_model.LinearRegression(fit_intercept = True, normalize = False)
self.mu = mu
self.sigma2 = sigma2
def fit(self, X, y):
Phi = self.rbf_transform(X, self.mu, self.sigma2)
linearRegressor.fit(Phi, y)
return self
def predict(self, X):
Phi = self.rbf_transform(X, self.mu, self.sigma2)
return linearRegressor.predict(Phi)
def score(self, X, y):
predicted_y = self.predict(X)
return -(sum((y - predicted_y)**2)/len(y))[0]
def get_params(self, deep = True):
return {'mu': self.mu, 'sigma2': self.sigma2}
def set_params(self, **params):
if 'mu' in params:
self.mu = params['mu']
if 'sigma2' in params:
self.sigma2 = params['sigma2']
return self
@staticmethod
def rbf(x1, x2, sigma2):
deltaX = (x1 - x2)
distX = sum(deltaX*deltaX)
return np.exp(-distX/2/sigma2)
@staticmethod
def phi(x, mu, sigma2):
return np.apply_along_axis(func1d = RBFNetwork.rbf, axis = 1, arr = mu, x2 = x, sigma2 = sigma2)
@staticmethod
def rbf_transform(X, mu, sigma2):
return np.apply_along_axis(func1d = RBFNetwork.phi, axis = 1, arr = X, mu = mu, sigma2 = sigma2)
bandwidths = [1, 2, 4, 8, 16, 32, 64, 128, 256, 512]
gridSearchCV = model_selection.GridSearchCV(RBFNetwork(mu = prototypes, sigma2 = 1024),
param_grid = {'sigma2': bandwidths},
cv = model_selection.KFold(n_splits = 5, shuffle = True, random_state = 2016),
n_jobs = 1) ## parallel jobs are broken for a reason that I don't feel like debugging
gridSearchCV.fit(X_scaled, Y_scaled)
cv_results = pd.DataFrame({'bandwidth': bandwidths,
'mean_test_score': gridSearchCV.cv_results_['mean_test_score'],
'std_test_score': gridSearchCV.cv_results_['std_test_score'],
'mean_train_score': gridSearchCV.cv_results_['mean_train_score'],
'std_train_score': gridSearchCV.cv_results_['std_train_score']})
cv_results
Out[9]:
In [10]:
fig = plt.figure(figsize=(10, 6))
ax = fig.gca()
test_error_bars = ax.errorbar(cv_results['bandwidth'], cv_results['mean_test_score'], cv_results['std_test_score'],
linestyle='-', marker='.', color='b', capsize=5)
train_error_bars = ax.errorbar(cv_results['bandwidth'], cv_results['mean_train_score'], cv_results['std_train_score'],
linestyle='-', marker='.', color='r', capsize=5)
ax.grid(True)
ax.legend([test_error_bars, train_error_bars], ['Test Set', 'Training Set'], loc='lower right')
ax.set_xscale('log', basex = 2)
ax.set_xlabel('Bandwidth ($\sigma^2$)')
ax.set_ylabel('Score (negative MSE)')
ax.set_title('Score versus Bandwidth')
fig.show()
It looks like the best model has a bandwidth of about $2^{8}$. Let's compute its SMSE.
In [11]:
rBFNetwork = RBFNetwork(mu = prototypes, sigma2 = 2**8)
rBFNetwork.fit(X_scaled, Y_scaled)
-rBFNetwork.score(X_scaled_test, Y_scaled_test)/np.var(Y_scaled, ddof=0)
Out[11]:
Seems that we can only get down to an SMSE of 0.043. The next step is to try a neural network. This has more parameters to choose. In particular, we're interested in the number of hidden units, layers, and the L2 regularization parameter.
Since I work at Google, I thought that I would demonstrate constructing a neural network in TensorFlow. Since I'm doing this with the intent of learning the details of the models, I'm going to eschew the high-level APIs and construct the graph mostly by hand. Much of the code is imported from sarco_train.py.
First, we'll set up our features which need to normalized. We make it so that the input has mean 0 and variance 1.
In [6]:
import tensorflow as tf
from sarco_train import create_feature_columns
feature_columns = create_feature_columns(pd.concat([training_data, test_data]))
def get_feature_data(data, feature_columns):
graph = tf.Graph()
with graph.as_default():
features = {column: tf.constant(column_data, dtype=tf.float32)
for column, column_data in data.to_dict('list').items()}
input_layer = tf.feature_column.input_layer(
features=features, feature_columns=feature_columns)
with tf.Session(graph=graph) as sess:
input_data = sess.run(input_layer)
return input_data, data[TARGET_COLUMN].astype(np.float32)
training_input_data, training_target_data = get_feature_data(training_data, feature_columns)
test_input_data, test_target_data = get_feature_data(test_data, feature_columns)
Everything in TensorFlow is a graph. We build build it layer by layer.
Here, we extract the features from the training data and test data and apply the feature columns to the two datasets.
Now, we can think about the inference layer. To keep things rather simple, I'll only use two layers. The weights and biases are initialize with noise, ReLU is used as the activation function. The first layer has 70 units, and the second layer has 50 units. Since we have an input of 21 features, the total number of weight parameters in a fully connected network are $21 \times 70 + 70 \times 50 + 50 \times 1$, and we have also have $70 + 50 + 1$ bias parameters. The number of units was chosen arbitrarily. I was too lazy to implement and cross validation strategy.
For loss, we minimize mean squared error. This is we'll compare models.
In [7]:
from sarco_train import inference, HIDDEN1_UNITS, HIDDEN2_UNITS
eval_graph = tf.Graph()
with eval_graph.as_default():
input = tf.placeholder(dtype=tf.float32, shape=(None, len(feature_columns)))
prediction = inference(input, HIDDEN1_UNITS, HIDDEN2_UNITS)
labels = tf.placeholder(dtype=tf.float32, shape=(None,))
mse = tf.losses.mean_squared_error(labels=labels, predictions=prediction, reduction=tf.losses.Reduction.MEAN)
tf.summary.scalar('mse', mse)
summary_op = tf.summary.merge_all()
# Reset evaluation process
last_checkpoint_summarized = None
Training the model is done in sarco_train.py. The script was run with the command:
python3 sarco_train.py --restart --log_level=INFO --log_dir /Users/phillypham/tmp/sarco/logs
Input is fetched from a queue to avoid blocking the thread doing the training. 1024 examples are processed in each training step. In addition to minimize mean squared error, we add $L^1$ and $L^2$ regularization to deal with the large number of parameters. The scale parameter for $L^1$ was $1.0$. The scale parameter for $L^2$ is $14.0$. The variation of gradient descent that we'll use is Adam Optimizer with a learning rate of $0.01$.
While the training script is running, this code runs in a separate process to periodically evaluate the model. We calculate mean squared error against the entire training set and entire test set. We can see these plotted by running tensorboard --logdir=/Users/phillypham/tmp/sarco/logs and visiting TensorBoard.
Since we're using a Monitored Session, we can continue training from where left off by running
python3 sarco_train.py --log_level=INFO --log_dir /Users/phillypham/tmp/sarco/logs
and restarting the evaluation code below.
In [8]:
from sarco_train import CHECKPOINT_INTERVAL_SECS
CHECKPOINT_DIR = '/Users/phillypham/tmp/sarco/logs'
EVAL_DIR = CHECKPOINT_DIR + '/eval'
with eval_graph.as_default():
saver = tf.train.Saver(tf.get_collection(tf.GraphKeys.TRAINABLE_VARIABLES))
training_writer = tf.summary.FileWriter(EVAL_DIR + '/training', eval_graph)
test_writer = tf.summary.FileWriter(EVAL_DIR + '/test', eval_graph)
with tf.Session() as sess:
def write_summary(step):
def write_mse(writer, features, target):
summary = sess.run(summary_op, feed_dict={input: features, labels: target})
writer.add_summary(summary, step)
writer.flush()
tf.logging.info('Writing summary for step {:d}'.format(step))
write_mse(test_writer, test_input_data, test_target_data)
write_mse(training_writer, training_input_data, training_target_data)
while True:
checkpoint_state = tf.train.get_checkpoint_state(CHECKPOINT_DIR)
checkpoint_path = checkpoint_state.model_checkpoint_path
# Stop running when no more new checkpoints are being generated.
if checkpoint_path == last_checkpoint_summarized:
break
saver.restore(sess, checkpoint_state.model_checkpoint_path)
global_step = int(checkpoint_state.model_checkpoint_path.split('/')[-1].split('-')[-1])
write_summary(global_step)
last_checkpoint_summarized = checkpoint_path
time.sleep(CHECKPOINT_INTERVAL_SECS)
tf.logging.info('Training stopped.')
In TensorBoard, I've been summarizing loss and the mean squared error on checkpoints as the model is training. Here's the loss plotted.
You can see that the model has more or less by converged 50,000 training steps. Because we're sampling batches of data, there's a lot of noise.
The loss includes the regularization penalty, so when evaluating the model on the test data, it may be more helpful to look at mean squared error. The various mean squared errors are:
Of course, this isn't very helpful because of how noisy the MSE of the training batch set is. It's more helpful to look at just the MSEs from the evaluation process.
After the model converged, there's a lot of random noise. We may possibly be overfitting by training this long. The MSE on the training and test set stay pretty close together, however. It looks like the lowest MSE is obtained by the model at step 188,939. Let's calculate the SMSE for this model.
In [10]:
SELECTED_MODEL = CHECKPOINT_DIR + '/model.ckpt-188939'
with eval_graph.as_default():
with tf.Session() as sess:
saver.restore(sess, SELECTED_MODEL)
test_mse = sess.run(mse, feed_dict={input: test_input_data, labels: test_target_data})
log_mse(test_mse)
Nice. Neural networks achieve the best standarded mean squared error. Of course, training this model took the longest, and the details of why it works are already obscure after only two layers.