Class 5: Backpropagation.
In [1]:
from sklearn import preprocessing
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
from sklearn.cross_validation import train_test_split
# Encode text values to dummy variables(i.e. [1,0,0],[0,1,0],[0,0,1] for red,green,blue)
def encode_text_dummy(df,name):
dummies = pd.get_dummies(df[name])
for x in dummies.columns:
dummy_name = "{}-{}".format(name,x)
df[dummy_name] = dummies[x]
df.drop(name, axis=1, inplace=True)
# Encode text values to indexes(i.e. [1],[2],[3] for red,green,blue).
def encode_text_index(df,name):
le = preprocessing.LabelEncoder()
df[name] = le.fit_transform(df[name])
return le.classes_
# Encode a numeric column as zscores
def encode_numeric_zscore(df,name,mean=None,sd=None):
if mean is None:
mean = df[name].mean()
if sd is None:
sd = df[name].std()
df[name] = (df[name]-mean)/sd
# Convert all missing values in the specified column to the median
def missing_median(df, name):
med = df[name].median()
df[name] = df[name].fillna(med)
# Convert a Pandas dataframe to the x,y inputs that TensorFlow needs
def to_xy(df,target):
result = []
for x in df.columns:
if x != target:
result.append(x)
# find out the type of the target column. Is it really this hard? :(
target_type = df[target].dtypes
target_type = target_type[0] if hasattr(target_type, '__iter__') else target_type
# Encode to int for classification, float otherwise. TensorFlow likes 32 bits.
if target_type in (np.int64, np.int32):
# Classification
return df.as_matrix(result).astype(np.float32),df.as_matrix([target]).astype(np.int32)
else:
# Regression
return df.as_matrix(result).astype(np.float32),df.as_matrix([target]).astype(np.float32)
# Nicely formatted time string
def hms_string(sec_elapsed):
h = int(sec_elapsed / (60 * 60))
m = int((sec_elapsed % (60 * 60)) / 60)
s = sec_elapsed % 60
return "{}:{:>02}:{:>05.2f}".format(h, m, s)
In [12]:
%matplotlib inline
import matplotlib.pyplot as plt
from sklearn.metrics import roc_curve, auc
# Plot a confusion matrix.
# cm is the confusion matrix, names are the names of the classes.
def plot_confusion_matrix(cm, names, title='Confusion matrix', cmap=plt.cm.Blues):
plt.imshow(cm, interpolation='nearest', cmap=cmap)
plt.title(title)
plt.colorbar()
tick_marks = np.arange(len(names))
plt.xticks(tick_marks, names, rotation=45)
plt.yticks(tick_marks, names)
plt.tight_layout()
plt.ylabel('True label')
plt.xlabel('Predicted label')
# Plot an ROC. pred - the predictions, y - the expected output.
def plot_roc(pred,y):
fpr, tpr, _ = roc_curve(y_test, pred)
roc_auc = auc(fpr, tpr)
plt.figure()
plt.plot(fpr, tpr, label='ROC curve (area = %0.2f)' % roc_auc)
plt.plot([0, 1], [0, 1], 'k--')
plt.xlim([0.0, 1.0])
plt.ylim([0.0, 1.05])
plt.xlabel('False Positive Rate')
plt.ylabel('True Positive Rate')
plt.title('Receiver Operating Characteristic (ROC)')
plt.legend(loc="lower right")
plt.show()
# Plot a lift curve. pred - the predictions, y - the expected output.
def chart_regression(pred,y):
t = pd.DataFrame({'pred' : pred.flatten(), 'y' : y_test.flatten()})
t.sort_values(by=['y'],inplace=True)
plt.plot(t['y'].tolist(),label='expected')
plt.plot(t['pred'].tolist(),label='prediction')
plt.ylabel('output')
plt.legend()
plt.show()
Backpropagation is the primary means by which a neural network's weights are determined during training. Backpropagation works by calculating a weight change amount ($v_t$) for every weight($\theta$) in the neural network. This value is subtracted from every weight by the following equation:
$ \theta_t = \theta_{t-1} - v_t $
This process is repeated for every iteration($t$). How the weight change is calculated depends on the training algorithm. Classic backpropagation simply calculates a gradient ($\nabla$) for every weight in the neural network with respect to the error function ($J$) of the neural network. The gradient is scaled by a learning rate ($\eta$).
$ v_t = \eta \nabla_{\theta_{t-1}} J(\theta_{t-1}) $
The learning rate is an important concept for backpropagation training. Setting the learning rate can be complex:
Common values for learning rate are: 0.1, 0.01, 0.001, etc.
Gradients:
The following link, from the book, shows how a simple neural network is trained with backpropagation.
Momentum adds another term to the calculation of $v_t$:
$ v_t = \eta \nabla_{\theta_{t-1}} J(\theta_{t-1}) + \lambda v_{t-1} $
Like the learning rate, momentum adds another training parameter that scales the effect of momentum. Momentum backpropagation has two training parameters: learning rate ($\lambda$) and momentum ($\eta$). Momentum simply adds the scaled value of the previous weight change amount ($v_{t-1}$) to the current weight change amount($v_t$).
This has the effect of adding additional force behind a direction a weight was moving. This might allow the weight to escape a local minima:
A very common value for momentum is 0.9.
How often should the weights of a neural network be updated? Gradients can be calculated for a training set element. These gradients can also be summed together into batches and the weights updated once per batch.
Because the batch size is smaller than the complete training set size, it may take several batches to make it completely through the training set. You may have noticed TensorFlow reporting both steps and epochs when a neural network is trained:
Step #100, epoch #7, avg. train loss: 23.02969
Step #200, epoch #15, avg. train loss: 2.67576
Step #300, epoch #23, avg. train loss: 1.33839
Step #400, epoch #30, avg. train loss: 0.86830
Step #500, epoch #38, avg. train loss: 0.67166
Step #600, epoch #46, avg. train loss: 0.54569
Step #700, epoch #53, avg. train loss: 0.47544
Step #800, epoch #61, avg. train loss: 0.39358
Step #900, epoch #69, avg. train loss: 0.36052
Stochastic gradient descent (SGD) is currently one of the most popular neural network training algorithms. It works very similarly to Batch/Mini-Batch training, except that the batches are made up of a random set of training elements.
This leads to a very irregular convergence in error during training:
Because the neural network is trained on a random sample of the complete training set each time, the error does not make a smooth transition downward. However, the error usually does go down.
Advantages to SGD include:
One problem with simple backpropagation training algorithms is that they are highly sensative to learning rate and momentum. This is difficult because:
Other training techniques:
ADAM is the first training algorithm you should try. It is very effective. Kingma and Ba (2014) introduced the Adam update rule that derives its name from the adaptive moment estimates that it uses. Adam estimates the first (mean) and second (variance) moments to determine the weight corrections. Adam begins with an exponentially decaying average of past gradients (m):
$ m_t = \beta_1 m_{t-1} + (1-\beta_1) g_t $
This average accomplishes a similar goal as classic momentum update; however, its value is calculated automatically based on the current gradient ($g_t$). The update rule then calculates the second moment ($v_t$):
$ v_t = \beta_2 v_{t-1} + (1-\beta_2) g_t^2 $
The values $m_t$ and $v_t$ are estimates of the first moment (the mean) and the second moment (the uncentered variance) of the gradients respectively. However, they will have a strong bias towards zero in the initial training cycles. The first moment’s bias is corrected as follows.
$ \hat{m}_t = \frac{m_t}{1-\beta^t_1} $
Similarly, the second moment is also corrected:
$ \hat{v}_t = \frac{v_t}{1-\beta_2^t} $
These bias-corrected first and second moment estimates are applied to the ultimate Adam update rule, as follows:
$ \theta_t = \theta_{t-1} - \frac{\eta}{\sqrt{\hat{v}_t}+\eta} \hat{m}_t $
Adam is very tolerant to initial learning rate (η) and other training parameters. Kingma and Ba (2014) propose default values of 0.9 for $\beta_1$, 0.999 for $\beta_2$, and 10-8 for $\eta$.
The following image shows how each of these algorithms train (image credits: author, where I found it ):
In [15]:
%matplotlib inline
from matplotlib.pyplot import figure, show
from numpy import arange
import tensorflow.contrib.learn as skflow
import pandas as pd
import os
import numpy as np
import tensorflow as tf
from sklearn import metrics
from scipy.stats import zscore
import matplotlib.pyplot as plt
path = "./data/"
filename_read = os.path.join(path,"auto-mpg.csv")
df = pd.read_csv(filename_read,na_values=['NA','?'])
# create feature vector
missing_median(df, 'horsepower')
df.drop('name',1,inplace=True)
encode_numeric_zscore(df, 'horsepower')
encode_numeric_zscore(df, 'weight')
encode_numeric_zscore(df, 'cylinders')
encode_numeric_zscore(df, 'displacement')
encode_numeric_zscore(df, 'acceleration')
encode_text_dummy(df, 'origin')
# Encode to a 2D matrix for training
x,y = to_xy(df,['mpg'])
# Split into train/test
x_train, x_test, y_train, y_test = train_test_split(
x, y, test_size=0.25, random_state=42)
# Create a deep neural network with 3 hidden layers of 50, 25, 10
regressor = skflow.TensorFlowDNNRegressor(
hidden_units=[50, 25, 10],
batch_size = 32,
#momentum=0.9,
optimizer='SGD',
learning_rate=0.01,
steps=5000)
# Early stopping
early_stop = skflow.monitors.ValidationMonitor(x_test, y_test,
early_stopping_rounds=200, print_steps=50)
# Fit/train neural network
regressor.fit(x_train, y_train, monitor=early_stop)
# Measure RMSE error. RMSE is common for regression.
pred = regressor.predict(x_test)
score = np.sqrt(metrics.mean_squared_error(pred,y_test))
print("Final score (RMSE): {}".format(score))
# Plot the chart
chart_regression(pred,y_test)
In [ ]: