Class 3: Training a Neural Network
Neural networks require their input to be a fixed number of columns. This is very similar to spreadsheet data. This input must be completely numeric.
It is important to represent the data in a way that the neural network can train from it. In class 6, we will see even more ways to preprocess data. For now, we will look at several of the most basic ways to transform data for a neural network.
Before we look at specific ways to preprocess data, it is important to consider four basic types of data, as defined by Stanley Smith Stevens. These are commonly referred to as the levels of measure:
The following code contains several useful functions to encode the feature vector for various types of data. Encoding data:
Ordinal values can be encoded as dummy or index. Later we will see a more advanced means of encoding
Dealing with missing data:
Creating the final feature vector:
Other utility functions:
In [4]:
from sklearn import preprocessing
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
# Encode text values to dummy variables(i.e. [1,0,0],[0,1,0],[0,0,1] for red,green,blue)
def encode_text_dummy(df,name):
dummies = pd.get_dummies(df[name])
for x in dummies.columns:
dummy_name = "{}-{}".format(name,x)
df[dummy_name] = dummies[x]
df.drop(name, axis=1, inplace=True)
# Encode text values to indexes(i.e. [1],[2],[3] for red,green,blue).
def encode_text_index(df,name):
le = preprocessing.LabelEncoder()
df[name] = le.fit_transform(df[name])
return le.classes_
# Encode a numeric column as zscores
def encode_numeric_zscore(df,name,mean=None,sd=None):
if mean is None:
mean = df[name].mean()
if sd is None:
sd = df[name].std()
df[name] = (df[name]-mean)/sd
# Convert all missing values in the specified column to the median
def missing_median(df, name):
med = df[name].median()
df[name] = df[name].fillna(med)
# Convert a Pandas dataframe to the x,y inputs that TensorFlow needs
def to_xy(df,target):
result = []
for x in df.columns:
if x != target:
result.append(x)
# find out the type of the target column. Is it really this hard? :(
target_type = df[target].dtypes
target_type = target_type[0] if hasattr(target_type, '__iter__') else target_type
print(target_type)
# Encode to int for classification, float otherwise. TensorFlow likes 32 bits.
if target_type in (np.int64, np.int32):
# Classification
return df.as_matrix(result).astype(np.float32),df.as_matrix([target]).astype(np.int32)
else:
# Regression
return df.as_matrix(result).astype(np.float32),df.as_matrix([target]).astype(np.float32)
# Nicely formatted time string
def hms_string(sec_elapsed):
h = int(sec_elapsed / (60 * 60))
m = int((sec_elapsed % (60 * 60)) / 60)
s = sec_elapsed % 60
return "{}:{:>02}:{:>05.2f}".format(h, m, s)
# Regression chart, we will see more of this chart in the next class.
def chart_regression(pred,y):
t = pd.DataFrame({'pred' : pred.flatten(), 'y' : y_test.flatten()})
t.sort_values(by=['y'],inplace=True)
a = plt.plot(t['y'].tolist(),label='expected')
b = plt.plot(t['pred'].tolist(),label='prediction')
plt.ylabel('output')
plt.legend()
plt.show()
Overfitting occurs when a neural network is trained to the point that it begins to memorize rather than generalize.
It is important to segment the original dataset into several datasets:
There are several different ways that these sets can be constructed. The following programs demonstrate some of these.
The first method is a training and validation set. The training data are used to train the neural network until the validation set no longer improves. This attempts to stop at a near optimal training point. This method will only give accurate "out of sample" predictions for the validation set, this is usually 20% or so of the data. The predictions for the training data will be overly optimistic, as these were the data that the neural network was trained on.
In [5]:
import os
import pandas as pd
from sklearn.cross_validation import train_test_split
import tensorflow.contrib.learn as skflow
import numpy as np
path = "./data/"
filename = os.path.join(path,"iris.csv")
df = pd.read_csv(filename,na_values=['NA','?'])
# Encode feature vector
encode_numeric_zscore(df,'petal_w')
encode_numeric_zscore(df,'petal_l')
encode_numeric_zscore(df,'sepal_w')
encode_numeric_zscore(df,'sepal_l')
species = encode_text_index(df,"species")
num_classes = len(species)
# Create x & y for training
# Create the x-side (feature vectors) of the training
x, y = to_xy(df,'species')
# Split into train/test
x_train, x_test, y_train, y_test = train_test_split(
x, y, test_size=0.25, random_state=45)
# as much as I would like to use 42, it gives a perfect result, and a boring confusion matrix!
# Create a deep neural network with 3 hidden layers of 10, 20, 10
classifier = skflow.TensorFlowDNNClassifier(hidden_units=[20, 10, 5], n_classes=num_classes,
steps=10000)
# Early stopping
early_stop = skflow.monitors.ValidationMonitor(x_test, y_test,
early_stopping_rounds=200, print_steps=50, n_classes=num_classes)
# Fit/train neural network
classifier.fit(x_train, y_train, monitor=early_stop)
Out[5]:
Accuracy is the number of rows where the neural network correctly predicted the target class. Accuracy is only used for classification, not regression.
$ accuracy = \frac{\textit{#} \ correct}{N} $
Where $N$ is the size of the evaluted set (training or validation). Higher accuracy numbers are desired.
In [7]:
from sklearn import metrics
# Evaluate success using accuracy
pred = classifier.predict(x_test)
score = metrics.accuracy_score(y_test, pred)
print("Accuracy score: {}".format(score))
Accuracy is like a final exam with no partial credit. However, neural networks can predict a probability of each of the target classes. Neural networks will give high probabilities to predictions that are more likely. Log loss is an error metric that penalizes confidence in wrong answers. Lower log loss values are desired.
For any scikit-learn model there are two ways to get a prediction:
The following code shows the output of predict_proba:
In [8]:
pred = classifier.predict_proba(x_test)
np.set_printoptions(precision=4)
print("Numpy array of predictions")
print(pred[0:5])
print("As percent probability")
(pred[0:5]*100).astype(int)
score = metrics.log_loss(y_test, pred)
print("Log loss score: {}".format(score))
Log loss is calculated as follows:
$ \text{log loss} = -\frac{1}{N}\sum_{i=1}^N {( {y}_i\log(\hat{y}_i) + (1 - {y}_i)\log(1 - \hat{y}_i))} $
The log function is useful to penalizing wrong answers. The following code demonstrates the utility of the log function:
In [22]:
%matplotlib inline
from matplotlib.pyplot import figure, show
from numpy import arange, sin, pi
t = arange(0.0, 5.0, 0.00001)
#t = arange(1.0, 5.0, 0.00001) # computer scientists
#t = arange(0.0, 1.0, 0.00001) # data scientists
fig = figure(1,figsize=(12, 10))
ax1 = fig.add_subplot(211)
ax1.plot(t, np.log(t))
ax1.grid(True)
ax1.set_ylim((-8, 1.5))
ax1.set_xlim((-0.1, 2))
ax1.set_xlabel('x')
ax1.set_ylabel('y')
ax1.set_title('log(x)')
show()
Regression results are evaluated differently than classification. Consider the following code that trains a neural network for the MPG dataset.
In [23]:
import tensorflow.contrib.learn as skflow
from sklearn.cross_validation import train_test_split
import pandas as pd
import os
import numpy as np
from sklearn import metrics
from scipy.stats import zscore
path = "./data/"
filename_read = os.path.join(path,"auto-mpg.csv")
df = pd.read_csv(filename_read,na_values=['NA','?'])
# create feature vector
missing_median(df, 'horsepower')
df.drop('name',1,inplace=True)
encode_numeric_zscore(df, 'horsepower')
encode_numeric_zscore(df, 'weight')
encode_numeric_zscore(df, 'cylinders')
encode_numeric_zscore(df, 'displacement')
encode_numeric_zscore(df, 'acceleration')
encode_text_dummy(df, 'origin')
# Encode to a 2D matrix for training
x,y = to_xy(df,['mpg'])
# Split into train/test
x_train, x_test, y_train, y_test = train_test_split(
x, y, test_size=0.20, random_state=42)
# Create a deep neural network with 3 hidden layers of 50, 25, 10
regressor = skflow.TensorFlowDNNRegressor(hidden_units=[50, 25, 10], steps=5000)
# Early stopping
early_stop = skflow.monitors.ValidationMonitor(x_test, y_test,
early_stopping_rounds=200, print_steps=50)
# Fit/train neural network
regressor.fit(x_train, y_train, monitor=early_stop)
Out[23]:
The mean square error is the sum of the squared differences between the prediction ($\hat{y}$) and the expected ($y$). MSE values are not of a particular unit. If an MSE value has decreased for a model, that is good. However, beyond this, there is not much more you can determine. Low MSE values are desired.
$ \text{MSE} = \frac{1}{n} \sum_{i=1}^n \left(\hat{y}_i - y_i\right)^2 $
In [24]:
pred = regressor.predict(x_test)
# Measure MSE error.
score = metrics.mean_squared_error(pred,y_test)
print("Final score (MSE): {}".format(score))
In [25]:
# Measure RMSE error. RMSE is common for regression.
score = np.sqrt(metrics.mean_squared_error(pred,y_test))
print("Final score (RMSE): {}".format(score))
Cross validation uses a number of folds, and multiple models, to generate out of sample predictions on the entire dataset. It is important to note that there will be one model (neural network) for each fold. Each model contributes part of the final out-of-sample prediction.
For new data, which is data not present in the training set, predictions from the fold models can be handled in several ways.
The following code trains the MPG dataset using a 5-fold cross validation. The expected performance of a neural network, of the type trained here, would be the score for the generated out-of-sample predictions.
In [14]:
import tensorflow.contrib.learn as skflow
import pandas as pd
import os
import numpy as np
from sklearn import metrics
from scipy.stats import zscore
from sklearn.cross_validation import KFold
path = "./data/"
filename_read = os.path.join(path,"auto-mpg.csv")
filename_write = os.path.join(path,"auto-mpg-out-of-sample.csv")
df = pd.read_csv(filename_read,na_values=['NA','?'])
# create feature vector
missing_median(df, 'horsepower')
df.drop('name',1,inplace=True)
encode_numeric_zscore(df, 'horsepower')
encode_numeric_zscore(df, 'weight')
encode_numeric_zscore(df, 'cylinders')
encode_numeric_zscore(df, 'displacement')
encode_numeric_zscore(df, 'acceleration')
encode_text_dummy(df, 'origin')
# Shuffle
np.random.seed(42)
df = df.reindex(np.random.permutation(df.index))
df.reset_index(inplace=True, drop=True)
# Encode to a 2D matrix for training
x,y = to_xy(df,['mpg'])
# Cross validate
kf = KFold(len(x), n_folds=5)
oos_y = []
oos_pred = []
fold = 1
for train, test in kf:
print("Fold #{}".format(fold))
fold+=1
x_train = x[train]
y_train = y[train]
x_test = x[test]
y_test = y[test]
# Create a deep neural network with 3 hidden layers of 10, 20, 10
regressor = skflow.TensorFlowDNNRegressor(hidden_units=[10, 20, 10], steps=500)
# Early stopping
early_stop = skflow.monitors.ValidationMonitor(x_test, y_test,
early_stopping_rounds=200, print_steps=50)
# Fit/train neural network
regressor.fit(x_train, y_train, monitor=early_stop)
# Add the predictions to the oos prediction list
pred = regressor.predict(x_test)
oos_y.append(y_test)
oos_pred.append(pred)
# Measure accuracy
score = np.sqrt(metrics.mean_squared_error(pred,y_test))
print("Fold score (RMSE): {}".format(score))
# Build the oos prediction list and calculate the error.
oos_y = np.concatenate(oos_y)
oos_pred = np.concatenate(oos_pred)
score = np.sqrt(metrics.mean_squared_error(oos_pred,oos_y))
print("Final, out of sample score (RMSE): {}".format(score))
# Write the cross-validated prediction
oos_y = pd.DataFrame(oos_y)
oos_pred = pd.DataFrame(oos_pred)
oosDF = pd.concat( [df, oos_y, oos_pred],axis=1 )
oosDF.to_csv(filename_write,index=False)
If you have a considerable amount of data, it is always valuable to set aside a holdout set before you crossvalidate. This hold out set will be the final evaluation before you make use of your model for its real-world use.
The following program makes use of a hodlout set, and then still cross validates.
In [27]:
import tensorflow.contrib.learn as skflow
from sklearn.cross_validation import train_test_split
import pandas as pd
import os
import numpy as np
from sklearn import metrics
from scipy.stats import zscore
from sklearn.cross_validation import KFold
path = "./data/"
filename_read = os.path.join(path,"auto-mpg.csv")
filename_write = os.path.join(path,"auto-mpg-holdout.csv")
df = pd.read_csv(filename_read,na_values=['NA','?'])
# create feature vector
missing_median(df, 'horsepower')
df.drop('name',1,inplace=True)
encode_numeric_zscore(df, 'horsepower')
encode_numeric_zscore(df, 'weight')
encode_numeric_zscore(df, 'cylinders')
encode_numeric_zscore(df, 'displacement')
encode_numeric_zscore(df, 'acceleration')
encode_text_dummy(df, 'origin')
# Shuffle
np.random.seed(42)
df = df.reindex(np.random.permutation(df.index))
df.reset_index(inplace=True, drop=True)
# Encode to a 2D matrix for training
x,y = to_xy(df,['mpg'])
# Keep a 10% holdout
x_main, x_holdout, y_main, y_holdout = train_test_split(
x, y, test_size=0.10)
# Cross validate
kf = KFold(len(x_main), n_folds=5)
oos_y = []
oos_pred = []
fold = 1
for train, test in kf:
print("Fold #{}".format(fold))
fold+=1
x_train = x_main[train]
y_train = y_main[train]
x_test = x_main[test]
y_test = y_main[test]
# Create a deep neural network with 3 hidden layers of 10, 20, 10
regressor = skflow.TensorFlowDNNRegressor(hidden_units=[10, 20, 10], steps=500)
# Early stopping
early_stop = skflow.monitors.ValidationMonitor(x_test, y_test,
early_stopping_rounds=200, print_steps=50)
# Fit/train neural network
regressor.fit(x_train, y_train, monitor=early_stop)
# Add the predictions to the OOS prediction list
pred = regressor.predict(x_test)
oos_y.append(y_test)
oos_pred.append(pred)
# Measure accuracy
score = np.sqrt(metrics.mean_squared_error(pred,y_test))
print("Fold score (RMSE): {}".format(score))
# Build the oos prediction list and calculate the error.
oos_y = np.concatenate(oos_y)
oos_pred = np.concatenate(oos_pred)
score = np.sqrt(metrics.mean_squared_error(oos_pred,oos_y))
print()
print("Cross-validated score (RMSE): {}".format(score))
# Write the cross-validated prediction
holdout_pred = regressor.predict(x_holdout)
score = np.sqrt(metrics.mean_squared_error(holdout_pred,y_holdout))
print("Holdout score (RMSE): {}".format(score))
Kaggle is a platform for competitive data science. Competitions are posted onto Kaggle by companies seeking the best model for their data. Competing in a Kaggle competition is quite a bit of work, I've competed in one Kaggle competition.
Kaggle awards "tiers", such as:
Your tier is based on your performance in past competitions.
To compete in Kaggle you simply provide predictions for a dataset that they post. You do not need to submit any code. Your prediction output will place you onto the leaderboard of a competition.
An original dataset is sent to Kaggle by the company. From this dataset, Kaggle posts public data that includes "train" and "test. For the "train" data, the outcomes (y) are provided. For the test data, no outcomes are provided. Your submission file contains your predictions for the "test data". When you submit your results, Kaggle will calculate a score on part of your prediction data. They do not publish want part of the submission data are used for the public and private leaderboard scores (this is a secret to prevent overfitting). While the competition is still running, Kaggle publishes the public leaderboard ranks. Once the competition ends, the private leaderboard is revealed to designate the true winners. Due to overfitting, there is sometimes an upset in positions when the final private leaderboard is revealed.
In [15]:
%matplotlib inline
from matplotlib.pyplot import figure, show
from numpy import arange
import tensorflow.contrib.learn as skflow
import pandas as pd
import os
import numpy as np
import tensorflow as tf
from sklearn import metrics
from scipy.stats import zscore
import matplotlib.pyplot as plt
path = "./data/"
filename_read = os.path.join(path,"auto-mpg.csv")
df = pd.read_csv(filename_read,na_values=['NA','?'])
# create feature vector
missing_median(df, 'horsepower')
df.drop('name',1,inplace=True)
encode_numeric_zscore(df, 'horsepower')
encode_numeric_zscore(df, 'weight')
encode_numeric_zscore(df, 'cylinders')
encode_numeric_zscore(df, 'displacement')
encode_numeric_zscore(df, 'acceleration')
encode_text_dummy(df, 'origin')
# Encode to a 2D matrix for training
x,y = to_xy(df,['mpg'])
# Split into train/test
x_train, x_test, y_train, y_test = train_test_split(
x, y, test_size=0.25, random_state=42)
# Create a deep neural network with 3 hidden layers of 50, 25, 10
regressor = skflow.TensorFlowDNNRegressor(
hidden_units=[50, 25, 10],
batch_size = 32,
optimizer='SGD',
learning_rate=0.01,
steps=5000)
# Early stopping
early_stop = skflow.monitors.ValidationMonitor(x_test, y_test,
early_stopping_rounds=200, print_steps=50)
# Fit/train neural network
regressor.fit(x_train, y_train, monitor=early_stop)
# Measure RMSE error. RMSE is common for regression.
pred = regressor.predict(x_test)
score = np.sqrt(metrics.mean_squared_error(pred,y_test))
print("Final score (RMSE): {}".format(score))
# Plot the chart
chart_regression(pred,y_test)
Finding the right set of hyperparameters can be a large task. Often computational power is thrown at this job. The scikit-learn grid search makes use of your computer's CPU cores to try every one of a defined number of hyperparameters to see which gets the best score.
The following code shows how many CPU cores are available to Python:
In [16]:
import multiprocessing
print("Your system has {} cores.".format(multiprocessing.cpu_count()))
The following code performs a grid search. Your system is queried for the number of cores available they are used to scan through the combinations of hyperparameters that you specify.
In [17]:
%matplotlib inline
from matplotlib.pyplot import figure, show
from numpy import arange
import tensorflow.contrib.learn as skflow
import pandas as pd
import os
import numpy as np
import tensorflow as tf
from sklearn import metrics
from scipy.stats import zscore
from sklearn.grid_search import GridSearchCV
import multiprocessing
import time
from sklearn.cross_validation import train_test_split
import matplotlib.pyplot as plt
def main():
path = "./data/"
filename_read = os.path.join(path,"auto-mpg.csv")
df = pd.read_csv(filename_read,na_values=['NA','?'])
start_time = time.time()
# create feature vector
missing_median(df, 'horsepower')
df.drop('name',1,inplace=True)
encode_numeric_zscore(df, 'horsepower')
encode_numeric_zscore(df, 'weight')
encode_numeric_zscore(df, 'cylinders')
encode_numeric_zscore(df, 'displacement')
encode_numeric_zscore(df, 'acceleration')
encode_text_dummy(df, 'origin')
# Encode to a 2D matrix for training
x,y = to_xy(df,['mpg'])
# Split into train/test
x_train, x_test, y_train, y_test = train_test_split(
x, y, test_size=0.25, random_state=42)
# The hyperparameters specified here will be searched. Every combination.
param_grid = {
'learning_rate': [0.1, 0.01, 0.001],
'batch_size': [8, 16, 32]
}
# Create a deep neural network. The hyperparameters specified here remain fixed.
model = skflow.TensorFlowDNNRegressor(
hidden_units=[50, 25, 10],
batch_size = 32,
optimizer='SGD',
steps=5000)
# Early stopping
early_stop = skflow.monitors.ValidationMonitor(x_test, y_test,
early_stopping_rounds=200, print_steps=50)
# Startup grid search
threads = 1 #multiprocessing.cpu_count()
print("Using {} cores.".format(threads))
regressor = GridSearchCV(model, verbose=True, n_jobs=threads,
param_grid=param_grid,fit_params={'monitor':early_stop})
# Fit/train neural network
regressor.fit(x_train, y_train)
# Measure RMSE error. RMSE is common for regression.
pred = regressor.predict(x_test)
score = np.sqrt(metrics.mean_squared_error(pred,y_test))
print("Final score (RMSE): {}".format(score))
print("Final options: {}".format(regressor.best_params_))
# Plot the chart
chart_regression(pred,y_test)
elapsed_time = time.time() - start_time
print("Elapsed time: {}".format(hms_string(elapsed_time)))
# Allow windows to multi-thread (unneeded on advanced OS's)
# See: https://docs.python.org/2/library/multiprocessing.html
if __name__ == '__main__':
main()
The best combination of hyperparameters are displayed.
It is also possable to conduct a random search. The random search is similar to the grid search, except that the entire search space is not used. Rather, random points in the search space are tried. For a random search you must specify the number of hyperparameter iterations (n_iter) to try.
In [35]:
%matplotlib inline
from matplotlib.pyplot import figure, show
from numpy import arange
import tensorflow.contrib.learn as skflow
import pandas as pd
import os
import numpy as np
import tensorflow as tf
from sklearn import metrics
from scipy.stats import zscore
from scipy.stats import randint as sp_randint
from sklearn.grid_search import RandomizedSearchCV
import multiprocessing
import time
from sklearn.cross_validation import train_test_split
import matplotlib.pyplot as plt
def main():
path = "./data/"
filename_read = os.path.join(path,"auto-mpg.csv")
df = pd.read_csv(filename_read,na_values=['NA','?'])
start_time = time.time()
# create feature vector
missing_median(df, 'horsepower')
df.drop('name',1,inplace=True)
encode_numeric_zscore(df, 'horsepower')
encode_numeric_zscore(df, 'weight')
encode_numeric_zscore(df, 'cylinders')
encode_numeric_zscore(df, 'displacement')
encode_numeric_zscore(df, 'acceleration')
encode_text_dummy(df, 'origin')
# Encode to a 2D matrix for training
x,y = to_xy(df,['mpg'])
# Split into train/test
x_train, x_test, y_train, y_test = train_test_split(
x, y, test_size=0.25, random_state=42)
# The hyperparameters specified here will be searched. A random sample will be searched.
param_dist = {
'learning_rate': [0.1, 0.01, 0.001],
'batch_size': sp_randint(4, 32),
}
model = skflow.TensorFlowDNNRegressor(
hidden_units=[50, 25, 10],
batch_size = 32,
optimizer='SGD',
steps=5000)
# Early stopping
early_stop = skflow.monitors.ValidationMonitor(x_test, y_test,
early_stopping_rounds=200, print_steps=50)
# Random search
threads = 1 #multiprocessing.cpu_count()
print("Using {} cores.".format(threads))
regressor = RandomizedSearchCV(model, verbose=True, n_iter = 10,
n_jobs=threads, param_distributions=param_dist,
fit_params={'monitor':early_stop})
# Fit/train neural network
regressor.fit(x_train, y_train)
# Measure RMSE error. RMSE is common for regression.
pred = regressor.predict(x_test)
score = np.sqrt(metrics.mean_squared_error(pred,y_test))
print("Final score (RMSE): {}".format(score))
print("Final options: {}".format(regressor.best_params_))
# Plot the chart
chart_regression(pred,y_test)
elapsed_time = time.time() - start_time
print("Elapsed time: {}".format(hms_string(elapsed_time)))
# Allow windows to multi-thread (unneeded on advanced OS's)
# See: https://docs.python.org/2/library/multiprocessing.html
if __name__ == '__main__':
main()
In [ ]: