Class 10: Recurrent and LSTM Networks
In [3]:
from sklearn import preprocessing
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
# Encode text values to dummy variables(i.e. [1,0,0],[0,1,0],[0,0,1] for red,green,blue)
def encode_text_dummy(df,name):
dummies = pd.get_dummies(df[name])
for x in dummies.columns:
dummy_name = "{}-{}".format(name,x)
df[dummy_name] = dummies[x]
df.drop(name, axis=1, inplace=True)
# Encode text values to indexes(i.e. [1],[2],[3] for red,green,blue).
def encode_text_index(df,name):
le = preprocessing.LabelEncoder()
df[name] = le.fit_transform(df[name])
return le.classes_
# Encode a numeric column as zscores
def encode_numeric_zscore(df,name,mean=None,sd=None):
if mean is None:
mean = df[name].mean()
if sd is None:
sd = df[name].std()
df[name] = (df[name]-mean)/sd
# Convert all missing values in the specified column to the median
def missing_median(df, name):
med = df[name].median()
df[name] = df[name].fillna(med)
# Convert a Pandas dataframe to the x,y inputs that TensorFlow needs
def to_xy(df,target):
result = []
for x in df.columns:
if x != target:
result.append(x)
# find out the type of the target column. Is it really this hard? :(
target_type = df[target].dtypes
target_type = target_type[0] if hasattr(target_type, '__iter__') else target_type
print(target_type)
# Encode to int for classification, float otherwise. TensorFlow likes 32 bits.
if target_type in (np.int64, np.int32):
# Classification
return df.as_matrix(result).astype(np.float32),df.as_matrix([target]).astype(np.int32)
else:
# Regression
return df.as_matrix(result).astype(np.float32),df.as_matrix([target]).astype(np.float32)
# Nicely formatted time string
def hms_string(sec_elapsed):
h = int(sec_elapsed / (60 * 60))
m = int((sec_elapsed % (60 * 60)) / 60)
s = sec_elapsed % 60
return "{}:{:>02}:{:>05.2f}".format(h, m, s)
# Regression chart, we will see more of this chart in the next class.
def chart_regression(pred,y):
t = pd.DataFrame({'pred' : pred.flatten(), 'y' : y_test.flatten()})
t.sort_values(by=['y'],inplace=True)
a = plt.plot(t['y'].tolist(),label='expected')
b = plt.plot(t['pred'].tolist(),label='prediction')
plt.ylabel('output')
plt.legend()
plt.show()
Previously we trained neural networks with input ($x$) and expected output ($y$). $X$ was a matrix, the rows were training examples and the columns were values to be predicted. The definition of $x$ will be expanded and y will stay the same.
Dimensions of training set ($x$):
Previously, we might take as input a single stock price, to predict if we should buy (1), sell (-1), or hold (0).
In [ ]:
#
x = [
[32],
[41],
[39],
[20],
[15]
]
y = [
1,
-1,
0,
-1,
1
]
print(x)
print(y)
This is essentially building a CSV file from scratch, to see it as a data frame, use the following:
In [ ]:
from IPython.display import display, HTML
import pandas as pd
import numpy as np
x = np.array(x)
print(x[:,0])
df = pd.DataFrame({'x':x[:,0], 'y':y})
display(df)
You might want to put volume in with the stock price.
In [ ]:
x = [
[32,1383],
[41,2928],
[39,8823],
[20,1252],
[15,1532]
]
y = [
1,
-1,
0,
-1,
1
]
print(x)
print(y)
In [ ]:
Again, very similar to what we did before. The following shows this as a data frame.
In [ ]:
from IPython.display import display, HTML
import pandas as pd
import numpy as np
x = np.array(x)
print(x[:,0])
df = pd.DataFrame({'price':x[:,0], 'volume':x[:,1], 'y':y})
display(df)
Now we get to sequence format. We want to predict something over a sequence, so the data format needs to add a dimension. A maximum sequence length must be specified, but the individual sequences can be of any length.
In [ ]:
x = [
[[32,1383],[41,2928],[39,8823],[20,1252],[15,1532]],
[[35,8272],[32,1383],[41,2928],[39,8823],[20,1252]],
[[37,2738],[35,8272],[32,1383],[41,2928],[39,8823]],
[[34,2845],[37,2738],[35,8272],[32,1383],[41,2928]],
[[32,2345],[34,2845],[37,2738],[35,8272],[32,1383]],
]
y = [
1,
-1,
0,
-1,
1
]
print(x)
print(y)
Even if there is only one feature (price), the 3rd dimension must be used:
In [ ]:
x = [
[[32],[41],[39],[20],[15]],
[[35],[32],[41],[39],[20]],
[[37],[35],[32],[41],[39]],
[[34],[37],[35],[32],[41]],
[[32],[34],[37],[35],[32]],
]
y = [
1,
-1,
0,
-1,
1
]
print(x)
print(y)
So far the neural networks that we’ve examined have always had forward connections. The input layer always connects to the first hidden layer. Each hidden layer always connects to the next hidden layer. The final hidden layer always connects to the output layer. This manner to connect layers is the reason that these networks are called “feedforward.” Recurrent neural networks are not so rigid, as backward connections are also allowed. A recurrent connection links a neuron in a layer to either a previous layer or the neuron itself. Most recurrent neural network architectures maintain state in the recurrent connections. Feedforward neural networks don’t maintain any state. A recurrent neural network’s state acts as a sort of short-term memory for the neural network. Consequently, a recurrent neural network will not always produce the same output for a given input.
Recurrent neural networks do not force the connections to flow only from one layer to the next, from input layer to output layer. A recurrent connection occurs when a connection is formed between a neuron and one of the following other types of neurons:
Recurrent connections can never target the input neurons or the bias neurons.
The processing of recurrent connections can be challenging. Because the recurrent links create endless loops, the neural network must have some way to know when to stop. A neural network that entered an endless loop would not be useful. To prevent endless loops, we can calculate the recurrent connections with the following three approaches:
We refer to neural networks that use context neurons as a simple recurrent network (SRN). The context neuron is a special neuron type that remembers its input and provides that input as its output the next time that we calculate the network. For example, if we gave a context neuron 0.5 as input, it would output 0. Context neurons always output 0 on their first call. However, if we gave the context neuron a 0.6 as input, the output would be 0.5. We never weight the input connections to a context neuron, but we can weight the output from a context neuron just like any other connection in a network.
Context neurons allow us to calculate a neural network in a single feedforward pass. Context neurons usually occur in layers. A layer of context neurons will always have the same number of context neurons as neurons in its source layer, as demonstrated here:
As you can see from the above layer, two hidden neurons that are labeled hidden 1 and hidden 2 directly connect to the two context neurons. The dashed lines on these connections indicate that these are not weighted connections. These weightless connections are never dense. If these connections were dense, hidden 1 would be connected to both hidden 1 and hidden 2. However, the direct connection simply joins each hidden neuron to its corresponding context neuron. The two context neurons form dense, weighted connections to the two hidden neurons. Finally, the two hidden neurons also form dense connections to the neurons in the next layer. The two context neurons would form two connections to a single neuron in the next layer, four connections to two neurons, six connections to three neurons, and so on.
You can combine context neurons with the input, hidden, and output layers of a neural network in many different ways. In the next two sections, we explore two common SRN architectures.
In 1990, Elman introduced a neural network that provides pattern recognition to time series. This neural network type has one input neuron for each stream that you are using to predict. There is one output neuron for each time slice you are trying to predict. A single-hidden layer is positioned between the input and output layer. A layer of context neurons takes its input from the hidden layer output and feeds back into the same hidden layer. Consequently, the context layers always have the same number of neurons as the hidden layer, as demonstrated here:
The Elman neural network is a good general-purpose architecture for simple recurrent neural networks. You can pair any reasonable number of input neurons to any number of output neurons. Using normal weighted connections, the two context neurons are fully connected with the two hidden neurons. The two context neurons receive their state from the two non-weighted connections (dashed lines) from each of the two hidden neurons.
Backpropagation through time works by unfolding the SRN to become a regular neural network. To unfold the SRN, we construct a chain of neural networks equal to how far back in time we wish to go. We start with a neural network that contains the inputs for the current time, known as t. Next we replace the context with the entire neural network, up to the context neuron’s input. We continue for the desired number of time slices and replace the final context neuron with a 0. The following diagram shows an unfolded Elman neural network for two time slices.
As you can see, there are inputs for both t (current time) and t-1 (one time slice in the past). The bottom neural network stops at the hidden neurons because you don’t need everything beyond the hidden neurons to calculate the context input. The bottom network structure becomes the context to the top network structure. Of course, the bottom structure would have had a context as well that connects to its hidden neurons. However, because the output neuron above does not contribute to the context, only the top network (current time) has one.
Some useful resources on LSTM/recurrent neural networks.
Long Short Term Neural Network (LSTM) are a type of recurrent unit that is often used with deep neural networks. For TensorFlow, LSTM can be thought of as a layer type that can be combined with other layer types, such as dense. LSTM makes use two transfer function types internally.
The first type of transfer function is the sigmoid. This transfer function type is used form gates inside of the unit. The sigmoid transfer function is given by the following equation:
$$ \text{S}(t) = \frac{1}{1 + e^{-t}} $$The second type of transfer function is the hyperbolic tangent (tanh) function. This function is used to scale the output of the LSTM, similarly to how other transfer functions have been used in this course.
The graphs for these functions are shown here:
In [ ]:
%matplotlib inline
import matplotlib
import numpy as np
import matplotlib.pyplot as plt
import math
def sigmoid(x):
a = []
for item in x:
a.append(1/(1+math.exp(-item)))
return a
def f2(x):
a = []
for item in x:
a.append(math.tanh(item))
return a
x = np.arange(-10., 10., 0.2)
y1 = sigmoid(x)
y2 = f2(x)
print("Sigmoid")
plt.plot(x,y1)
plt.show()
print("Hyperbolic Tangent(tanh)")
plt.plot(x,y2)
plt.show()
Both of these two functions compress their output to a specific range. For the sigmoid function, this range is 0 to 1. For the hyperbolic tangent function, this range is -1 to 1.
LSTM maintains an internal state and produces an output. The following diagram shows an LSTM unit over three time slices: the current time slice (t), as well as the previous (t-1) and next (t+1) slice:
The values $\hat{y}$ are the output from the unit, the values ($x$) are the input to the unit and the values $c$ are the context values. Both the output and context values are always fed to the next time slice. The context values allow
LSTM is made up of three gates:
Mathematically, the above diagram can be thought of as the following:
These are vector values.
First, calculate the forget gate value. This gate determines if the short term memory is forgotten. The value $b$ is a bias, just like the bias neurons we saw before. Except LSTM has a bias for every gate: $b_t$, $b_i$, and $b_o$.
$$ f_t = S(W_f \cdot [\hat{y}_{t-1}, x_t] + b_f) $$Next, calculate the input gate value. This gate's value determines what will be remembered.
$$ i_t = S(W_i \cdot [\hat{y}_{t-1},x_t] + b_i) $$Calculate a candidate context value (a value that might be remembered). This value is called $\tilde{c}$.
$$ \tilde{C}_t = \tanh(W_C \cdot [\hat{y}_{t-1},x_t]+b_C) $$Determine the new context ($C_t$). Do this by remembering the candidate context ($i_t$), depending on input gate. Forget depending on the forget gate ($f_t$).
$$ C_t = f_t \cdot C_{t-1}+i_t \cdot \tilde{C}_t $$Calculate the output gate ($o_t$):
$$ o_t = S(W_o \cdot [\hat{y}_{t-1},x_t] + b_o ) $$Calculate the actual output ($\hat{y}_t$):
$$ \hat{y}_t = o_t \cdot \tanh(C_t) $$
In [ ]:
In [47]:
import numpy as np
import pandas
import tensorflow as tf
from sklearn import metrics
from tensorflow.models.rnn import rnn, rnn_cell
from tensorflow.contrib import skflow
SEQUENCE_SIZE = 6
HIDDEN_SIZE = 20
NUM_CLASSES = 4
def char_rnn_model(X, y):
byte_list = skflow.ops.split_squeeze(1, SEQUENCE_SIZE, X)
cell = rnn_cell.LSTMCell(HIDDEN_SIZE)
_, encoding = rnn.rnn(cell, byte_list, dtype=tf.float32)
return skflow.models.logistic_regression(encoding, y)
classifier = skflow.TensorFlowEstimator(model_fn=char_rnn_model, n_classes=NUM_CLASSES,
steps=100, optimizer='Adam', learning_rate=0.01, continue_training=True)
The following code trains on a data set (x) with a max sequence size of 6 (columns) and 6 training elements (rows)
In [48]:
x = [
[[0],[1],[1],[0],[0],[0]],
[[0],[0],[0],[2],[2],[0]],
[[0],[0],[0],[0],[3],[3]],
[[0],[2],[2],[0],[0],[0]],
[[0],[0],[3],[3],[0],[0]],
[[0],[0],[0],[0],[1],[1]]
]
x = np.array(x,dtype=np.float32)
y = np.array([1,2,3,2,3,1])
classifier.fit(x, y)
Out[48]:
In [49]:
test = [[[0],[0],[0],[0],[3],[3]]]
test = np.array(test)
classifier.predict(test)
Out[49]:
In [14]:
# How to read data from the stock market.
from IPython.display import display, HTML
import pandas.io.data as web
import datetime
start = datetime.datetime(2014, 1, 1)
end = datetime.datetime(2014, 12, 31)
f=web.DataReader('tsla', 'yahoo', start, end)
display(f)
In [15]:
import numpy as np
prices = f.Close.pct_change().tolist() # to percent changes
prices = prices[1:] # skip the first, no percent change
SEQUENCE_SIZE = 5
x = []
y = []
for i in range(len(prices)-SEQUENCE_SIZE-1):
#print(i)
window = prices[i:(i+SEQUENCE_SIZE)]
after_window = prices[i+SEQUENCE_SIZE]
window = [[x] for x in window]
#print("{} - {}".format(window,after_window))
x.append(window)
y.append(after_window)
x = np.array(x)
print(len(x))
In [16]:
from tensorflow.contrib import skflow
from tensorflow.models.rnn import rnn, rnn_cell
import tensorflow as tf
HIDDEN_SIZE = 20
def char_rnn_model(X, y):
byte_list = skflow.ops.split_squeeze(1, SEQUENCE_SIZE, X)
cell = rnn_cell.LSTMCell(HIDDEN_SIZE)
_, encoding = rnn.rnn(cell, byte_list, dtype=tf.float32)
return skflow.models.linear_regression(encoding, y)
regressor = skflow.TensorFlowEstimator(model_fn=char_rnn_model, n_classes=1,
steps=100, optimizer='Adam', learning_rate=0.01, continue_training=True)
regressor.fit(x, y)
Out[16]:
In [17]:
# Try an in-sample prediction
from sklearn import metrics
# Measure RMSE error. RMSE is common for regression.
pred = regressor.predict(x)
score = np.sqrt(metrics.mean_squared_error(pred,y))
print("Final score (RMSE): {}".format(score))
In [19]:
# Try out of sample
start = datetime.datetime(2015, 1, 1)
end = datetime.datetime(2015, 12, 31)
f=web.DataReader('tsla', 'yahoo', start, end)
import numpy as np
prices = f.Close.pct_change().tolist() # to percent changes
prices = prices[1:] # skip the first, no percent change
SEQUENCE_SIZE = 5
x = []
y = []
for i in range(len(prices)-SEQUENCE_SIZE-1):
window = prices[i:(i+SEQUENCE_SIZE)]
after_window = prices[i+SEQUENCE_SIZE]
window = [[x] for x in window]
x.append(window)
y.append(after_window)
x = np.array(x)
# Measure RMSE error. RMSE is common for regression.
pred = regressor.predict(x)
score = np.sqrt(metrics.mean_squared_error(pred,y))
print("Out of sample score (RMSE): {}".format(score))
In [40]:
import os
import pandas as pd
from sklearn.cross_validation import train_test_split
import tensorflow.contrib.learn as skflow
import numpy as np
from sklearn import metrics
path = "./data/"
filename = os.path.join(path,"t81_558_train.csv")
train_df = pd.read_csv(filename)
train_df.drop('id',1,inplace=True)
train_x, train_y = to_xy(train_df,'outcome')
train_x, test_x, train_y, test_y = train_test_split(
train_x, train_y, test_size=0.25, random_state=42)
# Create a deep neural network with 3 hidden layers of 50, 25, 10
regressor = skflow.TensorFlowDNNRegressor(hidden_units=[50, 25, 10], steps=5000)
# Early stopping
early_stop = skflow.monitors.ValidationMonitor(test_x, test_y,
early_stopping_rounds=200, print_steps=50)
# Fit/train neural network
regressor.fit(train_x, train_y, monitor=early_stop)
# Measure RMSE error. RMSE is common for regression.
pred = regressor.predict(test_x)
score = np.sqrt(metrics.mean_squared_error(pred,test_y))
print("Final score (RMSE): {}".format(score))
####################
# Build submit file
####################
from IPython.display import display, HTML
filename = os.path.join(path,"t81_558_test.csv")
submit_df = pd.read_csv(filename)
ids = submit_df.Id
submit_df.drop('Id',1,inplace=True)
submit_x = submit_df.as_matrix()
pred_submit = regressor.predict(submit_x)
submit_df = pd.DataFrame({'Id': ids, 'outcome': pred_submit[:,0]})
submit_filename = os.path.join(path,"t81_558_jheaton_submit.csv")
submit_df.to_csv(submit_filename, index=False)
display(submit_df)
The following code uses a random forest to rank the importance of features. This can be used both to rank the origional features and new ones created.
In [41]:
import matplotlib.pyplot as plt
from sklearn.ensemble import ExtraTreesClassifier
from sklearn.ensemble import RandomForestRegressor
# Build a forest and compute the feature importances
forest = RandomForestRegressor(n_estimators=50,
random_state=0, verbose = True)
print("Training random forest")
forest.fit(train_x, train_y)
importances = forest.feature_importances_
std = np.std([tree.feature_importances_ for tree in forest.estimators_],
axis=0)
indices = np.argsort(importances)[::-1]
# Print the feature ranking
#train_df.drop('outcome',1,inplace=True)
bag_cols = train_df.columns.values
print("Feature ranking:")
for f in range(train_x.shape[1]):
print("{}. {} ({})".format(f + 1, bag_cols[indices[f]], importances[indices[f]]))
In [ ]:
The following code uses engineered features.
In [45]:
import os
import pandas as pd
from sklearn.cross_validation import train_test_split
import tensorflow.contrib.learn as skflow
import numpy as np
from sklearn import metrics
path = "./data/"
filename = os.path.join(path,"t81_558_train.csv")
train_df = pd.read_csv(filename)
train_df.drop('id',1,inplace=True)
#train_df.drop('g',1,inplace=True)
#train_df.drop('e',1,inplace=True)
train_df.insert(0, "a-b", train_df.a - train_df.b)
#display(train_df)
train_x, train_y = to_xy(train_df,'outcome')
train_x, test_x, train_y, test_y = train_test_split(
train_x, train_y, test_size=0.25, random_state=42)
# Create a deep neural network with 3 hidden layers of 50, 25, 10
regressor = skflow.TensorFlowDNNRegressor(hidden_units=[50, 25, 10], steps=5000)
# Early stopping
early_stop = skflow.monitors.ValidationMonitor(test_x, test_y,
early_stopping_rounds=200, print_steps=50)
# Fit/train neural network
regressor.fit(train_x, train_y, monitor=early_stop)
# Measure RMSE error. RMSE is common for regression.
pred = regressor.predict(test_x)
score = np.sqrt(metrics.mean_squared_error(pred,test_y))
print("Final score (RMSE): {}".format(score))
# foxtrot bravo
# charlie alpha
In [ ]: