T81-558: Applications of Deep Neural Networks

Class 10: Recurrent and LSTM Networks

Instructor: Jeff Heaton, School of Engineering and Applied Science, Washington University in St. Louis
For more information visit the class website.

Common Functions

Some of the common functions from previous classes that we will use again.



In [3]:

    
from sklearn import preprocessing
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd

# Encode text values to dummy variables(i.e. [1,0,0],[0,1,0],[0,0,1] for red,green,blue)
def encode_text_dummy(df,name):
    dummies = pd.get_dummies(df[name])
    for x in dummies.columns:
        dummy_name = "{}-{}".format(name,x)
        df[dummy_name] = dummies[x]
    df.drop(name, axis=1, inplace=True)

# Encode text values to indexes(i.e. [1],[2],[3] for red,green,blue).
def encode_text_index(df,name):
    le = preprocessing.LabelEncoder()
    df[name] = le.fit_transform(df[name])
    return le.classes_

# Encode a numeric column as zscores
def encode_numeric_zscore(df,name,mean=None,sd=None):
    if mean is None:
        mean = df[name].mean()

    if sd is None:
        sd = df[name].std()

    df[name] = (df[name]-mean)/sd

# Convert all missing values in the specified column to the median
def missing_median(df, name):
    med = df[name].median()
    df[name] = df[name].fillna(med)

# Convert a Pandas dataframe to the x,y inputs that TensorFlow needs
def to_xy(df,target):
    result = []
    for x in df.columns:
        if x != target:
            result.append(x)

    # find out the type of the target column.  Is it really this hard? :(
    target_type = df[target].dtypes
    target_type = target_type[0] if hasattr(target_type, '__iter__') else target_type
    print(target_type)
    
    # Encode to int for classification, float otherwise. TensorFlow likes 32 bits.
    if target_type in (np.int64, np.int32):
        # Classification
        return df.as_matrix(result).astype(np.float32),df.as_matrix([target]).astype(np.int32)
    else:
        # Regression
        return df.as_matrix(result).astype(np.float32),df.as_matrix([target]).astype(np.float32)

# Nicely formatted time string
def hms_string(sec_elapsed):
    h = int(sec_elapsed / (60 * 60))
    m = int((sec_elapsed % (60 * 60)) / 60)
    s = sec_elapsed % 60
    return "{}:{:>02}:{:>05.2f}".format(h, m, s)

# Regression chart, we will see more of this chart in the next class.
def chart_regression(pred,y):
    t = pd.DataFrame({'pred' : pred.flatten(), 'y' : y_test.flatten()})
    t.sort_values(by=['y'],inplace=True)
    a = plt.plot(t['y'].tolist(),label='expected')
    b = plt.plot(t['pred'].tolist(),label='prediction')
    plt.ylabel('output')
    plt.legend()
    plt.show()

Data Structure for Recurrent Neural Networks

Previously we trained neural networks with input ($x$) and expected output ($y$). $X$ was a matrix, the rows were training examples and the columns were values to be predicted. The definition of $x$ will be expanded and y will stay the same.

Dimensions of training set ($x$):

Axis 1: Training set elements (sequences) (must be of the same size as $y$ size)
Axis 2: Members of sequence
Axis 3: Features in data (like input neurons)

Previously, we might take as input a single stock price, to predict if we should buy (1), sell (-1), or hold (0).



In [ ]:

    
# 

x = [
    [32],
    [41],
    [39],
    [20],
    [15]
]

y = [
    1,
    -1,
    0,
    -1,
    1
]

print(x)
print(y)

This is essentially building a CSV file from scratch, to see it as a data frame, use the following:



In [ ]:

    
from IPython.display import display, HTML
import pandas as pd
import numpy as np

x = np.array(x)
print(x[:,0])


df = pd.DataFrame({'x':x[:,0], 'y':y})
display(df)

You might want to put volume in with the stock price.



In [ ]:

    
x = [
    [32,1383],
    [41,2928],
    [39,8823],
    [20,1252],
    [15,1532]
]

y = [
    1,
    -1,
    0,
    -1,
    1
]

print(x)
print(y)



In [ ]:

    
Again, very similar to what we did before.  The following shows this as a data frame.



In [ ]:

    
from IPython.display import display, HTML
import pandas as pd
import numpy as np

x = np.array(x)
print(x[:,0])


df = pd.DataFrame({'price':x[:,0], 'volume':x[:,1], 'y':y})
display(df)

Now we get to sequence format. We want to predict something over a sequence, so the data format needs to add a dimension. A maximum sequence length must be specified, but the individual sequences can be of any length.



In [ ]:

    
x = [
    [[32,1383],[41,2928],[39,8823],[20,1252],[15,1532]],
    [[35,8272],[32,1383],[41,2928],[39,8823],[20,1252]],
    [[37,2738],[35,8272],[32,1383],[41,2928],[39,8823]],
    [[34,2845],[37,2738],[35,8272],[32,1383],[41,2928]],
    [[32,2345],[34,2845],[37,2738],[35,8272],[32,1383]],
]

y = [
    1,
    -1,
    0,
    -1,
    1
]

print(x)
print(y)

Even if there is only one feature (price), the 3rd dimension must be used:



In [ ]:

    
x = [
    [[32],[41],[39],[20],[15]],
    [[35],[32],[41],[39],[20]],
    [[37],[35],[32],[41],[39]],
    [[34],[37],[35],[32],[41]],
    [[32],[34],[37],[35],[32]],
]

y = [
    1,
    -1,
    0,
    -1,
    1
]

print(x)
print(y)

Recurrent Neural Networks

So far the neural networks that we’ve examined have always had forward connections. The input layer always connects to the first hidden layer. Each hidden layer always connects to the next hidden layer. The final hidden layer always connects to the output layer. This manner to connect layers is the reason that these networks are called “feedforward.” Recurrent neural networks are not so rigid, as backward connections are also allowed. A recurrent connection links a neuron in a layer to either a previous layer or the neuron itself. Most recurrent neural network architectures maintain state in the recurrent connections. Feedforward neural networks don’t maintain any state. A recurrent neural network’s state acts as a sort of short-term memory for the neural network. Consequently, a recurrent neural network will not always produce the same output for a given input.

Recurrent neural networks do not force the connections to flow only from one layer to the next, from input layer to output layer. A recurrent connection occurs when a connection is formed between a neuron and one of the following other types of neurons:

The neuron itself
A neuron on the same level
A neuron on a previous level

Recurrent connections can never target the input neurons or the bias neurons.
The processing of recurrent connections can be challenging. Because the recurrent links create endless loops, the neural network must have some way to know when to stop. A neural network that entered an endless loop would not be useful. To prevent endless loops, we can calculate the recurrent connections with the following three approaches:

Context neurons
Calculating output over a fixed number of iterations
Calculating output until neuron output stabilizes

We refer to neural networks that use context neurons as a simple recurrent network (SRN). The context neuron is a special neuron type that remembers its input and provides that input as its output the next time that we calculate the network. For example, if we gave a context neuron 0.5 as input, it would output 0. Context neurons always output 0 on their first call. However, if we gave the context neuron a 0.6 as input, the output would be 0.5. We never weight the input connections to a context neuron, but we can weight the output from a context neuron just like any other connection in a network.

Context neurons allow us to calculate a neural network in a single feedforward pass. Context neurons usually occur in layers. A layer of context neurons will always have the same number of context neurons as neurons in its source layer, as demonstrated here:

As you can see from the above layer, two hidden neurons that are labeled hidden 1 and hidden 2 directly connect to the two context neurons. The dashed lines on these connections indicate that these are not weighted connections. These weightless connections are never dense. If these connections were dense, hidden 1 would be connected to both hidden 1 and hidden 2. However, the direct connection simply joins each hidden neuron to its corresponding context neuron. The two context neurons form dense, weighted connections to the two hidden neurons. Finally, the two hidden neurons also form dense connections to the neurons in the next layer. The two context neurons would form two connections to a single neuron in the next layer, four connections to two neurons, six connections to three neurons, and so on.

You can combine context neurons with the input, hidden, and output layers of a neural network in many different ways. In the next two sections, we explore two common SRN architectures.

In 1990, Elman introduced a neural network that provides pattern recognition to time series. This neural network type has one input neuron for each stream that you are using to predict. There is one output neuron for each time slice you are trying to predict. A single-hidden layer is positioned between the input and output layer. A layer of context neurons takes its input from the hidden layer output and feeds back into the same hidden layer. Consequently, the context layers always have the same number of neurons as the hidden layer, as demonstrated here:

The Elman neural network is a good general-purpose architecture for simple recurrent neural networks. You can pair any reasonable number of input neurons to any number of output neurons. Using normal weighted connections, the two context neurons are fully connected with the two hidden neurons. The two context neurons receive their state from the two non-weighted connections (dashed lines) from each of the two hidden neurons.

Backpropagation through time works by unfolding the SRN to become a regular neural network. To unfold the SRN, we construct a chain of neural networks equal to how far back in time we wish to go. We start with a neural network that contains the inputs for the current time, known as t. Next we replace the context with the entire neural network, up to the context neuron’s input. We continue for the desired number of time slices and replace the final context neuron with a 0. The following diagram shows an unfolded Elman neural network for two time slices.

As you can see, there are inputs for both t (current time) and t-1 (one time slice in the past). The bottom neural network stops at the hidden neurons because you don’t need everything beyond the hidden neurons to calculate the context input. The bottom network structure becomes the context to the top network structure. Of course, the bottom structure would have had a context as well that connects to its hidden neurons. However, because the output neuron above does not contribute to the context, only the top network (current time) has one.

Understanding LSTM

Some useful resources on LSTM/recurrent neural networks.

Long Short Term Neural Network (LSTM) are a type of recurrent unit that is often used with deep neural networks. For TensorFlow, LSTM can be thought of as a layer type that can be combined with other layer types, such as dense. LSTM makes use two transfer function types internally.

The first type of transfer function is the sigmoid. This transfer function type is used form gates inside of the unit. The sigmoid transfer function is given by the following equation:

$$ \text{S}(t) = \frac{1}{1 + e^{-t}} $$

The second type of transfer function is the hyperbolic tangent (tanh) function. This function is used to scale the output of the LSTM, similarly to how other transfer functions have been used in this course.

The graphs for these functions are shown here:



In [ ]:

    
%matplotlib inline

import matplotlib
import numpy as np
import matplotlib.pyplot as plt
import math

def sigmoid(x):
    a = []
    for item in x:
        a.append(1/(1+math.exp(-item)))
    return a

def f2(x):
    a = []
    for item in x:
        a.append(math.tanh(item))
    return a

x = np.arange(-10., 10., 0.2)
y1 = sigmoid(x)
y2 = f2(x)

print("Sigmoid")
plt.plot(x,y1)
plt.show()

print("Hyperbolic Tangent(tanh)")
plt.plot(x,y2)
plt.show()

Both of these two functions compress their output to a specific range. For the sigmoid function, this range is 0 to 1. For the hyperbolic tangent function, this range is -1 to 1.

LSTM maintains an internal state and produces an output. The following diagram shows an LSTM unit over three time slices: the current time slice (t), as well as the previous (t-1) and next (t+1) slice:

The values $\hat{y}$ are the output from the unit, the values ($x$) are the input to the unit and the values $c$ are the context values. Both the output and context values are always fed to the next time slice. The context values allow

LSTM is made up of three gates:

Forget Gate (f_t) - Controls if/when the context is forgotten. (MC)
Input Gate (i_t) - Controls if/when a value should be remembered by the context. (M+/MS)
Output Gate (o_t) - Controls if/when the remembered value is allowed to pass from the unit. (RM)

Mathematically, the above diagram can be thought of as the following:

These are vector values.

First, calculate the forget gate value. This gate determines if the short term memory is forgotten. The value $b$ is a bias, just like the bias neurons we saw before. Except LSTM has a bias for every gate: $b_t$, $b_i$, and $b_o$.

$$ f_t = S(W_f \cdot [\hat{y}_{t-1}, x_t] + b_f) $$

Next, calculate the input gate value. This gate's value determines what will be remembered.

$$ i_t = S(W_i \cdot [\hat{y}_{t-1},x_t] + b_i) $$

Calculate a candidate context value (a value that might be remembered). This value is called $\tilde{c}$.

$$ \tilde{C}_t = \tanh(W_C \cdot [\hat{y}_{t-1},x_t]+b_C) $$

Determine the new context ($C_t$). Do this by remembering the candidate context ($i_t$), depending on input gate. Forget depending on the forget gate ($f_t$).

$$ C_t = f_t \cdot C_{t-1}+i_t \cdot \tilde{C}_t $$

Calculate the output gate ($o_t$):

$$ o_t = S(W_o \cdot [\hat{y}_{t-1},x_t] + b_o ) $$

Calculate the actual output ($\hat{y}_t$):

$$ \hat{y}_t = o_t \cdot \tanh(C_t) $$



In [ ]:

Simple TensorFlow LSTM Example

The following code creates the LSTM network.



In [47]:

    
import numpy as np
import pandas
import tensorflow as tf
from sklearn import metrics
from tensorflow.models.rnn import rnn, rnn_cell
from tensorflow.contrib import skflow

SEQUENCE_SIZE = 6
HIDDEN_SIZE = 20
NUM_CLASSES = 4

def char_rnn_model(X, y):
    byte_list = skflow.ops.split_squeeze(1, SEQUENCE_SIZE, X)
    cell = rnn_cell.LSTMCell(HIDDEN_SIZE)
    _, encoding = rnn.rnn(cell, byte_list, dtype=tf.float32)
    return skflow.models.logistic_regression(encoding, y)

classifier = skflow.TensorFlowEstimator(model_fn=char_rnn_model, n_classes=NUM_CLASSES,
    steps=100, optimizer='Adam', learning_rate=0.01, continue_training=True)

The following code trains on a data set (x) with a max sequence size of 6 (columns) and 6 training elements (rows)



In [48]:

    
x = [
    [[0],[1],[1],[0],[0],[0]],
    [[0],[0],[0],[2],[2],[0]],
    [[0],[0],[0],[0],[3],[3]],
    [[0],[2],[2],[0],[0],[0]],
    [[0],[0],[3],[3],[0],[0]],
    [[0],[0],[0],[0],[1],[1]]
]
x = np.array(x,dtype=np.float32)
y = np.array([1,2,3,2,3,1])

classifier.fit(x, y)









    



Step #100, epoch #100, avg. train loss: 0.30626






    Out[48]:





TensorFlowEstimator(batch_size=32, class_weight=None, clip_gradients=5.0,
          config=None, continue_training=True, learning_rate=0.01,
          model_fn=<function char_rnn_model at 0x7efec8f62510>,
          n_classes=4, optimizer='Adam', steps=100, verbose=1)



In [49]:

    
test = [[[0],[0],[0],[0],[3],[3]]]
test = np.array(test)

classifier.predict(test)









    Out[49]:





array([3])

Stock Market Example



In [14]:

    
# How to read data from the stock market.
from IPython.display import display, HTML
import pandas.io.data as web
import datetime

start = datetime.datetime(2014, 1, 1)
end = datetime.datetime(2014, 12, 31)

f=web.DataReader('tsla', 'yahoo', start, end)
display(f)









    






  
    
      
      Open
      High
      Low
      Close
      Volume
      Adj Close
    
    
      Date
      
      
      
      
      
      
    
  
  
    
      2014-01-02
      149.800003
      152.479996
      146.550003
      150.100006
      6188400
      150.100006
    
    
      2014-01-03
      150.000000
      152.190002
      148.600006
      149.559998
      4695000
      149.559998
    
    
      2014-01-06
      150.000000
      150.399994
      145.240005
      147.000000
      5361100
      147.000000
    
    
      2014-01-07
      147.619995
      150.399994
      145.250000
      149.360001
      5034100
      149.360001
    
    
      2014-01-08
      148.850006
      153.699997
      148.759995
      151.279999
      6163200
      151.279999
    
    
      2014-01-09
      152.500000
      153.429993
      146.850006
      147.529999
      5382000
      147.529999
    
    
      2014-01-10
      148.460007
      148.899994
      142.250000
      145.720001
      7446100
      145.720001
    
    
      2014-01-13
      145.779999
      147.000000
      137.820007
      139.339996
      6316100
      139.339996
    
    
      2014-01-14
      140.500000
      162.000000
      136.669998
      161.270004
      27607000
      161.270004
    
    
      2014-01-15
      168.449997
      172.229996
      162.100006
      164.130005
      20465600
      164.130005
    
    
      2014-01-16
      162.500000
      172.699997
      162.399994
      170.970001
      11959400
      170.970001
    
    
      2014-01-17
      170.190002
      173.199997
      167.949997
      170.009995
      9206200
      170.009995
    
    
      2014-01-21
      171.240005
      177.289993
      170.809998
      176.679993
      9734700
      176.679993
    
    
      2014-01-22
      177.809998
      180.320007
      174.759995
      178.559998
      7022600
      178.559998
    
    
      2014-01-23
      177.229996
      182.380005
      173.419998
      181.500000
      7867400
      181.500000
    
    
      2014-01-24
      177.850006
      180.479996
      173.529999
      174.600006
      7664300
      174.600006
    
    
      2014-01-27
      175.160004
      177.919998
      164.710007
      169.619995
      8716400
      169.619995
    
    
      2014-01-28
      171.500000
      178.979996
      171.000000
      178.380005
      6093400
      178.380005
    
    
      2014-01-29
      175.300003
      179.089996
      173.130005
      175.229996
      5935500
      175.229996
    
    
      2014-01-30
      178.000000
      184.779999
      177.009995
      182.839996
      8565000
      182.839996
    
    
      2014-01-31
      178.850006
      186.000000
      178.509995
      181.410004
      6508800
      181.410004
    
    
      2014-02-03
      182.889999
      184.880005
      175.160004
      177.110001
      6764900
      177.110001
    
    
      2014-02-04
      180.699997
      181.600006
      176.199997
      178.729996
      4686300
      178.729996
    
    
      2014-02-05
      178.300003
      180.589996
      169.360001
      174.419998
      7268000
      174.419998
    
    
      2014-02-06
      176.300003
      180.110001
      176.000000
      178.380005
      5841600
      178.380005
    
    
      2014-02-07
      181.009995
      186.630005
      179.600006
      186.529999
      8928500
      186.529999
    
    
      2014-02-10
      189.339996
      199.300003
      189.320007
      196.559998
      12970700
      196.559998
    
    
      2014-02-11
      198.970001
      202.199997
      192.699997
      196.619995
      10709900
      196.619995
    
    
      2014-02-12
      195.779999
      198.270004
      194.320007
      195.320007
      5173700
      195.320007
    
    
      2014-02-13
      193.339996
      202.720001
      193.250000
      199.630005
      8029300
      199.630005
    
    
      ...
      ...
      ...
      ...
      ...
      ...
      ...
    
    
      2014-11-18
      255.860001
      259.989990
      255.509995
      257.700012
      4473000
      257.700012
    
    
      2014-11-19
      250.610001
      251.880005
      245.600006
      247.740005
      7918500
      247.740005
    
    
      2014-11-20
      247.949997
      250.929993
      246.000000
      248.710007
      3587200
      248.710007
    
    
      2014-11-21
      252.210007
      252.779999
      242.169998
      242.779999
      7485100
      242.779999
    
    
      2014-11-24
      245.199997
      247.600006
      240.639999
      246.720001
      4789700
      246.720001
    
    
      2014-11-25
      247.350006
      249.720001
      246.089996
      248.089996
      3159800
      248.089996
    
    
      2014-11-26
      248.339996
      249.000000
      246.600006
      248.440002
      1981200
      248.440002
    
    
      2014-11-28
      245.350006
      246.690002
      242.520004
      244.520004
      2119700
      244.520004
    
    
      2014-12-01
      241.160004
      242.470001
      229.009995
      231.639999
      8619400
      231.639999
    
    
      2014-12-02
      234.570007
      234.880005
      228.000000
      231.429993
      5887000
      231.429993
    
    
      2014-12-03
      226.250000
      229.720001
      225.500000
      229.300003
      5307700
      229.300003
    
    
      2014-12-04
      228.600006
      230.899994
      227.809998
      228.279999
      3855600
      228.279999
    
    
      2014-12-05
      228.669998
      229.389999
      222.259995
      223.710007
      6063600
      223.710007
    
    
      2014-12-08
      221.539993
      224.860001
      212.339996
      214.360001
      9225600
      214.360001
    
    
      2014-12-09
      209.339996
      217.729996
      204.270004
      216.889999
      9431500
      216.889999
    
    
      2014-12-10
      214.130005
      216.770004
      207.699997
      209.839996
      7314100
      209.839996
    
    
      2014-12-11
      210.529999
      215.429993
      208.229996
      208.880005
      6694400
      208.880005
    
    
      2014-12-12
      204.820007
      211.679993
      204.500000
      207.000000
      7173800
      207.000000
    
    
      2014-12-15
      209.289993
      209.800003
      202.669998
      204.039993
      5218300
      204.039993
    
    
      2014-12-16
      200.889999
      203.679993
      195.369995
      197.809998
      8426100
      197.809998
    
    
      2014-12-17
      193.059998
      206.649994
      192.649994
      205.820007
      7367800
      205.820007
    
    
      2014-12-18
      212.380005
      218.440002
      211.800003
      218.259995
      7483300
      218.259995
    
    
      2014-12-19
      220.190002
      220.399994
      214.500000
      219.289993
      6910500
      219.289993
    
    
      2014-12-22
      220.000000
      224.059998
      218.259995
      222.600006
      4799400
      222.600006
    
    
      2014-12-23
      223.809998
      224.320007
      219.520004
      220.970001
      4505700
      220.970001
    
    
      2014-12-24
      219.770004
      222.500000
      219.250000
      222.259995
      1332200
      222.259995
    
    
      2014-12-26
      221.509995
      228.500000
      221.500000
      227.820007
      3327000
      227.820007
    
    
      2014-12-29
      226.899994
      227.910004
      224.020004
      225.710007
      2802500
      225.710007
    
    
      2014-12-30
      223.990005
      225.649994
      221.399994
      222.229996
      2903200
      222.229996
    
    
      2014-12-31
      223.089996
      225.679993
      222.250000
      222.410004
      2297500
      222.410004
    
  

252 rows × 6 columns



In [15]:

    
import numpy as np
prices = f.Close.pct_change().tolist() # to percent changes
prices = prices[1:] # skip the first, no percent change


SEQUENCE_SIZE = 5
x = []
y = []

for i in range(len(prices)-SEQUENCE_SIZE-1):
    #print(i)
    window = prices[i:(i+SEQUENCE_SIZE)]
    after_window = prices[i+SEQUENCE_SIZE]
    window = [[x] for x in window]
    #print("{} - {}".format(window,after_window))
    x.append(window)
    y.append(after_window)
    
x = np.array(x)
print(len(x))



In [16]:

    
from tensorflow.contrib import skflow
from tensorflow.models.rnn import rnn, rnn_cell
import tensorflow as tf

HIDDEN_SIZE = 20

def char_rnn_model(X, y):
    byte_list = skflow.ops.split_squeeze(1, SEQUENCE_SIZE, X)
    cell = rnn_cell.LSTMCell(HIDDEN_SIZE)
    _, encoding = rnn.rnn(cell, byte_list, dtype=tf.float32)
    return skflow.models.linear_regression(encoding, y)

regressor = skflow.TensorFlowEstimator(model_fn=char_rnn_model, n_classes=1,
    steps=100, optimizer='Adam', learning_rate=0.01, continue_training=True)

regressor.fit(x, y)









    



Step #100, epoch #12, avg. train loss: 0.04157






    Out[16]:





TensorFlowEstimator(batch_size=32, class_weight=None, clip_gradients=5.0,
          config=None, continue_training=True, learning_rate=0.01,
          model_fn=<function char_rnn_model at 0x7f693bbba620>,
          n_classes=1, optimizer='Adam', steps=100, verbose=1)



In [17]:

    
# Try an in-sample prediction

from sklearn import metrics
# Measure RMSE error.  RMSE is common for regression.
pred = regressor.predict(x)
score = np.sqrt(metrics.mean_squared_error(pred,y))
print("Final score (RMSE): {}".format(score))









    



Final score (RMSE): 0.03126534778871552



In [19]:

    
# Try out of sample
start = datetime.datetime(2015, 1, 1)
end = datetime.datetime(2015, 12, 31)

f=web.DataReader('tsla', 'yahoo', start, end)

import numpy as np
prices = f.Close.pct_change().tolist() # to percent changes
prices = prices[1:] # skip the first, no percent change


SEQUENCE_SIZE = 5
x = []
y = []

for i in range(len(prices)-SEQUENCE_SIZE-1):
    window = prices[i:(i+SEQUENCE_SIZE)]
    after_window = prices[i+SEQUENCE_SIZE]
    window = [[x] for x in window]
    x.append(window)
    y.append(after_window)
    
x = np.array(x)

# Measure RMSE error.  RMSE is common for regression.
pred = regressor.predict(x)
score = np.sqrt(metrics.mean_squared_error(pred,y))
print("Out of sample score (RMSE): {}".format(score))









    



Out of sample score (RMSE): 0.02485462016585515

Assignment 3 Solution

Basic neural network solution:



In [40]:

    
import os
import pandas as pd
from sklearn.cross_validation import train_test_split
import tensorflow.contrib.learn as skflow
import numpy as np
from sklearn import metrics

path = "./data/"
    
filename = os.path.join(path,"t81_558_train.csv")    
train_df = pd.read_csv(filename)

train_df.drop('id',1,inplace=True)

train_x, train_y = to_xy(train_df,'outcome')

train_x, test_x, train_y, test_y = train_test_split(
    train_x, train_y, test_size=0.25, random_state=42)

# Create a deep neural network with 3 hidden layers of 50, 25, 10
regressor = skflow.TensorFlowDNNRegressor(hidden_units=[50, 25, 10], steps=5000)

# Early stopping
early_stop = skflow.monitors.ValidationMonitor(test_x, test_y,
    early_stopping_rounds=200, print_steps=50)

# Fit/train neural network
regressor.fit(train_x, train_y, monitor=early_stop)

# Measure RMSE error.  RMSE is common for regression.
pred = regressor.predict(test_x)
score = np.sqrt(metrics.mean_squared_error(pred,test_y))
print("Final score (RMSE): {}".format(score))

####################
# Build submit file
####################
from IPython.display import display, HTML
filename = os.path.join(path,"t81_558_test.csv")    
submit_df = pd.read_csv(filename)
ids = submit_df.Id
submit_df.drop('Id',1,inplace=True)
submit_x = submit_df.as_matrix()

pred_submit = regressor.predict(submit_x)

submit_df = pd.DataFrame({'Id': ids, 'outcome': pred_submit[:,0]})
submit_filename = os.path.join(path,"t81_558_jheaton_submit.csv")
submit_df.to_csv(submit_filename, index=False)

display(submit_df)









    



float64
Step #49, avg. train loss: 262.02747, avg. val loss: 169.31293
Step #99, avg. train loss: 175.77449, avg. val loss: 169.29913
Step #149, avg. train loss: 84.64280, avg. val loss: 169.13676
Step #199, avg. train loss: 118.05414, avg. val loss: 168.87610
Step #249, avg. train loss: 204.99768, avg. val loss: 168.47989
Step #299, avg. train loss: 255.39549, avg. val loss: 168.20102
Step #349, avg. train loss: 179.02652, avg. val loss: 168.05875
Step #399, avg. train loss: 272.38480, avg. val loss: 167.92145
Step #449, avg. train loss: 474.30655, avg. val loss: 167.85458
Step #499, avg. train loss: 158.72092, avg. val loss: 167.89682
Step #549, avg. train loss: 284.42297, avg. val loss: 167.23676
Step #599, avg. train loss: 256.35623, avg. val loss: 167.16699
Step #649, avg. train loss: 174.39070, avg. val loss: 166.59933
Step #699, avg. train loss: 133.87422, avg. val loss: 166.63768
Step #749, avg. train loss: 219.42043, avg. val loss: 166.40108
Step #799, avg. train loss: 111.38207, avg. val loss: 165.72337
Step #850, epoch #1, avg. train loss: 150.21307, avg. val loss: 166.12837
Step #900, epoch #1, avg. train loss: 222.44461, avg. val loss: 165.18475
Step #950, epoch #1, avg. train loss: 268.78259, avg. val loss: 165.05949
Step #1000, epoch #1, avg. train loss: 259.04990, avg. val loss: 164.80380
Step #1050, epoch #1, avg. train loss: 335.93539, avg. val loss: 164.45537
Step #1100, epoch #1, avg. train loss: 315.20062, avg. val loss: 164.21877
Step #1150, epoch #1, avg. train loss: 149.28389, avg. val loss: 163.79355
Step #1200, epoch #1, avg. train loss: 158.25729, avg. val loss: 163.66724
Step #1250, epoch #1, avg. train loss: 164.98840, avg. val loss: 163.85109
Step #1300, epoch #1, avg. train loss: 200.14111, avg. val loss: 164.00955
Step #1350, epoch #1, avg. train loss: 97.43488, avg. val loss: 163.70239
Step #1400, epoch #1, avg. train loss: 169.25858, avg. val loss: 162.92458
Step #1450, epoch #1, avg. train loss: 59.47280, avg. val loss: 163.10606
Step #1500, epoch #1, avg. train loss: 200.46297, avg. val loss: 162.67957
Step #1550, epoch #1, avg. train loss: 189.56543, avg. val loss: 163.05392
Step #1600, epoch #1, avg. train loss: 151.75871, avg. val loss: 163.28995
Step #1650, epoch #2, avg. train loss: 223.84618, avg. val loss: 161.94337
Step #1700, epoch #2, avg. train loss: 342.18832, avg. val loss: 161.32852
Step #1750, epoch #2, avg. train loss: 202.27881, avg. val loss: 162.57892
Step #1800, epoch #2, avg. train loss: 199.05495, avg. val loss: 162.29051
Step #1850, epoch #2, avg. train loss: 195.59039, avg. val loss: 161.27264
Step #1900, epoch #2, avg. train loss: 167.66586, avg. val loss: 160.89017
Step #1950, epoch #2, avg. train loss: 68.90379, avg. val loss: 161.50226
Step #2000, epoch #2, avg. train loss: 222.50452, avg. val loss: 161.14532
Step #2050, epoch #2, avg. train loss: 225.49635, avg. val loss: 160.25165
Step #2100, epoch #2, avg. train loss: 237.03014, avg. val loss: 160.22340
Step #2150, epoch #2, avg. train loss: 206.91718, avg. val loss: 160.10815
Step #2200, epoch #2, avg. train loss: 376.84235, avg. val loss: 160.25551
Step #2250, epoch #2, avg. train loss: 81.24576, avg. val loss: 159.93127
Step #2300, epoch #2, avg. train loss: 173.83713, avg. val loss: 159.00479
Step #2350, epoch #2, avg. train loss: 134.20891, avg. val loss: 159.34698
Step #2400, epoch #2, avg. train loss: 247.74081, avg. val loss: 159.52701
Step #2450, epoch #2, avg. train loss: 150.32472, avg. val loss: 160.43280
Step #2500, epoch #3, avg. train loss: 147.11612, avg. val loss: 159.66171
Step #2550, epoch #3, avg. train loss: 160.58397, avg. val loss: 159.49266






    



Stopping. Best step:
 step 2357 with loss 158.3131103515625






    



Final score (RMSE): 17.850584030151367






    






  
    
      
      Id
      outcome
    
  
  
    
      0
      1
      2.984916
    
    
      1
      2
      -0.033047
    
    
      2
      3
      -2.751747
    
    
      3
      4
      -0.465436
    
    
      4
      5
      3.323134
    
    
      5
      6
      -0.111835
    
    
      6
      7
      -2.564799
    
    
      7
      8
      -0.236744
    
    
      8
      9
      7.016618
    
    
      9
      10
      -1.404967
    
    
      10
      11
      0.219732
    
    
      11
      12
      -5.178894
    
    
      12
      13
      3.522084
    
    
      13
      14
      -5.464541
    
    
      14
      15
      -2.210263
    
    
      15
      16
      -1.344966
    
    
      16
      17
      0.107996
    
    
      17
      18
      2.744113
    
    
      18
      19
      1.354974
    
    
      19
      20
      -0.095815
    
    
      20
      21
      -0.207139
    
    
      21
      22
      -0.199414
    
    
      22
      23
      -0.277334
    
    
      23
      24
      5.914406
    
    
      24
      25
      2.992425
    
    
      25
      26
      -0.059389
    
    
      26
      27
      -0.144523
    
    
      27
      28
      -0.654598
    
    
      28
      29
      -0.340663
    
    
      29
      30
      -1.725085
    
    
      ...
      ...
      ...
    
    
      44805
      44806
      0.013280
    
    
      44806
      44807
      -2.594729
    
    
      44807
      44808
      -5.415764
    
    
      44808
      44809
      1.559146
    
    
      44809
      44810
      -0.185846
    
    
      44810
      44811
      -2.472906
    
    
      44811
      44812
      2.547927
    
    
      44812
      44813
      0.932792
    
    
      44813
      44814
      3.796257
    
    
      44814
      44815
      0.775063
    
    
      44815
      44816
      -1.466948
    
    
      44816
      44817
      0.511186
    
    
      44817
      44818
      0.835496
    
    
      44818
      44819
      -1.716403
    
    
      44819
      44820
      -0.155211
    
    
      44820
      44821
      -0.187585
    
    
      44821
      44822
      -4.715468
    
    
      44822
      44823
      2.772964
    
    
      44823
      44824
      -2.475748
    
    
      44824
      44825
      0.386926
    
    
      44825
      44826
      1.134414
    
    
      44826
      44827
      2.409041
    
    
      44827
      44828
      -0.125013
    
    
      44828
      44829
      1.313928
    
    
      44829
      44830
      -0.865511
    
    
      44830
      44831
      -0.693406
    
    
      44831
      44832
      -0.181998
    
    
      44832
      44833
      -0.412026
    
    
      44833
      44834
      -0.413845
    
    
      44834
      44835
      -0.020959
    
  

44835 rows × 2 columns

The following code uses a random forest to rank the importance of features. This can be used both to rank the origional features and new ones created.



In [41]:

    
import matplotlib.pyplot as plt
from sklearn.ensemble import ExtraTreesClassifier
from sklearn.ensemble import RandomForestRegressor


# Build a forest and compute the feature importances
forest = RandomForestRegressor(n_estimators=50,
                              random_state=0, verbose = True)
print("Training random forest")
forest.fit(train_x, train_y)
importances = forest.feature_importances_
std = np.std([tree.feature_importances_ for tree in forest.estimators_],
             axis=0)
indices = np.argsort(importances)[::-1]

# Print the feature ranking
#train_df.drop('outcome',1,inplace=True)
bag_cols = train_df.columns.values
print("Feature ranking:")

for f in range(train_x.shape[1]):
    print("{}. {} ({})".format(f + 1, bag_cols[indices[f]], importances[indices[f]]))









    



Training random forest






    



/usr/local/lib/python3.4/dist-packages/ipykernel/__main__.py:10: DataConversionWarning: A column-vector y was passed when a 1d array was expected. Please change the shape of y to (n_samples,), for example using ravel().
[Parallel(n_jobs=1)]: Done  49 tasks       | elapsed:   19.7s






    



Feature ranking:
1. f (0.17347717510242006)
2. b (0.15743856858729224)
3. a (0.15083287490096894)
4. d (0.13655150208195754)
5. c (0.13522566532659755)
6. e (0.12661078605327306)
7. g (0.11986342794749044)






    



[Parallel(n_jobs=1)]: Done  50 out of  50 | elapsed:   20.1s finished



In [ ]:

    
The following code uses engineered features.



In [45]:

    
import os
import pandas as pd
from sklearn.cross_validation import train_test_split
import tensorflow.contrib.learn as skflow
import numpy as np
from sklearn import metrics

path = "./data/"
    
filename = os.path.join(path,"t81_558_train.csv")    
train_df = pd.read_csv(filename)

train_df.drop('id',1,inplace=True)
#train_df.drop('g',1,inplace=True)
#train_df.drop('e',1,inplace=True)


train_df.insert(0, "a-b", train_df.a - train_df.b)
#display(train_df)

train_x, train_y = to_xy(train_df,'outcome')

train_x, test_x, train_y, test_y = train_test_split(
    train_x, train_y, test_size=0.25, random_state=42)

# Create a deep neural network with 3 hidden layers of 50, 25, 10
regressor = skflow.TensorFlowDNNRegressor(hidden_units=[50, 25, 10], steps=5000)

# Early stopping
early_stop = skflow.monitors.ValidationMonitor(test_x, test_y,
    early_stopping_rounds=200, print_steps=50)

# Fit/train neural network
regressor.fit(train_x, train_y, monitor=early_stop)

# Measure RMSE error.  RMSE is common for regression.
pred = regressor.predict(test_x)
score = np.sqrt(metrics.mean_squared_error(pred,test_y))
print("Final score (RMSE): {}".format(score))

# foxtrot bravo
# charlie alpha









    



float64
Step #49, avg. train loss: 262.01456, avg. val loss: 169.29431
Step #99, avg. train loss: 175.81499, avg. val loss: 169.23647
Step #149, avg. train loss: 84.75820, avg. val loss: 169.22414
Step #199, avg. train loss: 118.46288, avg. val loss: 169.19536
Step #249, avg. train loss: 205.13033, avg. val loss: 169.12578
Step #299, avg. train loss: 256.44272, avg. val loss: 169.04340
Step #349, avg. train loss: 179.59492, avg. val loss: 168.99483
Step #399, avg. train loss: 272.88321, avg. val loss: 168.94601
Step #449, avg. train loss: 475.41263, avg. val loss: 168.86700
Step #499, avg. train loss: 159.91197, avg. val loss: 168.89235
Step #549, avg. train loss: 285.60718, avg. val loss: 168.71172
Step #599, avg. train loss: 257.62073, avg. val loss: 168.32053
Step #649, avg. train loss: 175.43346, avg. val loss: 168.01474
Step #699, avg. train loss: 134.36299, avg. val loss: 167.99881
Step #749, avg. train loss: 219.90060, avg. val loss: 167.95235
Step #799, avg. train loss: 112.76654, avg. val loss: 167.33336
Step #850, epoch #1, avg. train loss: 152.12091, avg. val loss: 167.34566
Step #900, epoch #1, avg. train loss: 225.16454, avg. val loss: 166.42712
Step #950, epoch #1, avg. train loss: 271.27850, avg. val loss: 165.88651
Step #1000, epoch #1, avg. train loss: 258.29684, avg. val loss: 165.58501
Step #1050, epoch #1, avg. train loss: 336.41226, avg. val loss: 165.26320
Step #1100, epoch #1, avg. train loss: 316.56357, avg. val loss: 165.37637
Step #1150, epoch #1, avg. train loss: 149.58206, avg. val loss: 164.95998
Step #1200, epoch #1, avg. train loss: 159.21538, avg. val loss: 164.73892
Step #1250, epoch #1, avg. train loss: 166.58478, avg. val loss: 164.61166
Step #1300, epoch #1, avg. train loss: 201.38486, avg. val loss: 164.41125
Step #1350, epoch #1, avg. train loss: 99.02577, avg. val loss: 164.61804
Step #1400, epoch #1, avg. train loss: 171.07423, avg. val loss: 163.84512
Step #1450, epoch #1, avg. train loss: 59.83423, avg. val loss: 163.45181
Step #1500, epoch #1, avg. train loss: 202.18701, avg. val loss: 163.10725
Step #1550, epoch #1, avg. train loss: 191.93936, avg. val loss: 163.24649
Step #1600, epoch #1, avg. train loss: 151.62419, avg. val loss: 163.31381
Step #1650, epoch #2, avg. train loss: 223.15309, avg. val loss: 162.15009
Step #1700, epoch #2, avg. train loss: 339.78391, avg. val loss: 162.09174
Step #1750, epoch #2, avg. train loss: 203.35071, avg. val loss: 162.79472
Step #1800, epoch #2, avg. train loss: 199.48436, avg. val loss: 162.70357
Step #1850, epoch #2, avg. train loss: 195.52251, avg. val loss: 161.53848
Step #1900, epoch #2, avg. train loss: 167.04567, avg. val loss: 161.51526
Step #1950, epoch #2, avg. train loss: 70.80038, avg. val loss: 162.08649
Step #2000, epoch #2, avg. train loss: 225.86168, avg. val loss: 161.81807
Step #2050, epoch #2, avg. train loss: 224.96057, avg. val loss: 160.70282
Step #2100, epoch #2, avg. train loss: 239.22934, avg. val loss: 160.82790
Step #2150, epoch #2, avg. train loss: 207.87396, avg. val loss: 161.22589
Step #2200, epoch #2, avg. train loss: 374.89554, avg. val loss: 161.43372
Step #2250, epoch #2, avg. train loss: 82.19041, avg. val loss: 160.68716
Step #2300, epoch #2, avg. train loss: 174.15918, avg. val loss: 159.94739
Step #2350, epoch #2, avg. train loss: 136.94189, avg. val loss: 159.85329
Step #2400, epoch #2, avg. train loss: 249.86937, avg. val loss: 160.53398
Step #2450, epoch #2, avg. train loss: 149.51268, avg. val loss: 160.83829
Step #2500, epoch #3, avg. train loss: 147.07068, avg. val loss: 160.68674
Step #2550, epoch #3, avg. train loss: 162.09476, avg. val loss: 160.47844






    



Stopping. Best step:
 step 2357 with loss 159.2803192138672






    



Final score (RMSE): 17.914104461669922



In [ ]:

	Open	High	Low	Close	Volume	Adj Close
Date
2014-01-02	149.800003	152.479996	146.550003	150.100006	6188400	150.100006
2014-01-03	150.000000	152.190002	148.600006	149.559998	4695000	149.559998
2014-01-06	150.000000	150.399994	145.240005	147.000000	5361100	147.000000
2014-01-07	147.619995	150.399994	145.250000	149.360001	5034100	149.360001
2014-01-08	148.850006	153.699997	148.759995	151.279999	6163200	151.279999
2014-01-09	152.500000	153.429993	146.850006	147.529999	5382000	147.529999
2014-01-10	148.460007	148.899994	142.250000	145.720001	7446100	145.720001
2014-01-13	145.779999	147.000000	137.820007	139.339996	6316100	139.339996
2014-01-14	140.500000	162.000000	136.669998	161.270004	27607000	161.270004
2014-01-15	168.449997	172.229996	162.100006	164.130005	20465600	164.130005
2014-01-16	162.500000	172.699997	162.399994	170.970001	11959400	170.970001
2014-01-17	170.190002	173.199997	167.949997	170.009995	9206200	170.009995
2014-01-21	171.240005	177.289993	170.809998	176.679993	9734700	176.679993
2014-01-22	177.809998	180.320007	174.759995	178.559998	7022600	178.559998
2014-01-23	177.229996	182.380005	173.419998	181.500000	7867400	181.500000
2014-01-24	177.850006	180.479996	173.529999	174.600006	7664300	174.600006
2014-01-27	175.160004	177.919998	164.710007	169.619995	8716400	169.619995
2014-01-28	171.500000	178.979996	171.000000	178.380005	6093400	178.380005
2014-01-29	175.300003	179.089996	173.130005	175.229996	5935500	175.229996
2014-01-30	178.000000	184.779999	177.009995	182.839996	8565000	182.839996
2014-01-31	178.850006	186.000000	178.509995	181.410004	6508800	181.410004
2014-02-03	182.889999	184.880005	175.160004	177.110001	6764900	177.110001
2014-02-04	180.699997	181.600006	176.199997	178.729996	4686300	178.729996
2014-02-05	178.300003	180.589996	169.360001	174.419998	7268000	174.419998
2014-02-06	176.300003	180.110001	176.000000	178.380005	5841600	178.380005
2014-02-07	181.009995	186.630005	179.600006	186.529999	8928500	186.529999
2014-02-10	189.339996	199.300003	189.320007	196.559998	12970700	196.559998
2014-02-11	198.970001	202.199997	192.699997	196.619995	10709900	196.619995
2014-02-12	195.779999	198.270004	194.320007	195.320007	5173700	195.320007
2014-02-13	193.339996	202.720001	193.250000	199.630005	8029300	199.630005
...	...	...	...	...	...	...
2014-11-18	255.860001	259.989990	255.509995	257.700012	4473000	257.700012
2014-11-19	250.610001	251.880005	245.600006	247.740005	7918500	247.740005
2014-11-20	247.949997	250.929993	246.000000	248.710007	3587200	248.710007
2014-11-21	252.210007	252.779999	242.169998	242.779999	7485100	242.779999
2014-11-24	245.199997	247.600006	240.639999	246.720001	4789700	246.720001
2014-11-25	247.350006	249.720001	246.089996	248.089996	3159800	248.089996
2014-11-26	248.339996	249.000000	246.600006	248.440002	1981200	248.440002
2014-11-28	245.350006	246.690002	242.520004	244.520004	2119700	244.520004
2014-12-01	241.160004	242.470001	229.009995	231.639999	8619400	231.639999
2014-12-02	234.570007	234.880005	228.000000	231.429993	5887000	231.429993
2014-12-03	226.250000	229.720001	225.500000	229.300003	5307700	229.300003
2014-12-04	228.600006	230.899994	227.809998	228.279999	3855600	228.279999
2014-12-05	228.669998	229.389999	222.259995	223.710007	6063600	223.710007
2014-12-08	221.539993	224.860001	212.339996	214.360001	9225600	214.360001
2014-12-09	209.339996	217.729996	204.270004	216.889999	9431500	216.889999
2014-12-10	214.130005	216.770004	207.699997	209.839996	7314100	209.839996
2014-12-11	210.529999	215.429993	208.229996	208.880005	6694400	208.880005
2014-12-12	204.820007	211.679993	204.500000	207.000000	7173800	207.000000
2014-12-15	209.289993	209.800003	202.669998	204.039993	5218300	204.039993
2014-12-16	200.889999	203.679993	195.369995	197.809998	8426100	197.809998
2014-12-17	193.059998	206.649994	192.649994	205.820007	7367800	205.820007
2014-12-18	212.380005	218.440002	211.800003	218.259995	7483300	218.259995
2014-12-19	220.190002	220.399994	214.500000	219.289993	6910500	219.289993
2014-12-22	220.000000	224.059998	218.259995	222.600006	4799400	222.600006
2014-12-23	223.809998	224.320007	219.520004	220.970001	4505700	220.970001
2014-12-24	219.770004	222.500000	219.250000	222.259995	1332200	222.259995
2014-12-26	221.509995	228.500000	221.500000	227.820007	3327000	227.820007
2014-12-29	226.899994	227.910004	224.020004	225.710007	2802500	225.710007
2014-12-30	223.990005	225.649994	221.399994	222.229996	2903200	222.229996
2014-12-31	223.089996	225.679993	222.250000	222.410004	2297500	222.410004

	Id	outcome
0	1	2.984916
1	2	-0.033047
2	3	-2.751747
3	4	-0.465436
4	5	3.323134
5	6	-0.111835
6	7	-2.564799
7	8	-0.236744
8	9	7.016618
9	10	-1.404967
10	11	0.219732
11	12	-5.178894
12	13	3.522084
13	14	-5.464541
14	15	-2.210263
15	16	-1.344966
16	17	0.107996
17	18	2.744113
18	19	1.354974
19	20	-0.095815
20	21	-0.207139
21	22	-0.199414
22	23	-0.277334
23	24	5.914406
24	25	2.992425
25	26	-0.059389
26	27	-0.144523
27	28	-0.654598
28	29	-0.340663
29	30	-1.725085
...	...	...
44805	44806	0.013280
44806	44807	-2.594729
44807	44808	-5.415764
44808	44809	1.559146
44809	44810	-0.185846
44810	44811	-2.472906
44811	44812	2.547927
44812	44813	0.932792
44813	44814	3.796257
44814	44815	0.775063
44815	44816	-1.466948
44816	44817	0.511186
44817	44818	0.835496
44818	44819	-1.716403
44819	44820	-0.155211
44820	44821	-0.187585
44821	44822	-4.715468
44822	44823	2.772964
44823	44824	-2.475748
44824	44825	0.386926
44825	44826	1.134414
44826	44827	2.409041
44827	44828	-0.125013
44828	44829	1.313928
44829	44830	-0.865511
44830	44831	-0.693406
44831	44832	-0.181998
44832	44833	-0.412026
44833	44834	-0.413845
44834	44835	-0.020959