Time Series Exercise -

Follow along with the instructions in bold. Watch the solutions video if you get stuck!

The Data

Source: https://datamarket.com/data/set/22ox/monthly-milk-production-pounds-per-cow-jan-62-dec-75#!ds=22ox&display=line

Monthly milk production: pounds per cow. Jan 62 - Dec 75

Import numpy pandas and matplotlib


In [1]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
%matplotlib inline

Use pandas to read the csv of the monthly-milk-production.csv file and set index_col='Month'


In [2]:
data = pd.read_csv("./data/monthly-milk-production.csv", index_col = 'Month')

Check out the head of the dataframe


In [3]:
data.head()


Out[3]:
Milk Production
Month
1962-01-01 01:00:00 589.0
1962-02-01 01:00:00 561.0
1962-03-01 01:00:00 640.0
1962-04-01 01:00:00 656.0
1962-05-01 01:00:00 727.0

Make the index a time series by using:

milk.index = pd.to_datetime(milk.index)

In [4]:
data.index = pd.to_datetime(data.index)

Plot out the time series data.


In [5]:
data.plot()


Out[5]:
<matplotlib.axes._subplots.AxesSubplot at 0x1fe9a48f908>

Train Test Split

Let's attempt to predict a year's worth of data. (12 months or 12 steps into the future)

Create a test train split using indexing (hint: use .head() or tail() or .iloc[]). We don't want a random train test split, we want to specify that the test set is the last 12 months of data is the test set, with everything before it is the training.


In [6]:
data.info()


<class 'pandas.core.frame.DataFrame'>
DatetimeIndex: 168 entries, 1962-01-01 01:00:00 to 1975-12-01 01:00:00
Data columns (total 1 columns):
Milk Production    168 non-null float64
dtypes: float64(1)
memory usage: 2.6 KB

In [7]:
training_set = data.head(156)

In [8]:
test_set = data.tail(12)

Scale the Data

Use sklearn.preprocessing to scale the data using the MinMaxScaler. Remember to only fit_transform on the training data, then transform the test data. You shouldn't fit on the test data as well, otherwise you are assuming you would know about future behavior!


In [9]:
from sklearn.preprocessing import MinMaxScaler

In [10]:
scaler = MinMaxScaler()

In [11]:
training_set = scaler.fit_transform(training_set)

In [12]:
test_set_scaled = scaler.transform(test_set)

Batch Function

We'll need a function that can feed batches of the training data. We'll need to do several things that are listed out as steps in the comments of the function. Remember to reference the previous batch method from the lecture for hints. Try to fill out the function template below, this is a pretty hard step, so feel free to reference the solutions!


In [13]:
def next_batch(training_data, batch_size, steps):
    """
    INPUT: Data, Batch Size, Time Steps per batch
    OUTPUT: A tuple of y time series results. y[:,:-1] and y[:,1:]
    """
    
    # STEP 1: Use np.random.randint to set a random starting point index for the batch.
    # Remember that each batch needs have the same number of steps in it.
    # This means you should limit the starting point to len(data)-steps
    random_start = np.random.randint(0, len(training_data) - steps)
    
    # STEP 2: Now that you have a starting index you'll need to index the data from
    # the random start to random start + steps + 1. Then reshape this data to be (1,steps+1)
    # Create Y data for time series in the batches
    y_batch = np.array(training_data[random_start : random_start + steps + 1]).reshape(1, steps+1)
    
    # STEP 3: Return the batches. You'll have two batches to return y[:,:-1] and y[:,1:]
    # You'll need to reshape these into tensors for the RNN to .reshape(-1,steps,1)
    return y_batch[:, :-1].reshape(-1, steps, 1), y_batch[:, 1:].reshape(-1, steps, 1)

Setting Up The RNN Model

Import TensorFlow


In [14]:
import tensorflow as tf

The Constants

Define the constants in a single cell. You'll need the following (in parenthesis are the values I used in my solution, but you can play with some of these):

  • Number of Inputs (1)
  • Number of Time Steps (12)
  • Number of Neurons per Layer (100)
  • Number of Outputs (1)
  • Learning Rate (0.03)
  • Number of Iterations for Training (4000)
  • Batch Size (1)

In [15]:
num_inputs = 1

num_time_steps = 12

num_neurons = 100

num_outputs = 1

learning_rate = 0.03

num_train_iter = 4000

batch_size = 1

Create Placeholders for X and y. (You can change the variable names if you want). The shape for these placeholders should be [None,num_time_steps-1,num_inputs] and [None, num_time_steps-1, num_outputs] The reason we use num_time_steps-1 is because each of these will be one step shorter than the original time steps size, because we are training the RNN network to predict one point into the future based on the input sequence.


In [16]:
X = tf.placeholder(tf.float32, [None, num_time_steps, num_inputs])
y = tf.placeholder(tf.float32, [None, num_time_steps, num_outputs])

Now create the RNN Layer, you have complete freedom over this, use tf.contrib.rnn and choose anything you want, OutputProjectionWrappers, BasicRNNCells, BasicLSTMCells, MultiRNNCell, GRUCell etc... Keep in mind not every combination will work well! (If in doubt, the solutions used an Outputprojection Wrapper around a basic LSTM cell with relu activation.


In [17]:
cell = tf.contrib.rnn.OutputProjectionWrapper(tf.contrib.rnn.BasicLSTMCell(num_units = num_neurons, activation = tf.nn.relu), output_size = num_outputs)


WARNING:tensorflow:From c:\programdata\anaconda3\lib\site-packages\tensorflow\contrib\learn\python\learn\datasets\base.py:198: retry (from tensorflow.contrib.learn.python.learn.datasets.base) is deprecated and will be removed in a future version.
Instructions for updating:
Use the retry module or similar alternatives.

Now pass in the cells variable into tf.nn.dynamic_rnn, along with your first placeholder (X)


In [18]:
outputs, states = tf.nn.dynamic_rnn(cell, X, dtype = tf.float32)

Loss Function and Optimizer

Create a Mean Squared Error Loss Function and use it to minimize an AdamOptimizer, remember to pass in your learning rate.


In [19]:
# MSE
loss = tf.reduce_mean(tf.square(outputs - y))
optimizer = tf.train.AdamOptimizer(learning_rate = learning_rate)
train = optimizer.minimize(loss)

Initialize the global variables


In [20]:
init = tf.global_variables_initializer()

Create an instance of tf.train.Saver()


In [21]:
saver = tf.train.Saver()

Session

Run a tf.Session that trains on the batches created by your next_batch function. Also add an a loss evaluation for every 100 training iterations. Remember to save your model after you are done training.


In [22]:
gpu_options = tf.GPUOptions(per_process_gpu_memory_fraction = 0.75)

In [23]:
with tf.Session() as sess:
    # Run
    sess.run(init)
    
    for iteration in range(num_train_iter):
        X_batch, Y_batch = next_batch(training_set, batch_size, num_time_steps)
        
        sess.run(train, feed_dict = {X: X_batch, y: Y_batch})
        
        if iteration % 100 == 0:
            mse = loss.eval(feed_dict = {X: X_batch, y: Y_batch})
            print(iteration, "\tMSE:", mse)
    
    # Save Model for Later
    saver.save(sess, "./checkpoints/ex_time_series_model")


0 	MSE: 0.14541869
100 	MSE: 0.02841618
200 	MSE: 0.008366213
300 	MSE: 0.010956082
400 	MSE: 0.009047647
500 	MSE: 0.008949796
600 	MSE: 0.007822753
700 	MSE: 0.00971632
800 	MSE: 0.00853831
900 	MSE: 0.010831367
1000 	MSE: 0.009773464
1100 	MSE: 0.006342268
1200 	MSE: 0.006229364
1300 	MSE: 0.010549121
1400 	MSE: 0.006783081
1500 	MSE: 0.009549073
1600 	MSE: 0.013671103
1700 	MSE: 0.0051058605
1800 	MSE: 0.008275347
1900 	MSE: 0.008504177
2000 	MSE: 0.008999887
2100 	MSE: 0.006180187
2200 	MSE: 0.0066117123
2300 	MSE: 0.0080993455
2400 	MSE: 0.008517283
2500 	MSE: 0.0030064553
2600 	MSE: 0.008333419
2700 	MSE: 0.00840407
2800 	MSE: 0.007698992
2900 	MSE: 0.004447857
3000 	MSE: 0.0072046854
3100 	MSE: 0.007451559
3200 	MSE: 0.006269134
3300 	MSE: 0.006170534
3400 	MSE: 0.003233603
3500 	MSE: 0.00175663
3600 	MSE: 0.0044156048
3700 	MSE: 0.004216565
3800 	MSE: 0.001972133
3900 	MSE: 0.0045218114

Predicting Future (Test Data)

Show the test_set (the last 12 months of your original complete data set)


In [24]:
test_set


Out[24]:
Milk Production
Month
1975-01-01 01:00:00 834.0
1975-02-01 01:00:00 782.0
1975-03-01 01:00:00 892.0
1975-04-01 01:00:00 903.0
1975-05-01 01:00:00 966.0
1975-06-01 01:00:00 937.0
1975-07-01 01:00:00 896.0
1975-08-01 01:00:00 858.0
1975-09-01 01:00:00 817.0
1975-10-01 01:00:00 827.0
1975-11-01 01:00:00 797.0
1975-12-01 01:00:00 843.0

Now we want to attempt to predict these 12 months of data, using only the training data we had. To do this we will feed in a seed training_instance of the last 12 months of the training_set of data to predict 12 months into the future. Then we will be able to compare our generated 12 months to our actual true historical values from the test set!

Generative Session

NOTE: Recall that our model is really only trained to predict 1 time step ahead, asking it to generate 12 steps is a big ask, and technically not what it was trained to do! Think of this more as generating new values based off some previous pattern, rather than trying to directly predict the future. You would need to go back to the original model and train the model to predict 12 time steps ahead to really get a higher accuracy on the test data. (Which has its limits due to the smaller size of our data set)

Fill out the session code below to generate 12 months of data based off the last 12 months of data from the training set. The hardest part about this is adjusting the arrays with their shapes and sizes. Reference the lecture for hints.


In [25]:
with tf.Session() as sess:
    
    # Use your Saver instance to restore your saved rnn time series model
    saver.restore(sess, "./checkpoints/ex_time_series_model")

    # Create a numpy array for your genreative seed from the last 12 months of the 
    # training set data. Hint: Just use tail(12) and then pass it to an np.array
    train_seed = list(training_set[-12:])
    
    ## Now create a for loop that 
    for iteration in range(12):
        X_batch = np.array(train_seed[-num_time_steps:]).reshape(1, num_time_steps, 1)
        y_pred = sess.run(outputs, feed_dict={X: X_batch})
        train_seed.append(y_pred[0, -1, 0])


INFO:tensorflow:Restoring parameters from ./checkpoints/ex_time_series_model

Show the result of the predictions.


In [26]:
train_seed


Out[26]:
[array([0.66105769]),
 array([0.54086538]),
 array([0.80769231]),
 array([0.83894231]),
 array([1.]),
 array([0.94711538]),
 array([0.85336538]),
 array([0.75480769]),
 array([0.62980769]),
 array([0.62259615]),
 array([0.52884615]),
 array([0.625]),
 0.6263535,
 0.5815746,
 0.7504245,
 0.62584615,
 0.8299959,
 0.86410713,
 0.8700878,
 0.86114335,
 0.6898574,
 0.6607598,
 0.59247124,
 0.5458949]

Grab the portion of the results that are the generated values and apply inverse_transform on them to turn them back into milk production value units (lbs per cow). Also reshape the results to be (12,1) so we can easily add them to the test_set dataframe.


In [27]:
results = scaler.inverse_transform(np.array(train_seed[12:]).reshape(12, 1))

Create a new column on the test_set called "Generated" and set it equal to the generated results. You may get a warning about this, feel free to ignore it.


In [28]:
test_set['Generated'] = results


c:\programdata\anaconda3\lib\site-packages\ipykernel_launcher.py:1: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  """Entry point for launching an IPython kernel.

View the test_set dataframe.


In [29]:
test_set


Out[29]:
Milk Production Generated
Month
1975-01-01 01:00:00 834.0 813.563049
1975-02-01 01:00:00 782.0 794.935059
1975-03-01 01:00:00 892.0 865.176636
1975-04-01 01:00:00 903.0 813.351990
1975-05-01 01:00:00 966.0 898.278259
1975-06-01 01:00:00 937.0 912.468567
1975-07-01 01:00:00 896.0 914.956543
1975-08-01 01:00:00 858.0 911.235596
1975-09-01 01:00:00 817.0 839.980713
1975-10-01 01:00:00 827.0 827.876038
1975-11-01 01:00:00 797.0 799.468018
1975-12-01 01:00:00 843.0 780.092285

Plot out the two columns for comparison.


In [30]:
test_set.plot()


Out[30]:
<matplotlib.axes._subplots.AxesSubplot at 0x1fe998a8a20>

Great Job!

Play around with the parameters and RNN layers, does a faster learning rate with more steps improve the model? What about GRU or BasicRNN units? What if you train the original model to not just predict one timestep ahead into the future, but 3 instead? Lots of stuff to add on here!