Time series prediction, end-to-end

This notebook illustrates several models to find the next value of a time-series:

  1. Linear
  2. DNN
  3. CNN
  4. RNN

# Change these to try this notebook out
BUCKET = "cloud-training-demos-ml"
PROJECT = "cloud-training-demos"
REGION = "us-central1"
SEQ_LEN = 50

import os
os.environ['BUCKET'] = BUCKET
os.environ['PROJECT'] = PROJECT
os.environ['REGION'] = REGION
os.environ['SEQ_LEN'] = str(SEQ_LEN)
os.environ['TFVERSION'] = "1.13"

Simulate some time-series data

Essentially a set of sinusoids with random amplitudes and frequencies.

import tensorflow as tf

import numpy as np
import seaborn as sns

def create_time_series():
    freq = (np.random.random()*0.5) + 0.1  # 0.1 to 0.6
    ampl = np.random.random() + 0.5  # 0.5 to 1.5
    noise = [np.random.random()*0.3 for i in range(SEQ_LEN)] # -0.3 to +0.3 uniformly distributed
    x = np.sin(np.arange(0,SEQ_LEN) * freq) * ampl + noise
    return x

flatui = ["#9b59b6", "#3498db", "#95a5a6", "#e74c3c", "#34495e", "#2ecc71"]
for i in range(0, 5):
    sns.tsplot( create_time_series(), color=flatui[i%len(flatui)] );  # 5 series

def to_csv(filename, N):
    with open(filename, 'w') as ofp:
        for lineno in range(0, N):
            seq = create_time_series()
            line = ",".join(map(str, seq))
            ofp.write(line + '\n')

import os
except OSError:
to_csv("data/sines/train-1.csv", 1000)  # 1000 sequences
to_csv("data/sines/valid-1.csv", 250)

!head -5 data/sines/*-1.csv

Train model locally

Make sure the code works as intended.

The model.py and task.py containing the model code is in sinemodel/

Complete the TODOs in model.py before proceeding!

Once you've completed the TODOs, set --model below to the appropriate model (linear,dnn,cnn,rnn,rnn2 or rnnN) and run it locally for a few steps to test the code.

rm -rf $OUTDIR
gcloud ml-engine local train \
    --module-name=sinemodel.task \
    --package-path=${PWD}/sinemodel \
    -- \
    --train_data_path="${DATADIR}/train-1.csv" \
    --eval_data_path="${DATADIR}/valid-1.csv"  \
    --output_dir=${OUTDIR} \
    --model=linear --train_steps=10 --sequence_length=$SEQ_LEN

Cloud ML Engine

Now to train on Cloud ML Engine with more data.

import shutil
shutil.rmtree(path = "data/sines", ignore_errors = True)
for i in range(0,10):
    to_csv("data/sines/train-{}.csv".format(i), 1000)  # 1000 sequences
    to_csv("data/sines/valid-{}.csv".format(i), 250)

gsutil -m rm -rf gs://${BUCKET}/sines/*
gsutil -m cp data/sines/*.csv gs://${BUCKET}/sines

for MODEL in linear dnn cnn rnn rnn2; do
    JOBNAME=sines_${MODEL}_$(date -u +%y%m%d_%H%M%S)
    gsutil -m rm -rf $OUTDIR
    gcloud ml-engine jobs submit training $JOBNAME \
        --region=$REGION \
        --module-name=sinemodel.task \
        --package-path=${PWD}/sinemodel \
        --job-dir=$OUTDIR \
        --staging-bucket=gs://$BUCKET \
        --scale-tier=BASIC_GPU \
        --runtime-version=$TFVERSION \
        -- \
        --train_data_path="gs://${BUCKET}/sines/train*.csv" \
        --eval_data_path="gs://${BUCKET}/sines/valid*.csv"  \
        --output_dir=$OUTDIR \
        --train_steps=3000 --sequence_length=$SEQ_LEN --model=$MODEL

Monitor training with TensorBoard

Use this cell to launch tensorboard. If tensorboard appears blank try refreshing after 5 minutes

from google.datalab.ml import TensorBoard

for pid in TensorBoard.list()["pid"]:
    print("Stopped TensorBoard with pid {}".format(pid))


Complete the below table with your own results! Then compare your results to the results in the solution notebook.

Model Sequence length # of steps Minutes RMSE
linear 50 3000 - -
dnn 50 3000 - -
cnn 50 3000 - -
rnn 50 3000 - -
rnn2 50 3000 - -
rnnN 50 3000 - -

