Time Series Prediction with BQML and AutoML

Objectives

Learn how to use BQML to create a classification time-series model using CREATE MODEL.
Learn how to use BQML to create a linear regression time-series model.
Learn how to use AutoML Tables to build a time series model from data in BigQuery.

Set up environment variables and load necessary libraries



In [2]:

    
PROJECT = "your-gcp-project-here" # REPLACE WITH YOUR PROJECT NAME
REGION = "us-central1" # REPLACE WITH YOUR BUCKET REGION e.g. us-central1



In [3]:

    
%env
PROJECT = PROJECT
REGION = REGION



In [4]:

    
%%bash
sudo python3 -m pip freeze | grep google-cloud-bigquery==1.6.1 || \
sudo python3 -m pip install google-cloud-bigquery==1.6.1









    



google-cloud-bigquery==1.6.1

Create the dataset



In [8]:

    
from google.cloud import bigquery
from IPython import get_ipython

bq = bigquery.Client(project=PROJECT)


def create_dataset():
    dataset = bigquery.Dataset(bq.dataset("stock_market"))
    try:
        bq.create_dataset(dataset)  # Will fail if dataset already exists.
        print("Dataset created")
    except:
        print("Dataset already exists")


def create_features_table():
    error = None
    try:
        bq.query('''
        CREATE TABLE stock_market.eps_percent_change_sp500
        AS
        SELECT *
        FROM `asl-ml-immersion.stock_market.eps_percent_change_sp500`
        ''').to_dataframe()
    except Exception as e:
        error = str(e)
    if error is None:
        print('Table created')
    elif 'Already Exists' in error:
        print('Table already exists.')
    else:
        raise Exception('Table was not created.')

create_dataset()
create_features_table()









    



Dataset already exists
Table already exists.

Review the dataset

In the previous lab we created the data we will use modeling and saved them as tables in BigQuery. Let's examine that table again to see that everything is as we expect. Then, we will build a model using BigQuery ML using this table.



In [42]:

    
%%bigquery --project $PROJECT
#standardSQL
SELECT
  *
FROM
  stock_market.eps_percent_change_sp500
LIMIT
  10









    Out[42]:







  
    
      
      symbol
      Date
      Open
      Close
      tomorrow_close
      tomo_close_m_close
      close_MIN_prior_5_days
      close_MIN_prior_20_days
      close_MIN_prior_260_days
      close_MAX_prior_5_days
      ...
      days_on_market
      scaled_change
      s_p_scaled_change
      normalized_change
      company
      industry
      direction
      consensus_EPS
      reported_EPS
      surprise
    
  
  
    
      0
      A
      2007-05-14
      36.46
      35.91
      37.78
      1.87
      1.002785
      0.957115
      0.754107
      1.042607
      ...
      1880
      0.052075
      -0.001304
      0.053379
      Agilent Technologies Inc
      Health Care
      UP
      0.44
      0.43
      -2.27
    
    
      1
      A
      2008-07-01
      35.04
      35.57
      35.36
      -0.21
      0.999157
      0.999157
      0.819511
      1.038516
      ...
      2166
      -0.005904
      -0.018204
      0.012300
      Agilent Technologies Inc
      Health Care
      UP
      0.52
      0.46
      -11.54
    
    
      2
      A
      2008-07-01
      35.04
      35.57
      35.36
      -0.21
      0.999157
      0.999157
      0.819511
      1.038516
      ...
      2166
      -0.005904
      -0.018204
      0.012300
      Agilent Technologies Inc
      Health Care
      UP
      0.48
      0.46
      -4.17
    
    
      3
      A
      2005-11-14
      32.80
      32.90
      34.50
      1.60
      1.000000
      0.929179
      0.620061
      1.004255
      ...
      1506
      0.048632
      -0.003850
      0.052482
      Agilent Technologies Inc
      Health Care
      UP
      0.37
      0.38
      2.70
    
    
      4
      A
      2003-05-19
      16.10
      15.51
      16.04
      0.53
      1.047066
      0.985171
      0.699549
      1.077369
      ...
      877
      0.034172
      -0.001129
      0.035301
      Agilent Technologies Inc
      Health Care
      UP
      -0.15
      -0.15
      0.00
    
    
      5
      A
      2011-11-15
      37.96
      38.25
      38.58
      0.33
      0.943007
      0.862484
      0.768627
      1.013072
      ...
      3018
      0.008627
      -0.016616
      0.025244
      Agilent Technologies Inc
      Health Care
      UP
      0.81
      0.84
      3.70
    
    
      6
      A
      2007-11-15
      33.71
      33.70
      36.72
      3.02
      0.990801
      0.990801
      0.910682
      1.037982
      ...
      2010
      0.089614
      0.005230
      0.084384
      Agilent Technologies Inc
      Health Care
      UP
      0.52
      0.53
      1.92
    
    
      7
      A
      2004-08-12
      21.78
      19.68
      20.52
      0.84
      1.106199
      1.106199
      1.045732
      1.161585
      ...
      1188
      0.042683
      0.001477
      0.041206
      Agilent Technologies Inc
      Health Care
      UP
      0.28
      0.30
      7.14
    
    
      8
      A
      2008-07-01
      35.04
      35.57
      35.36
      -0.21
      0.999157
      0.999157
      0.819511
      1.038516
      ...
      2166
      -0.005904
      -0.018204
      0.012300
      Agilent Technologies Inc
      Health Care
      UP
      0.44
      0.35
      -20.45
    
    
      9
      A
      2009-11-13
      27.97
      28.61
      29.37
      0.76
      0.914715
      0.864733
      0.434813
      0.959804
      ...
      2513
      0.026564
      0.014468
      0.012097
      Agilent Technologies Inc
      Health Care
      UP
      0.23
      0.32
      39.13
    
  

10 rows × 29 columns

Using BQML

Create classification model for `direction`

To create a model

Use CREATE MODEL and provide a destination table for resulting model. Alternatively we can use CREATE OR REPLACE MODEL which allows overwriting an existing model.
Use OPTIONS to specify the model type (linear_reg or logistic_reg). There are many more options we could specify, such as regularization and learning rate, but we'll accept the defaults.
Provide the query which fetches the training data

Have a look at Step Two of this tutorial to see another example.

The query will take about two minutes to complete

We'll start with creating a classification model to predict the direction of each stock.

We'll take a random split using the symbol value. With about 500 different values, using ABS(MOD(FARM_FINGERPRINT(symbol), 15)) = 1 will give 30 distinct symbol values which corresponds to about 171,000 training examples. After taking 70% for training, we will be building a model on about 110,000 training examples.



In [43]:

    
%%bigquery --project $PROJECT
#standardSQL
CREATE OR REPLACE MODEL
  stock_market.direction_model OPTIONS(model_type = "logistic_reg",
    input_label_cols = ["direction"]) AS
  -- query to fetch training data
SELECT
  symbol,
  Date,
  Open,
  close_MIN_prior_5_days,
  close_MIN_prior_20_days,
  close_MIN_prior_260_days,
  close_MAX_prior_5_days,
  close_MAX_prior_20_days,
  close_MAX_prior_260_days,
  close_AVG_prior_5_days,
  close_AVG_prior_20_days,
  close_AVG_prior_260_days,
  close_STDDEV_prior_5_days,
  close_STDDEV_prior_20_days,
  close_STDDEV_prior_260_days,
  direction
FROM
  `stock_market.eps_percent_change_sp500`
WHERE
  tomorrow_close IS NOT NULL
  AND ABS(MOD(FARM_FINGERPRINT(symbol), 15)) = 1
  AND ABS(MOD(FARM_FINGERPRINT(symbol), 15 * 100)) <= 15 * 70









    Out[43]:

Get training statistics and examine training info

After creating our model, we can evaluate the performance using the ML.EVALUATE function. With this command, we can find the precision, recall, accuracy F1-score and AUC of our classification model.



In [44]:

    
%%bigquery --project $PROJECT
#standardSQL
SELECT
  *
FROM
  ML.EVALUATE(MODEL `stock_market.direction_model`,
    (
    SELECT
      symbol,
      Date,
      Open,
      close_MIN_prior_5_days,
      close_MIN_prior_20_days,
      close_MIN_prior_260_days,
      close_MAX_prior_5_days,
      close_MAX_prior_20_days,
      close_MAX_prior_260_days,
      close_AVG_prior_5_days,
      close_AVG_prior_20_days,
      close_AVG_prior_260_days,
      close_STDDEV_prior_5_days,
      close_STDDEV_prior_20_days,
      close_STDDEV_prior_260_days,
      direction
    FROM
      `stock_market.eps_percent_change_sp500`
    WHERE
      tomorrow_close IS NOT NULL
      AND ABS(MOD(FARM_FINGERPRINT(symbol), 15)) = 1
      AND ABS(MOD(FARM_FINGERPRINT(symbol), 15 * 100)) > 15 * 70
      AND ABS(MOD(FARM_FINGERPRINT(symbol), 15 * 100)) <= 15 * 85))









    Out[44]:







  
    
      
      precision
      recall
      accuracy
      f1_score
      log_loss
      roc_auc
    
  
  
    
      0
      0.373016
      0.376404
      0.417178
      0.366026
      1.118389
      0.513714

We can also examine the training statistics collected by Big Query. To view training results we use the ML.TRAINING_INFO function.



In [45]:

    
%%bigquery --project $PROJECT
#standardSQL
SELECT
  *
FROM
  ML.TRAINING_INFO(MODEL `stock_market.direction_model`)
ORDER BY iteration









    Out[45]:







  
    
      
      training_run
      iteration
      loss
      eval_loss
      learning_rate
      duration_ms
    
  
  
    
      0
      0
      0
      0.323281
      0.360413
      0.2
      2644
    
    
      1
      0
      1
      0.258986
      0.355889
      0.4
      2904

Compare to simple benchmark

Another way to asses the performance of our model is to compare with a simple benchmark. We can do this by seeing what kind of accuracy we would get using the naive strategy of just predicted the majority class. For the training dataset, the majority class is 'STAY'. The following query we can see how this naive strategy would perform on the eval set.



In [46]:

    
%%bigquery --project $PROJECT
#standardSQL
WITH
  eval_data AS (
  SELECT
    symbol,
    Date,
    Open,
    close_MIN_prior_5_days,
    close_MIN_prior_20_days,
    close_MIN_prior_260_days,
    close_MAX_prior_5_days,
    close_MAX_prior_20_days,
    close_MAX_prior_260_days,
    close_AVG_prior_5_days,
    close_AVG_prior_20_days,
    close_AVG_prior_260_days,
    close_STDDEV_prior_5_days,
    close_STDDEV_prior_20_days,
    close_STDDEV_prior_260_days,
    direction
  FROM
    `stock_market.eps_percent_change_sp500`
  WHERE
    tomorrow_close IS NOT NULL
    AND ABS(MOD(FARM_FINGERPRINT(symbol), 15)) = 1
    AND ABS(MOD(FARM_FINGERPRINT(symbol), 15 * 100)) > 15 * 70
    AND ABS(MOD(FARM_FINGERPRINT(symbol), 15 * 100)) <= 15 * 85)
SELECT
  direction,
  (COUNT(direction)* 100 / (
    SELECT
      COUNT(*)
    FROM
      eval_data)) AS percentage
FROM
  eval_data
GROUP BY
  direction

So, the naive strategy of just guessing the majority class would have accuracy of 0.5509 on the eval dataset, just below our BQML model.

Create regression model for `normalized change`

We can also use BigQuery to train a regression model to predict the normalized change for each stock. To do this in BigQuery we need only change the OPTIONS when calling CREATE OR REPLACE MODEL. This will give us a more precise prediction rather than just predicting if the stock will go up, down, or stay the same. Thus, we can treat this problem as either a regression problem or a classification problem, depending on the business needs.



In [47]:

    
%%bigquery --project $PROJECT
#standardSQL
CREATE OR REPLACE MODEL
  stock_market.price_model OPTIONS(model_type = "linear_reg",
    input_label_cols = ["normalized_change"]) AS
  -- query to fetch training data
SELECT
  symbol,
  Date,
  Open,
  close_MIN_prior_5_days,
  close_MIN_prior_20_days,
  close_MIN_prior_260_days,
  close_MAX_prior_5_days,
  close_MAX_prior_20_days,
  close_MAX_prior_260_days,
  close_AVG_prior_5_days,
  close_AVG_prior_20_days,
  close_AVG_prior_260_days,
  close_STDDEV_prior_5_days,
  close_STDDEV_prior_20_days,
  close_STDDEV_prior_260_days,
  normalized_change
FROM
  `stock_market.eps_percent_change_sp500`
WHERE
  normalized_change IS NOT NULL
  AND ABS(MOD(FARM_FINGERPRINT(symbol), 15)) = 1
  AND ABS(MOD(FARM_FINGERPRINT(symbol), 15 * 100)) <= 15 * 70









    Out[47]:

Just as before we can examine the evaluation metrics for our regression model and examine the training statistics in Big Query



In [48]:

    
%%bigquery --project $PROJECT
#standardSQL
SELECT
  *
FROM
  ML.EVALUATE(MODEL `stock_market.price_model`,
    (
    SELECT
      symbol,
      Date,
      Open,
      close_MIN_prior_5_days,
      close_MIN_prior_20_days,
      close_MIN_prior_260_days,
      close_MAX_prior_5_days,
      close_MAX_prior_20_days,
      close_MAX_prior_260_days,
      close_AVG_prior_5_days,
      close_AVG_prior_20_days,
      close_AVG_prior_260_days,
      close_STDDEV_prior_5_days,
      close_STDDEV_prior_20_days,
      close_STDDEV_prior_260_days,
      normalized_change
    FROM
      `stock_market.eps_percent_change_sp500`
    WHERE
      normalized_change IS NOT NULL
      AND ABS(MOD(FARM_FINGERPRINT(symbol), 15)) = 1
      AND ABS(MOD(FARM_FINGERPRINT(symbol), 15 * 100)) > 15 * 70
      AND ABS(MOD(FARM_FINGERPRINT(symbol), 15 * 100)) <= 15 * 85))









    Out[48]:







  
    
      
      mean_absolute_error
      mean_squared_error
      mean_squared_log_error
      median_absolute_error
      r2_score
      explained_variance
    
  
  
    
      0
      0.017972
      0.000887
      0.000958
      0.011666
      -0.115487
      -0.104581



In [49]:

    
%%bigquery --project $PROJECT
#standardSQL
SELECT
  *
FROM
  ML.TRAINING_INFO(MODEL `stock_market.price_model`)
ORDER BY iteration









    Out[49]:







  
    
      
      training_run
      iteration
      loss
      eval_loss
      learning_rate
      duration_ms
    
  
  
    
      0
      0
      0
      0.003059
      0.0028
      0.2
      2173

Train a Time Series model using AutoML Tables

Step 1. Launch AutoML

Within the GCP console, navigate to Tables in the console menu.

Click Enable API, if API is not enabled.

Click GET STARTED.

Step 2. Create a Dataset

Select New Dataset and give it a name like stock_market and click Create Dataset. In the section on Importing data, select the option to import your data from a BigQuery Table. Fill in the details for your project, the dataset ID, and the table ID.

Step 3. Import the Data

Once you have created the dataset you can then import the data. This will take a few minutes.

Step 4. Train the model

Once the data has been imported into the dataset. You can examine the Schema of your data, Analyze the properties and values of the features and ultimately Train the model. Here you can also determine the label column and features for training the model. Since we are doing a classifcation model, we'll use direction as our target column.

Under the Train tab, click Train Model. You can choose the features to use when training. Select the same features as we used above.

Step 5. Evaluate your model.

Training can take many hours. But once training is complete you can inspect the evaluation metrics of your model. Since this is a classification task, we can also adjust the threshold and explore how different thresholds will affect your evaluation metrics. Also on that page, we can explore the feature importance of the various features used in the model and view confusion matrix for our model predictions.

Step 6. Predict with the trained model.

Once the model is done training, navigate to the Models page and Deploy the model, so we can test prediction.

When calling predictions, you can call batch prediction jobs by specifying a BigQuery table or csv file. Or you can do online prediction for a single instance.

Copyright 2019 Google Inc. Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with the License. You may obtain a copy of the License at http://www.apache.org/licenses/LICENSE-2.0 Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the specific language governing permissions and limitations under the License

	symbol	Date	Open	Close	tomorrow_close	tomo_close_m_close	close_MIN_prior_5_days	close_MIN_prior_20_days	close_MIN_prior_260_days	close_MAX_prior_5_days	...	days_on_market	scaled_change	s_p_scaled_change	normalized_change	company	industry	direction	consensus_EPS	reported_EPS	surprise
0	A	2007-05-14	36.46	35.91	37.78	1.87	1.002785	0.957115	0.754107	1.042607	...	1880	0.052075	-0.001304	0.053379	Agilent Technologies Inc	Health Care	UP	0.44	0.43	-2.27
1	A	2008-07-01	35.04	35.57	35.36	-0.21	0.999157	0.999157	0.819511	1.038516	...	2166	-0.005904	-0.018204	0.012300	Agilent Technologies Inc	Health Care	UP	0.52	0.46	-11.54
2	A	2008-07-01	35.04	35.57	35.36	-0.21	0.999157	0.999157	0.819511	1.038516	...	2166	-0.005904	-0.018204	0.012300	Agilent Technologies Inc	Health Care	UP	0.48	0.46	-4.17
3	A	2005-11-14	32.80	32.90	34.50	1.60	1.000000	0.929179	0.620061	1.004255	...	1506	0.048632	-0.003850	0.052482	Agilent Technologies Inc	Health Care	UP	0.37	0.38	2.70
4	A	2003-05-19	16.10	15.51	16.04	0.53	1.047066	0.985171	0.699549	1.077369	...	877	0.034172	-0.001129	0.035301	Agilent Technologies Inc	Health Care	UP	-0.15	-0.15	0.00
5	A	2011-11-15	37.96	38.25	38.58	0.33	0.943007	0.862484	0.768627	1.013072	...	3018	0.008627	-0.016616	0.025244	Agilent Technologies Inc	Health Care	UP	0.81	0.84	3.70
6	A	2007-11-15	33.71	33.70	36.72	3.02	0.990801	0.990801	0.910682	1.037982	...	2010	0.089614	0.005230	0.084384	Agilent Technologies Inc	Health Care	UP	0.52	0.53	1.92
7	A	2004-08-12	21.78	19.68	20.52	0.84	1.106199	1.106199	1.045732	1.161585	...	1188	0.042683	0.001477	0.041206	Agilent Technologies Inc	Health Care	UP	0.28	0.30	7.14
8	A	2008-07-01	35.04	35.57	35.36	-0.21	0.999157	0.999157	0.819511	1.038516	...	2166	-0.005904	-0.018204	0.012300	Agilent Technologies Inc	Health Care	UP	0.44	0.35	-20.45
9	A	2009-11-13	27.97	28.61	29.37	0.76	0.914715	0.864733	0.434813	0.959804	...	2513	0.026564	0.014468	0.012097	Agilent Technologies Inc	Health Care	UP	0.23	0.32	39.13