In [2]:
PROJECT = "your-gcp-project-here" # REPLACE WITH YOUR PROJECT NAME
REGION = "us-central1" # REPLACE WITH YOUR BUCKET REGION e.g. us-central1
In [3]:
%env
PROJECT = PROJECT
REGION = REGION
In [4]:
%%bash
sudo python3 -m pip freeze | grep google-cloud-bigquery==1.6.1 || \
sudo python3 -m pip install google-cloud-bigquery==1.6.1
In [8]:
from google.cloud import bigquery
from IPython import get_ipython
bq = bigquery.Client(project=PROJECT)
def create_dataset():
dataset = bigquery.Dataset(bq.dataset("stock_market"))
try:
bq.create_dataset(dataset) # Will fail if dataset already exists.
print("Dataset created")
except:
print("Dataset already exists")
def create_features_table():
error = None
try:
bq.query('''
CREATE TABLE stock_market.eps_percent_change_sp500
AS
SELECT *
FROM `asl-ml-immersion.stock_market.eps_percent_change_sp500`
''').to_dataframe()
except Exception as e:
error = str(e)
if error is None:
print('Table created')
elif 'Already Exists' in error:
print('Table already exists.')
else:
raise Exception('Table was not created.')
create_dataset()
create_features_table()
In [42]:
%%bigquery --project $PROJECT
#standardSQL
SELECT
*
FROM
stock_market.eps_percent_change_sp500
LIMIT
10
Out[42]:
direction
To create a model
CREATE MODEL
and provide a destination table for resulting model. Alternatively we can use CREATE OR REPLACE MODEL
which allows overwriting an existing model.OPTIONS
to specify the model type (linear_reg or logistic_reg). There are many more options we could specify, such as regularization and learning rate, but we'll accept the defaults.Have a look at Step Two of this tutorial to see another example.
The query will take about two minutes to complete
We'll start with creating a classification model to predict the direction
of each stock.
We'll take a random split using the symbol
value. With about 500 different values, using ABS(MOD(FARM_FINGERPRINT(symbol), 15)) = 1
will give 30 distinct symbol
values which corresponds to about 171,000 training examples. After taking 70% for training, we will be building a model on about 110,000 training examples.
In [43]:
%%bigquery --project $PROJECT
#standardSQL
CREATE OR REPLACE MODEL
stock_market.direction_model OPTIONS(model_type = "logistic_reg",
input_label_cols = ["direction"]) AS
-- query to fetch training data
SELECT
symbol,
Date,
Open,
close_MIN_prior_5_days,
close_MIN_prior_20_days,
close_MIN_prior_260_days,
close_MAX_prior_5_days,
close_MAX_prior_20_days,
close_MAX_prior_260_days,
close_AVG_prior_5_days,
close_AVG_prior_20_days,
close_AVG_prior_260_days,
close_STDDEV_prior_5_days,
close_STDDEV_prior_20_days,
close_STDDEV_prior_260_days,
direction
FROM
`stock_market.eps_percent_change_sp500`
WHERE
tomorrow_close IS NOT NULL
AND ABS(MOD(FARM_FINGERPRINT(symbol), 15)) = 1
AND ABS(MOD(FARM_FINGERPRINT(symbol), 15 * 100)) <= 15 * 70
Out[43]:
After creating our model, we can evaluate the performance using the ML.EVALUATE
function. With this command, we can find the precision, recall, accuracy F1-score and AUC of our classification model.
In [44]:
%%bigquery --project $PROJECT
#standardSQL
SELECT
*
FROM
ML.EVALUATE(MODEL `stock_market.direction_model`,
(
SELECT
symbol,
Date,
Open,
close_MIN_prior_5_days,
close_MIN_prior_20_days,
close_MIN_prior_260_days,
close_MAX_prior_5_days,
close_MAX_prior_20_days,
close_MAX_prior_260_days,
close_AVG_prior_5_days,
close_AVG_prior_20_days,
close_AVG_prior_260_days,
close_STDDEV_prior_5_days,
close_STDDEV_prior_20_days,
close_STDDEV_prior_260_days,
direction
FROM
`stock_market.eps_percent_change_sp500`
WHERE
tomorrow_close IS NOT NULL
AND ABS(MOD(FARM_FINGERPRINT(symbol), 15)) = 1
AND ABS(MOD(FARM_FINGERPRINT(symbol), 15 * 100)) > 15 * 70
AND ABS(MOD(FARM_FINGERPRINT(symbol), 15 * 100)) <= 15 * 85))
Out[44]:
We can also examine the training statistics collected by Big Query. To view training results we use the ML.TRAINING_INFO
function.
In [45]:
%%bigquery --project $PROJECT
#standardSQL
SELECT
*
FROM
ML.TRAINING_INFO(MODEL `stock_market.direction_model`)
ORDER BY iteration
Out[45]:
Another way to asses the performance of our model is to compare with a simple benchmark. We can do this by seeing what kind of accuracy we would get using the naive strategy of just predicted the majority class. For the training dataset, the majority class is 'STAY'. The following query we can see how this naive strategy would perform on the eval set.
In [46]:
%%bigquery --project $PROJECT
#standardSQL
WITH
eval_data AS (
SELECT
symbol,
Date,
Open,
close_MIN_prior_5_days,
close_MIN_prior_20_days,
close_MIN_prior_260_days,
close_MAX_prior_5_days,
close_MAX_prior_20_days,
close_MAX_prior_260_days,
close_AVG_prior_5_days,
close_AVG_prior_20_days,
close_AVG_prior_260_days,
close_STDDEV_prior_5_days,
close_STDDEV_prior_20_days,
close_STDDEV_prior_260_days,
direction
FROM
`stock_market.eps_percent_change_sp500`
WHERE
tomorrow_close IS NOT NULL
AND ABS(MOD(FARM_FINGERPRINT(symbol), 15)) = 1
AND ABS(MOD(FARM_FINGERPRINT(symbol), 15 * 100)) > 15 * 70
AND ABS(MOD(FARM_FINGERPRINT(symbol), 15 * 100)) <= 15 * 85)
SELECT
direction,
(COUNT(direction)* 100 / (
SELECT
COUNT(*)
FROM
eval_data)) AS percentage
FROM
eval_data
GROUP BY
direction
Out[46]:
So, the naive strategy of just guessing the majority class would have accuracy of 0.5509 on the eval dataset, just below our BQML model.
normalized change
We can also use BigQuery to train a regression model to predict the normalized change for each stock. To do this in BigQuery we need only change the OPTIONS when calling CREATE OR REPLACE MODEL
. This will give us a more precise prediction rather than just predicting if the stock will go up, down, or stay the same. Thus, we can treat this problem as either a regression problem or a classification problem, depending on the business needs.
In [47]:
%%bigquery --project $PROJECT
#standardSQL
CREATE OR REPLACE MODEL
stock_market.price_model OPTIONS(model_type = "linear_reg",
input_label_cols = ["normalized_change"]) AS
-- query to fetch training data
SELECT
symbol,
Date,
Open,
close_MIN_prior_5_days,
close_MIN_prior_20_days,
close_MIN_prior_260_days,
close_MAX_prior_5_days,
close_MAX_prior_20_days,
close_MAX_prior_260_days,
close_AVG_prior_5_days,
close_AVG_prior_20_days,
close_AVG_prior_260_days,
close_STDDEV_prior_5_days,
close_STDDEV_prior_20_days,
close_STDDEV_prior_260_days,
normalized_change
FROM
`stock_market.eps_percent_change_sp500`
WHERE
normalized_change IS NOT NULL
AND ABS(MOD(FARM_FINGERPRINT(symbol), 15)) = 1
AND ABS(MOD(FARM_FINGERPRINT(symbol), 15 * 100)) <= 15 * 70
Out[47]:
Just as before we can examine the evaluation metrics for our regression model and examine the training statistics in Big Query
In [48]:
%%bigquery --project $PROJECT
#standardSQL
SELECT
*
FROM
ML.EVALUATE(MODEL `stock_market.price_model`,
(
SELECT
symbol,
Date,
Open,
close_MIN_prior_5_days,
close_MIN_prior_20_days,
close_MIN_prior_260_days,
close_MAX_prior_5_days,
close_MAX_prior_20_days,
close_MAX_prior_260_days,
close_AVG_prior_5_days,
close_AVG_prior_20_days,
close_AVG_prior_260_days,
close_STDDEV_prior_5_days,
close_STDDEV_prior_20_days,
close_STDDEV_prior_260_days,
normalized_change
FROM
`stock_market.eps_percent_change_sp500`
WHERE
normalized_change IS NOT NULL
AND ABS(MOD(FARM_FINGERPRINT(symbol), 15)) = 1
AND ABS(MOD(FARM_FINGERPRINT(symbol), 15 * 100)) > 15 * 70
AND ABS(MOD(FARM_FINGERPRINT(symbol), 15 * 100)) <= 15 * 85))
Out[48]:
In [49]:
%%bigquery --project $PROJECT
#standardSQL
SELECT
*
FROM
ML.TRAINING_INFO(MODEL `stock_market.price_model`)
ORDER BY iteration
Out[49]:
Click Enable API, if API is not enabled.
Click GET STARTED.
Once the data has been imported into the dataset. You can examine the Schema of your data, Analyze the properties and values of the features and ultimately Train the model. Here you can also determine the label column and features for training the model. Since we are doing a classifcation model, we'll use direction
as our target column.
Under the Train
tab, click Train Model. You can choose the features to use when training. Select the same features as we used above.
Training can take many hours. But once training is complete you can inspect the evaluation metrics of your model. Since this is a classification task, we can also adjust the threshold and explore how different thresholds will affect your evaluation metrics. Also on that page, we can explore the feature importance of the various features used in the model and view confusion matrix for our model predictions.
Once the model is done training, navigate to the Models page and Deploy the model, so we can test prediction.
When calling predictions, you can call batch prediction jobs by specifying a BigQuery table or csv file. Or you can do online prediction for a single instance.
Copyright 2019 Google Inc. Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with the License. You may obtain a copy of the License at http://www.apache.org/licenses/LICENSE-2.0 Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the specific language governing permissions and limitations under the License