LAB 3a: BigQuery ML Model Baseline.

Learning Objectives

  1. Create baseline model with BQML
  2. Evaluate baseline model
  3. Calculate RMSE of baseline model

Introduction

In this notebook, we will create a baseline model to predict the weight of a baby before it is born. We will use BigQuery ML to build a linear babyweight prediction model with the base features and no feature engineering, yet.

We will create a baseline model with BQML, evaluate our baseline model, and calculate the its RMSE.

Load necessary libraries

Check that the Google BigQuery library is installed and if not, install it.


In [ ]:
!sudo chown -R jupyter:jupyter /home/jupyter/training-data-analyst

In [1]:
%%bash
sudo pip freeze | grep google-cloud-bigquery==1.6.1 || \
sudo pip install google-cloud-bigquery==1.6.1


google-cloud-bigquery==1.6.1

Verify tables exist

Run the following cells to verify that we previously created the dataset and data tables. If not, go back to lab 1b_prepare_data_babyweight to create them.


In [2]:
%%bigquery
-- LIMIT 0 is a free query; this allows us to check that the table exists.
SELECT * FROM babyweight.babyweight_data_train
LIMIT 0


Out[2]:
weight_pounds is_male mother_age plurality gestation_weeks

In [3]:
%%bigquery
-- LIMIT 0 is a free query; this allows us to check that the table exists.
SELECT * FROM babyweight.babyweight_data_eval
LIMIT 0


Out[3]:
weight_pounds is_male mother_age plurality gestation_weeks

Create the baseline model

Next, we'll create a linear regression baseline model with no feature engineering. We'll use this to compare our later, more complex models against.

Train the "Baseline Model".

When creating a BQML model, you must specify the model type (in our case linear regression) and the input label (weight_pounds). Note also that we are using the training data table as the data source and we don't need BQML to split the data because we have already split it ourselves.


In [4]:
%%bigquery
CREATE OR REPLACE MODEL
    babyweight.baseline_model

OPTIONS (
    MODEL_TYPE="LINEAR_REG",
    INPUT_LABEL_COLS=["weight_pounds"],
    DATA_SPLIT_METHOD="NO_SPLIT") AS

SELECT
    weight_pounds,
    is_male,
    mother_age,
    plurality,
    gestation_weeks
FROM
    babyweight.babyweight_data_train


Out[4]:

REMINDER: The query takes several minutes to complete. After the first iteration is complete, your model (baseline_model) appears in the navigation panel of the BigQuery web UI. Because the query uses a CREATE MODEL statement to create a model, you do not see query results.

You can observe the model as it's being trained by viewing the Model stats tab in the BigQuery web UI. As soon as the first iteration completes, the tab is updated. The stats continue to update as each iteration completes.

Once the training is done, visit the BigQuery Cloud Console and look at the model that has been trained. Then, come back to this notebook.

Evaluate the baseline model

Even though BigQuery can automatically split the data it is given, and training on only a part of the data and using the rest for evaluation, to compare with our custom models later we wanted to decide the split ourselves so that it is completely reproducible.

NOTE: The results are also displayed in the BigQuery Cloud Console under the Evaluation tab.


In [5]:
%%bigquery
-- Information from model training
SELECT * FROM ML.TRAINING_INFO(MODEL babyweight.baseline_model)


Out[5]:
training_run iteration loss eval_loss learning_rate duration_ms
0 0 0 1.139961 None None 34025

Get evaluation statistics for the baseline_model.

After creating your model, you evaluate the performance of the regressor using the ML.EVALUATE function. The ML.EVALUATE function evaluates the predicted values against the actual data.


In [6]:
%%bigquery
SELECT
    *
FROM
    ML.EVALUATE(MODEL babyweight.baseline_model,
    (
    SELECT
        weight_pounds,
        is_male,
        mother_age,
        plurality,
        gestation_weeks
    FROM
        babyweight.babyweight_data_eval
    ))


Out[6]:
mean_absolute_error mean_squared_error mean_squared_log_error median_absolute_error r2_score explained_variance
0 0.829811 1.139579 0.02041 0.674778 0.345083 0.345085

Resource for an explanation of the Regression Metrics.

Write a SQL query to find the RMSE of the evaluation data

Since this is regression, we typically use the RMSE, but natively this is not in the output of our evaluation metrics above. However, we can simply take the SQRT() of the mean squared error of our loss metric from evaluation of the baseline_model to get RMSE.


In [7]:
%%bigquery
SELECT
    SQRT(mean_squared_error) AS rmse
FROM
    ML.EVALUATE(MODEL babyweight.baseline_model,
    (
    SELECT
        weight_pounds,
        is_male,
        mother_age,
        plurality,
        gestation_weeks
    FROM
        babyweight.babyweight_data_eval
    ))


Out[7]:
rmse
0 1.06751

Lab Summary:

In this lab, we created a baseline model with BQML, evaluated our baseline model, and calculated the RMSE of our baseline model.

Copyright 2019 Google Inc. Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with the License. You may obtain a copy of the License at http://www.apache.org/licenses/LICENSE-2.0 Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the specific language governing permissions and limitations under the License


In [ ]: