Learning Objectives
In this notebook, we will create a baseline model to predict the weight of a baby before it is born. We will use BigQuery ML to build a linear babyweight prediction model with the base features and no feature engineering, yet.
We will create a baseline model with BQML, evaluate our baseline model, and calculate the its RMSE.
Check that the Google BigQuery library is installed and if not, install it.
In [ ]:
!sudo chown -R jupyter:jupyter /home/jupyter/training-data-analyst
In [1]:
%%bash
sudo pip freeze | grep google-cloud-bigquery==1.6.1 || \
sudo pip install google-cloud-bigquery==1.6.1
Run the following cells to verify that we previously created the dataset and data tables. If not, go back to lab 1b_prepare_data_babyweight to create them.
In [2]:
%%bigquery
-- LIMIT 0 is a free query; this allows us to check that the table exists.
SELECT * FROM babyweight.babyweight_data_train
LIMIT 0
Out[2]:
In [3]:
%%bigquery
-- LIMIT 0 is a free query; this allows us to check that the table exists.
SELECT * FROM babyweight.babyweight_data_eval
LIMIT 0
Out[3]:
When creating a BQML model, you must specify the model type (in our case linear regression) and the input label (weight_pounds). Note also that we are using the training data table as the data source and we don't need BQML to split the data because we have already split it ourselves.
In [4]:
%%bigquery
CREATE OR REPLACE MODEL
babyweight.baseline_model
OPTIONS (
MODEL_TYPE="LINEAR_REG",
INPUT_LABEL_COLS=["weight_pounds"],
DATA_SPLIT_METHOD="NO_SPLIT") AS
SELECT
weight_pounds,
is_male,
mother_age,
plurality,
gestation_weeks
FROM
babyweight.babyweight_data_train
Out[4]:
REMINDER: The query takes several minutes to complete. After the first iteration is complete, your model (baseline_model) appears in the navigation panel of the BigQuery web UI. Because the query uses a CREATE MODEL statement to create a model, you do not see query results.
You can observe the model as it's being trained by viewing the Model stats tab in the BigQuery web UI. As soon as the first iteration completes, the tab is updated. The stats continue to update as each iteration completes.
Once the training is done, visit the BigQuery Cloud Console and look at the model that has been trained. Then, come back to this notebook.
Even though BigQuery can automatically split the data it is given, and training on only a part of the data and using the rest for evaluation, to compare with our custom models later we wanted to decide the split ourselves so that it is completely reproducible.
NOTE: The results are also displayed in the BigQuery Cloud Console under the Evaluation tab.
In [5]:
%%bigquery
-- Information from model training
SELECT * FROM ML.TRAINING_INFO(MODEL babyweight.baseline_model)
Out[5]:
In [6]:
%%bigquery
SELECT
*
FROM
ML.EVALUATE(MODEL babyweight.baseline_model,
(
SELECT
weight_pounds,
is_male,
mother_age,
plurality,
gestation_weeks
FROM
babyweight.babyweight_data_eval
))
Out[6]:
Resource for an explanation of the Regression Metrics.
Since this is regression, we typically use the RMSE, but natively this is not in the output of our evaluation metrics above. However, we can simply take the SQRT() of the mean squared error of our loss metric from evaluation of the baseline_model to get RMSE.
In [7]:
%%bigquery
SELECT
SQRT(mean_squared_error) AS rmse
FROM
ML.EVALUATE(MODEL babyweight.baseline_model,
(
SELECT
weight_pounds,
is_male,
mother_age,
plurality,
gestation_weeks
FROM
babyweight.babyweight_data_eval
))
Out[7]:
Copyright 2019 Google Inc. Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with the License. You may obtain a copy of the License at http://www.apache.org/licenses/LICENSE-2.0 Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the specific language governing permissions and limitations under the License
In [ ]: