Learning Objectives
In this notebook, we will create multiple linear models to predict the weight of a baby before it is born, using increasing levels of feature engineering using BigQuery ML. If you need a refresher, you can go back and look how we made a baseline model in the previous notebook BQML Baseline Model.
We will create and evaluate a linear model using BigQuery's ML.FEATURE_CROSS, create and evaluate a linear model using BigQuery's ML.FEATURE_CROSS and ML.BUCKETIZE, and create and evaluate a linear model using BigQuery's ML.TRANSFORM.
Check that the Google BigQuery library is installed and if not, install it.
In [1]:
%%bash
sudo pip freeze | grep google-cloud-bigquery==1.6.1 || \
sudo pip install google-cloud-bigquery==1.6.1
Run the following cells to verify that we previously created the dataset and data tables. If not, go back to lab 1b_prepare_data_babyweight to create them.
In [2]:
%%bigquery
-- LIMIT 0 is a free query; this allows us to check that the table exists.
SELECT * FROM babyweight.babyweight_data_train
LIMIT 0
Out[2]:
In [3]:
%%bigquery
-- LIMIT 0 is a free query; this allows us to check that the table exists.
SELECT * FROM babyweight.babyweight_data_eval
LIMIT 0
Out[3]:
BigQuery ML now has ML.FEATURE_CROSS, a pre-processing clause that performs a feature cross with syntax ML.FEATURE_CROSS(STRUCT(features), degree) where features are comma-separated categorical columns and degree is highest degree of all combinations.
In [4]:
%%bigquery
CREATE OR REPLACE MODEL
babyweight.model_1
OPTIONS (
MODEL_TYPE="LINEAR_REG",
INPUT_LABEL_COLS=["weight_pounds"],
L2_REG=0.1,
DATA_SPLIT_METHOD="NO_SPLIT") AS
SELECT
weight_pounds,
is_male,
mother_age,
plurality,
gestation_weeks,
ML.FEATURE_CROSS(
STRUCT(
is_male,
plurality)
) AS gender_plurality_cross
FROM
babyweight.babyweight_data_train
Out[4]:
In [5]:
%%bigquery
SELECT
*
FROM
ML.EVALUATE(MODEL babyweight.model_1,
(
SELECT
weight_pounds,
is_male,
mother_age,
plurality,
gestation_weeks,
ML.FEATURE_CROSS(
STRUCT(
is_male,
plurality)
) AS gender_plurality_cross
FROM
babyweight.babyweight_data_eval
))
Out[5]:
In [6]:
%%bigquery
SELECT
SQRT(mean_squared_error) AS rmse
FROM
ML.EVALUATE(MODEL babyweight.model_1,
(
SELECT
weight_pounds,
is_male,
mother_age,
plurality,
gestation_weeks,
ML.FEATURE_CROSS(
STRUCT(
is_male,
plurality)
) AS gender_plurality_cross
FROM
babyweight.babyweight_data_eval
))
Out[6]:
Bucketize is a pre-processing function that creates "buckets" (e.g bins) - e.g. it bucketizes a continuous numerical feature into a string feature with bucket names as the value with syntax ML.BUCKETIZE(feature, split_points) with split_points being an array of numerical points to determine bucket bounds.
In [7]:
%%bigquery
CREATE OR REPLACE MODEL
babyweight.model_2
OPTIONS (
MODEL_TYPE="LINEAR_REG",
INPUT_LABEL_COLS=["weight_pounds"],
L2_REG=0.1,
DATA_SPLIT_METHOD="NO_SPLIT") AS
SELECT
weight_pounds,
is_male,
mother_age,
plurality,
gestation_weeks,
ML.FEATURE_CROSS(
STRUCT(
is_male,
ML.BUCKETIZE(
mother_age,
GENERATE_ARRAY(15, 45, 1)
) AS bucketed_mothers_age,
plurality,
ML.BUCKETIZE(
gestation_weeks,
GENERATE_ARRAY(17, 47, 1)
) AS bucketed_gestation_weeks
)
) AS crossed
FROM
babyweight.babyweight_data_train
Out[7]:
Let's now retrieve the training statistics and evaluate the model.
In [8]:
%%bigquery
SELECT * FROM ML.TRAINING_INFO(MODEL babyweight.model_2)
Out[8]:
We now evaluate our model on our eval dataset:
In [9]:
%%bigquery
SELECT
*
FROM
ML.EVALUATE(MODEL babyweight.model_2,
(
SELECT
weight_pounds,
is_male,
mother_age,
plurality,
gestation_weeks,
ML.FEATURE_CROSS(
STRUCT(
is_male,
ML.BUCKETIZE(
mother_age,
GENERATE_ARRAY(15, 45, 1)
) AS bucketed_mothers_age,
plurality,
ML.BUCKETIZE(
gestation_weeks,
GENERATE_ARRAY(17, 47, 1)
) AS bucketed_gestation_weeks
)
) AS crossed
FROM
babyweight.babyweight_data_eval))
Out[9]:
Let's select the mean_squared_error from the evaluation table we just computed and square it to obtain the rmse.
In [10]:
%%bigquery
SELECT
SQRT(mean_squared_error) AS rmse
FROM
ML.EVALUATE(MODEL babyweight.model_2,
(
SELECT
weight_pounds,
is_male,
mother_age,
plurality,
gestation_weeks,
ML.FEATURE_CROSS(
STRUCT(
is_male,
ML.BUCKETIZE(
mother_age,
GENERATE_ARRAY(15, 45, 1)
) AS bucketed_mothers_age,
plurality,
ML.BUCKETIZE(
gestation_weeks,
GENERATE_ARRAY(17, 47, 1)
) AS bucketed_gestation_weeks
)
) AS crossed
FROM
babyweight.babyweight_data_eval))
Out[10]:
Let's apply the TRANSFORM clause to the model_3 and run the query.
In [11]:
%%bigquery
CREATE OR REPLACE MODEL
babyweight.model_3
TRANSFORM(
weight_pounds,
is_male,
mother_age,
plurality,
gestation_weeks,
ML.FEATURE_CROSS(
STRUCT(
is_male,
ML.BUCKETIZE(
mother_age,
GENERATE_ARRAY(15, 45, 1)
) AS bucketed_mothers_age,
plurality,
ML.BUCKETIZE(
gestation_weeks,
GENERATE_ARRAY(17, 47, 1)
) AS bucketed_gestation_weeks
)
) AS crossed
)
OPTIONS (
MODEL_TYPE="LINEAR_REG",
INPUT_LABEL_COLS=["weight_pounds"],
L2_REG=0.1,
DATA_SPLIT_METHOD="NO_SPLIT") AS
SELECT
*
FROM
babyweight.babyweight_data_train
Out[11]:
Let's retrieve the training statistics:
In [12]:
%%bigquery
SELECT * FROM ML.TRAINING_INFO(MODEL babyweight.model_3)
Out[12]:
We now evaluate our model on our eval dataset:
In [13]:
%%bigquery
SELECT
*
FROM
ML.EVALUATE(MODEL babyweight.model_3,
(
SELECT
*
FROM
babyweight.babyweight_data_eval
))
Out[13]:
Let's select the mean_squared_error from the evaluation table we just computed and square it to obtain the rmse.
In [14]:
%%bigquery
SELECT
SQRT(mean_squared_error) AS rmse
FROM
ML.EVALUATE(MODEL babyweight.model_3,
(
SELECT
*
FROM
babyweight.babyweight_data_eval
))
Out[14]:
Copyright 2019 Google Inc. Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with the License. You may obtain a copy of the License at http://www.apache.org/licenses/LICENSE-2.0 Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the specific language governing permissions and limitations under the License
In [ ]: