In [0]:
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# https://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
This tutorial shows how to train and evaluate a Binary Classification model using BigQuery ML (BQML), we also will use this model to serve predictions.
The first two parts of the tutorial walk through extracting the data, preparing it before you train your model.
The last part of the tutorial digs into the training code used for this model, with a particular focus on requirements for making it compatible with BigQuery ML.
This tutorial uses the United States Census Income Dataset provided by the UC Irvine Machine Learning Repository. This dataset contains information about people from a 1994 Census database, including age, education, marital status, occupation, and whether they make more than $50,000 a year.
Dataset now exists in BigQuery Public Datasets:
`bigquery-public-data.ml_datasets.census_adult_income`
The goal is to train a Binary Classification model using BigQuery ML that predicts whether a person makes more than $50,000 a year (target label) based on other Census information about the person (features).
This tutorial focuses more on using this model with BigQuery ML than on the design of the model itself. However, it's always important to think about potential problems and unintended consequences when building machine learning systems. See the Machine Learning Crash Course exercise about fairness to learn about sources of bias in the Census dataset, as well as machine learning fairness more generally.
You must do several things before you can train a model in BigQuery ML:
This tutorial uses billable components of Cloud Platform, including:
Learn about Google BigQuery pricing and Cloud Storage pricing, and use the Pricing Calculator to generate a cost estimate based on your projected usage.
If you are using Colab or AI Platform Notebooks, your environment already meets all the requirements to run this notebook. You can skip this step.
Otherwise, make sure your environment meets this notebook's requirements. You need the following:
The Google Cloud guide to Setting up a Python development environment and the Jupyter installation guide provide detailed instructions for meeting these requirements. The following steps provide a condensed set of instructions:
Install virtualenv and create a virtual environment that uses Python 3.
Activate that environment. Run pip install jupyter
in a shell to install
Jupyter.
Run jupyter notebook
in a shell to launch Jupyter.
Open this notebook in the Jupyter Notebook Dashboard.
The following steps are required, regardless of your notebook environment.
BigQuery is automatically enabled in new projects. To activate BigQuery in a pre-existing project, go to Enable the BigQuery API.
Enter your project ID in the cell below.
In [0]:
PROJECT_ID = "[your-gpc-project]" #@param {type:"string"}
! gcloud config set project $PROJECT_ID
If you are using AI Platform Notebooks, your environment is already authenticated. Skip this step.
If you are using Colab, run the cell below and follow the instructions when prompted to authenticate your account via oAuth.
Otherwise, follow these steps:
In the GCP Console, go to the Create service account key page.
From the Service account drop-down list, select New service account.
In the Service account name field, enter a name.
From the Role drop-down list, select BigQuery > BigQuery Admin and Storage > Storage Object Admin.
Click Create. A JSON file that contains your key downloads to your computer.
Enter the path to your service account key as the GOOGLE_APPLICATION_CREDENTIALS
variable in the cell below.
In [0]:
import sys
# If you are running this notebook in Colaboratory, run this cell and follow the
# instructions to authenticate your GCP account. This provides access to your
# Google Cloud Storage bucket and lets us submit training jobs and prediction
# requests.
if 'google.colab' in sys.modules:
from google.colab import auth as google_auth
google_auth.authenticate_user()
# If you are running this notebook locally, replace the string below with the
# path to your service account key and run this cell to authenticate your GCP
# account.
else:
% env GOOGLE_APPLICATION_CREDENTIALS '/path/to/your/service-account-key.json'
In [0]:
# Data processing
import pandas as pd
# Visualizations
import matplotlib.pyplot as plt
import seaborn as sns
# BigQuery API
from google.cloud import bigquery
# Show software versions
print(__import__('sys').version)
print(pd.__version__)
In [0]:
# These are the features in the dataset.
# Dataset information: https://archive.ics.uci.edu/ml/datasets/census+income
# which exists now in: `bigquery-public-data.ml_datasets.census_adult_income`
_COLUMNS = [
'age', 'workclass', 'functional_weight', 'education', 'education_num',
'marital_status', 'occupation', 'relationship', 'race', 'sex',
'capital_gain', 'capital_loss', 'hours_per_week', 'native_country',
'income_bracket'
]
# This is the label (target) we want to predict.
_LABEL_COLUMN = 'income_bracket'
In [0]:
"""
These are columns we will not use as features for training. There are many
reasons not to use certain attributes of data for training. Perhaps their
values are noisy or inconsistent, or perhaps they encode bias that we do not
want our model to learn. For a deep dive into the features of this Census
dataset and the challenges they pose, see the Introduction to ML Fairness
notebook:
("https://colab.research.google.com/github/google/eng-edu/blob/master/ml/cc/"
"exercises/intro_to_fairness.ipynb")
"""
UNUSED_COLUMNS = ['functional_weight', 'education', 'sex']
The Census datasets contains both numbers and strings we need to convert string data into numbers to be able to train the model.
BigQuery ML supports handling categorical data:
For all non-numeric columns other than TIMESTAMP, BigQuery ML performs a one-hot encoding transformation. This transformation generates a separate feature for each unique value in the column.
Learn more about feature engineering and bias in data.
In [0]:
client = bigquery.Client(location='US', project=PROJECT_ID)
In [0]:
# Dataset and table information
dataset_name = 'census_tutorial'
# Create BigQuery dataset
dataset = client.create_dataset(dataset_name)
In this case we will split our data in 80/10/10 for training, validation and testing.
https://www.oreilly.com/learning/repeatable-sampling-of-data-sets-in-bigquery-for-machine-learning
For machine learning, you want repeatable sampling of the data you have in BigQuery. To get the validation data: change the < 8 in the query above to = 8, and for testing data, change it to = 9. This way, you get 10% of samples in validation and 10% in testing.
In [0]:
# This query will process 4.8 MB when run:
query = """
SELECT
age,
workclass,
functional_weight,
education,
education_num,
marital_status,
occupation,
relationship,
race,
sex,
capital_gain,
capital_loss,
hours_per_week,
native_country,
income_bracket
FROM
`bigquery-public-data.ml_datasets.census_adult_income`
"""
dataset = client.query(query).to_dataframe()
In [0]:
query = """
SELECT
age,
workclass,
functional_weight,
education,
education_num,
marital_status,
occupation,
relationship,
race,
sex,
capital_gain,
capital_loss,
hours_per_week,
native_country,
income_bracket
FROM
`bigquery-public-data.ml_datasets.census_adult_income`
WHERE
MOD(ABS(FARM_FINGERPRINT(CAST(functional_weight AS STRING))), 10) < 8
"""
train_dataset = client.query(query).to_dataframe()
In [0]:
query = """
SELECT
age,
workclass,
functional_weight,
education,
education_num,
marital_status,
occupation,
relationship,
race,
sex,
capital_gain,
capital_loss,
hours_per_week,
native_country,
income_bracket
FROM
`bigquery-public-data.ml_datasets.census_adult_income`
WHERE
MOD(ABS(FARM_FINGERPRINT(CAST(functional_weight AS STRING))), 10) = 8
"""
eval_dataset = client.query(query).to_dataframe()
In [0]:
query = """
SELECT
age,
workclass,
functional_weight,
education,
education_num,
marital_status,
occupation,
relationship,
race,
sex,
capital_gain,
capital_loss,
hours_per_week,
native_country,
income_bracket
FROM
`bigquery-public-data.ml_datasets.census_adult_income`
WHERE
MOD(ABS(FARM_FINGERPRINT(CAST(functional_weight AS STRING))), 10) = 9
"""
test_dataset = client.query(query).to_dataframe()
In [0]:
len(dataset), len(train_dataset), len(eval_dataset), len(test_dataset)
Pandas provides the .corr
method. Is used to find the pairwise correlation of all columns in the dataframe. Any na
values are automatically excluded. For any non-numeric data type columns in the dataframe it is ignored.
The term correlation refers to a mutual relationship or association between quantities.
The closer ρ is to 1, the more an increase in one variable associates with an increase in the other. On the other hand, the closer ρ is to -1, the increase in one variable would result in a decrease in the other. Note that if X and Y are independent, then ρ is close to 0, but not vice versa! In other words, Pearson correlation can be small even if there is a strong relationship between two variables
In [0]:
hmap = dataset.corr(method='pearson')
plt.subplots(figsize=(12, 9))
sns.heatmap(hmap, vmax=0.8 ,annot=True, cmap="BrBG", square=True)
When we look at the numerical features they do not have a strong correlation. The numerical attributes have a significant number of unique values:
functional_weight
, has more than 28,000 unique values for a set of ~32000 values. This may indicate that this feature might not be a significant predictor.
In [0]:
# Visualize interactions
sns.pairplot(dataset, hue='income_bracket')
In this plot you can see the different samples and their values, this is important when you are doing feature engineering.
In [0]:
# Count of >50K & <=50K
sns.countplot(dataset['income_bracket'])
There is an imbalance in the proportion of labels, with 24720 values for <=50K, and 7841 for => 50K. however, 24% of the total should be sufficient to determine class through patterns. In other cases techniques like, boosting can be applied.
In [0]:
# Checking Empty records (NULL) OR (?) and their percentage overall
query = """
SELECT
COUNTIF(workclass IS NULL
OR LTRIM(workclass) LIKE '?') AS workclass,
ROUND(COUNTIF(workclass IS NULL
OR LTRIM(workclass) LIKE '?') / COUNT(workclass) * 100)
AS workclass_percentage,
COUNTIF(occupation IS NULL
OR LTRIM(occupation) LIKE '?') AS occupation,
ROUND(COUNTIF(occupation IS NULL
OR LTRIM(occupation) LIKE '?') / COUNT(occupation) * 100)
AS occupation_percentage,
COUNTIF(native_country IS NULL
OR LTRIM(native_country) LIKE '?') AS native_country,
ROUND(COUNTIF(native_country IS NULL
OR LTRIM(native_country) LIKE '?') / COUNT(native_country) * 100)
AS native_country_percentage
FROM
`bigquery-public-data.ml_datasets.census_adult_income`
"""
client.query(query).to_dataframe()
In [0]:
# Checking for workclass values.
query = """
SELECT
workclass,
COUNT(workclass) AS total_workclass
FROM
`bigquery-public-data.ml_datasets.census_adult_income`
GROUP BY workclass
ORDER BY total_workclass DESC
"""
client.query(query).to_dataframe()
In [0]:
# Checking for occupation values.
query = """
SELECT
occupation,
COUNT(occupation) AS total_occupation
FROM
`bigquery-public-data.ml_datasets.census_adult_income`
GROUP BY occupation
ORDER BY total_occupation DESC
"""
client.query(query).to_dataframe()
In [0]:
# Checking for native_country values.
query = """
SELECT
native_country,
COUNT(native_country) AS total_native_country
FROM
`bigquery-public-data.ml_datasets.census_adult_income`
GROUP BY native_country
ORDER BY total_native_country DESC
"""
client.query(query).to_dataframe()
After running the SQL queries above we have found:
workclass
has almost 70% instances of Private, the Unknown (?) can be imputed with this value.
native_country
, 90% of the instances are United States
which can be used to impute for the Unknown (?) values. Same cannot be said about occupation
column as the values are more distributed.
occupation
the missing values account for 6% of the instances, it might be possible to remove these instances without replacement.Country:
In [0]:
# Checking for capital_gain
# Total records: 32561
query = """
SELECT
capital_gain,
income_bracket,
COUNT(capital_gain) AS capital_gain_records
FROM
`bigquery-public-data.ml_datasets.census_adult_income`
GROUP BY capital_gain, income_bracket
ORDER BY capital_gain_records DESC
"""
client.query(query).to_dataframe()
In [0]:
# Checking for capital_loss
# Total records: 32561
query = """
SELECT
capital_loss,
income_bracket,
COUNT(capital_loss) AS capital_loss_records
FROM
`bigquery-public-data.ml_datasets.census_adult_income`
GROUP BY capital_loss, income_bracket
ORDER BY capital_loss_records DESC
"""
client.query(query).to_dataframe()
Let's take a look at the following numerical attributes:
capital_gain
and capital_loss
each have close to 100 unique values, although the majority of their instances have zero values.
capital_gain
has 72% instances with zero values for less than 50K and 19% instances with zero values for >50K.
capital_loss
has 73% instances with zero values for less than 50K and 21% instances with zero values for >50K.
This implies that capital_gain
or capital_loss
will not make significant predictors either.
In [0]:
# Checking for education
query = """
SELECT
education,
education_num
FROM
`bigquery-public-data.ml_datasets.census_adult_income`
GROUP BY education, education_num
ORDER BY education_num
"""
client.query(query).to_dataframe()
education
and education_number
are indicators of the same attribute, and are fully correlated with direct mapping, it makes sense to remove one of them during feature selection.
In [0]:
# Explore Education Num vs Income
g = sns.catplot(x="education_num", y="income_bracket", data=dataset,kind="bar",
height = 6,palette = "muted")
g.despine(left=True)
g = g.set_ylabels(">50K probability")
In [0]:
# Checking for marital_status
query = """
SELECT
marital_status
FROM
`bigquery-public-data.ml_datasets.census_adult_income`
GROUP BY marital_status
"""
client.query(query).to_dataframe()
In [0]:
# Checking for relationship
query = """
SELECT
relationship
FROM
`bigquery-public-data.ml_datasets.census_adult_income`
GROUP BY relationship
"""
client.query(query).to_dataframe()
As relationship and marital status are features that describe similar status,
we may be able to drop relationship
and keep marital_status
and create a new feature with new values.
We will start Model training using the Public Dataset:
The CREATE MODEL
clause is used to create and train the model named census_tutorial.census_model
.
`CREATE OR REPLACE MODEL`
Creates and trains a model and replaces an existing model with the same name in the specified dataset. Documentation here
`OPTIONS(model_type='logistic_reg', input_label_cols=['income_bracket'])`
Clause indicates that you are creating a logistic regression model. This option creates a logistic regression model or a multiclass logistic regression model. For logistic regression models, the label column must contain only two distinct values. When you create a multiclass logistic regression model, specify training data that contains more than two unique labels.
model_type: logistic_reg creates a logistic regression model or a multiclass logistic regression model.
auto_class_weights: By default, the training data used to create a multiclass logistic regression model is unweighted. If the labels in the training data are imbalanced, the model may learn to predict the most popular class of labels more heavily, which may not be desired. Class weights can be used to balance the class labels and can be used for logistic and multiclass logistic regressions. If set to true, the weights for each class are calculated in inverse proportion to the frequency of that class. To balance every class, use the following formula: TOTAL_INPUT_ROWS / (INPUT_ROWS_FOR_CLASS_N * NUMBER_OF_UNIQUE_CLASSES)
data_split_method: The method to split input data into training and evaluation sets. Training data is used to train the model. Evaluation data is used to avoid overfitting via early stopping. The default value is auto_split.
input_label_cols: The label column name(s) in the training data. input_label_cols accepts an array of strings, but only one array element is supported for linear_reg and logistic_reg models. If input_label_cols is unspecified, the column named "label" in the training data is used. If neither exists, the query fails.
max_iterations: The maximum number of training iterations (steps). The default value is 20.
When you use a CREATE MODEL
statement, the size of the model must be 90 MB or less or the query fails. Generally, if all categorical variables are short strings, a total feature cardinality (model dimension) of 5-10 million is supported. The dimensionality is dependent on the cardinality and length of the string variables.
When you use a CREATE MODEL
statement, the label column cannot contain NULL values. If the label column contains NULL values, the query fails.
1) For all numeric columns, BigQuery ML standardizes and centers the column at zero before passing it into training.
2) In SQL query we skip the following columns for training data:
['functional_weight', 'education', 'sex', 'relationship']
3) Categorical features will be converted to numerical by BQML.
4) The Unknown values in workclass
and native_country
are replaced with Private and United States respectively.
The instances with Unknown values for occupation
are removed.
5) Drop relationship
and use marital_status
in a simplified manner. (You can use Chi Square which is commonly used for testing relationships between categorical variables (martial_status vs relationship).
6) Duplicates in the train set are removed.
Result of the model creation will be an Empty DataFrame
this is normal.
In [0]:
# Train a BQML model
train_query = """
CREATE OR REPLACE MODEL `census_tutorial.census_model`
OPTIONS (
model_type='logistic_reg',
auto_class_weights=true,
data_split_method='no_split',
input_label_cols=['income_bracket'],
max_iterations=15) AS
SELECT
age,
CASE
WHEN workclass IS NULL THEN 'Private'
WHEN LTRIM(workclass) LIKE '?' THEN 'Private'
ELSE workclass
END AS workclass,
CASE
WHEN native_country IS NULL THEN 'United States'
WHEN LTRIM(native_country) LIKE '?' THEN 'United States'
ELSE native_country
END AS native_country,
CASE
WHEN LTRIM(marital_status) IN
(
'Never-married',
'Divorced',
'Separated',
'Widowed'
) THEN 'Single'
WHEN LTRIM(marital_status) IN
(
'Married-civ-spouse',
'Married-spouse-absent',
'Married-AF-spouse'
) THEN 'Married'
ELSE NULL
END AS marital_status,
education_num,
occupation,
race,
hours_per_week,
income_bracket
FROM
`bigquery-public-data.ml_datasets.census_adult_income`
WHERE
MOD(ABS(FARM_FINGERPRINT(CAST(functional_weight AS STRING))), 10) < 8
AND (occupation IS NOT NULL OR LTRIM(occupation) NOT LIKE '?%')
GROUP BY 1, 2, 3, 4, 5, 6, 7, 8, 9
"""
client.query(train_query)
A machine learning algorithm builds a model by examining many examples and attempting to find a model that minimizes loss. This process is called empirical risk minimization.
Loss is the penalty for a bad prediction, a number indicating how bad the model's prediction was on a single example. If the model's prediction is perfect, the loss is zero; otherwise, the loss is greater. The goal of training a model is to find a set of weights and biases that have low loss, on average, across all examples.
In [0]:
training_info = """
SELECT
training_run,
iteration,
loss,
eval_loss,
duration_ms,
learning_rate
FROM
ML.TRAINING_INFO(MODEL `census_tutorial.census_model`)
ORDER BY iteration ASC
"""
client.query(training_info).to_dataframe()
In [0]:
# Perform model evaluation
query_evaluate = """
SELECT
precision,
recall,
accuracy,
f1_score,
log_loss,
roc_auc
FROM ML.EVALUATE (MODEL `census_tutorial.census_model`,
(
SELECT
age,
CASE
WHEN workclass IS NULL THEN 'Private'
WHEN LTRIM(workclass) LIKE '?' THEN 'Private'
ELSE workclass
END AS workclass,
CASE
WHEN native_country IS NULL THEN 'United States'
WHEN LTRIM(native_country) LIKE '?' THEN 'United States'
ELSE native_country
END AS native_country,
CASE
WHEN LTRIM(marital_status) IN
(
'Never-married',
'Divorced',
'Separated',
'Widowed'
) THEN 'Single'
WHEN LTRIM(marital_status) IN
(
'Married-civ-spouse',
'Married-spouse-absent',
'Married-AF-spouse'
) THEN 'Married'
ELSE NULL
END AS marital_status,
education_num,
occupation,
race,
hours_per_week,
income_bracket
FROM
`bigquery-public-data.ml_datasets.census_adult_income`
WHERE
MOD(ABS(FARM_FINGERPRINT(CAST(functional_weight AS STRING))), 10) = 8
AND (occupation IS NOT NULL OR LTRIM(occupation) NOT LIKE '?%')
GROUP BY 1, 2, 3, 4, 5, 6, 7, 8, 9
))
"""
evaluation_job = client.query(query_evaluate).to_dataframe()
In [0]:
evaluation_job
In [0]:
# Perform model evaluation
query_roc_curve = """
SELECT
threshold,
recall,
false_positive_rate,
true_positives,
false_positives,
true_negatives,
false_negatives
FROM
ML.ROC_CURVE(MODEL `census_tutorial.census_model`,
(
SELECT
age,
CASE
WHEN workclass IS NULL THEN 'Private'
WHEN LTRIM(workclass) LIKE '?' THEN 'Private'
ELSE workclass
END AS workclass,
CASE
WHEN native_country IS NULL THEN 'United States'
WHEN LTRIM(native_country) LIKE '?' THEN 'United States'
ELSE native_country
END AS native_country,
CASE
WHEN LTRIM(marital_status) IN
(
'Never-married',
'Divorced',
'Separated',
'Widowed'
) THEN 'Single'
WHEN LTRIM(marital_status) IN
(
'Married-civ-spouse',
'Married-spouse-absent',
'Married-AF-spouse'
) THEN 'Married'
ELSE NULL
END AS marital_status,
education_num,
occupation,
race,
hours_per_week,
income_bracket
FROM
`bigquery-public-data.ml_datasets.census_adult_income`
WHERE
MOD(ABS(FARM_FINGERPRINT(CAST(functional_weight AS STRING))), 10) = 8
AND (occupation IS NOT NULL OR LTRIM(occupation) NOT LIKE '?%')
GROUP BY 1, 2, 3, 4, 5, 6, 7, 8, 9))
"""
roc_curve_job = client.query(query_roc_curve).to_dataframe()
roc_curve_job
In [0]:
# Visualize History for Accuracy.
plt.title('BigQuery ML Model accuracy')
plt.xlim([0.0, 1.0])
plt.ylim([0.0, 1.02])
plt.xlabel('False Positive Rate')
plt.ylabel('True Positive Rate')
fpr = roc_curve_job['false_positive_rate']
tpr = roc_curve_job['recall']
plt.plot(fpr, tpr, color='darkorange')
plt.plot([0, 1], [0, 1], color='navy', linestyle='--')
plt.show()
Now you will use your model to predict outcomes. The following query uses the ML.PREDICT. The query returns these columns:
Note: Normally for prediction, your label will be empty, in this example we use it to be able to compare the model result vs expected label.
In [0]:
query_prediction = """
SELECT
income_bracket,
predicted_income_bracket,
predicted_income_bracket_probs
FROM
ML.PREDICT(MODEL `census_tutorial.census_model`,
(
SELECT
age,
CASE
WHEN workclass IS NULL THEN 'Private'
WHEN LTRIM(workclass) LIKE '?' THEN 'Private'
ELSE workclass
END AS workclass,
CASE
WHEN native_country IS NULL THEN 'United States'
WHEN LTRIM(native_country) LIKE '?' THEN 'United States'
ELSE native_country
END AS native_country,
CASE
WHEN LTRIM(marital_status) IN
(
'Never-married',
'Divorced',
'Separated',
'Widowed'
) THEN 'Single'
WHEN LTRIM(marital_status) IN
(
'Married-civ-spouse',
'Married-spouse-absent',
'Married-AF-spouse'
) THEN 'Married'
ELSE NULL
END AS marital_status,
education_num,
occupation,
race,
hours_per_week,
income_bracket
FROM
`bigquery-public-data.ml_datasets.census_adult_income`
WHERE
MOD(ABS(FARM_FINGERPRINT(CAST(functional_weight AS STRING))), 10) = 9
AND occupation IS NOT NULL AND LTRIM(occupation) NOT LIKE '?%'
GROUP BY 1, 2, 3, 4, 5, 6, 7, 8, 9
))
"""
predictions = client.query(query_prediction).to_dataframe()
We will see a few predictions and the expected result:
In [0]:
predictions[['income_bracket', 'predicted_income_bracket']].head()
In [0]:
predictions['predicted_income_bracket_probs'].head()
In this case the first prediction probability is ~ 0.53, which corresponds to >50K
.
As you can see is very easy to load data into BigQuery and create a model to start training and serving using SQL language only.
Count the number of correct predictions:
In [0]:
_count = predictions['income_bracket'].str.strip().str.lower() == \
predictions['predicted_income_bracket'].str.strip().str.lower()
# Group predictions:
_count.value_counts(normalize=True)
# Note: Set normalize=False to see grouped results.