Copyright 2019 Google LLC



In [0]:

    
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# https://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.

Getting started: Training and Prediction with BigQuery ML

Overview

This tutorial shows how to train and evaluate a Binary Classification model using BigQuery ML (BQML), we also will use this model to serve predictions.

The first two parts of the tutorial walk through extracting the data, preparing it before you train your model.

The last part of the tutorial digs into the training code used for this model, with a particular focus on requirements for making it compatible with BigQuery ML.

Dataset

This tutorial uses the United States Census Income Dataset provided by the UC Irvine Machine Learning Repository. This dataset contains information about people from a 1994 Census database, including age, education, marital status, occupation, and whether they make more than $50,000 a year.

Dataset now exists in BigQuery Public Datasets:

`bigquery-public-data.ml_datasets.census_adult_income`

Goal

The goal is to train a Binary Classification model using BigQuery ML that predicts whether a person makes more than $50,000 a year (target label) based on other Census information about the person (features).

This tutorial focuses more on using this model with BigQuery ML than on the design of the model itself. However, it's always important to think about potential problems and unintended consequences when building machine learning systems. See the Machine Learning Crash Course exercise about fairness to learn about sources of bias in the Census dataset, as well as machine learning fairness more generally.

Before you begin

You must do several things before you can train a model in BigQuery ML:

Set up a Google Cloud Platform (GCP) project with billing and the necessary APIs enabled. Take a look here for a Free Trial
Set up your development environment. (Skip this step if you're using Colaboratory.)
Authenticate your GCP account in this notebook.
Create a Google Cloud Storage bucket to store your training data.

Costs

This tutorial uses billable components of Cloud Platform, including:

Google BigQuery
BigQuery ML

Learn about Google BigQuery pricing and Cloud Storage pricing, and use the Pricing Calculator to generate a cost estimate based on your projected usage.

Set up your development environment

If you are using Colab or AI Platform Notebooks, your environment already meets all the requirements to run this notebook. You can skip this step.

Otherwise, make sure your environment meets this notebook's requirements. You need the following:

The Google Cloud SDK
Python 3
virtualenv
Jupyter notebook running in a virtual environment with Python 3

The Google Cloud guide to Setting up a Python development environment and the Jupyter installation guide provide detailed instructions for meeting these requirements. The following steps provide a condensed set of instructions:

Install and initialize the Cloud SDK.
Install Python 3.
Install virtualenv and create a virtual environment that uses Python 3.
Activate that environment. Run pip install jupyter in a shell to install Jupyter.
Run jupyter notebook in a shell to launch Jupyter.
Open this notebook in the Jupyter Notebook Dashboard.

Set up your GCP project

The following steps are required, regardless of your notebook environment.

Select or create a GCP project.
Make sure that billing is enabled for your project.
BigQuery is automatically enabled in new projects. To activate BigQuery in a pre-existing project, go to Enable the BigQuery API.
Enter your project ID in the cell below.



In [0]:

    
PROJECT_ID = "[your-gpc-project]" #@param {type:"string"}
! gcloud config set project $PROJECT_ID

Authenticate your GCP account

If you are using AI Platform Notebooks, your environment is already authenticated. Skip this step.

If you are using Colab, run the cell below and follow the instructions when prompted to authenticate your account via oAuth.

Otherwise, follow these steps:

In the GCP Console, go to the Create service account key page.
From the Service account drop-down list, select New service account.
In the Service account name field, enter a name.
From the Role drop-down list, select BigQuery > BigQuery Admin and Storage > Storage Object Admin.
Click Create. A JSON file that contains your key downloads to your computer.
Enter the path to your service account key as the GOOGLE_APPLICATION_CREDENTIALS variable in the cell below.



In [0]:

    
import sys

# If you are running this notebook in Colaboratory, run this cell and follow the
# instructions to authenticate your GCP account. This provides access to your
# Google Cloud Storage bucket and lets us submit training jobs and prediction
# requests.

if 'google.colab' in sys.modules:
  from google.colab import auth as google_auth
  google_auth.authenticate_user()

# If you are running this notebook locally, replace the string below with the
# path to your service account key and run this cell to authenticate your GCP
# account.
else:
  % env GOOGLE_APPLICATION_CREDENTIALS '/path/to/your/service-account-key.json'

ML Workflow using a BigQuery model

In this section we will build a Keras model from scratch. We will perform the following steps:

Data preparation
Model creation
Model training
Model evaluation
Model serving

We will create the model and export it to serve requests in BigQuery ML.

Import libraries

Import supporting modules:



In [0]:

    
# Data processing
import pandas as pd

# Visualizations
import matplotlib.pyplot as plt
import seaborn as sns

# BigQuery API
from google.cloud import bigquery 

# Show software versions
print(__import__('sys').version)
print(pd.__version__)

Define Constants



In [0]:

    
# These are the features in the dataset.
# Dataset information: https://archive.ics.uci.edu/ml/datasets/census+income
# which exists now in: `bigquery-public-data.ml_datasets.census_adult_income`
_COLUMNS = [
    'age', 'workclass', 'functional_weight', 'education', 'education_num',
    'marital_status', 'occupation', 'relationship', 'race', 'sex',
    'capital_gain', 'capital_loss', 'hours_per_week', 'native_country',
    'income_bracket'
]

# This is the label (target) we want to predict.
_LABEL_COLUMN = 'income_bracket'

Unused Features and Features that are Biased

These columns can be skipped when using SQL when creating the training table.



In [0]:

    
"""
These are columns we will not use as features for training. There are many
reasons not to use certain attributes of data for training. Perhaps their
values are noisy or inconsistent, or perhaps they encode bias that we do not
want our model to learn. For a deep dive into the features of this Census
dataset and the challenges they pose, see the Introduction to ML Fairness
notebook: 
("https://colab.research.google.com/github/google/eng-edu/blob/master/ml/cc/"
"exercises/intro_to_fairness.ipynb")
"""

UNUSED_COLUMNS = ['functional_weight', 'education', 'sex']

Data preparation

Process Numerical and Categorical columns

The Census datasets contains both numbers and strings we need to convert string data into numbers to be able to train the model.

BigQuery ML supports handling categorical data:

For all non-numeric columns other than TIMESTAMP, BigQuery ML performs a one-hot encoding transformation. This transformation generates a separate feature for each unique value in the column.

Learn more about feature engineering and bias in data.

Create BigQuery client

Create a BigQuery client to bundle configuration needed for API requests.



In [0]:

    
client = bigquery.Client(location='US', project=PROJECT_ID)

Create BigQuery dataset

We will create a dataset called census_tutorial,inside this dataset we will create our BQML model.



In [0]:

    
# Dataset and table information
dataset_name = 'census_tutorial'

# Create BigQuery dataset
dataset = client.create_dataset(dataset_name)

Extract Training and Evaluation dataframes:

In this case we will split our data in 80/10/10 for training, validation and testing.

https://www.oreilly.com/learning/repeatable-sampling-of-data-sets-in-bigquery-for-machine-learning

For machine learning, you want repeatable sampling of the data you have in BigQuery. To get the validation data: change the < 8 in the query above to = 8, and for testing data, change it to = 9. This way, you get 10% of samples in validation and 10% in testing.



In [0]:

    
# This query will process 4.8 MB when run:
query = """
SELECT
    age,
    workclass,
    functional_weight,
    education,
    education_num,
    marital_status,
    occupation,
    relationship,
    race,
    sex,
    capital_gain,
    capital_loss,
    hours_per_week,
    native_country,
    income_bracket
FROM
  `bigquery-public-data.ml_datasets.census_adult_income`
"""
dataset = client.query(query).to_dataframe()

Training dataset



In [0]:

    
query = """
SELECT
    age,
    workclass,
    functional_weight,
    education,
    education_num,
    marital_status,
    occupation,
    relationship,
    race,
    sex,
    capital_gain,
    capital_loss,
    hours_per_week,
    native_country,
    income_bracket
FROM
  `bigquery-public-data.ml_datasets.census_adult_income`
WHERE
  MOD(ABS(FARM_FINGERPRINT(CAST(functional_weight AS STRING))), 10) < 8
"""
train_dataset = client.query(query).to_dataframe()

Evaluation dataset



In [0]:

    
query = """
SELECT
    age,
    workclass,
    functional_weight,
    education,
    education_num,
    marital_status,
    occupation,
    relationship,
    race,
    sex,
    capital_gain,
    capital_loss,
    hours_per_week,
    native_country,
    income_bracket
FROM
  `bigquery-public-data.ml_datasets.census_adult_income`
WHERE
  MOD(ABS(FARM_FINGERPRINT(CAST(functional_weight AS STRING))), 10) = 8
"""
eval_dataset = client.query(query).to_dataframe()

Testing dataset



In [0]:

    
query = """
SELECT
    age,
    workclass,
    functional_weight,
    education,
    education_num,
    marital_status,
    occupation,
    relationship,
    race,
    sex,
    capital_gain,
    capital_loss,
    hours_per_week,
    native_country,
    income_bracket
FROM
  `bigquery-public-data.ml_datasets.census_adult_income`
WHERE
  MOD(ABS(FARM_FINGERPRINT(CAST(functional_weight AS STRING))), 10) = 9
"""
test_dataset = client.query(query).to_dataframe()

DataFrame Length

List length of training and testing data. This will validate that dataset contains data before we start processing it.



In [0]:

    
len(dataset), len(train_dataset), len(eval_dataset), len(test_dataset)

Visualize data

Pandas provides the .corr method. Is used to find the pairwise correlation of all columns in the dataframe. Any na values are automatically excluded. For any non-numeric data type columns in the dataframe it is ignored.

The term correlation refers to a mutual relationship or association between quantities.

The closer ρ is to 1, the more an increase in one variable associates with an increase in the other. On the other hand, the closer ρ is to -1, the increase in one variable would result in a decrease in the other. Note that if X and Y are independent, then ρ is close to 0, but not vice versa! In other words, Pearson correlation can be small even if there is a strong relationship between two variables



In [0]:

    
hmap = dataset.corr(method='pearson')
plt.subplots(figsize=(12, 9))
sns.heatmap(hmap, vmax=0.8 ,annot=True, cmap="BrBG", square=True)

When we look at the numerical features they do not have a strong correlation. The numerical attributes have a significant number of unique values:

functional_weight, has more than 28,000 unique values for a set of ~32000 values.

This may indicate that this feature might not be a significant predictor.



In [0]:

    
# Visualize interactions
sns.pairplot(dataset, hue='income_bracket')

In this plot you can see the different samples and their values, this is important when you are doing feature engineering.



In [0]:

    
# Count of >50K & <=50K
sns.countplot(dataset['income_bracket'])

There is an imbalance in the proportion of labels, with 24720 values for <=50K, and 7841 for => 50K. however, 24% of the total should be sufficient to determine class through patterns. In other cases techniques like, boosting can be applied.

Feature Engineering

Numerical data analysis

We will do a numeric data analysis to check on the different features and the correlation with income bracket.

In the categorical variables, there are three attributes with unknown/missing values:

workclass (6%)
occupation (6%)
native_country (2%)



In [0]:

    
# Checking Empty records (NULL) OR (?) and their percentage overall
query = """
SELECT
  COUNTIF(workclass IS NULL 
    OR LTRIM(workclass) LIKE '?') AS workclass,
  ROUND(COUNTIF(workclass IS NULL 
    OR LTRIM(workclass) LIKE '?') / COUNT(workclass) * 100) 
    AS workclass_percentage,
  COUNTIF(occupation IS NULL 
    OR LTRIM(occupation) LIKE '?') AS occupation,  
  ROUND(COUNTIF(occupation IS NULL 
    OR LTRIM(occupation) LIKE '?') / COUNT(occupation) * 100) 
    AS occupation_percentage,
  COUNTIF(native_country IS NULL 
    OR LTRIM(native_country) LIKE '?') AS native_country,
  ROUND(COUNTIF(native_country IS NULL 
    OR LTRIM(native_country) LIKE '?') / COUNT(native_country) * 100) 
    AS native_country_percentage
FROM
  `bigquery-public-data.ml_datasets.census_adult_income`
"""
client.query(query).to_dataframe()



In [0]:

    
# Checking for workclass values.
query = """
SELECT
  workclass,
  COUNT(workclass) AS total_workclass
FROM 
  `bigquery-public-data.ml_datasets.census_adult_income`
GROUP BY workclass
ORDER BY total_workclass DESC
"""
client.query(query).to_dataframe()



In [0]:

    
# Checking for occupation values.
query = """
SELECT
  occupation,
  COUNT(occupation) AS total_occupation
FROM 
  `bigquery-public-data.ml_datasets.census_adult_income`
GROUP BY occupation
ORDER BY total_occupation DESC
"""
client.query(query).to_dataframe()



In [0]:

    
# Checking for native_country values.
query = """
SELECT
  native_country,
  COUNT(native_country) AS total_native_country
FROM 
  `bigquery-public-data.ml_datasets.census_adult_income`
GROUP BY native_country
ORDER BY total_native_country DESC
"""
client.query(query).to_dataframe()

After running the SQL queries above we have found:

workclass has almost 70% instances of Private, the Unknown (?) can be imputed with this value.
native_country, 90% of the instances are United States which can be used to impute for the Unknown (?) values. Same cannot be said about occupation column as the values are more distributed.

occupation the missing values account for 6% of the instances, it might be possible to remove these instances without replacement.

Country:

Hong: Refers to Hong Kong
South: Probably refers to South Korea



In [0]:

    
# Checking for capital_gain
# Total records: 32561
query = """
SELECT
  capital_gain,
  income_bracket,
  COUNT(capital_gain) AS capital_gain_records
FROM 
  `bigquery-public-data.ml_datasets.census_adult_income`
GROUP BY capital_gain, income_bracket  
ORDER BY capital_gain_records DESC
"""
client.query(query).to_dataframe()



In [0]:

    
# Checking for capital_loss
# Total records: 32561
query = """
SELECT
  capital_loss,
  income_bracket,
  COUNT(capital_loss) AS capital_loss_records
FROM 
  `bigquery-public-data.ml_datasets.census_adult_income`
GROUP BY capital_loss, income_bracket
ORDER BY capital_loss_records DESC
"""
client.query(query).to_dataframe()

Let's take a look at the following numerical attributes:

capital_gain and capital_loss each have close to 100 unique values, although the majority of their instances have zero values.
capital_gain has 72% instances with zero values for less than 50K and 19% instances with zero values for >50K.
capital_loss has 73% instances with zero values for less than 50K and 21% instances with zero values for >50K.

This implies that capital_gain or capital_loss will not make significant predictors either.



In [0]:

    
# Checking for education
query = """
SELECT
  education,
  education_num      
FROM 
  `bigquery-public-data.ml_datasets.census_adult_income`
GROUP BY education, education_num
ORDER BY education_num
"""
client.query(query).to_dataframe()

education and education_number are indicators of the same attribute, and are fully correlated with direct mapping, it makes sense to remove one of them during feature selection.



In [0]:

    
# Explore Education Num vs Income
g = sns.catplot(x="education_num", y="income_bracket", data=dataset,kind="bar", 
                height = 6,palette = "muted")
g.despine(left=True)
g = g.set_ylabels(">50K probability")



In [0]:

    
# Checking for marital_status
query = """
SELECT
  marital_status            
FROM 
  `bigquery-public-data.ml_datasets.census_adult_income`
GROUP BY marital_status
"""
client.query(query).to_dataframe()



In [0]:

    
# Checking for relationship
query = """
SELECT
  relationship            
FROM 
  `bigquery-public-data.ml_datasets.census_adult_income`
GROUP BY relationship
"""
client.query(query).to_dataframe()

As relationship and marital status are features that describe similar status, we may be able to drop relationship and keep marital_status and create a new feature with new values.

Model training

We will start Model training using the Public Dataset:

The CREATE MODEL clause is used to create and train the model named census_tutorial.census_model.

`CREATE OR REPLACE MODEL`

Creates and trains a model and replaces an existing model with the same name in the specified dataset. Documentation here

`OPTIONS(model_type='logistic_reg', input_label_cols=['income_bracket'])`

Clause indicates that you are creating a logistic regression model. This option creates a logistic regression model or a multiclass logistic regression model. For logistic regression models, the label column must contain only two distinct values. When you create a multiclass logistic regression model, specify training data that contains more than two unique labels.

model_type: logistic_reg creates a logistic regression model or a multiclass logistic regression model.
auto_class_weights: By default, the training data used to create a multiclass logistic regression model is unweighted. If the labels in the training data are imbalanced, the model may learn to predict the most popular class of labels more heavily, which may not be desired. Class weights can be used to balance the class labels and can be used for logistic and multiclass logistic regressions. If set to true, the weights for each class are calculated in inverse proportion to the frequency of that class. To balance every class, use the following formula: TOTAL_INPUT_ROWS / (INPUT_ROWS_FOR_CLASS_N * NUMBER_OF_UNIQUE_CLASSES)
data_split_method: The method to split input data into training and evaluation sets. Training data is used to train the model. Evaluation data is used to avoid overfitting via early stopping. The default value is auto_split.
input_label_cols: The label column name(s) in the training data. input_label_cols accepts an array of strings, but only one array element is supported for linear_reg and logistic_reg models. If input_label_cols is unspecified, the column named "label" in the training data is used. If neither exists, the query fails.
max_iterations: The maximum number of training iterations (steps). The default value is 20.

When you use a CREATE MODEL statement, the size of the model must be 90 MB or less or the query fails. Generally, if all categorical variables are short strings, a total feature cardinality (model dimension) of 5-10 million is supported. The dimensionality is dependent on the cardinality and length of the string variables.

When you use a CREATE MODEL statement, the label column cannot contain NULL values. If the label column contains NULL values, the query fails.

Standardize data for Model Convergence

1) For all numeric columns, BigQuery ML standardizes and centers the column at zero before passing it into training.

2) In SQL query we skip the following columns for training data:

['functional_weight', 'education', 'sex', 'relationship']

3) Categorical features will be converted to numerical by BQML.

4) The Unknown values in workclass and native_country are replaced with Private and United States respectively. The instances with Unknown values for occupation are removed.

5) Drop relationship and use marital_status in a simplified manner. (You can use Chi Square which is commonly used for testing relationships between categorical variables (martial_status vs relationship).

6) Duplicates in the train set are removed.

Result of the model creation will be an Empty DataFrame this is normal.



In [0]:

    
# Train a BQML model
train_query = """
CREATE OR REPLACE MODEL `census_tutorial.census_model`
  OPTIONS (
      model_type='logistic_reg',
      auto_class_weights=true,
      data_split_method='no_split',
      input_label_cols=['income_bracket'],
      max_iterations=15) AS
  SELECT
      age,
      CASE 
        WHEN workclass IS NULL THEN 'Private' 
        WHEN LTRIM(workclass) LIKE '?' THEN 'Private'
        ELSE workclass
      END AS workclass,
      CASE 
        WHEN native_country IS NULL THEN 'United States' 
        WHEN LTRIM(native_country) LIKE '?' THEN 'United States'
        ELSE native_country
      END AS native_country,        
      CASE 
        WHEN LTRIM(marital_status) IN 
          (
           'Never-married',
           'Divorced',
           'Separated',
           'Widowed'
          ) THEN 'Single' 
        WHEN LTRIM(marital_status) IN 
          (
           'Married-civ-spouse',
           'Married-spouse-absent',
           'Married-AF-spouse'
          ) THEN 'Married' 
        ELSE NULL 
      END AS marital_status,
      education_num,
      occupation,
      race,       
      hours_per_week,        
      income_bracket
    FROM   
      `bigquery-public-data.ml_datasets.census_adult_income`
    WHERE
      MOD(ABS(FARM_FINGERPRINT(CAST(functional_weight AS STRING))), 10) < 8
      AND (occupation IS NOT NULL OR LTRIM(occupation) NOT LIKE '?%')
    GROUP BY  1, 2, 3, 4, 5, 6, 7, 8, 9
"""
client.query(train_query)

Model information

A machine learning algorithm builds a model by examining many examples and attempting to find a model that minimizes loss. This process is called empirical risk minimization.

Loss is the penalty for a bad prediction, a number indicating how bad the model's prediction was on a single example. If the model's prediction is perfect, the loss is zero; otherwise, the loss is greater. The goal of training a model is to find a set of weights and biases that have low loss, on average, across all examples.



In [0]:

    
training_info = """
SELECT
  training_run,
  iteration,
  loss,
  eval_loss,
  duration_ms,
  learning_rate
FROM
  ML.TRAINING_INFO(MODEL `census_tutorial.census_model`)
ORDER BY iteration ASC
"""
client.query(training_info).to_dataframe()

Model evaluation



In [0]:

    
# Perform model evaluation
query_evaluate = """
SELECT 
  precision,
  recall,
  accuracy,
  f1_score,
  log_loss,
  roc_auc
FROM ML.EVALUATE (MODEL `census_tutorial.census_model`, 
  (
    SELECT
      age,
      CASE 
        WHEN workclass IS NULL THEN 'Private' 
        WHEN LTRIM(workclass) LIKE '?' THEN 'Private'
        ELSE workclass
      END AS workclass,
      CASE 
        WHEN native_country IS NULL THEN 'United States' 
        WHEN LTRIM(native_country) LIKE '?' THEN 'United States'
        ELSE native_country
      END AS native_country,        
      CASE 
        WHEN LTRIM(marital_status) IN 
          (
            'Never-married',
            'Divorced',
            'Separated',
            'Widowed'
            ) THEN 'Single' 
        WHEN LTRIM(marital_status) IN 
          (
            'Married-civ-spouse',
            'Married-spouse-absent',
            'Married-AF-spouse'
            ) THEN 'Married' 
        ELSE NULL 
      END AS marital_status,
      education_num,
      occupation,
      race,       
      hours_per_week,        
      income_bracket
    FROM   
      `bigquery-public-data.ml_datasets.census_adult_income`
    WHERE
      MOD(ABS(FARM_FINGERPRINT(CAST(functional_weight AS STRING))), 10) = 8
      AND (occupation IS NOT NULL OR LTRIM(occupation) NOT LIKE '?%')
    GROUP BY 1, 2, 3, 4, 5, 6, 7, 8, 9
    ))
"""
evaluation_job = client.query(query_evaluate).to_dataframe()

Model results

By examining the ROC accuracy (roc_auc) printed during evaluation, you should see that your model ultimately learned to predict a person's income bracket with around 78% accuracy. Can you improve it?



In [0]:

    
evaluation_job

ROC

The output ML.ROC_CURVE function includes multiple rows with metrics for different threshold values for the model. The metrics include:

threshold
recall
false_positive_rate
true_positives
false_positives
true_negatives
false_negatives

More information about these metrics here.



In [0]:

    
# Perform model evaluation
query_roc_curve = """
SELECT
  threshold,
  recall,
  false_positive_rate,
  true_positives,
  false_positives,
  true_negatives,
  false_negatives
FROM
  ML.ROC_CURVE(MODEL `census_tutorial.census_model`,
  (
    SELECT
      age,
      CASE 
        WHEN workclass IS NULL THEN 'Private' 
        WHEN LTRIM(workclass) LIKE '?' THEN 'Private'
        ELSE workclass
      END AS workclass,
      CASE 
        WHEN native_country IS NULL THEN 'United States' 
        WHEN LTRIM(native_country) LIKE '?' THEN 'United States'
        ELSE native_country
      END AS native_country,        
      CASE 
        WHEN LTRIM(marital_status) IN 
          (
            'Never-married',
            'Divorced',
            'Separated',
            'Widowed'
            ) THEN 'Single' 
        WHEN LTRIM(marital_status) IN 
          (
            'Married-civ-spouse',
            'Married-spouse-absent',
            'Married-AF-spouse'
            ) THEN 'Married' 
        ELSE NULL 
      END AS marital_status,
      education_num,
      occupation,
      race,       
      hours_per_week,        
      income_bracket
    FROM   
       `bigquery-public-data.ml_datasets.census_adult_income`
    WHERE
      MOD(ABS(FARM_FINGERPRINT(CAST(functional_weight AS STRING))), 10) = 8
      AND (occupation IS NOT NULL OR LTRIM(occupation) NOT LIKE '?%')
    GROUP BY  1, 2, 3, 4, 5, 6, 7, 8, 9))
"""
roc_curve_job = client.query(query_roc_curve).to_dataframe()
roc_curve_job

Visualize Model history



In [0]:

    
# Visualize History for Accuracy.
plt.title('BigQuery ML Model accuracy')
plt.xlim([0.0, 1.0])
plt.ylim([0.0, 1.02])
plt.xlabel('False Positive Rate')
plt.ylabel('True Positive Rate')
fpr = roc_curve_job['false_positive_rate']
tpr = roc_curve_job['recall']
plt.plot(fpr, tpr, color='darkorange')
plt.plot([0, 1], [0, 1], color='navy', linestyle='--')
plt.show()

Model predictions

Generate predictions

Now you will use your model to predict outcomes. The following query uses the ML.PREDICT. The query returns these columns:

income_bracket
predicted_income_bracket
predicted_income_bracket_probs
All other columns specified in query.

Note: Normally for prediction, your label will be empty, in this example we use it to be able to compare the model result vs expected label.



In [0]:

    
query_prediction = """
SELECT
  income_bracket,
  predicted_income_bracket,
  predicted_income_bracket_probs
FROM
  ML.PREDICT(MODEL `census_tutorial.census_model`,
  (
    SELECT
      age,
      CASE 
        WHEN workclass IS NULL THEN 'Private' 
        WHEN LTRIM(workclass) LIKE '?' THEN 'Private'
        ELSE workclass
      END AS workclass,
      CASE 
        WHEN native_country IS NULL THEN 'United States' 
        WHEN LTRIM(native_country) LIKE '?' THEN 'United States'
        ELSE native_country
      END AS native_country,        
      CASE 
        WHEN LTRIM(marital_status) IN 
        (
            'Never-married',
            'Divorced',
            'Separated',
            'Widowed'
            ) THEN 'Single' 
        WHEN LTRIM(marital_status) IN 
          (
            'Married-civ-spouse',
            'Married-spouse-absent',
            'Married-AF-spouse'
            ) THEN 'Married' 
        ELSE NULL 
      END AS marital_status,
      education_num,
      occupation,
      race,       
      hours_per_week,        
      income_bracket
    FROM   
       `bigquery-public-data.ml_datasets.census_adult_income`
    WHERE
      MOD(ABS(FARM_FINGERPRINT(CAST(functional_weight AS STRING))), 10) = 9
      AND occupation IS NOT NULL AND LTRIM(occupation) NOT LIKE '?%'
      GROUP BY  1, 2, 3, 4, 5, 6, 7, 8, 9
    ))
"""
predictions = client.query(query_prediction).to_dataframe()

We will see a few predictions and the expected result:



In [0]:

    
predictions[['income_bracket', 'predicted_income_bracket']].head()



In [0]:

    
predictions['predicted_income_bracket_probs'].head()

In this case the first prediction probability is ~ 0.53, which corresponds to >50K. As you can see is very easy to load data into BigQuery and create a model to start training and serving using SQL language only.

Count the number of correct predictions:



In [0]:

    
_count = predictions['income_bracket'].str.strip().str.lower() == \
  predictions['predicted_income_bracket'].str.strip().str.lower()
# Group predictions:
_count.value_counts(normalize=True) 
# Note: Set normalize=False to see grouped results.

Conclusion

Using BigQuery to create Machine Learning has never been so easy, with BigQuery ML now you can start using the power of the Cloud and SQL language to create different Machine Learning solutions to solve your different business challenges.

Cleaning up

To avoid incurring charges to your Google Cloud Platform account for the resources used in this tutorial:

You can delete the project you created.
Or you can keep the project and delete the dataset.
Make sure you check the Pricing section when using this tutorial.

Support

https://cloud.google.com/bigquery/support