In [ ]:
# Copyright 2019 Google LLC
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
#     https://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.

Purchase Prediction with AutoML Tables

Run in Colab View on GitHub

Overview

One of the most common use cases in Marketing is to predict the likelihood of conversion. Conversion could be defined by the marketer as taking a certain action like making a purchase, signing up for a free trial, subscribing to a newsletter, etc. Knowing the likelihood that a marketing lead or prospect will ‘convert’ can enable the marketer to target the lead with the right marketing campaign. This could take the form of remarketing, targeted email campaigns, online offers or other treatments.

Here we demonstrate how you can use BigQuery and AutoML Tables to build a supervised binary classification model for purchase prediction.

Dataset

The model uses a real dataset from the Google Merchandise store consisting of Google Analytics web sessions.

The goal here is to predict the likelihood of a web visitor visiting the online Google Merchandise Store making a purchase on the website during that Google Analytics session. Past web interactions of the user on the store website in addition to information like browser details and geography are used to make this prediction.

This is framed as a binary classification model, to label a user during a session as either true (makes a purchase) or false (does not make a purchase). Dataset Details The dataset consists of a set of tables corresponding to Google Analytics sessions being tracked on the Google Merchandise Store. Each table is a single day of GA sessions. More details around the schema can be seen here.

You can access the data on BigQuery here.

Costs

This tutorial uses billable components of Google Cloud Platform (GCP):

  • Cloud AI Platform
  • Cloud Storage
  • BigQuery
  • AutoML Tables

Learn about Cloud AI Platform pricing, Cloud Storage pricing, BigQuery pricing and AutoML Tables pricing, and use the Pricing Calculator to generate a cost estimate based on your projected usage.

Set up your local development environment

If you are using Colab or AI Platform Notebooks, your environment already meets all the requirements to run this notebook. If you are using AI Platform Notebook, make sure the machine configuration type is 4 vCPU, 15 GB RAM or above. You can skip this step.

Otherwise, make sure your environment meets this notebook's requirements. You need the following:

  • The Google Cloud SDK
  • Git
  • Python 3
  • virtualenv
  • Jupyter notebook running in a virtual environment with Python 3

The Google Cloud guide to Setting up a Python development environment and the Jupyter installation guide provide detailed instructions for meeting these requirements. The following steps provide a condensed set of instructions:

  1. Install and initialize the Cloud SDK.

  2. Install Python 3.

  3. Install virtualenv and create a virtual environment that uses Python 3.

  4. Activate that environment and run pip install jupyter in a shell to install Jupyter.

  5. Run jupyter notebook in a shell to launch Jupyter.

  6. Open this notebook in the Jupyter Notebook Dashboard.

Set up your GCP project

The following steps are required, regardless of your notebook environment.

  1. Select or create a GCP project.. When you first create an account, you get a $300 free credit towards your compute/storage costs.

  2. Make sure that billing is enabled for your project.

  3. Enable the AI Platform APIs and Compute Engine APIs.

  4. Enable AutoML API.

PIP Install Packages and dependencies

Install addional dependencies not installed in Notebook environment


In [ ]:
! pip install --upgrade --quiet --user google-cloud-automl
! pip install --upgrade --quiet --user google-cloud-bigquery
! pip install --upgrade --quiet --user google-cloud-storage
! pip install --upgrade --quiet --user matplotlib
! pip install --upgrade --quiet --user pandas 
! pip install --upgrade --quiet --user pandas-gbq 
! pip install --upgrade --quiet --user gcsfs

Note: Try installing using sudo, if the above command throw any permission errors.

Restart the kernel to allow automl_v1beta1 to be imported for Jupyter Notebooks.


In [ ]:
from IPython.core.display import HTML
HTML("<script>Jupyter.notebook.kernel.restart()</script>")

Set up your GCP Project Id

Enter your Project Id in the cell below. Then run the cell to make sure the Cloud SDK uses the right project for all the commands in this notebook.


In [ ]:
PROJECT_ID = "[your-project-id]" # @param {type:"string"}
COMPUTE_REGION = "us-central1" # Currently only supported region.

Authenticate your GCP account

If you are using AI Platform Notebooks, your environment is already authenticated. Skip this step.

Otherwise, follow these steps:

  1. In the GCP Console, go to the Create service account key page.

  2. From the Service account drop-down list, select New service account.

  3. In the Service account name field, enter a name.

  4. From the Role drop-down list, select AutoML > AutoML Admin, Storage > Storage Admin and BigQuery > BigQuery Admin.

  5. Click Create. A JSON file that contains your key downloads to your local environment.

Note: Jupyter runs lines prefixed with ! as shell commands, and it interpolates Python variables prefixed with $ into these commands.


In [ ]:
import sys

# Upload the downloaded JSON file that contains your key.
if 'google.colab' in sys.modules:    
  from google.colab import files
  keyfile_upload = files.upload()
  keyfile = list(keyfile_upload.keys())[0]
  %env GOOGLE_APPLICATION_CREDENTIALS $keyfile
  ! gcloud auth activate-service-account --key-file $keyfile

If you are running the notebook locally, enter the path to your service account key as the GOOGLE_APPLICATION_CREDENTIALS variable in the cell below and run the cell


In [ ]:
# If you are running this notebook locally, replace the string below with the
# path to your service account key and run this cell to authenticate your GCP
# account.

%env GOOGLE_APPLICATION_CREDENTIALS /path/to/service/account
! gcloud auth activate-service-account --key-file '/path/to/service/account'

Create a Cloud Storage bucket

The following steps are required, regardless of your notebook environment.

When you submit a training job using the Cloud SDK, you upload a Python package containing your training code to a Cloud Storage bucket. AI Platform runs the code from this package. In this tutorial, AI Platform also saves the trained model that results from your job in the same bucket. You can then create an AI Platform model version based on this output in order to serve online predictions.

Set the name of your Cloud Storage bucket below. It must be unique across all Cloud Storage buckets.

You may also change the REGION variable, which is used for operations throughout the rest of this notebook. Make sure to choose a region where Cloud AI Platform services are available. You may not use a Multi-Regional Storage bucket for training with AI Platform.


In [ ]:
BUCKET_NAME = "[your-bucket-name]" #@param {type:"string"}

Only if your bucket doesn't exist: Run the following cell to create your Cloud Storage bucket. Make sure Storage > Storage Admin role is enabled


In [ ]:
! gsutil mb -p $PROJECT_ID -l $COMPUTE_REGION gs://$BUCKET_NAME

Finally, validate access to your Cloud Storage bucket by examining its contents:


In [ ]:
! gsutil ls -al gs://$BUCKET_NAME

Import libraries and define constants

Import relevant packages.


In [ ]:
from __future__ import absolute_import
from __future__ import division
from __future__ import print_function

In [ ]:
# AutoML library.
from google.cloud import automl_v1beta1 as automl
import google.cloud.automl_v1beta1.proto.data_types_pb2 as data_types
from google.cloud import bigquery
from google.cloud import storage

In [ ]:
import matplotlib.pyplot as plt
import datetime
import pandas as pd
import numpy as np
from sklearn import metrics

Populate the following cell with the necessary constants and run it to initialize constants.


In [ ]:
#@title Constants { vertical-output: true }

# A name for the AutoML tables Dataset to create.
DATASET_DISPLAY_NAME = 'purchase_prediction' #@param {type: 'string'}
# A name for the file to hold the nested data.
NESTED_CSV_NAME = 'FULL.csv' #@param {type: 'string'}
# A name for the file to hold the unnested data.
UNNESTED_CSV_NAME = 'FULL_unnested.csv' #@param {type: 'string'}
# A name for the input train data.
TRAINING_CSV = 'training_unnested_balanced_FULL' #@param {type: 'string'}
# A name for the input validation data.
VALIDATION_CSV = 'validation_unnested_FULL' #@param {type: 'string'}
# A name for the AutoML tables model to create.
MODEL_DISPLAY_NAME = 'model_1' #@param {type:'string'}

assert all([
    PROJECT_ID,
    COMPUTE_REGION,
    DATASET_DISPLAY_NAME,
    MODEL_DISPLAY_NAME,
])

Initialize client for AutoML, AutoML Tables, BigQuery and Storage.


In [ ]:
# Initialize the clients.
automl_client = automl.AutoMlClient()
tables_client = automl.TablesClient(project=PROJECT_ID, region=COMPUTE_REGION)
bq_client = bigquery.Client()
storage_client = storage.Client()

Test the set up

To test whether your project set up and authentication steps were successful, run the following cell to list your datasets in this project.

If no dataset has previously imported into AutoML Tables, you shall expect an empty return.


In [ ]:
# List the datasets.
list_datasets = tables_client.list_datasets()
datasets = { dataset.display_name: dataset.name for dataset in list_datasets }
datasets

You can also print the list of your models by running the following cell.

If no model has previously trained using AutoML Tables, you shall expect an empty return.


In [ ]:
# List the models.
list_models = tables_client.list_models()
models = { model.display_name: model.name for model in list_models }
models

Transformation and Feature Engineering Functions

The data cleaning and transformation step was by far the most involved. It includes a few sections that create an AutoML tables dataset, pull the Google merchandise store data from BigQuery, transform the data, and save it multiple times to csv files in google cloud storage.

The dataset that is made viewable in the AutoML Tables UI. It will eventually hold the training data after that training data is cleaned and transformed.

This dataset has only around 1% of its values with a positive label value of True i.e. cases when a transaction was made. This is a class imbalance problem. There are several ways to handle class imbalance. We chose to oversample the positive class by random over sampling. This resulted in an artificial increase in the sessions with the positive label of true transaction value.

There were also many columns with either all missing or all constant values. These columns would not add any signal to our model, so we dropped them.

There were also columns with NaN rather than 0 values. For instance, rather than having a count of 0, a column might have a null value. So we added code to change some of these null values to 0, specifically in our target column, in which null values were not allowed by AutoML Tables. However, AutoML Tables can handle null values for the features.

Feature Engineering

The dataset had rich information on customer location and behavior; however, it can be improved by performing feature engineering. Moreover, there was a concern about data leakage. The decision to do feature engineering, therefore, had two contributing motivations: remove data leakage without too much loss of useful data, and to improve the signal in our data.

Weekdays

The date seemed like a useful piece of information to include, as it could capture seasonal effects. Unfortunately, we only had one year of data, so seasonality on an annual scale would be difficult (read impossible) to incorporate. Fortunately, we could try and detect seasonal effects on a micro, with perhaps equally informative results. We ended up creating a new column of weekdays out of dates, to denote which day of the week the session was held on. This new feature turned out to have some useful predictive power, when added as a variable into our model.

Data Leakage

The marginal gain from adding a weekday feature, was overshadowed by the concern of data leakage in our training data. In the initial naive models we trained, we got outstanding results. So outstanding that we knew that something must be going on. As it turned out, quite a few features functioned as proxies for the feature we were trying to predict: meaning some of the features we conditioned on to build the model had an almost 1:1 correlation with the target feature. Intuitively, this made sense.

One feature that exhibited this behavior was the number of page views a customer made during a session. By conditioning on page views in a session, we could very reliably predict which customer sessions a purchase would be made in. At first this seems like the golden ticket, we can reliably predict whether or not a purchase is made! The catch: the full page view information can only be collected at the end of the session, by which point we would also have whether or not a transaction was made. Seen from this perspective, collecting page views at the same time as collecting the transaction information would make it pointless to predict the transaction information using the page views information, as we would already have both. One solution was to drop page views as a feature entirely. This would safely stop the data leakage, but we would lose some critically useful information. Another solution, (the one we ended up going with), was to track the page view information of all previous sessions for a given customer, and use it to inform the current session. This way, we could use the page view information, but only the information that we would have before the session even began. So we created a new column called previous_views, and populated it with the total count of all previous page views made by the customer in all previous sessions. We then deleted the page views feature, to stop the data leakage.

Our rationale for this change can be boiled down to the concise heuristic: only use the information that is available to us on the first click of the session. Applying this reasoning, we performed similar data engineering on other features which we found to be proxies for the label feature. We also refined our objective in the process: For a visit to the Google Merchandise store, what is the probability that a customer will make a purchase, and can we calculate this probability the moment the customer arrives? By clarifying the question, we both made the result more powerful/useful, and eliminated the data leakage that threatened to make the predictive power trivial.


In [ ]:
def balanceTable(table):
  # class count.
  count_class_false, count_class_true = table.totalTransactionRevenue\
                                        .value_counts()

  # divide by class.
  table_class_false = table[table["totalTransactionRevenue"]==False]
  table_class_true = table[table["totalTransactionRevenue"]==True]

  # random over-sampling.
  table_class_true_over = table_class_true.sample(
                          count_class_false, replace=True)
  table_test_over = pd.concat([table_class_false, table_class_true_over])
  return table_test_over

In [ ]:
def partitionTable(table, dt=20170500):
  # The automl tables model could be training on future data and implicitly learning about past data in the testing
  # dataset, this would cause data leakage. To prevent this, we are training only with the first 9 months of data (table1)
  # and doing validation with the last three months of data (table2).
  table1 = table[table["date"]<=dt].copy(deep=False)
  table2 = table[table["date"]>dt].copy(deep=False)
  return table1, table2

In [ ]:
def N_updatePrevCount(table, new_column, old_column):
  table = table.fillna(0)
  table[new_column] = 1
  table.sort_values(by=['fullVisitorId','date'])
  table[new_column] = table.groupby(['fullVisitorId'])[old_column].apply(
                        lambda x: x.cumsum())
  table.drop([old_column], axis=1, inplace=True)
  return table

In [ ]:
def N_updateDate(table):
  table['weekday'] = 1
  table['date'] = pd.to_datetime(table['date'].astype(str), format='%Y%m%d')
  table['weekday'] = table['date'].dt.dayofweek
  return table

In [ ]:
def change_transaction_values(table):
  table['totalTransactionRevenue'] = table['totalTransactionRevenue'].fillna(0)
  table['totalTransactionRevenue'] = table['totalTransactionRevenue'].apply(
                                      lambda x: x!=0)
  return table

In [ ]:
def saveTable(table, csv_file_name, bucket_name):
  table.to_csv(csv_file_name, index=False)
  bucket = storage_client.get_bucket(bucket_name)
  blob = bucket.blob(csv_file_name)
  blob.upload_from_filename(filename=csv_file_name)

Getting training data

If you are using Colab the memory may not be sufficient enough to generate Nested and Unnested data using the queries. In this case, you can directly download the unnested data FULL_unnested.csv from here and upload the file manually to GCS bucket that was created in the previous steps (BUCKET_NAME).

If you are using AI Platform Notebook or Local environment, run the following code


In [ ]:
# Save table.
query = """
SELECT
 date, 
 device, 
 geoNetwork, 
 totals, 
 trafficSource, 
 fullVisitorId 
FROM 
 `bigquery-public-data.google_analytics_sample.ga_sessions_*`
WHERE
 _TABLE_SUFFIX BETWEEN FORMAT_DATE('%Y%m%d',DATE_SUB('2017-08-01', INTERVAL 366 DAY)) AND
 FORMAT_DATE('%Y%m%d',DATE_SUB('2017-08-01', INTERVAL 1 DAY))
"""
df = bq_client.query(query).to_dataframe()
print(df.iloc[:3])
saveTable(df, NESTED_CSV_NAME, BUCKET_NAME)

In [ ]:
# Unnest the Data.
nested_gcs_uri = 'gs://{}/{}'.format(BUCKET_NAME, NESTED_CSV_NAME)
table = pd.read_csv(nested_gcs_uri, low_memory=False)

column_names = ['device', 'geoNetwork','totals', 'trafficSource']

for name in column_names:
  print(name)
  table[name] = table[name].apply(lambda i: dict(eval(i)))
  temp = table[name].apply(pd.Series)
  table = pd.concat([table, temp], axis=1).drop(name, axis=1)

# need to drop a column.
table.drop(['adwordsClickInfo'], axis=1, inplace=True)
saveTable(table, UNNESTED_CSV_NAME, BUCKET_NAME)

Run the Transformations


In [ ]:
# Run the transformations.
unnested_gcs_uri = 'gs://{}/{}'.format(BUCKET_NAME, UNNESTED_CSV_NAME)
table = pd.read_csv(unnested_gcs_uri, low_memory=False)

consts = ['transactionRevenue', 'transactions', 'adContent', 'browserSize', 
          'campaignCode', 'cityId', 'flashVersion', 'javaEnabled', 'language', 
          'latitude', 'longitude', 'mobileDeviceBranding', 'mobileDeviceInfo', 
          'mobileDeviceMarketingName','mobileDeviceModel','mobileInputSelector',
          'networkLocation', 'operatingSystemVersion', 'screenColors', 
          'screenResolution', 'screenviews', 'sessionQualityDim', 
          'timeOnScreen', 'visits', 'uniqueScreenviews', 'browserVersion', 
          'referralPath','fullVisitorId', 'date']

table = N_updatePrevCount(table, 'previous_views', 'pageviews')
table = N_updatePrevCount(table, 'previous_hits', 'hits')
table = N_updatePrevCount(table, 'previous_timeOnSite', 'timeOnSite')
table = N_updatePrevCount(table, 'previous_Bounces', 'bounces')

table = change_transaction_values(table)

In [ ]:
table1, table2 = partitionTable(table)
table1 = N_updateDate(table1)
table2 = N_updateDate(table2)

table1.drop(consts, axis=1, inplace=True)
table2.drop(consts, axis=1, inplace=True)

saveTable(table2,'{}.csv'.format(VALIDATION_CSV), BUCKET_NAME)

table1 = balanceTable(table1)

# training_unnested_FULL.csv = the first 9 months of data.
saveTable(table1, '{}.csv'.format(TRAINING_CSV), BUCKET_NAME)

Import Training Data

Select a dataset display name and pass your table source information to create a new dataset.

Create Dataset


In [ ]:
# Create dataset.
dataset = tables_client.create_dataset(
    dataset_display_name=DATASET_DISPLAY_NAME)
dataset_name = dataset.name
dataset

Import Data


In [ ]:
# Read the data source from GCS. 
dataset_gcs_input_uris = ['gs://{}/{}.csv'.format(BUCKET_NAME, TRAINING_CSV)]

import_data_response = tables_client.import_data(
    dataset=dataset,
    gcs_input_uris=dataset_gcs_input_uris
)

print('Dataset import operation: {}'.format(import_data_response.operation))

# Synchronous check of operation status. Wait until import is done.
print('Dataset import response: {}'.format(import_data_response.result()))

# Verify the status by checking the example_count field.
dataset = tables_client.get_dataset(dataset_name=dataset_name)
dataset

Review the specs

Run the following command to see table specs such as row count.


In [ ]:
# List table specs.
list_table_specs_response = tables_client.list_table_specs(dataset=dataset)
table_specs = [s for s in list_table_specs_response]

# List column specs.
list_column_specs_response = tables_client.list_column_specs(dataset=dataset)
column_specs = {s.display_name: s for s in list_column_specs_response}

# Print Features and data_type.
features = [(key, data_types.TypeCode.Name(value.data_type.type_code)) 
            for key, value in column_specs.items()]
print('Feature list:\n')
for feature in features:
    print(feature[0],':', feature[1])

In [ ]:
# Table schema pie chart.
type_counts = {}
for column_spec in column_specs.values():
  type_name = data_types.TypeCode.Name(column_spec.data_type.type_code)
  type_counts[type_name] = type_counts.get(type_name, 0) + 1
    
plt.pie(x=type_counts.values(), labels=type_counts.keys(), autopct='%1.1f%%')
plt.axis('equal')
plt.show()

Update dataset: assign a label column and enable nullable columns

AutoML Tables automatically detects your data column type. Depending on the type of your label column, AutoML Tables chooses to run a classification or regression model. If your label column contains only numerical values, but they represent categories, change your label column type to categorical by updating your schema.

Update a column: set to not nullable


In [ ]:
# Update column.
column_spec_display_name = 'totalTransactionRevenue' #@param {type: 'string'}
update_column_response = tables_client.update_column_spec(
    dataset=dataset,
    column_spec_display_name=column_spec_display_name,
    nullable=False,
)
update_column_response

Tip: You can use kwarg type_code='CATEGORY' in the preceding update_column_spec(..) call to convert the column data type from FLOAT64 to CATEGORY.

Update dataset: assign a target column


In [ ]:
# Assign target column.
column_spec_display_name = 'totalTransactionRevenue' #@param {type: 'string'}
update_dataset_response = tables_client.set_target_column(
    dataset=dataset,
    column_spec_display_name=column_spec_display_name,
)
update_dataset_response

Creating a model

Train a model

To create the datasets for training, testing and validation, we first had to consider what kind of data we were dealing with. The data we had keeps track of all customer sessions with the Google Merchandise store over a year. AutoML tables does its own training and testing, and delivers a quite nice UI to view the results in. For the training and testing dataset then, we simply used the over sampled, balanced dataset created by the transformations described above. But we first partitioned the dataset to include the first 9 months in one table and the last 3 in another. This allowed us to train and test with an entirely different dataset that what we used to validate.

Moreover, we held off on oversampling for the validation dataset, to not bias the data that we would ultimately use to judge the success of our model.

The decision to divide the sessions along time was made to avoid the model training on future data to predict past data. (This can be avoided with a datetime variable in the dataset and by toggling a button in the UI)

Training the model may take one hour or more. The following cell keeps running until the training is done. If your Colab times out, use client.list_models() to check whether your model has been created. Then use model name to continue to the next steps. Run the following command to retrieve your model. Replace model_name with its actual value.

model = client.get_model(model_name=model_name)

Note that we trained on the first 9 months of data and we validate using the last 3.

For demonstration purpose, the following command sets the budget as 1 node hour ('train_budget_milli_node_hours': 1000). You can increase that number up to a maximum of 72 hours ('train_budget_milli_node_hours': 72000) for the best model performance.

Even with a budget of 1 node hour (the minimum possible budget), training a model can take more than the specified node hours.

You can also select the objective to optimize your model training by setting optimization_objective. This solution optimizes the model by using default optimization objective. Refer link for more details.


In [ ]:
# The number of hours to train the model.
model_train_hours = 1 #@param {type:'integer'}

create_model_response = tables_client.create_model(
    MODEL_DISPLAY_NAME,
    dataset=dataset,
    train_budget_milli_node_hours=model_train_hours*1000,
)

operation_id = create_model_response.operation.name

print('Create model operation: {}'.format(create_model_response.operation))

In [ ]:
# Wait until model training is done.
model = create_model_response.result()
model_name = model.name
model

Make a prediction

In this section, we take our validation data prediction results and plot the Precision Recall curve and the ROC curve of both the false and true predictions.

There are two different prediction modes: online and batch. The following cell shows you how to make a batch prediction.


In [ ]:
#@title Start batch prediction { vertical-output: true }

batch_predict_gcs_input_uris = ['gs://{}/{}.csv'.format(BUCKET_NAME, VALIDATION_CSV)] #@param {type:'string'}
batch_predict_gcs_output_uri_prefix = 'gs://{}'.format(BUCKET_NAME) #@param {type:'string'}

batch_predict_response = tables_client.batch_predict(
    model=model, 
    gcs_input_uris=batch_predict_gcs_input_uris,
    gcs_output_uri_prefix=batch_predict_gcs_output_uri_prefix,
)
print('Batch prediction operation: {}'.format(batch_predict_response.operation))

# Wait until batch prediction is done.
batch_predict_result = batch_predict_response.result()
batch_predict_response.metadata

Evaluate your prediction

The follow cell creates a Precision Recall curve and a ROC curve for both the true and false classifications.


In [ ]:
def invert(x):
  return 1-x

def switch_label(x):
  return(not x)

In [ ]:
batch_predict_results_location = batch_predict_response.metadata\
                                 .batch_predict_details.output_info\
                                 .gcs_output_directory
table = pd.read_csv('{}/tables_1.csv'.format(batch_predict_results_location))
y = table["totalTransactionRevenue"]
scores = table["totalTransactionRevenue_True_score"]
scores_invert = table['totalTransactionRevenue_False_score']

In [ ]:
# code for ROC curve, for true values.
fpr, tpr, thresholds = metrics.roc_curve(y, scores)
roc_auc = metrics.auc(fpr, tpr)
plt.figure()
lw = 2
plt.plot(fpr, tpr, color='darkorange',
         lw=lw, label='ROC curve (area=%0.2f)' % roc_auc)
plt.plot([0, 1], [0, 1], color='navy', lw=lw, linestyle='--')
plt.xlim([0.0, 1.0])
plt.ylim([0.0, 1.05])
plt.xlabel('False Positive Rate')
plt.ylabel('True Positive Rate')
plt.title('Receiver operating characteristic for True')
plt.legend(loc="lower right")
plt.show()

In [ ]:
# code for ROC curve, for false values.
plt.figure()
lw = 2
label_invert = y.apply(switch_label)
fpr, tpr, thresholds = metrics.roc_curve(label_invert, scores_invert)
plt.plot(fpr, tpr, color='darkorange',
         lw=lw, label='ROC curve (area=%0.2f)' % roc_auc)
plt.plot([0, 1], [0, 1], color='navy', lw=lw, linestyle='--')
plt.xlim([0.0, 1.0])
plt.ylim([0.0, 1.05])
plt.xlabel('False Positive Rate')
plt.ylabel('True Positive Rate')
plt.title('Receiver operating characteristic for False')
plt.legend(loc="lower right")
plt.show()

In [ ]:
# code for PR curve, for true values.
precision, recall, thresholds = metrics.precision_recall_curve(y, scores)
plt.figure()
lw = 2
plt.plot( recall, precision, color='darkorange',
         lw=lw, label='Precision recall curve for True')
plt.xlim([0.0, 1.0])
plt.ylim([0.0, 1.05])
plt.xlabel('Recall')
plt.ylabel('Precision')
plt.title('Precision Recall Curve for True')
plt.legend(loc="lower right")
plt.show()

In [ ]:
# code for PR curve, for false values.
precision, recall, thresholds = metrics.precision_recall_curve(
                                label_invert, scores_invert)
print(precision.shape)
print(recall.shape)

plt.figure()
lw = 2
plt.plot( recall, precision, color='darkorange',
          label='Precision recall curve for False')
plt.xlim([0.0, 1.1])
plt.ylim([0.0, 1.1])
plt.xlabel('Recall')
plt.ylabel('Precision')
plt.title('Precision Recall Curve for False')
plt.legend(loc="lower right")
plt.show()

Cleaning up

To clean up all GCP resources used in this project, you can delete the GCP project you used for the tutorial.


In [ ]:
# Delete model resource.
tables_client.delete_model(model_name=model_name)

# Delete dataset resource.
tables_client.delete_dataset(dataset_name=dataset_name)

# Delete Cloud Storage objects that were created.
! gsutil -m rm -r gs://$BUCKET_NAME

# If training model is still running, cancel it.
automl_client.transport._operations_client.cancel_operation(operation_id)