In [ ]:
# Copyright 2019 Google LLC
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# https://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
|
|
AutoML Tables enables you to build machine learning models based on tables of your own data and host them on Google Cloud for scalability. This Notebook demonstrates how you can use AutoML Tables to solve a product stockouts problem in the retail industry. This problem is solved using a binary classification approach, which predicts whether a particular product at a certain store will be out-of-stock or not in the next four weeks. Once the solution is built, you can plug this in with your production system and proactively predict stock-outs for your business.
In this solution, you will use two datasets: Training/Evaluation data and Batch Prediction inputs. To access the datasets in BigQuery, you need the following information.
Project ID: product-stockout
Dataset ID: product_stockout
Table ID: stockout
Project ID: product-stockout
Dataset ID: product_stockout
Table ID: batch_prediction_inputs
Field name | Datatype | Type | Description |
---|---|---|---|
Item_Number | STRING | Identifier | This is the product/ item identifier |
Category | STRING | Identifier | Several items could belong to one category |
Instructions to download dataset:
Sample Dataset: Download this dataset which contains sales data.
Upload this dataset to GCS or BigQuery (optional).
You could select either GCS or BigQuery as the location of your choice to store the data for this challenge.
A stockout, or out-of-stock (OOS) event is an event that causes inventory to be exhausted. While out-of-stocks can occur along the entire supply chain, the most visible kind are retail out-of-stocks in the fast-moving consumer goods industry (e.g., sweets, diapers, fruits). Stockouts are the opposite of overstocks, where too much inventory is retained.
According to a study by researchers Thomas Gruen and Daniel Corsten, the global average level of out-of-stocks within retail fast-moving consumer goods sector across developed economies was 8.3% in 2002. This means that shoppers would have a 42% chance of fulfilling a ten-item shopping list without encountering a stockout. Despite the initiatives designed to improve the collaboration of retailers and their suppliers, such as Efficient Consumer Response (ECR), and despite the increasing use of new technologies such as radio-frequency identification (RFID) and point-of-sale data analytics, this situation has improved little over the past decades.
The biggest impacts being
Using machine learning to solve for stock-outs can help with store operations and thus prevent out-of-stock proactively.
There are three big challenges any retailer would face as they try and solve this problem with machine learning:
Hence, we recommend using AutoML Tables. With AutoML Tables you only need to work on acquiring all data and features, and AutoML Tables would do the rest. This is a one-click deploy to solving the problem of stock-out with machine learning.
This tutorial uses billable components of Google Cloud Platform (GCP):
Learn about Cloud AI Platform pricing, Cloud Storage pricing, BigQuery pricing, AutoML Tables pricing, and use the Pricing Calculator to generate a cost estimate based on your projected usage.
If you are using Colab or AI Platform Notebooks, your environment already meets all the requirements to run this notebook. If you are using AI Platform Notebook, make sure the machine configuration type is 1 vCPU, 3.75 GB RAM or above. You can skip this step.
Otherwise, make sure your environment meets this notebook's requirements. You need the following:
The Google Cloud guide to Setting up a Python development environment and the Jupyter installation guide provide detailed instructions for meeting these requirements. The following steps provide a condensed set of instructions:
Install virtualenv and create a virtual environment that uses Python 3.
Activate that environment and run pip install jupyter
in a shell to install
Jupyter.
Run jupyter notebook
in a shell to launch Jupyter.
Open this notebook in the Jupyter Notebook Dashboard.
The following steps are required, regardless of your notebook environment.
Select or create a GCP project.. When you first create an account, you get a $300 free credit towards your compute/storage costs.
In [ ]:
! pip install --upgrade --quiet --user google-cloud-automl
! pip install matplotlib
Note: Try installing using sudo
, if the above command throw any permission errors.
Restart
the kernel to allow automl_v1beta1 to be imported for Jupyter Notebooks.
In [ ]:
from IPython.core.display import HTML
HTML("<script>Jupyter.notebook.kernel.restart()</script>")
In [ ]:
PROJECT_ID = "[your-project-id]" #@param {type:"string"}
COMPUTE_REGION = "us-central1" # Currently only supported region.
Otherwise, follow these steps:
In the GCP Console, go to the Create service account key page.
From the Service account drop-down list, select New service account.
In the Service account name field, enter a name.
From the Role drop-down list, select AutoML > AutoML Admin, Storage > Storage Object Admin and BigQuery > BigQuery Admin.
Click Create. A JSON file that contains your key downloads to your local environment.
Note: Jupyter runs lines prefixed with !
as shell commands, and it interpolates Python variables prefixed with $
into these commands.
In [ ]:
# Upload the downloaded JSON file that contains your key.
import sys
if 'google.colab' in sys.modules:
from google.colab import files
keyfile_upload = files.upload()
keyfile = list(keyfile_upload.keys())[0]
%env GOOGLE_APPLICATION_CREDENTIALS $keyfile
! gcloud auth activate-service-account --key-file $keyfile
If you are running the notebook locally, enter the path to your service account key as the GOOGLE_APPLICATION_CREDENTIALS
variable in the cell below and run the cell
In [ ]:
# If you are running this notebook locally, replace the string below with the
# path to your service account key and run this cell to authenticate your GCP
# account.
%env GOOGLE_APPLICATION_CREDENTIALS /path/to/service/account
! gcloud auth activate-service-account --key-file '/path/to/service/account'
The following steps are required, regardless of your notebook environment.
When you submit a training job using the Cloud SDK, you upload a Python package containing your training code to a Cloud Storage bucket. AI Platform runs the code from this package. In this tutorial, AI Platform also saves the trained model that results from your job in the same bucket. You can then create an AI Platform model version based on this output in order to serve online predictions.
Set the name of your Cloud Storage bucket below. It must be unique across all Cloud Storage buckets.
You may also change the REGION
variable, which is used for operations
throughout the rest of this notebook. Make sure to choose a region where Cloud
AI Platform services are
available. You may
not use a Multi-Regional Storage bucket for training with AI Platform.
In [ ]:
BUCKET_NAME = "[your-bucket-name]" #@param {type:"string"}
Only if your bucket doesn't exist: Run the following cell to create your Cloud Storage bucket. Make sure Storage > Storage Admin role is enabled
In [ ]:
! gsutil mb -p $PROJECT_ID -l $COMPUTE_REGION gs://$BUCKET_NAME
Finally, validate access to your Cloud Storage bucket by examining its contents:
In [ ]:
! gsutil ls -al gs://$BUCKET_NAME
Import relevant packages.
In [ ]:
from __future__ import absolute_import
from __future__ import division
from __future__ import print_function
In [ ]:
# AutoML library.
from google.cloud import automl_v1beta1 as automl
import google.cloud.automl_v1beta1.proto.data_types_pb2 as data_types
import matplotlib.pyplot as plt
Populate the following cell with the necessary constants and run it to initialize constants.
In [ ]:
#@title Constants { vertical-output: true }
# A name for the AutoML tables Dataset to create.
DATASET_DISPLAY_NAME = 'stockout_data' #@param {type: 'string'}
# The BigQuery Dataset URI to import data from.
BQ_INPUT_URI = 'bq://product-stockout.product_stockout.stockout' #@param {type: 'string'}
# A name for the AutoML tables model to create.
MODEL_DISPLAY_NAME = 'stockout_model' #@param {type: 'string'}
assert all([
PROJECT_ID,
COMPUTE_REGION,
DATASET_DISPLAY_NAME,
BQ_INPUT_URI,
MODEL_DISPLAY_NAME,
])
Initialize the client for AutoML and AutoML Tables.
In [ ]:
# Initialize the clients.
automl_client = automl.AutoMlClient()
tables_client = automl.TablesClient(project=PROJECT_ID, region=COMPUTE_REGION)
In [ ]:
# List the datasets.
list_datasets = tables_client.list_datasets()
datasets = { dataset.display_name: dataset.name for dataset in list_datasets }
datasets
You can also print the list of your models by running the following cell.
If no model has previously trained using AutoML Tables, you shall expect an empty return.
In [ ]:
# List the models.
list_models = tables_client.list_models()
models = { model.display_name: model.name for model in list_models }
models
In [ ]:
# Create dataset.
dataset = tables_client.create_dataset(DATASET_DISPLAY_NAME)
dataset_name = dataset.name
dataset
You can import your data to AutoML Tables from GCS or BigQuery. For this solution, you will import data from a BigQuery Table. The URI for your table is in the format of bq://PROJECT_ID.DATASET_ID.TABLE_ID
.
The BigQuery Table used for demonstration purpose can be accessed as bq://product-stockout.product_stockout.stockout
.
See the table schema and dataset description from the README.
In [ ]:
# Import data.
import_data_response = tables_client.import_data(
dataset=dataset,
bigquery_input_uri=BQ_INPUT_URI,
)
print('Dataset import operation: {}'.format(import_data_response.operation))
# Synchronous check of operation status. Wait until import is done.
print('Dataset import response: {}'.format(import_data_response.result()))
# Verify the status by checking the example_count field.
dataset = tables_client.get_dataset(dataset_name=dataset_name)
dataset
Importing this stockout datasets takes about 10 minutes.
If you re-visit this Notebook, uncomment the following cell and run the command to retrieve your dataset. Replace YOUR_DATASET_NAME
with its actual value obtained in the preceding cells.
YOUR_DATASET_NAME
is a string in the format of 'projects/<project_id>/locations/<location>/datasets/<dataset_id>'
.
In [ ]:
# dataset_name = '<YOUR_DATASET_NAME>' #@param {type: 'string'}
# dataset = tables_client.get_dataset(dataset_name=dataset_name)
In [ ]:
# List table specs.
list_table_specs_response = tables_client.list_table_specs(dataset=dataset)
table_specs = [s for s in list_table_specs_response]
# List column specs.
list_column_specs_response = tables_client.list_column_specs(dataset=dataset)
column_specs = {s.display_name: s for s in list_column_specs_response}
# Print Features and data_type.
features = [(key, data_types.TypeCode.Name(value.data_type.type_code))
for key, value in column_specs.items()]
print('Feature list:\n')
for feature in features:
print(feature[0],':', feature[1])
In [ ]:
# Table schema pie chart.
type_counts = {}
for column_spec in column_specs.values():
type_name = data_types.TypeCode.Name(column_spec.data_type.type_code)
type_counts[type_name] = type_counts.get(type_name, 0) + 1
plt.pie(x=type_counts.values(), labels=type_counts.keys(), autopct='%1.1f%%')
plt.axis('equal')
plt.show()
In the pie chart above, you see this dataset contains three variable types: FLOAT64
(treated as Numeric
), CATEGORY
(treated as Categorical
) and STRING
(treated as Text
).
AutoML Tables automatically detects your data column type.
There are a total of 120 columns in this stockout dataset.
Run the following command to check the column data type that automaticallyed detected. If columns contains only numerical values, but they represent categories, change that column data type to caregorical by updating your schema.
In addition, AutoML Tables detects Stockout
to be categorical that chooses to run a classification model.
In [ ]:
# Print column data types.
for column in column_specs:
print(column, '-', column_specs[column].data_type)
From the column data type, you noticed Item_Number
, Category
, Vendor_Number
, Store_Number
, Zip_Code
and County_Number
have been autodetected as FLOAT64
(Numerical) instead of CATEGORY
(Categorical).
In this solution, the columns Item_Number
, Category
, Vendor_Number
and Store_Number
are not nullable, but Zip_Code
and County_Number
can take null values.
To change the data type, you can update the schema by updating the column spec.
In [ ]:
type_code='CATEGORY' #@param {type:'string'}
# Update dataset.
categorical_column_names = ['Item_Number', 'Category', 'Vendor_Number',
'Store_Number', 'Zip_Code', 'County_Number']
is_nullable = [False, False, False, False, True, True]
for i in range(len(categorical_column_names)):
column_name = categorical_column_names[i]
nullable = is_nullable[i]
tables_client.update_column_spec(
dataset=dataset,
column_spec_display_name=column_name,
type_code=type_code,
nullable=nullable,
)
In [ ]:
#@title Update dataset { vertical-output: true }
target_column_name = 'Stockout' #@param {type: 'string'}
update_dataset_response = tables_client.set_target_column(
dataset=dataset,
column_spec_display_name=target_column_name,
)
update_dataset_response
Training the model may take one hour or more. To obtain the results with less training time or budget, you can set train_budget_milli_node_hours
, which is the train budget of creating this model, expressed in milli node hours i.e. 1,000 value in this field means 1 node hour.
For demonstration purpose, the following command sets the budget as 1 node hour ('train_budget_milli_node_hours': 1000)
. You can increase that number up to a maximum of 72 hours ('train_budget_milli_node_hours': 72000)
for the best model performance.
Even with a budget of 1 node hour (the minimum possible budget), training a model can take more than the specified node hours
You can also select the objective to optimize your model training by setting optimization_objective
. This solution optimizes the model by maximizing the Area Under the Precision-Recall (PR) Curve.
In [ ]:
# The number of hours to train the model.
model_train_hours = 1 #@param {type:'integer'}
# Set optimization objective to train a model.
model_optimization_objective = 'MAXIMIZE_AU_PRC' #@param {type:'string'}
create_model_response = tables_client.create_model(
MODEL_DISPLAY_NAME,
dataset=dataset,
train_budget_milli_node_hours=model_train_hours*1000,
optimization_objective=model_optimization_objective,
)
operation_id = create_model_response.operation.name
print('Create model operation: {}'.format(create_model_response.operation))
In [ ]:
# Wait until model training is done.
model = create_model_response.result()
model_name = model.name
model
If your Colab times out, use tables_client.list_models()
to check whether your model has been created.
Then uncomment the following cell and run the command to retrieve your model. Replace YOUR_MODEL_NAME
with its actual value obtained in the preceding cell.
YOUR_MODEL_NAME
is a string in the format of 'projects/<project_id>/locations/<location>/models/<model_id>'
In [ ]:
#model_name = '<YOUR_MODEL_NAME>' #@param {type: 'string'}
# model = tables_client.get_model(model_name=model_name)
Your data source for batch prediction can be GCS or BigQuery. For this solution, you will use a BigQuery Table as the input source. The URI for your table is in the format of bq://PROJECT_ID.DATASET_ID.TABLE_ID
.
To write out the predictions, you need to specify a GCS bucket gs://BUCKET_NAME
.
The AutoML Tables logs the errors in the errors.csv
file.
NOTE: The batch prediction output file(s) will be updated to the GCS bucket that you set in the preceding cells.
In [ ]:
#@title Start batch prediction { vertical-output: true, output-height: 200 }
batch_predict_bq_input_uri = 'bq://product-stockout.product_stockout.batch_prediction_inputs' #@param {type:'string'}
batch_predict_gcs_output_uri_prefix = 'gs://{}'.format(BUCKET_NAME) #@param {type:'string'}
batch_predict_response = tables_client.batch_predict(
model_name=model_name,
bigquery_input_uri=batch_predict_bq_input_uri,
gcs_output_uri_prefix=batch_predict_gcs_output_uri_prefix,
)
print('Batch prediction operation: {}'.format(batch_predict_response.operation))
# Wait until batch prediction is done.
batch_predict_result = batch_predict_response.result()
batch_predict_response.metadata
In [ ]:
# Check prediction results.
gcs_output_directory = batch_predict_response.metadata.batch_predict_details\
.output_info.gcs_output_directory
result_file = gcs_output_directory + 'tables_1.csv'
print('Batch prediction results are stored as: {}'.format(result_file))
To clean up all GCP resources used in this project, you can delete the GCP project you used for the tutorial.
In [ ]:
# Delete model resource.
tables_client.delete_model(model_name=model_name)
# Delete dataset resource.
tables_client.delete_dataset(dataset_name=dataset_name)
# Delete Cloud Storage objects that were created.
! gsutil -m rm -r gs://$BUCKET_NAME
# If training model is still running, cancel it.
automl_client.transport._operations_client.cancel_operation(operation_id)