In [ ]:
# Copyright 2019 Google LLC
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
#     https://www.apache.org/licenses/LICENSE-2.0 
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.

Retail Product Stockouts Prediction using AutoML Tables

Run in Colab View on GitHub

Overview

AutoML Tables enables you to build machine learning models based on tables of your own data and host them on Google Cloud for scalability. This Notebook demonstrates how you can use AutoML Tables to solve a product stockouts problem in the retail industry. This problem is solved using a binary classification approach, which predicts whether a particular product at a certain store will be out-of-stock or not in the next four weeks. Once the solution is built, you can plug this in with your production system and proactively predict stock-outs for your business.

Dataset

In this solution, you will use two datasets: Training/Evaluation data and Batch Prediction inputs. To access the datasets in BigQuery, you need the following information.

Training/Evaluation dataset
  • Project ID: product-stockout
  • Dataset ID: product_stockout
  • Table ID: stockout
Batch Prediction inputs
  • Project ID: product-stockout
  • Dataset ID: product_stockout
  • Table ID: batch_prediction_inputs
Data Schema

Field name Datatype Type Description
Item_Number STRING Identifier This is the product/ item identifier
Category STRING Identifier Several items could belong to one category

Vendor_Number STRING Identifier Product vendor identifier

Store_Number STRING Identifier Store identifier

Item_Description STRING Text Features Item Description

Category_Name STRING Text Features Category Name

Vendor_Name STRING Text Features Vendor Name

Store_Name STRING Text Features Store Name

Address STRING Text Features Address

City STRING Categorical Features City

Zip_Code STRING Categorical Features Zip-code

Store_Location STRING Categorical Features Store Location

County_Number STRING Categorical Features County Number

County STRING Categorical Features County Name

Weekly Sales Quantity INTEGER Time series data 52 columns for weekly sales quantity from week 1 to week 52

Weekly Sales Dollars INTEGER Time series data 52 columns for weekly sales dollars from week 1 to week 52

Inventory FLOAT Numeric Feature This inventory is stocked by the retailer looking at past sales and seasonality of the product to meet demand for future sales.

Stockout INTEGER Label (1 - Stock-out, 0 - No stock-out) When the demand for four weeks future sales is not met by the inventory in stock we say we see a stock-out.
This is because an early warning sign would help the retailer re-stock inventory with a lead time for the stock to be replenished. </tbody> </table>
To use AutoML Tables with BigQuery you do not need to download this dataset. However, if you would like to use AutoML Tables with GCS you may want to download this dataset and upload it into your GCP Project storage bucket.

Instructions to download dataset:

  1. Sample Dataset: Download this dataset which contains sales data.

  2. Upload this dataset to GCS or BigQuery (optional).

Objective

Problem statement

A stockout, or out-of-stock (OOS) event is an event that causes inventory to be exhausted. While out-of-stocks can occur along the entire supply chain, the most visible kind are retail out-of-stocks in the fast-moving consumer goods industry (e.g., sweets, diapers, fruits). Stockouts are the opposite of overstocks, where too much inventory is retained.

Impact

According to a study by researchers Thomas Gruen and Daniel Corsten, the global average level of out-of-stocks within retail fast-moving consumer goods sector across developed economies was 8.3% in 2002. This means that shoppers would have a 42% chance of fulfilling a ten-item shopping list without encountering a stockout. Despite the initiatives designed to improve the collaboration of retailers and their suppliers, such as Efficient Consumer Response (ECR), and despite the increasing use of new technologies such as radio-frequency identification (RFID) and point-of-sale data analytics, this situation has improved little over the past decades.

The biggest impacts being

  • Customer dissatisfaction
  • Loss of revenue

Machine Learning Solution

Using machine learning to solve for stock-outs can help with store operations and thus prevent out-of-stock proactively.

There are three big challenges any retailer would face as they try and solve this problem with machine learning:

  1. Data silos: Sales data, supply-chain data, inventory data, etc. may all be in silos. Such disjoint datasets could be a challenge to work with as a machine learning model tries to derive insights from all these data points.
  2. Missing Features: Features such as vendor location, weather conditions, etc. could add a lot of value to a machine learning algorithm to learn from. But such features are not always available and when building machine learning solutions we think for collecting features as an iterative approach to improving the machine learning model.
  3. Imbalanced dataset: Datasets for classification problems such as retail stock-out are traditionally very imbalanced with fewer cases for stock-out. Designing machine learning solutions by hand for such problems would be time consuming effort when your team should be focusing on collecting features.

Hence, we recommend using AutoML Tables. With AutoML Tables you only need to work on acquiring all data and features, and AutoML Tables would do the rest. This is a one-click deploy to solving the problem of stock-out with machine learning.

Costs

This tutorial uses billable components of Google Cloud Platform (GCP):

  • Cloud AI Platform
  • Cloud Storage
  • BigQuery
  • AutoML Tables

Learn about Cloud AI Platform pricing, Cloud Storage pricing, BigQuery pricing, AutoML Tables pricing, and use the Pricing Calculator to generate a cost estimate based on your projected usage.

Set up your local development environment

If you are using Colab or AI Platform Notebooks, your environment already meets all the requirements to run this notebook. If you are using AI Platform Notebook, make sure the machine configuration type is 1 vCPU, 3.75 GB RAM or above. You can skip this step.

Otherwise, make sure your environment meets this notebook's requirements. You need the following:

  • The Google Cloud SDK
  • Git
  • Python 3
  • virtualenv
  • Jupyter notebook running in a virtual environment with Python 3

The Google Cloud guide to Setting up a Python development environment and the Jupyter installation guide provide detailed instructions for meeting these requirements. The following steps provide a condensed set of instructions:

  1. Install and initialize the Cloud SDK.

  2. Install Python 3.

  3. Install virtualenv and create a virtual environment that uses Python 3.

  4. Activate that environment and run pip install jupyter in a shell to install Jupyter.

  5. Run jupyter notebook in a shell to launch Jupyter.

  6. Open this notebook in the Jupyter Notebook Dashboard.

Set up your GCP project

The following steps are required, regardless of your notebook environment.

  1. Select or create a GCP project.. When you first create an account, you get a $300 free credit towards your compute/storage costs.

  2. Make sure that billing is enabled for your project.

  3. Enable the AI Platform APIs and Compute Engine APIs.

  4. Enable AutoML API.

PIP Install Packages and dependencies

Install addional dependencies not installed in Notebook environment


In [ ]:
! pip install --upgrade --quiet --user google-cloud-automl
! pip install matplotlib

Note: Try installing using sudo, if the above command throw any permission errors.

Restart the kernel to allow automl_v1beta1 to be imported for Jupyter Notebooks.


In [ ]:
from IPython.core.display import HTML
HTML("<script>Jupyter.notebook.kernel.restart()</script>")

Set up your GCP Project Id

Enter your Project Id in the cell below. Then run the cell to make sure the Cloud SDK uses the right project for all the commands in this notebook.


In [ ]:
PROJECT_ID = "[your-project-id]" #@param {type:"string"}
COMPUTE_REGION = "us-central1" # Currently only supported region.

Authenticate your GCP account

If you are using AI Platform Notebooks, your environment is already authenticated. Skip this step.

Otherwise, follow these steps:

  1. In the GCP Console, go to the Create service account key page.

  2. From the Service account drop-down list, select New service account.

  3. In the Service account name field, enter a name.

  4. From the Role drop-down list, select AutoML > AutoML Admin, Storage > Storage Object Admin and BigQuery > BigQuery Admin.

  5. Click Create. A JSON file that contains your key downloads to your local environment.

Note: Jupyter runs lines prefixed with ! as shell commands, and it interpolates Python variables prefixed with $ into these commands.


In [ ]:
# Upload the downloaded JSON file that contains your key.
import sys

if 'google.colab' in sys.modules:    
  from google.colab import files
  keyfile_upload = files.upload()
  keyfile = list(keyfile_upload.keys())[0]
  %env GOOGLE_APPLICATION_CREDENTIALS $keyfile
  ! gcloud auth activate-service-account --key-file $keyfile

If you are running the notebook locally, enter the path to your service account key as the GOOGLE_APPLICATION_CREDENTIALS variable in the cell below and run the cell


In [ ]:
# If you are running this notebook locally, replace the string below with the
# path to your service account key and run this cell to authenticate your GCP
# account.

%env GOOGLE_APPLICATION_CREDENTIALS /path/to/service/account
! gcloud auth activate-service-account --key-file '/path/to/service/account'

Create a Cloud Storage bucket

The following steps are required, regardless of your notebook environment.

When you submit a training job using the Cloud SDK, you upload a Python package containing your training code to a Cloud Storage bucket. AI Platform runs the code from this package. In this tutorial, AI Platform also saves the trained model that results from your job in the same bucket. You can then create an AI Platform model version based on this output in order to serve online predictions.

Set the name of your Cloud Storage bucket below. It must be unique across all Cloud Storage buckets.

You may also change the REGION variable, which is used for operations throughout the rest of this notebook. Make sure to choose a region where Cloud AI Platform services are available. You may not use a Multi-Regional Storage bucket for training with AI Platform.


In [ ]:
BUCKET_NAME = "[your-bucket-name]" #@param {type:"string"}

Only if your bucket doesn't exist: Run the following cell to create your Cloud Storage bucket. Make sure Storage > Storage Admin role is enabled


In [ ]:
! gsutil mb -p $PROJECT_ID -l $COMPUTE_REGION gs://$BUCKET_NAME

Finally, validate access to your Cloud Storage bucket by examining its contents:


In [ ]:
! gsutil ls -al gs://$BUCKET_NAME

Import libraries and define constants

Import relevant packages.


In [ ]:
from __future__ import absolute_import
from __future__ import division
from __future__ import print_function

In [ ]:
# AutoML library.
from google.cloud import automl_v1beta1 as automl
import google.cloud.automl_v1beta1.proto.data_types_pb2 as data_types
import matplotlib.pyplot as plt

Populate the following cell with the necessary constants and run it to initialize constants.


In [ ]:
#@title Constants { vertical-output: true }

# A name for the AutoML tables Dataset to create.
DATASET_DISPLAY_NAME = 'stockout_data' #@param {type: 'string'}
# The BigQuery Dataset URI to import data from.
BQ_INPUT_URI = 'bq://product-stockout.product_stockout.stockout' #@param {type: 'string'}
# A name for the AutoML tables model to create.
MODEL_DISPLAY_NAME = 'stockout_model' #@param {type: 'string'}

assert all([
    PROJECT_ID,
    COMPUTE_REGION,
    DATASET_DISPLAY_NAME,
    BQ_INPUT_URI,
    MODEL_DISPLAY_NAME,
])

Initialize the client for AutoML and AutoML Tables.


In [ ]:
# Initialize the clients.
automl_client = automl.AutoMlClient()
tables_client = automl.TablesClient(project=PROJECT_ID, region=COMPUTE_REGION)

Test the set up

To test whether your project set up and authentication steps were successful, run the following cell to list your datasets in this project.

If no dataset has previously imported into AutoML Tables, you shall expect an empty return.


In [ ]:
# List the datasets.
list_datasets = tables_client.list_datasets()
datasets = { dataset.display_name: dataset.name for dataset in list_datasets }
datasets

You can also print the list of your models by running the following cell.

If no model has previously trained using AutoML Tables, you shall expect an empty return.


In [ ]:
# List the models.
list_models = tables_client.list_models()
models = { model.display_name: model.name for model in list_models }
models

Import training data

Create dataset

Select a dataset display name and pass your table source information to create a new dataset.


In [ ]:
# Create dataset.
dataset = tables_client.create_dataset(DATASET_DISPLAY_NAME)
dataset_name = dataset.name
dataset

Import data

You can import your data to AutoML Tables from GCS or BigQuery. For this solution, you will import data from a BigQuery Table. The URI for your table is in the format of bq://PROJECT_ID.DATASET_ID.TABLE_ID.

The BigQuery Table used for demonstration purpose can be accessed as bq://product-stockout.product_stockout.stockout.

See the table schema and dataset description from the README.


In [ ]:
# Import data.
import_data_response = tables_client.import_data(
    dataset=dataset,
    bigquery_input_uri=BQ_INPUT_URI,
)
print('Dataset import operation: {}'.format(import_data_response.operation))

# Synchronous check of operation status. Wait until import is done.
print('Dataset import response: {}'.format(import_data_response.result()))

# Verify the status by checking the example_count field.
dataset = tables_client.get_dataset(dataset_name=dataset_name)
dataset

Importing this stockout datasets takes about 10 minutes.

If you re-visit this Notebook, uncomment the following cell and run the command to retrieve your dataset. Replace YOUR_DATASET_NAME with its actual value obtained in the preceding cells.

YOUR_DATASET_NAME is a string in the format of 'projects/<project_id>/locations/<location>/datasets/<dataset_id>'.


In [ ]:
# dataset_name = '<YOUR_DATASET_NAME>' #@param {type: 'string'}
# dataset = tables_client.get_dataset(dataset_name=dataset_name)

Review the specs

Run the following command to see table specs such as row count.


In [ ]:
# List table specs.
list_table_specs_response = tables_client.list_table_specs(dataset=dataset)
table_specs = [s for s in list_table_specs_response]

# List column specs.
list_column_specs_response = tables_client.list_column_specs(dataset=dataset)
column_specs = {s.display_name: s for s in list_column_specs_response}

# Print Features and data_type.
features = [(key, data_types.TypeCode.Name(value.data_type.type_code))
            for key, value in column_specs.items()]
print('Feature list:\n')
for feature in features:
    print(feature[0],':', feature[1])

In [ ]:
# Table schema pie chart.
type_counts = {}
for column_spec in column_specs.values():
  type_name = data_types.TypeCode.Name(column_spec.data_type.type_code)
  type_counts[type_name] = type_counts.get(type_name, 0) + 1
    
plt.pie(x=type_counts.values(), labels=type_counts.keys(), autopct='%1.1f%%')
plt.axis('equal')
plt.show()

In the pie chart above, you see this dataset contains three variable types: FLOAT64 (treated as Numeric), CATEGORY (treated as Categorical) and STRING (treated as Text).

Update dataset: assign a label column and enable nullable columns

Get column specs

AutoML Tables automatically detects your data column type.

There are a total of 120 columns in this stockout dataset.

Run the following command to check the column data type that automaticallyed detected. If columns contains only numerical values, but they represent categories, change that column data type to caregorical by updating your schema.

In addition, AutoML Tables detects Stockout to be categorical that chooses to run a classification model.


In [ ]:
# Print column data types.
for column in column_specs:
    print(column, '-', column_specs[column].data_type)

Update columns: make categorical

From the column data type, you noticed Item_Number, Category, Vendor_Number, Store_Number, Zip_Code and County_Number have been autodetected as FLOAT64 (Numerical) instead of CATEGORY (Categorical).

In this solution, the columns Item_Number, Category, Vendor_Number and Store_Number are not nullable, but Zip_Code and County_Number can take null values.

To change the data type, you can update the schema by updating the column spec.


In [ ]:
type_code='CATEGORY' #@param {type:'string'}

# Update dataset.
categorical_column_names = ['Item_Number', 'Category', 'Vendor_Number', 
                            'Store_Number', 'Zip_Code', 'County_Number']

is_nullable = [False, False, False, False, True, True]                      

for i in range(len(categorical_column_names)):
    column_name = categorical_column_names[i]
    nullable = is_nullable[i]
    tables_client.update_column_spec(
        dataset=dataset,
        column_spec_display_name=column_name,
        type_code=type_code,
        nullable=nullable,
    )

Update dataset: Assign a label

Select the target column and update the dataset.


In [ ]:
#@title Update dataset { vertical-output: true }

target_column_name = 'Stockout' #@param {type: 'string'}
update_dataset_response = tables_client.set_target_column(
    dataset=dataset,
    column_spec_display_name=target_column_name,
)
update_dataset_response

Creating a model

Train a model

Training the model may take one hour or more. To obtain the results with less training time or budget, you can set train_budget_milli_node_hours, which is the train budget of creating this model, expressed in milli node hours i.e. 1,000 value in this field means 1 node hour.

For demonstration purpose, the following command sets the budget as 1 node hour ('train_budget_milli_node_hours': 1000). You can increase that number up to a maximum of 72 hours ('train_budget_milli_node_hours': 72000) for the best model performance.

Even with a budget of 1 node hour (the minimum possible budget), training a model can take more than the specified node hours

You can also select the objective to optimize your model training by setting optimization_objective. This solution optimizes the model by maximizing the Area Under the Precision-Recall (PR) Curve.


In [ ]:
# The number of hours to train the model.
model_train_hours = 1 #@param {type:'integer'}
# Set optimization objective to train a model.
model_optimization_objective = 'MAXIMIZE_AU_PRC' #@param {type:'string'}

create_model_response = tables_client.create_model(
    MODEL_DISPLAY_NAME,
    dataset=dataset,
    train_budget_milli_node_hours=model_train_hours*1000,
    optimization_objective=model_optimization_objective,
)
operation_id = create_model_response.operation.name

print('Create model operation: {}'.format(create_model_response.operation))

In [ ]:
# Wait until model training is done.
model = create_model_response.result()
model_name = model.name
model

If your Colab times out, use tables_client.list_models() to check whether your model has been created.

Then uncomment the following cell and run the command to retrieve your model. Replace YOUR_MODEL_NAME with its actual value obtained in the preceding cell.

YOUR_MODEL_NAME is a string in the format of 'projects/<project_id>/locations/<location>/models/<model_id>'


In [ ]:
#model_name = '<YOUR_MODEL_NAME>' #@param {type: 'string'}
# model = tables_client.get_model(model_name=model_name)

Batch prediction

Initialize prediction

Your data source for batch prediction can be GCS or BigQuery. For this solution, you will use a BigQuery Table as the input source. The URI for your table is in the format of bq://PROJECT_ID.DATASET_ID.TABLE_ID.

To write out the predictions, you need to specify a GCS bucket gs://BUCKET_NAME.

The AutoML Tables logs the errors in the errors.csv file.

NOTE: The batch prediction output file(s) will be updated to the GCS bucket that you set in the preceding cells.


In [ ]:
#@title Start batch prediction { vertical-output: true, output-height: 200 }
batch_predict_bq_input_uri = 'bq://product-stockout.product_stockout.batch_prediction_inputs' #@param {type:'string'}
batch_predict_gcs_output_uri_prefix = 'gs://{}'.format(BUCKET_NAME)  #@param {type:'string'}

batch_predict_response = tables_client.batch_predict(
    model_name=model_name, 
    bigquery_input_uri=batch_predict_bq_input_uri,
    gcs_output_uri_prefix=batch_predict_gcs_output_uri_prefix,
)
print('Batch prediction operation: {}'.format(batch_predict_response.operation))

# Wait until batch prediction is done.
batch_predict_result = batch_predict_response.result()
batch_predict_response.metadata

In [ ]:
# Check prediction results.
gcs_output_directory = batch_predict_response.metadata.batch_predict_details\
                       .output_info.gcs_output_directory
result_file = gcs_output_directory + 'tables_1.csv'
print('Batch prediction results are stored as: {}'.format(result_file))

Cleaning up

To clean up all GCP resources used in this project, you can delete the GCP project you used for the tutorial.


In [ ]:
# Delete model resource.
tables_client.delete_model(model_name=model_name)

# Delete dataset resource.
tables_client.delete_dataset(dataset_name=dataset_name)

# Delete Cloud Storage objects that were created.
! gsutil -m rm -r gs://$BUCKET_NAME

# If training model is still running, cancel it.
automl_client.transport._operations_client.cancel_operation(operation_id)