In [ ]:
# Copyright 2019 Google LLC
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
#     https://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.

Getting Started with AutoML Tables

Run in Colab View on GitHub

Overview

Google’s AutoML provides the ability for software engineers to build high quality models without the need to know how to build, train models, or deploy/serve models on the cloud. Instead, one only needs to know about dataset curation, evaluating results, and the how-to steps.

AutoML Tables is a supervised learning service. This means that you train a machine learning model with example data. AutoML Tables uses tabular (structured) data to train a machine learning model to make predictions on new data. One column from your dataset, called the target, is what your model will learn to predict. Some number of the other data columns are inputs (called features) that the model will learn patterns from.

In this notebook, we will use the Google Cloud SDK AutoML Python API to create a binary classification model using a real dataset from the Census Income Dataset.

We will provide the training and evaluation dataset, once dataset is created we will use AutoML API to create the model and then perform predictions to predict if a given individual has an income above or below 50k, given information like the person's age, education level, marital-status, occupation etc...

For setting up a Google Cloud Platform (GCP) account for using AutoML, please see the online documentation for Getting Started.

Dataset

This tutorial uses the United States Census Income Dataset provided by the UC Irvine Machine Learning Repository containing information about people from a 1994 Census database, including age, education, marital status, occupation, and whether they make more than $50,000 a year. The dataset consists of over 30k rows, where each row corresponds to a different person. For a given row, there are 14 features that the model conditions on to predict the income of the person. A few of the features are named above, and the exhaustive list can be found both in the dataset link above.

Costs

This tutorial uses billable components of Google Cloud Platform (GCP):

  • Cloud AI Platform
  • Cloud Storage
  • AutoML Tables

Learn about Cloud AI Platform pricing, Cloud Storage pricing, AutoML Tables pricing and use the Pricing Calculator to generate a cost estimate based on your projected usage.

Set up your local development environment

If you are using Colab or AI Platform Notebooks, your environment already meets all the requirements to run this notebook. If you are using AI Platform Notebook, make sure the machine configuration type is 1 vCPU, 3.75 GB RAM or above. You can skip this step.

Otherwise, make sure your environment meets this notebook's requirements. You need the following:

  • The Google Cloud SDK
  • Git
  • Python 3
  • virtualenv
  • Jupyter notebook running in a virtual environment with Python 3

The Google Cloud guide to Setting up a Python development environment and the Jupyter installation guide provide detailed instructions for meeting these requirements. The following steps provide a condensed set of instructions:

  1. Install and initialize the Cloud SDK.

  2. Install Python 3.

  3. Install virtualenv and create a virtual environment that uses Python 3.

  4. Activate that environment and run pip install jupyter in a shell to install Jupyter.

  5. Run jupyter notebook in a shell to launch Jupyter.

  6. Open this notebook in the Jupyter Notebook Dashboard.

Set up your GCP project

The following steps are required, regardless of your notebook environment.

  1. Select or create a GCP project.. When you first create an account, you get a $300 free credit towards your compute/storage costs.

  2. Make sure that billing is enabled for your project.

  3. Enable the AI Platform APIs and Compute Engine APIs.

  4. Enable AutoML API.

PIP Install Packages and dependencies

Install addional dependencies not installed in Notebook environment


In [ ]:
# Use the latest major GA version of the framework.
! pip install --upgrade --quiet --user --user google-cloud-automl

Note: Try installing using sudo, if the above command throw any permission errors.

Restart the kernel to allow automl_v1beta1 to be imported for Jupyter Notebooks.


In [ ]:
from IPython.core.display import HTML
HTML("<script>Jupyter.notebook.kernel.restart()</script>")


Out[ ]:

Set up your GCP Project Id

Enter your Project Id in the cell below. Then run the cell to make sure the Cloud SDK uses the right project for all the commands in this notebook.


In [ ]:
PROJECT_ID = "[your-project-id]" #@param {type:"string"}
COMPUTE_REGION = "us-central1" # Currently only supported region.

Authenticate your GCP account

If you are using AI Platform Notebooks, your environment is already authenticated. Skip this step.

Otherwise, follow these steps:

  1. In the GCP Console, go to the Create service account key page.

  2. From the Service account drop-down list, select New service account.

  3. In the Service account name field, enter a name.

  4. From the Role drop-down list, select AutoML > AutoML Admin and Storage > Storage Object Admin.

  5. Click Create. A JSON file that contains your key downloads to your local environment.

Note: Jupyter runs lines prefixed with ! as shell commands, and it interpolates Python variables prefixed with $ into these commands.


In [ ]:
# Upload the downloaded JSON file that contains your key.
import sys

if 'google.colab' in sys.modules:    
  from google.colab import files
  keyfile_upload = files.upload()
  keyfile = list(keyfile_upload.keys())[0]
  %env GOOGLE_APPLICATION_CREDENTIALS $keyfile
  ! gcloud auth activate-service-account --key-file $keyfile

If you are running the notebook locally, enter the path to your service account key as the GOOGLE_APPLICATION_CREDENTIALS variable in the cell below and run the cell


In [ ]:
# If you are running this notebook locally, replace the string below with the
# path to your service account key and run this cell to authenticate your GCP
# account.

%env GOOGLE_APPLICATION_CREDENTIALS /path/to/service/account
! gcloud auth activate-service-account --key-file '/path/to/service/account'

Create a Cloud Storage bucket

The following steps are required, regardless of your notebook environment.

When you submit a training job using the Cloud SDK, you upload a Python package containing your training code to a Cloud Storage bucket. AI Platform runs the code from this package. In this tutorial, AI Platform also saves the trained model that results from your job in the same bucket. You can then create an AI Platform model version based on this output in order to serve online predictions.

Set the name of your Cloud Storage bucket below. It must be unique across all Cloud Storage buckets.

You may also change the REGION variable, which is used for operations throughout the rest of this notebook. Make sure to choose a region where Cloud AI Platform services are available. You may not use a Multi-Regional Storage bucket for training with AI Platform.


In [ ]:
BUCKET_NAME = "[your-bucket-name]" #@param {type:"string"}

Only if your bucket doesn't already exist: Run the following cell to create your Cloud Storage bucket. Make sure Storage > Storage Admin role is enabled


In [ ]:
! gsutil mb -p $PROJECT_ID -l $COMPUTE_REGION gs://$BUCKET_NAME

Finally, validate access to your Cloud Storage bucket by examining its contents:


In [ ]:
! gsutil ls -al gs://$BUCKET_NAME

Import libraries and define constants

Import relevant packages.


In [ ]:
from __future__ import absolute_import
from __future__ import division
from __future__ import print_function

In [ ]:
# AutoML library.
from google.cloud import automl_v1beta1 as automl
import google.cloud.automl_v1beta1.proto.data_types_pb2 as data_types

In [ ]:
import matplotlib.pyplot as plt
from ipywidgets import interact
import ipywidgets as widgets

Populate the following cell with the necessary constants and run it to initialize constants.


In [ ]:
#@title Constants { vertical-output: true }

# A name for the AutoML tables Dataset to create.
DATASET_DISPLAY_NAME = 'census' #@param {type: 'string'}
# The GCS data to import data from (doesn't need to exist).
INPUT_CSV_NAME = 'census_income' #@param {type: 'string'}
# A name for the AutoML tables model to create.
MODEL_DISPLAY_NAME = 'census_income_model' #@param {type: 'string'}

assert all([
    PROJECT_ID,
    COMPUTE_REGION,
    DATASET_DISPLAY_NAME,
    INPUT_CSV_NAME,
    MODEL_DISPLAY_NAME,
])

Initialize client for AutoML and AutoML Tables


In [ ]:
# Initialize the clients.
automl_client = automl.AutoMlClient()
tables_client = automl.TablesClient(project=PROJECT_ID, region=COMPUTE_REGION)

Test the set up

To test whether your project set up and authentication steps were successful, run the following cell to list your datasets in this project.

If no dataset has previously imported into AutoML Tables, you shall expect an empty return.


In [ ]:
# List the datasets.
list_datasets = tables_client.list_datasets()
datasets = { dataset.display_name: dataset.name for dataset in list_datasets }
datasets

You can also print the list of your models by running the following cell. ​ If no model has previously trained using AutoML Tables, you shall expect an empty return.


In [ ]:
# List the models.
list_models = tables_client.list_models()
models = { model.display_name: model.name for model in list_models }
models

Import training data

Create dataset

Now we are ready to create a dataset instance (on GCP) using the client method create_dataset(). This method has one required parameter, the human readable display name DATASET_DISPLAY_NAME.

Select a dataset display name and pass your table source information to create a new dataset.


In [ ]:
# Create dataset.
dataset = tables_client.create_dataset(
          dataset_display_name=DATASET_DISPLAY_NAME)
dataset_name = dataset.name
dataset

Import data

You can import your data to AutoML Tables from GCS or BigQuery. For this tutorial, you can use the census_income dataset as your training data. We provide code below to copy the data into a bucket you own automatically. You are free to adjust the value of BUCKET_NAME as needed.


In [ ]:
GCS_DATASET_URI = 'gs://{}/{}.csv'.format(BUCKET_NAME, INPUT_CSV_NAME)
! gsutil ls gs://$BUCKET_NAME || gsutil mb -l $COMPUTE_REGION gs://$BUCKET_NAME
! gsutil cp gs://cloud-ml-data-tables/notebooks/census_income.csv $GCS_DATASET_URI

Import data into the dataset, this process may take a while, depending on your data, once completed, you can verify the status by printing the dataset object. This time pay attention to the example_count field with 32461 records.


In [ ]:
# Read the data source from GCS. 
import_data_response = tables_client.import_data(
    dataset=dataset,
    gcs_input_uris=GCS_DATASET_URI
)
print('Dataset import operation: {}'.format(import_data_response.operation))

# Synchronous check of operation status. Wait until import is done.
print('Dataset import response: {}'.format(import_data_response.result()))

# Verify the status by checking the example_count field.
dataset = tables_client.get_dataset(dataset_name=dataset_name)
dataset

Review the specs

Run the following command to see table specs such as row count.


In [ ]:
# List table specs.
list_table_specs_response = tables_client.list_table_specs(dataset=dataset)
table_specs = [s for s in list_table_specs_response]

# List column specs.
list_column_specs_response = tables_client.list_column_specs(dataset=dataset)
column_specs = {s.display_name: s for s in list_column_specs_response}

# Print Features and data_type.
features = [(key, data_types.TypeCode.Name(value.data_type.type_code)) 
            for key, value in column_specs.items()]
print('Feature list:\n')
for feature in features:
    print(feature[0],':', feature[1])

In [ ]:
# Table schema pie chart.
type_counts = {}
for column_spec in column_specs.values():
  type_name = data_types.TypeCode.Name(column_spec.data_type.type_code)
  type_counts[type_name] = type_counts.get(type_name, 0) + 1
    
plt.pie(x=type_counts.values(), labels=type_counts.keys(), autopct='%1.1f%%')
plt.axis('equal')
plt.show()

Update dataset: assign a label column and enable nullable columns

This section is important, as it is where you specify which column (meaning which feature) you will use as your label. This label feature will then be predicted using all other features in the row.

AutoML Tables automatically detects your data column type. For example, for the (census_income) it detects income_bracket to be categorical (as it is just either over or under 50k) and age to be numerical. Depending on the type of your label column, AutoML Tables chooses to run a classification or regression model. If your label column contains only numerical values, but they represent categories, change your label column type to categorical by updating your schema.

Update a column: Set nullable parameter


In [ ]:
column_spec_display_name = 'income' #@param {type:'string'}
type_code='CATEGORY' #@param {type:'string'}

update_column_response = tables_client.update_column_spec(
    dataset=dataset,
    column_spec_display_name=column_spec_display_name,
    type_code=type_code,
    nullable=False,
)
update_column_response

Tip: You can use 'type_code': 'CATEGORY' in the preceding update_column_spec_dict to convert the column data type from FLOAT64 to CATEGORY.

Update dataset: Assign a label


In [ ]:
column_spec_display_name = 'income' #@param {type:'string'}

update_dataset_response = tables_client.set_target_column(
    dataset=dataset,
    column_spec_display_name=column_spec_display_name,
)
update_dataset_response

Creating a model

Train a Model

Once we have defined our datasets and features we will create a model.

Specify the duration of the training. For example, 'train_budget_milli_node_hours': 1000 runs the training for one hour. You can increase that number up to a maximum of 72 hours ('train_budget_milli_node_hours': 72000) for the best model performance.

Even with a budget of 1 node hour (the minimum possible budget), training a model can take more than the specified node hours

If your Colab times out, use client.list_models() to check whether your model has been created. Then use model name to continue to the next steps. Run the following command to retrieve your model.

model = tables_client.get_model(model_display_name=MODEL_DISPLAY_NAME)

You can also select the objective to optimize your model training by setting optimization_objective. This solution optimizes the model by using default optimization objective. Refer link for more details.


In [ ]:
# The number of hours to train the model.
model_train_hours = 1 #@param {type:'integer'}

create_model_response = tables_client.create_model(
    model_display_name=MODEL_DISPLAY_NAME,
    dataset=dataset,
    train_budget_milli_node_hours=model_train_hours*1000,,
    exclude_column_spec_names=['fnlwgt','income'],
)

operation_id = create_model_response.operation.name

print('Create model operation: {}'.format(create_model_response.operation))

In [ ]:
# Wait until model training is done.
model = create_model_response.result()
model_name = model.name
model

Model deployment

Important : Deploy the model, then wait until the model FINISHES deployment.

The model takes a while to deploy online. When the deployment code response = client.deploy_model(model_name=model.name) finishes, you will be able to see this on the UI. Check the UI and navigate to the predict tab of your model, and then to the online prediction portion, to see when it finishes online deployment before running the prediction cell.You should see "online prediction" text near the top, click on it, and it will take you to a view of your online prediction interface. You should see "model deployed" on the far right of the screen if the model is deployed, or a "deploying model" message if it is still deploying. </span>


In [ ]:
tables_client.deploy_model(model=model).result()

Verify if model has been deployed, check deployment_state field, it should show: DEPLOYED


In [ ]:
model = tables_client.get_model(model_name=model_name)
model

Run the prediction, only after the model finishes deployment

Make an Online prediction

You can toggle exactly which values you want for all of the numeric features, and choose from the drop down windows which values you want for the categorical features.

Note: If the model has not finished deployment, the prediction will NOT work. The following cells show you how to make an online prediction.


In [ ]:
workclass_ids = ['Private', 'Self-emp-not-inc', 'Self-emp-inc', 'Federal-gov',
                 'Local-gov', 'State-gov', 'Without-pay', 'Never-worked']
education_ids = ['Bachelors', 'Some-college', '11th', 'HS-grad', 'Prof-school',
                 'Assoc-acdm', 'Assoc-voc', '9th', '7th-8th', '12th', 'Masters',
                 '1st-4th', '10th', 'Doctorate', '5th-6th', 'Preschool']
marital_status_ids = ['Married-civ-spouse', 'Divorced', 'Never-married',
                      'Separated', 'Widowed', 'Married-spouse-absent', 
                      'Married-AF-spouse']
occupation_ids = ['Tech-support', 'Craft-repair', 'Other-service', 'Sales', 
                  'Exec-managerial', 'Prof-specialty', 'Handlers-cleaners', 
                  'Machine-op-inspct', 'Adm-clerical', 'Farming-fishing', 
                  'Transport-moving', 'Priv-house-serv', 'Protective-serv', 
                  'Armed-Forces']
relationship_ids = ['Wife', 'Own-child', 'Husband', 'Not-in-family', 
                    'Other-relative', 'Unmarried']
race_ids = ['White', 'Asian-Pac-Islander', 'Amer-Indian-Eskimo', 'Other',
            'Black']

In [ ]:
sex_ids = ['Female', 'Male']
native_country_ids = ['United-States', 'Cambodia', 'England', 'Puerto-Rico', 
                      'Canada', 'Germany', 'Outlying-US(Guam-USVI-etc)', 
                      'India', 'Japan', 'Greece', 'South', 'China', 'Cuba', 
                      'Iran', 'Honduras', 'Philippines', 'Italy', 'Poland', 
                      'Jamaica', 'Vietnam', 'Mexico', 'Portugal', 'Ireland',
                      'France', 'Dominican-Republic', 'Laos', 'Ecuador',
                      'Taiwan', 'Haiti', 'Columbia', 'Hungary', 'Guatemala', 
                      'Nicaragua', 'Scotland', 'Thailand', 'Yugoslavia', 
                      'El-Salvador', 'Trinadad&Tobago', 'Peru', 'Hong', 
                      'Holand-Netherlands']

In [ ]:
# Create dropdown for workclass.
workclass = widgets.Dropdown(
    options=workclass_ids, 
    value=workclass_ids[0],
    description='workclass:'
)

In [ ]:
# Create dropdown for education.
education = widgets.Dropdown(
    options=education_ids, 
    value=education_ids[0],
    description='education:', 
    width='500px'
)

In [ ]:
# Create dropdown for marital status.
marital_status = widgets.Dropdown(
    options=marital_status_ids, 
    value=marital_status_ids[0],
    description='marital status:', 
    width='500px'
)

In [ ]:
# Create dropdown for occupation.
occupation = widgets.Dropdown(
    options=occupation_ids, 
    value=occupation_ids[0],
    description='occupation:', 
    width='500px'
)

In [ ]:
# Create dropdown for relationship.
relationship = widgets.Dropdown(
    options=relationship_ids, 
    value=relationship_ids[0],
    description='relationship:', 
    width='500px'
)

In [ ]:
# Create dropdown for race.
race = widgets.Dropdown(
    options=race_ids, 
    value=race_ids[0],                           
    description='race:', 
    width='500px'
)

In [ ]:
# Create dropdown for sex.
sex = widgets.Dropdown(
    options=sex_ids, 
    value=sex_ids[0],
    description='sex:', 
    width='500px'
)

In [ ]:
# Create dropdown for native country.
native_country = widgets.Dropdown(
    options=native_country_ids, 
    value=native_country_ids[0],
    description='native_country:', 
    width='500px'
)

In [ ]:
display(workclass)
display(education)
display(marital_status)
display(occupation)
display(relationship)
display(race)
display(sex)
display(native_country)

Adjust the slides on the right to the desired test values for your online prediction.


In [ ]:
#@title Make an online prediction: set the numeric variables{ vertical-output: true }

age = 36 #@param {type:'slider', min:1, max:100, step:1}
capital_gain = 40000 #@param {type:'slider', min:0, max:100000, step:10000}
capital_loss = 559.5 #@param {type:'slider', min:0, max:4000, step:0.1}
fnlwgt = 150000 #@param {type:'slider', min:0, max:1000000, step:50000}
education_num = 9 #@param {type:'slider', min:1, max:16, step:1}
hours_per_week = 40 #@param {type:'slider', min:1, max:100, step:1}

Run the following cell, and then choose the desired test values for your online prediction.


In [ ]:
inputs = {
    'age': age,
    'workclass': workclass.value,
    'fnlwgt': fnlwgt,
    'education': education.value,
    'education_num': education_num,
    'marital_status': marital_status.value,
    'occupation': occupation.value,
    'relationship': relationship.value,
    'race': race.value,
    'sex': sex.value,
    'capital_gain': capital_gain,
    'capital_loss': capital_loss,
    'hours_per_week': hours_per_week,
    'native_country': native_country.value,
}

In [ ]:
prediction_result = tables_client.predict(model=model, inputs=inputs)
prediction_result

Get Prediction

We extract the google.cloud.automl_v1beta1.types.PredictResponse object prediction_result and iterate to create a list of tuples with score and label, then we sort based on highest score and display it.


In [ ]:
predictions = [(prediction.tables.score, prediction.tables.value.string_value) 
               for prediction in prediction_result.payload]
predictions = sorted(
    predictions, key=lambda tup: (tup[0],tup[1]), reverse=True)
print('Prediction is: ', predictions[0])

Undeploy the model


In [ ]:
undeploy_model_response = tables_client.undeploy_model(model=model)

Batch prediction

Initialize prediction

Your data source for batch prediction can be GCS or BigQuery.

For this tutorial, you can use:

Create a GCS bucket and upload the file into your bucket.

Some of the lines in the batch prediction input file are intentionally left missing some values. The AutoML Tables logs the errors in the errors.csv file. Also, enter the UI and create the bucket into which you will load your predictions.

The bucket's default name here is automl-tables-pred to be replaced with your own.

NOTE: The client library has a bug. If the following cell returns a:

TypeError: Could not convert Any to BatchPredictResult error, ignore it.

The batch prediction output file(s) will be updated to the GCS bucket that you set in the preceding cells.


In [ ]:
gcs_output_folder_name = 'census_income_predictions' #@param {type: 'string'}

SAMPLE_INPUT = 'gs://cloud-ml-data/automl-tables/notebooks/census_income_batch_prediction_input.csv'
GCS_BATCH_PREDICT_OUTPUT = 'gs://{}/{}/'.format(BUCKET_NAME,
                                                gcs_output_folder_name)

! gsutil cp $SAMPLE_INPUT $GCS_BATCH_PREDICT_OUTPUT

Launch Batch prediction


In [ ]:
batch_predict_response = tables_client.batch_predict(
    model=model, 
    gcs_input_uris=GCS_BATCH_PREDICT_URI,
    gcs_output_uri_prefix=GCS_BATCH_PREDICT_OUTPUT,
)
print('Batch prediction operation: {}'.format(
    batch_predict_response.operation))

# Wait until batch prediction is done.
batch_predict_result = batch_predict_response.result()
batch_predict_response.metadata

Cleaning up

To clean up all GCP resources used in this project, you can delete the GCP project you used for the tutorial.


In [ ]:
# Delete model resource.
tables_client.delete_model(model_name=model_name)

# Delete dataset resource.
tables_client.delete_dataset(dataset_name=dataset_name)

# Delete Cloud Storage objects that were created.
! gsutil -m rm -r gs://$BUCKET_NAME
  
# If training model is still running, cancel it.
automl_client.transport._operations_client.cancel_operation(operation_id)

Next steps

Please follow latest updates on AutoML here.