In [ ]:
# Copyright 2019 Google LLC
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# https://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
|
|
Google’s AutoML provides the ability for software engineers to build high quality models without the need to know how to build, train models, or deploy/serve models on the cloud. Instead, one only needs to know about dataset curation, evaluating results, and the how-to steps.
AutoML Tables is a supervised learning service. This means that you train a machine learning model with example data. AutoML Tables uses tabular (structured) data to train a machine learning model to make predictions on new data. One column from your dataset, called the target, is what your model will learn to predict. Some number of the other data columns are inputs (called features) that the model will learn patterns from.
In this notebook, we will use the Google Cloud SDK AutoML Python API to create a binary classification model using a real dataset from the Census Income Dataset.
We will provide the training and evaluation dataset, once dataset is created we will use AutoML API to create the model and then perform predictions to predict if a given individual has an income above or below 50k, given information like the person's age, education level, marital-status, occupation etc...
For setting up a Google Cloud Platform (GCP) account for using AutoML, please see the online documentation for Getting Started.
This tutorial uses the United States Census Income Dataset provided by the UC Irvine Machine Learning Repository containing information about people from a 1994 Census database, including age, education, marital status, occupation, and whether they make more than $50,000 a year. The dataset consists of over 30k rows, where each row corresponds to a different person. For a given row, there are 14 features that the model conditions on to predict the income of the person. A few of the features are named above, and the exhaustive list can be found both in the dataset link above.
This tutorial uses billable components of Google Cloud Platform (GCP):
Learn about Cloud AI Platform pricing, Cloud Storage pricing, AutoML Tables pricing and use the Pricing Calculator to generate a cost estimate based on your projected usage.
If you are using Colab or AI Platform Notebooks, your environment already meets all the requirements to run this notebook. If you are using AI Platform Notebook, make sure the machine configuration type is 1 vCPU, 3.75 GB RAM or above. You can skip this step.
Otherwise, make sure your environment meets this notebook's requirements. You need the following:
The Google Cloud guide to Setting up a Python development environment and the Jupyter installation guide provide detailed instructions for meeting these requirements. The following steps provide a condensed set of instructions:
Install virtualenv and create a virtual environment that uses Python 3.
Activate that environment and run pip install jupyter
in a shell to install
Jupyter.
Run jupyter notebook
in a shell to launch Jupyter.
Open this notebook in the Jupyter Notebook Dashboard.
The following steps are required, regardless of your notebook environment.
Select or create a GCP project.. When you first create an account, you get a $300 free credit towards your compute/storage costs.
In [ ]:
# Use the latest major GA version of the framework.
! pip install --upgrade --quiet --user --user google-cloud-automl
Note: Try installing using sudo
, if the above command throw any permission errors.
Restart
the kernel to allow automl_v1beta1 to be imported for Jupyter Notebooks.
In [ ]:
from IPython.core.display import HTML
HTML("<script>Jupyter.notebook.kernel.restart()</script>")
Out[ ]:
In [ ]:
PROJECT_ID = "[your-project-id]" #@param {type:"string"}
COMPUTE_REGION = "us-central1" # Currently only supported region.
Otherwise, follow these steps:
In the GCP Console, go to the Create service account key page.
From the Service account drop-down list, select New service account.
In the Service account name field, enter a name.
From the Role drop-down list, select AutoML > AutoML Admin and Storage > Storage Object Admin.
Click Create. A JSON file that contains your key downloads to your local environment.
Note: Jupyter runs lines prefixed with !
as shell commands, and it interpolates Python variables prefixed with $
into these commands.
In [ ]:
# Upload the downloaded JSON file that contains your key.
import sys
if 'google.colab' in sys.modules:
from google.colab import files
keyfile_upload = files.upload()
keyfile = list(keyfile_upload.keys())[0]
%env GOOGLE_APPLICATION_CREDENTIALS $keyfile
! gcloud auth activate-service-account --key-file $keyfile
If you are running the notebook locally, enter the path to your service account key as the GOOGLE_APPLICATION_CREDENTIALS
variable in the cell below and run the cell
In [ ]:
# If you are running this notebook locally, replace the string below with the
# path to your service account key and run this cell to authenticate your GCP
# account.
%env GOOGLE_APPLICATION_CREDENTIALS /path/to/service/account
! gcloud auth activate-service-account --key-file '/path/to/service/account'
The following steps are required, regardless of your notebook environment.
When you submit a training job using the Cloud SDK, you upload a Python package containing your training code to a Cloud Storage bucket. AI Platform runs the code from this package. In this tutorial, AI Platform also saves the trained model that results from your job in the same bucket. You can then create an AI Platform model version based on this output in order to serve online predictions.
Set the name of your Cloud Storage bucket below. It must be unique across all Cloud Storage buckets.
You may also change the REGION
variable, which is used for operations
throughout the rest of this notebook. Make sure to choose a region where Cloud
AI Platform services are
available. You may
not use a Multi-Regional Storage bucket for training with AI Platform.
In [ ]:
BUCKET_NAME = "[your-bucket-name]" #@param {type:"string"}
Only if your bucket doesn't already exist: Run the following cell to create your Cloud Storage bucket. Make sure Storage > Storage Admin role is enabled
In [ ]:
! gsutil mb -p $PROJECT_ID -l $COMPUTE_REGION gs://$BUCKET_NAME
Finally, validate access to your Cloud Storage bucket by examining its contents:
In [ ]:
! gsutil ls -al gs://$BUCKET_NAME
Import relevant packages.
In [ ]:
from __future__ import absolute_import
from __future__ import division
from __future__ import print_function
In [ ]:
# AutoML library.
from google.cloud import automl_v1beta1 as automl
import google.cloud.automl_v1beta1.proto.data_types_pb2 as data_types
In [ ]:
import matplotlib.pyplot as plt
from ipywidgets import interact
import ipywidgets as widgets
Populate the following cell with the necessary constants and run it to initialize constants.
In [ ]:
#@title Constants { vertical-output: true }
# A name for the AutoML tables Dataset to create.
DATASET_DISPLAY_NAME = 'census' #@param {type: 'string'}
# The GCS data to import data from (doesn't need to exist).
INPUT_CSV_NAME = 'census_income' #@param {type: 'string'}
# A name for the AutoML tables model to create.
MODEL_DISPLAY_NAME = 'census_income_model' #@param {type: 'string'}
assert all([
PROJECT_ID,
COMPUTE_REGION,
DATASET_DISPLAY_NAME,
INPUT_CSV_NAME,
MODEL_DISPLAY_NAME,
])
Initialize client for AutoML and AutoML Tables
In [ ]:
# Initialize the clients.
automl_client = automl.AutoMlClient()
tables_client = automl.TablesClient(project=PROJECT_ID, region=COMPUTE_REGION)
In [ ]:
# List the datasets.
list_datasets = tables_client.list_datasets()
datasets = { dataset.display_name: dataset.name for dataset in list_datasets }
datasets
You can also print the list of your models by running the following cell. If no model has previously trained using AutoML Tables, you shall expect an empty return.
In [ ]:
# List the models.
list_models = tables_client.list_models()
models = { model.display_name: model.name for model in list_models }
models
In [ ]:
# Create dataset.
dataset = tables_client.create_dataset(
dataset_display_name=DATASET_DISPLAY_NAME)
dataset_name = dataset.name
dataset
You can import your data to AutoML Tables from GCS or BigQuery. For this tutorial, you can use the census_income dataset as your training data. We provide code below to copy the data into a bucket you own automatically. You are free to adjust the value of BUCKET_NAME
as needed.
In [ ]:
GCS_DATASET_URI = 'gs://{}/{}.csv'.format(BUCKET_NAME, INPUT_CSV_NAME)
! gsutil ls gs://$BUCKET_NAME || gsutil mb -l $COMPUTE_REGION gs://$BUCKET_NAME
! gsutil cp gs://cloud-ml-data-tables/notebooks/census_income.csv $GCS_DATASET_URI
Import data into the dataset, this process may take a while, depending on your data, once completed, you can verify the status by printing the dataset object. This time pay attention to the example_count field with 32461 records.
In [ ]:
# Read the data source from GCS.
import_data_response = tables_client.import_data(
dataset=dataset,
gcs_input_uris=GCS_DATASET_URI
)
print('Dataset import operation: {}'.format(import_data_response.operation))
# Synchronous check of operation status. Wait until import is done.
print('Dataset import response: {}'.format(import_data_response.result()))
# Verify the status by checking the example_count field.
dataset = tables_client.get_dataset(dataset_name=dataset_name)
dataset
In [ ]:
# List table specs.
list_table_specs_response = tables_client.list_table_specs(dataset=dataset)
table_specs = [s for s in list_table_specs_response]
# List column specs.
list_column_specs_response = tables_client.list_column_specs(dataset=dataset)
column_specs = {s.display_name: s for s in list_column_specs_response}
# Print Features and data_type.
features = [(key, data_types.TypeCode.Name(value.data_type.type_code))
for key, value in column_specs.items()]
print('Feature list:\n')
for feature in features:
print(feature[0],':', feature[1])
In [ ]:
# Table schema pie chart.
type_counts = {}
for column_spec in column_specs.values():
type_name = data_types.TypeCode.Name(column_spec.data_type.type_code)
type_counts[type_name] = type_counts.get(type_name, 0) + 1
plt.pie(x=type_counts.values(), labels=type_counts.keys(), autopct='%1.1f%%')
plt.axis('equal')
plt.show()
This section is important, as it is where you specify which column (meaning which feature) you will use as your label. This label feature will then be predicted using all other features in the row.
AutoML Tables automatically detects your data column type. For example, for the (census_income) it detects income_bracket
to be categorical (as it is just either over or under 50k) and age to be numerical. Depending on the type of your label column, AutoML Tables chooses to run a classification or regression model. If your label column contains only numerical values, but they represent categories, change your label column type to categorical by updating your schema.
In [ ]:
column_spec_display_name = 'income' #@param {type:'string'}
type_code='CATEGORY' #@param {type:'string'}
update_column_response = tables_client.update_column_spec(
dataset=dataset,
column_spec_display_name=column_spec_display_name,
type_code=type_code,
nullable=False,
)
update_column_response
Tip: You can use 'type_code': 'CATEGORY'
in the preceding update_column_spec_dict
to convert the column data type from FLOAT64 to CATEGORY
.
In [ ]:
column_spec_display_name = 'income' #@param {type:'string'}
update_dataset_response = tables_client.set_target_column(
dataset=dataset,
column_spec_display_name=column_spec_display_name,
)
update_dataset_response
Once we have defined our datasets and features we will create a model.
Specify the duration of the training. For example, 'train_budget_milli_node_hours': 1000
runs the training for one hour. You can increase that number up to a maximum of 72 hours ('train_budget_milli_node_hours': 72000)
for the best model performance.
Even with a budget of 1 node hour (the minimum possible budget), training a model can take more than the specified node hours
If your Colab times out, use client.list_models()
to check whether your model has been created. Then use model name to continue to the next steps. Run the following command to retrieve your model.
model = tables_client.get_model(model_display_name=MODEL_DISPLAY_NAME)
You can also select the objective to optimize your model training by setting optimization_objective. This solution optimizes the model by using default optimization objective. Refer link for more details.
In [ ]:
# The number of hours to train the model.
model_train_hours = 1 #@param {type:'integer'}
create_model_response = tables_client.create_model(
model_display_name=MODEL_DISPLAY_NAME,
dataset=dataset,
train_budget_milli_node_hours=model_train_hours*1000,,
exclude_column_spec_names=['fnlwgt','income'],
)
operation_id = create_model_response.operation.name
print('Create model operation: {}'.format(create_model_response.operation))
In [ ]:
# Wait until model training is done.
model = create_model_response.result()
model_name = model.name
model
Important : Deploy the model, then wait until the model FINISHES deployment.
The model takes a while to deploy online. When the deployment code response = client.deploy_model(model_name=model.name)
finishes, you will be able to see this on the UI. Check the UI and navigate to the predict tab of your model, and then to the online prediction portion, to see when it finishes online deployment before running the prediction cell.You should see "online prediction" text near the top, click on it, and it will take you to a view of your online prediction interface. You should see "model deployed" on the far right of the screen if the model is deployed, or a "deploying model" message if it is still deploying. </span>
In [ ]:
tables_client.deploy_model(model=model).result()
Verify if model has been deployed, check deployment_state field, it should show: DEPLOYED
In [ ]:
model = tables_client.get_model(model_name=model_name)
model
Run the prediction, only after the model finishes deployment
You can toggle exactly which values you want for all of the numeric features, and choose from the drop down windows which values you want for the categorical features.
Note: If the model has not finished deployment, the prediction will NOT work. The following cells show you how to make an online prediction.
In [ ]:
workclass_ids = ['Private', 'Self-emp-not-inc', 'Self-emp-inc', 'Federal-gov',
'Local-gov', 'State-gov', 'Without-pay', 'Never-worked']
education_ids = ['Bachelors', 'Some-college', '11th', 'HS-grad', 'Prof-school',
'Assoc-acdm', 'Assoc-voc', '9th', '7th-8th', '12th', 'Masters',
'1st-4th', '10th', 'Doctorate', '5th-6th', 'Preschool']
marital_status_ids = ['Married-civ-spouse', 'Divorced', 'Never-married',
'Separated', 'Widowed', 'Married-spouse-absent',
'Married-AF-spouse']
occupation_ids = ['Tech-support', 'Craft-repair', 'Other-service', 'Sales',
'Exec-managerial', 'Prof-specialty', 'Handlers-cleaners',
'Machine-op-inspct', 'Adm-clerical', 'Farming-fishing',
'Transport-moving', 'Priv-house-serv', 'Protective-serv',
'Armed-Forces']
relationship_ids = ['Wife', 'Own-child', 'Husband', 'Not-in-family',
'Other-relative', 'Unmarried']
race_ids = ['White', 'Asian-Pac-Islander', 'Amer-Indian-Eskimo', 'Other',
'Black']
In [ ]:
sex_ids = ['Female', 'Male']
native_country_ids = ['United-States', 'Cambodia', 'England', 'Puerto-Rico',
'Canada', 'Germany', 'Outlying-US(Guam-USVI-etc)',
'India', 'Japan', 'Greece', 'South', 'China', 'Cuba',
'Iran', 'Honduras', 'Philippines', 'Italy', 'Poland',
'Jamaica', 'Vietnam', 'Mexico', 'Portugal', 'Ireland',
'France', 'Dominican-Republic', 'Laos', 'Ecuador',
'Taiwan', 'Haiti', 'Columbia', 'Hungary', 'Guatemala',
'Nicaragua', 'Scotland', 'Thailand', 'Yugoslavia',
'El-Salvador', 'Trinadad&Tobago', 'Peru', 'Hong',
'Holand-Netherlands']
In [ ]:
# Create dropdown for workclass.
workclass = widgets.Dropdown(
options=workclass_ids,
value=workclass_ids[0],
description='workclass:'
)
In [ ]:
# Create dropdown for education.
education = widgets.Dropdown(
options=education_ids,
value=education_ids[0],
description='education:',
width='500px'
)
In [ ]:
# Create dropdown for marital status.
marital_status = widgets.Dropdown(
options=marital_status_ids,
value=marital_status_ids[0],
description='marital status:',
width='500px'
)
In [ ]:
# Create dropdown for occupation.
occupation = widgets.Dropdown(
options=occupation_ids,
value=occupation_ids[0],
description='occupation:',
width='500px'
)
In [ ]:
# Create dropdown for relationship.
relationship = widgets.Dropdown(
options=relationship_ids,
value=relationship_ids[0],
description='relationship:',
width='500px'
)
In [ ]:
# Create dropdown for race.
race = widgets.Dropdown(
options=race_ids,
value=race_ids[0],
description='race:',
width='500px'
)
In [ ]:
# Create dropdown for sex.
sex = widgets.Dropdown(
options=sex_ids,
value=sex_ids[0],
description='sex:',
width='500px'
)
In [ ]:
# Create dropdown for native country.
native_country = widgets.Dropdown(
options=native_country_ids,
value=native_country_ids[0],
description='native_country:',
width='500px'
)
In [ ]:
display(workclass)
display(education)
display(marital_status)
display(occupation)
display(relationship)
display(race)
display(sex)
display(native_country)
Adjust the slides on the right to the desired test values for your online prediction.
In [ ]:
#@title Make an online prediction: set the numeric variables{ vertical-output: true }
age = 36 #@param {type:'slider', min:1, max:100, step:1}
capital_gain = 40000 #@param {type:'slider', min:0, max:100000, step:10000}
capital_loss = 559.5 #@param {type:'slider', min:0, max:4000, step:0.1}
fnlwgt = 150000 #@param {type:'slider', min:0, max:1000000, step:50000}
education_num = 9 #@param {type:'slider', min:1, max:16, step:1}
hours_per_week = 40 #@param {type:'slider', min:1, max:100, step:1}
Run the following cell, and then choose the desired test values for your online prediction.
In [ ]:
inputs = {
'age': age,
'workclass': workclass.value,
'fnlwgt': fnlwgt,
'education': education.value,
'education_num': education_num,
'marital_status': marital_status.value,
'occupation': occupation.value,
'relationship': relationship.value,
'race': race.value,
'sex': sex.value,
'capital_gain': capital_gain,
'capital_loss': capital_loss,
'hours_per_week': hours_per_week,
'native_country': native_country.value,
}
In [ ]:
prediction_result = tables_client.predict(model=model, inputs=inputs)
prediction_result
Get Prediction
We extract the google.cloud.automl_v1beta1.types.PredictResponse
object prediction_result
and iterate to create a list of tuples with score and label, then we sort based on highest score and display it.
In [ ]:
predictions = [(prediction.tables.score, prediction.tables.value.string_value)
for prediction in prediction_result.payload]
predictions = sorted(
predictions, key=lambda tup: (tup[0],tup[1]), reverse=True)
print('Prediction is: ', predictions[0])
Undeploy the model
In [ ]:
undeploy_model_response = tables_client.undeploy_model(model=model)
Initialize prediction
Your data source for batch prediction can be GCS or BigQuery.
For this tutorial, you can use:
Create a GCS bucket and upload the file into your bucket.
Some of the lines in the batch prediction input file are intentionally left missing some values. The AutoML Tables logs the errors in the errors.csv
file. Also, enter the UI and create the bucket into which you will load your predictions.
The bucket's default name here is automl-tables-pred
to be replaced with your own.
NOTE: The client library has a bug. If the following cell returns a:
TypeError: Could not convert Any to BatchPredictResult
error, ignore it.
The batch prediction output file(s) will be updated to the GCS bucket that you set in the preceding cells.
In [ ]:
gcs_output_folder_name = 'census_income_predictions' #@param {type: 'string'}
SAMPLE_INPUT = 'gs://cloud-ml-data/automl-tables/notebooks/census_income_batch_prediction_input.csv'
GCS_BATCH_PREDICT_OUTPUT = 'gs://{}/{}/'.format(BUCKET_NAME,
gcs_output_folder_name)
! gsutil cp $SAMPLE_INPUT $GCS_BATCH_PREDICT_OUTPUT
Launch Batch prediction
In [ ]:
batch_predict_response = tables_client.batch_predict(
model=model,
gcs_input_uris=GCS_BATCH_PREDICT_URI,
gcs_output_uri_prefix=GCS_BATCH_PREDICT_OUTPUT,
)
print('Batch prediction operation: {}'.format(
batch_predict_response.operation))
# Wait until batch prediction is done.
batch_predict_result = batch_predict_response.result()
batch_predict_response.metadata
To clean up all GCP resources used in this project, you can delete the GCP project you used for the tutorial.
In [ ]:
# Delete model resource.
tables_client.delete_model(model_name=model_name)
# Delete dataset resource.
tables_client.delete_dataset(dataset_name=dataset_name)
# Delete Cloud Storage objects that were created.
! gsutil -m rm -r gs://$BUCKET_NAME
# If training model is still running, cancel it.
automl_client.transport._operations_client.cancel_operation(operation_id)
Please follow latest updates on AutoML here.