This notebook demonstrates how to use AI Platform to train a simple classification model using scikit-learn
, and then deploy the model to get predictions.
You train the model to predict a person's income level based on the Census Income data set.
Before you jump in, let’s cover some of the different tools you’ll be using:
AI Platform is a managed service that enables you to easily build machine learning models that work on any type of data, of any size.
Cloud Storage is a unified object storage for developers and enterprises, from live data serving to data analytics/ML to data archiving.
Cloud SDK is a command line tool which allows you to interact with Google Cloud products. This notebook introduces several gcloud
and gsutil
commands, which are part of the Cloud SDK. Note that shell commands in a notebook must be prepended with a !
.
In [ ]:
!gcloud services enable ml.googleapis.com
!gcloud services enable compute.googleapis.com
Buckets are the basic containers that hold your data. Everything that you store in Cloud Storage must be contained in a bucket. You can use buckets to organize your data and control access to your data.
Start by defining a globally unique name.
For more information about naming buckets, see Bucket name requirements.
In [ ]:
BUCKET_NAME = 'your-new-bucket'
In the examples below, the BUCKET_NAME
variable is referenced in the commands using $
.
Create the new bucket with the gsutil mb
command:
In [ ]:
!gsutil mb gs://$BUCKET_NAME/
The Census Income Data Set that this sample uses for training is provided by the UC Irvine Machine Learning Repository.
Census data courtesy of: Lichman, M. (2013). UCI Machine Learning Repository http://archive.ics.uci.edu/ml. Irvine, CA: University of California, School of Information and Computer Science. This dataset is publicly available for anyone to use under the following terms provided by the Dataset Source - http://archive.ics.uci.edu/ml - and is provided "AS IS" without any warranty, express or implied, from Google. Google disclaims all liability for any damages, direct or indirect, resulting from the use of the dataset.
The data used in this tutorial is located in a public Cloud Storage bucket:
gs://cloud-samples-data/ml-engine/sklearn/census_data/
The training file is adult.data
(download) and the evaluation file is adult.test
(download). The evaluation file is not used in this tutorial.
The easiest (and recommended) way to create a training application package is to use gcloud
to package and upload the application when you submit your training job. This method allows you to create a very simple file structure with only two files. For this tutorial, the file structure of your training application package should appear similar to the following:
census_training/
__init__.py
train.py
Create a directory locally:
In [ ]:
!mkdir census_training
Create a blank file named __init__.py
:
In [ ]:
!touch ./census_training/__init__.py
Save training code in one Python file in the census_training
directory. The following cell writes a training file to the census_training
directory. The training file performs the following operations:
DataFrame
that can be used by scikit-learn
pickle
libraryThe following model training code is not executed within this notebook. Instead, it is saved to a Python file and packaged as a Python module that runs on AI Platform after you submit the training job.
In [ ]:
%%writefile ./census_training/train.py
import argparse
import pickle
import pandas as pd
from google.cloud import storage
from sklearn.ensemble import RandomForestClassifier
from sklearn.feature_selection import SelectKBest
from sklearn.pipeline import FeatureUnion
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import LabelBinarizer
parser = argparse.ArgumentParser()
parser.add_argument("--bucket-name", help="The bucket name", required=True)
arguments, unknown = parser.parse_known_args()
bucket_name = arguments.bucket_name
# Define the format of your input data, including unused columns.
# These are the columns from the census data files.
COLUMNS = (
'age',
'workclass',
'fnlwgt',
'education',
'education-num',
'marital-status',
'occupation',
'relationship',
'race',
'sex',
'capital-gain',
'capital-loss',
'hours-per-week',
'native-country',
'income-level'
)
# Categorical columns are columns that need to be turned into a numerical value
# to be used by scikit-learn
CATEGORICAL_COLUMNS = (
'workclass',
'education',
'marital-status',
'occupation',
'relationship',
'race',
'sex',
'native-country'
)
# Create a Cloud Storage client to download the census data
storage_client = storage.Client()
# Download the data
public_bucket = storage_client.bucket('cloud-samples-data')
blob = public_bucket.blob('ml-engine/sklearn/census_data/adult.data')
blob.download_to_filename('adult.data')
# Load the training census dataset
with open("./adult.data", "r") as train_data:
raw_training_data = pd.read_csv(train_data, header=None, names=COLUMNS)
# Removing the whitespaces in categorical features
for col in CATEGORICAL_COLUMNS:
raw_training_data[col] = raw_training_data[col].apply(lambda x: str(x).strip())
# Remove the column we are trying to predict ('income-level') from our features
# list and convert the DataFrame to a lists of lists
train_features = raw_training_data.drop("income-level", axis=1).values.tolist()
# Create our training labels list, convert the DataFrame to a lists of lists
train_labels = (raw_training_data["income-level"] == " >50K").values.tolist()
# Since the census data set has categorical features, we need to convert
# them to numerical values. We'll use a list of pipelines to convert each
# categorical column and then use FeatureUnion to combine them before calling
# the RandomForestClassifier.
categorical_pipelines = []
# Each categorical column needs to be extracted individually and converted to a
# numerical value. To do this, each categorical column will use a pipeline that
# extracts one feature column via SelectKBest(k=1) and a LabelBinarizer() to
# convert the categorical value to a numerical one. A scores array (created
# below) will select and extract the feature column. The scores array is
# created by iterating over the columns and checking if it is a
# categorical column.
for i, col in enumerate(COLUMNS[:-1]):
if col in CATEGORICAL_COLUMNS:
# Create a scores array to get the individual categorical column.
# Example:
# data = [
# 39, 'State-gov', 77516, 'Bachelors', 13, 'Never-married',
# 'Adm-clerical', 'Not-in-family', 'White', 'Male', 2174, 0,
# 40, 'United-States'
# ]
# scores = [0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]
#
# Returns: [['State-gov']]
# Build the scores array
scores = [0] * len(COLUMNS[:-1])
# This column is the categorical column we want to extract.
scores[i] = 1
skb = SelectKBest(k=1)
skb.scores_ = scores
# Convert the categorical column to a numerical value
lbn = LabelBinarizer()
r = skb.transform(train_features)
lbn.fit(r)
# Create the pipeline to extract the categorical feature
categorical_pipelines.append(
(
'categorical-{}'.format(i),
Pipeline([
('SKB-{}'.format(i), skb),
('LBN-{}'.format(i), lbn)])
)
)
# Create pipeline to extract the numerical features
skb = SelectKBest(k=6)
# From COLUMNS use the features that are numerical
skb.scores_ = [1, 0, 1, 0, 1, 0, 0, 0, 0, 0, 1, 1, 1, 0]
categorical_pipelines.append(("numerical", skb))
# Combine all the features using FeatureUnion
preprocess = FeatureUnion(categorical_pipelines)
# Create the classifier
classifier = RandomForestClassifier()
# Transform the features and fit them to the classifier
classifier.fit(preprocess.transform(train_features), train_labels)
# Create the overall model as a single pipeline
pipeline = Pipeline([("union", preprocess), ("classifier", classifier)])
# Create the model file
# It is required to name the model file "model.pkl" if you are using pickle
model_filename = "model.pkl"
with open(model_filename, "wb") as model_file:
pickle.dump(pipeline, model_file)
# Upload the model to Cloud Storage
bucket = storage_client.bucket(bucket_name)
blob = bucket.blob(model_filename)
blob.upload_from_filename(model_filename)
In this section, you use gcloud ai-platform jobs submit training
to submit your training job. The --
argument passed to the command is a separator; anything after the separator will be passed to the Python code as input arguments.
For more information about the arguments preceeding the separator, run the following:
gcloud ai-platform jobs submit training --help
The argument given to the python script is --bucket-name
. The --bucket-name
argument is used to specify the name of the bucket to save the model file.
In [ ]:
import time
# Define a timestamped job name
JOB_NAME = "census_training_{}".format(int(time.time()))
In [ ]:
# Submit the training job:
!gcloud ai-platform jobs submit training $JOB_NAME \
--job-dir gs://$BUCKET_NAME/census_job_dir \
--package-path ./census_training \
--module-name census_training.train \
--region us-central1 \
--runtime-version=1.12 \
--python-version=3.5 \
--scale-tier BASIC \
--stream-logs \
-- \
--bucket-name $BUCKET_NAME
In [ ]:
!gsutil ls gs://$BUCKET_NAME/
Once the model is successfully created and trained, you can serve it. A model can have different versions. In order to serve the model, create a model and version in AI Platform.
Define the model and version names:
In [ ]:
MODEL_NAME = "CensusPredictor"
VERSION_NAME = "census_predictor_{}".format(int(time.time()))
Create the model in AI Platform:
In [ ]:
!gcloud ai-platform models create $MODEL_NAME --regions us-central1
Create a version that points to your model file in Cloud Storage:
In [ ]:
!gcloud ai-platform versions create $VERSION_NAME \
--model=$MODEL_NAME \
--framework=scikit-learn \
--origin=gs://$BUCKET_NAME/ \
--python-version=3.5 \
--runtime-version=1.12
Before you send an online prediction request, you must format your test data to prepare it for use by the AI Platform prediction service. Make sure that the format of your input instances matches what your model expects.
Create an input.json
file with each input instance on a separate line. The following example uses ten data instances. Note that the format of input instances needs to match what your model expects. In this example, the Census model requires 14 features, so your input must be a matrix of shape (num_instances, 14
).
In [ ]:
# Define a name for the input file
INPUT_FILE = "./census_training/input.json"
In [ ]:
%%writefile $INPUT_FILE
[25, "Private", 226802, "11th", 7, "Never-married", "Machine-op-inspct", "Own-child", "Black", "Male", 0, 0, 40, "United-States"]
[38, "Private", 89814, "HS-grad", 9, "Married-civ-spouse", "Farming-fishing", "Husband", "White", "Male", 0, 0, 50, "United-States"]
[28, "Local-gov", 336951, "Assoc-acdm", 12, "Married-civ-spouse", "Protective-serv", "Husband", "White", "Male", 0, 0, 40, "United-States"]
[44, "Private", 160323, "Some-college", 10, "Married-civ-spouse", "Machine-op-inspct", "Husband", "Black", "Male", 7688, 0, 40, "United-States"]
[18, "?", 103497, "Some-college", 10, "Never-married", "?", "Own-child", "White", "Female", 0, 0, 30, "United-States"]
[34, "Private", 198693, "10th", 6, "Never-married", "Other-service", "Not-in-family", "White", "Male", 0, 0, 30, "United-States"]
[29, "?", 227026, "HS-grad", 9, "Never-married", "?", "Unmarried", "Black", "Male", 0, 0, 40, "United-States"]
[63, "Self-emp-not-inc", 104626, "Prof-school", 15, "Married-civ-spouse", "Prof-specialty", "Husband", "White", "Male", 3103, 0, 32, "United-States"]
[24, "Private", 369667, "Some-college", 10, "Never-married", "Other-service", "Unmarried", "White", "Female", 0, 0, 40, "United-States"]
[55, "Private", 104996, "7th-8th", 4, "Married-civ-spouse", "Craft-repair", "Husband", "White", "Male", 0, 0, 10, "United-States"]
The prediction results return True
if the person's income is predicted to be greater than $50,000 per year, and False
otherwise. The output of the command below may appear similar to the following:
[False, False, False, True, False, False, False, False, False, False]
In [ ]:
!gcloud ai-platform predict --model $MODEL_NAME --version \
$VERSION_NAME --json-instances $INPUT_FILE
To delete all resources you created in this tutorial, run the following commands:
In [ ]:
# Delete the model version
!gcloud ai-platform versions delete $VERSION_NAME --model=$MODEL_NAME --quiet
# Delete the model
!gcloud ai-platform models delete $MODEL_NAME --quiet
# Delete the bucket and contents
!gsutil rm -r gs://$BUCKET_NAME
# Delete the local files created by the tutorial
!rm -rf census_training