In [0]:
# Copyright 2019 Google LLC
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# https://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
|
|
Beta
This is a beta release of custom code for scikit-learn pipelines. This feature might be changed in backward-incompatible ways and is not subject to any SLA or deprecation policy.
This tutorial shows how to use AI Platform to deploy a scikit-learn pipeline that uses custom transformers.
scikit-learn pipelines allow you to compose multiple estimators. For example, you can use transformers to preprocess data and pass the transformed data to a classifier. scikit-learn provides many
transformers in the sklearn
package.
You can also use scikit-learn's FunctionTransformer
or TransformerMixin
class to create your own custom transformer. If you want to deploy a pipeline that uses custom transformers to AI Platform Prediction, you must provide that code to AI Platform as a source distribution package.
This tutorial presents a sample problem involving Census data to walk you through the following steps:
This tutorial uses the United States Census Income Dataset provided by the UC Irvine Machine Learning Repository. This dataset contains information about people from a 1994 Census database, including age, education, marital status, occupation, and whether they make more than $50,000 a year.
The data used in this tutorial is available in a public Cloud Storage bucket:
gs://cloud-samples-data/ai-platform/sklearn/census_data/
The goal is to train a scikit-learn pipeline that predicts whether a person makes more than $50,000 a year (target label) based on other Census information about the person (features).
This tutorial focuses more on using this model with AI Platform than on the design of the model itself. However, it's always important to think about potential problems and unintended consequences when building machine learning systems. See the Machine Learning Crash Course exercise about fairness to learn about sources of bias in the Census dataset, as well as machine learning fairness more generally.
This tutorial uses billable components of Google Cloud Platform (GCP):
Learn about AI Platform pricing and Cloud Storage pricing, and use the Pricing Calculator to generate a cost estimate based on your projected usage.
You must do several things before you can train and deploy a model in AI Platform:
Otherwise, make sure your environment meets this notebook's requirements. You need the following:
The Google Cloud guide to Setting up a Python development environment and the Jupyter installation guide provide detailed instructions for meeting these requirements. The following steps provide a condensed set of instructions:
Install virtualenv and create a virtual environment that uses Python 3. (You can skip this step if you want to install Jupyter globally.)
Activate that environment and run pip install jupyter
in a shell to install
Jupyter.
Run jupyter notebook
in a shell to launch Jupyter.
Open this notebook in the Jupyter Notebook Dashboard.
The following steps are required, regardless of your notebook environment.
Enable the AI Platform ("Cloud Machine Learning Engine") and Compute Engine APIs.
Enter your project ID below and run the following cells to make sure the Cloud SDK uses the right project for all the commands in this notebook.
Note: Jupyter runs lines prefixed with !
as shell commands, and it interpolates Python variables prefixed with $
into these commands.
In [0]:
PROJECT_ID = "<your-project-id>" #@param {type:"string"}
In [0]:
! gcloud config set project $PROJECT_ID
If you are using Colab, run the cell below and follow the instructions when prompted to authenticate your account via oAuth.
Otherwise, follow these steps:
In the GCP Console, go to the Create service account key page.
From the Service account drop-down list, select New service account.
In the Service account name field, enter a name.
From the Role drop-down list, select Machine Learning Engine > AI Platform Admin and Storage > Storage Object Admin.
Click Create. A JSON file that contains your key downloads to your local environment.
Enter the path to your service account key as the
GOOGLE_APPLICATION_CREDENTIALS
variable in the cell below and run the cell.
In [0]:
import sys
# If you are running this notebook in Colab, run this cell and follow the
# instructions to authenticate your GCP account. This provides access to your
# Cloud Storage bucket and lets you submit training jobs and prediction
# requests.
if 'google.colab' in sys.modules:
from google.colab import auth as google_auth
google_auth.authenticate_user()
# If you are running this notebook locally, replace the string below with the
# path to your service account key and run this cell to authenticate your GCP
# account.
else:
%env GOOGLE_APPLICATION_CREDENTIALS '<path-to-your-service-account-key.json>'
The following steps are required, regardless of your notebook environment.
This tutorial uses Cloud Storage in several ways:
When you submit a training job using the Cloud SDK, you upload a Python package containing your training code to a Cloud Storage bucket. AI Platform runs the code from this package.
In this tutorial, AI Platform also saves the trained model that results from your job in the same bucket.
To deploy your scikit-learn pipeline that uses custom code to serve predictions, you must upload the custom transformers that your pipeline uses to Cloud Storage.
When you create the AI Platform version resource that serves predictions, you provide the trained scikit-learn pipeline and your custom code as Cloud Storage URIs.
Set the name of your Cloud Storage bucket below. It must be unique across all Cloud Storage buckets.
You may also change the REGION
variable, which is used for operations
throughout the rest of this notebook. Make sure to choose a region where Cloud
AI Platform services are
available. You may
not use a Multi-Regional Storage bucket for training with AI Platform.
In [0]:
BUCKET_NAME = "<your-bucket-name>" #@param {type:"string"}
REGION = "us-central1" #@param {type:"string"}
Only if your bucket doesn't already exist: Run the following cell to create your Cloud Storage bucket.
In [0]:
! gsutil mb -l $REGION gs://$BUCKET_NAME
Finally, validate access to your Cloud Storage bucket by examining its contents:
In [0]:
! gsutil ls -al gs://$BUCKET_NAME
Create an application to train a scikit-learn pipeline with the Census data. In this tutorial, the training package also contains the custom code that the trained pipeline uses during prediction. This is a useful pattern, because pipelines are generally designed to use the same transformers during training and prediction.
Use the following steps to create a directory with three files inside that matches the following structure:
census_package/
__init__.py
my_pipeline.py
train.py
First, create the empty census_package/
directory:
In [0]:
! mkdir census_package
Within census_package/
create a blank file named __init__.py
:
In [0]:
! touch ./census_package/__init__.py
This makes it possible to import census_package/
as a package in Python.
scikit-learn provides many transformers that you can use as part of a pipeline, but it also lets you define your own custom transformers. These transformers can even learn a saved state during training that gets used later during prediction.
Extend sklearn.base.TransformerMixin
to create several custom transformers in a file named census_package/my_pipeline.py
.
Run the following cell to define three transformers:
PositionalSelector
: Given a list of indices C and a matrix M, this returns a matrix with a subset of M's columns, indicated by C.
StripString
: Given a matrix of strings, this strips whitespaces from each string.
SimpleOneHotEncoder
: A simple one-hot encoder which can be applied to a matrix of strings.
In [0]:
%%writefile ./census_package/my_pipeline.py
import numpy as np
from sklearn.base import BaseEstimator, TransformerMixin
class PositionalSelector(BaseEstimator, TransformerMixin):
def __init__(self, positions):
self.positions = positions
def fit(self, X, y=None):
return self
def transform(self, X):
return np.array(X)[:, self.positions]
class StripString(BaseEstimator, TransformerMixin):
def fit(self, X, y=None):
return self
def transform(self, X):
strip = np.vectorize(str.strip)
return strip(np.array(X))
class SimpleOneHotEncoder(BaseEstimator, TransformerMixin):
def fit(self, X, y=None):
self.values = []
for c in range(X.shape[1]):
Y = X[:, c]
values = {v: i for i, v in enumerate(np.unique(Y))}
self.values.append(values)
return self
def transform(self, X):
X = np.array(X)
matrices = []
for c in range(X.shape[1]):
Y = X[:, c]
matrix = np.zeros(shape=(len(Y), len(self.values[c])), dtype=np.int8)
for i, x in enumerate(Y):
if x in self.values[c]:
matrix[i][self.values[c][x]] = 1
matrices.append(matrix)
res = np.concatenate(matrices, axis=1)
return res
Note: You can also create custom transformers by using sklearn.preprocessing.FunctionTransformer
, but this only works for stateless transformations.
Next, create a training module to train your scikit-learn pipeline on Census data. Part of this code involves defining the pipeline.
This training module does several things:
DataFrame
that can be used by scikit-learn.'age'
, 'education-num'
, and 'hours-per-week'
) and three categorical features ('workclass'
, 'marital-status'
, and 'relationship'
) from the input data. It transforms the numerical features using scikit-learn's built-in StandardScaler
and transforms the categorical ones with the custom one-hot encoder you defined in my_pipeline.py
. Then it combines the preprocessed data as input for a classifier.joblib
and saves it to your Cloud Storage bucket.Run the following cell to create census_package/train.py
:
In [0]:
%%writefile ./census_package/train.py
import warnings
import argparse
from google.cloud import storage
import pandas as pd
import numpy as np
from sklearn.externals import joblib
from sklearn.preprocessing import StandardScaler
from sklearn.ensemble import GradientBoostingClassifier
from sklearn.pipeline import Pipeline, FeatureUnion, make_pipeline
import census_package.my_pipeline as mp
warnings.filterwarnings('ignore')
def download_data(bucket_name, gcs_path, local_path):
bucket = storage.Client().bucket(bucket_name)
blob = bucket.blob(gcs_path)
blob.download_to_filename(local_path)
def upload_data(bucket_name, gcs_path, local_path):
bucket = storage.Client().bucket(bucket_name)
blob = bucket.blob(gcs_path)
blob.upload_from_filename(local_path)
def get_features_target(local_path):
strip = np.vectorize(str.strip)
raw_df = pd.read_csv(local_path, header=None)
target_index = len(raw_df.columns) - 1 # Last columns, 'income-level', is the target
features_df = raw_df.drop(target_index, axis=1)
features = features_df.as_matrix()
target = strip(raw_df[target_index].values)
return features, target
def create_pipeline():
# We want to use 3 categorical and 3 numerical features in this sample.
# Categorical features: age, education-num, and hours-per-week
# Numerical features: workclass, marital-status, and relationship
numerical_indices = [0, 4, 12] # age, education-num, and hours-per-week
categorical_indices = [1, 5, 7] # workclass, marital-status, and relationship
p1 = make_pipeline(mp.PositionalSelector(categorical_indices), mp.StripString(), mp.SimpleOneHotEncoder())
p2 = make_pipeline(mp.PositionalSelector(numerical_indices), StandardScaler())
feats = FeatureUnion([
('numericals', p1),
('categoricals', p2),
])
pipeline = Pipeline([
('pre', feats),
('estimator', GradientBoostingClassifier(max_depth=4, n_estimators=100))
])
return pipeline
def get_bucket_path(gcs_uri):
if not gcs_uri.startswith('gs://'):
raise Exception('{} does not start with gs://'.format(gcs_uri))
no_gs_uri = gcs_uri[len('gs://'):]
first_slash_index = no_gs_uri.find('/')
bucket_name = no_gs_uri[:first_slash_index]
gcs_path = no_gs_uri[first_slash_index + 1:]
return bucket_name, gcs_path
if __name__ == '__main__':
parser = argparse.ArgumentParser()
parser.add_argument('--gcs_data_path', action="store", required=True)
parser.add_argument('--gcs_model_path', action="store", required=True)
arguments, others = parser.parse_known_args()
local_path = '/tmp/adul.data'
data_bucket, data_path = get_bucket_path(arguments.gcs_data_path)
print('Downloading the data...')
download_data(data_bucket, data_path, local_path)
features, target = get_features_target(local_path)
pipeline = create_pipeline()
print('Training the model...')
pipeline.fit(features, target)
joblib.dump(pipeline, './model.joblib')
model_bucket, model_path = get_bucket_path(arguments.gcs_model_path)
upload_data(model_bucket, model_path, './model.joblib')
print('Model was successfully uploaded.')
Use gcloud
to submit a training job to AI Platform. The following command packages your training application, uploads it to Cloud Storage, and tells
AI Platform to run your training module.
The --
argument is a separator: the AI Platform service doesn't use arguments
that follow the separator, but your training module can still access them.
In [0]:
! gcloud ai-platform jobs submit training census_training_$(date +"%Y%m%d_%H%M%S") \
--job-dir gs://$BUCKET_NAME/custom_pipeline_tutorial/job \
--package-path ./census_package \
--module-name census_package.train \
--region $REGION \
--runtime-version 1.13 \
--python-version 3.5 \
--scale-tier BASIC \
--stream-logs \
-- \
--gcs_data_path gs://cloud-samples-data/ai-platform/census/data/adult.data.csv \
--gcs_model_path gs://$BUCKET_NAME/custom_pipeline_tutorial/model/model.joblib
To serve predictions from AI Platform, you must deploy a model resource and a version resource. The model helps you organize multiple deployments if you modify and train your pipeline multiple times. The version uses your trained model and custom code to serve predictions.
To deploy these resources, you need to provide two artifacts:
model.joblib
to
your bucket..tar.gz
source distribution package in Cloud Storage containing any custom
transformers your pipeline uses. Create this in the next step.If you deploy a version without providing the code from my_pipeline.py
, AI Platform Prediction won't be able to import the custom transformers (for example, mp.SimpleOneHotEncoder
) and it will be unable to serve predictions.
Create the following setup.py
to define a source distribution package for your code:
In [0]:
%%writefile ./setup.py
import setuptools
setuptools.setup(name='census_package',
packages=['census_package'],
version="1.0",
)
Then run the following command to create dist/census_package-1.0.tar.gz
:
In [0]:
! python setup.py sdist --formats=gztar
Note: This package contains train.py
as well, but the version resource doesn't require it.
Finally, upload this tarball to your Cloud Storage bucket:
In [0]:
! gsutil cp ./dist/census_package-1.0.tar.gz gs://$BUCKET_NAME/custom_pipeline_tutorial/code/census_package-1.0.tar.gz
First, define model and version names:
In [0]:
MODEL_NAME = 'CensusPredictor'
VERSION_NAME = 'v1'
Then use the following command to create the model resource:
In [0]:
! gcloud ai-platform models create $MODEL_NAME \
--regions $REGION
Finally, create the version resource by providing Cloud Storage paths to your model directory (the one that contains model.joblib
) and your custom code (census_package-1.0.tar.gz
):
In [0]:
# --quiet automatically installs the beta component if it isn't already installed
! gcloud --quiet beta ai-platform versions create $VERSION_NAME --model $MODEL_NAME \
--origin gs://$BUCKET_NAME/custom_pipeline_tutorial/model/ \
--runtime-version 1.13 \
--python-version 3.5 \
--framework SCIKIT_LEARN \
--package-uris gs://$BUCKET_NAME/custom_pipeline_tutorial/code/census_package-1.0.tar.gz
In [0]:
! pip install --upgrade google-api-python-client
Then send two instances of Census data to your deployed version:
In [0]:
import googleapiclient.discovery
instances = [
[39, 'State-gov', 77516, ' Bachelors . ', 13, 'Never-married', 'Adm-clerical', 'Not-in-family',
'White', 'Male', 2174, 0, 40, 'United-States', '<=50K'],
[50, 'Self-emp-not-inc', 83311, 'Bachelors', 13, 'Married-civ-spouse', 'Exec-managerial', 'Husband',
'White', 'Male', 0, 0, 13, 'United-States', '<=50K']
]
service = googleapiclient.discovery.build('ml', 'v1')
name = 'projects/{}/models/{}/versions/{}'.format(PROJECT_ID, MODEL_NAME, VERSION_NAME)
response = service.projects().predict(
name=name,
body={'instances': instances}
).execute()
if 'error' in response:
raise RuntimeError(response['error'])
else:
print(response['predictions'])
Note: This code uses the credentials you set up during the authentication step to make the online prediction request.
The version passes the input data through the trained pipeline and returns the classifier's results: either <=50K
or >50K
for each instance, depending on its prediction for the person's income bracket.
To clean up all GCP resources used in this project, you can delete the GCP project you used for the tutorial.
Alternatively, you can clean up individual resources by running the following commands:
In [0]:
# Delete version resource
! gcloud ai-platform versions delete $VERSION_NAME --quiet --model $MODEL_NAME
# Delete model resource
! gcloud ai-platform models delete $MODEL_NAME --quiet
# Delete Cloud Storage objects that were created
! gsutil -m rm -r gs://$BUCKET_NAME/custom_pipeline_tutorial