This notebook illustrates:
Please see this notebook for more context on this problem and how the features were chosen.
In [1]:
# change these to try this notebook out
BUCKET = 'cloud-training-demos-ml'
PROJECT = 'cloud-training-demos'
PROJECTNUMBER = '663413318684'
REGION = 'us-central1'
In [2]:
import os
os.environ['BUCKET'] = BUCKET
os.environ['PROJECT'] = PROJECT
os.environ['PROJECTNUMBER'] = PROJECTNUMBER
os.environ['REGION'] = REGION
In [3]:
%bash
gcloud config set project $PROJECT
gcloud config set compute/region $REGION
In [4]:
%%bash
if ! gsutil ls | grep -q gs://${BUCKET}/; then
gsutil mb -l ${REGION} gs://${BUCKET}
fi
In [138]:
%bash
# Pandas will use this privatekey to access BigQuery on our behalf.
# Do NOT check in the private key into git!!!
# if you get a JWT grant error when using this key, create the key via gcp web console in IAM > Service Accounts section
KEYFILE=babyweight/trainer/privatekey.json
if [ ! -f $KEYFILE ]; then
gcloud iam service-accounts keys create \
--iam-account ${PROJECTNUMBER}-compute@developer.gserviceaccount.com \
$KEYFILE
fi
In [21]:
KEYDIR='babyweight/trainer'
Please see this notebook for more context on this problem and how the features were chosen.
In [76]:
#%writefile babyweight/trainer/model.py
# Copyright 2018 Google Inc. All Rights Reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
We can use BigQuery to create the training and evaluation datasets. Because of the masking (ultrasound vs. no ultrasound), the query itself is a little complex.
In [1]:
#%writefile -a babyweight/trainer/model.py
def create_queries():
query_all = """
WITH with_ultrasound AS (
SELECT
weight_pounds AS label,
CAST(is_male AS STRING) AS is_male,
mother_age,
CAST(plurality AS STRING) AS plurality,
gestation_weeks,
FARM_FINGERPRINT(CONCAT(CAST(YEAR AS STRING), CAST(month AS STRING))) AS hashmonth
FROM
publicdata.samples.natality
WHERE
year > 2000
AND gestation_weeks > 0
AND mother_age > 0
AND plurality > 0
AND weight_pounds > 0
),
without_ultrasound AS (
SELECT
weight_pounds AS label,
'Unknown' AS is_male,
mother_age,
IF(plurality > 1, 'Multiple', 'Single') AS plurality,
gestation_weeks,
FARM_FINGERPRINT(CONCAT(CAST(YEAR AS STRING), CAST(month AS STRING))) AS hashmonth
FROM
publicdata.samples.natality
WHERE
year > 2000
AND gestation_weeks > 0
AND mother_age > 0
AND plurality > 0
AND weight_pounds > 0
),
preprocessed AS (
SELECT * from with_ultrasound
UNION ALL
SELECT * from without_ultrasound
)
SELECT
label,
is_male,
mother_age,
plurality,
gestation_weeks
FROM
preprocessed
"""
train_query = "{} WHERE ABS(MOD(hashmonth, 4)) < 3".format(query_all)
eval_query = "{} WHERE ABS(MOD(hashmonth, 4)) = 3".format(query_all)
return train_query, eval_query
In [2]:
print create_queries()[0]
In [19]:
#%writefile -a babyweight/trainer/model.py
def query_to_dataframe(query):
import pandas as pd
import pkgutil
privatekey = pkgutil.get_data(KEYDIR, 'privatekey.json')
print(privatekey[:200])
return pd.read_gbq(query,
project_id=PROJECT,
dialect='standard',
private_key=privatekey)
def create_dataframes(frac):
# small dataset for testing
if frac > 0 and frac < 1:
sample = " AND RAND() < {}".format(frac)
else:
sample = ""
train_query, eval_query = create_queries()
train_query = "{} {}".format(train_query, sample)
eval_query = "{} {}".format(eval_query, sample)
train_df = query_to_dataframe(train_query)
eval_df = query_to_dataframe(eval_query)
return train_df, eval_df
In [22]:
train_df, eval_df = create_dataframes(0.001)
train_df.describe()
Out[22]:
In [23]:
eval_df.head()
Out[23]:
Let's train the model locally
In [36]:
#%writefile -a babyweight/trainer/model.py
def input_fn(indf):
import copy
import pandas as pd
df = copy.deepcopy(indf)
# one-hot encode the categorical columns
df["plurality"] = df["plurality"].astype(pd.api.types.CategoricalDtype(
categories=["Single","Multiple","1","2","3","4","5"]))
df["is_male"] = df["is_male"].astype(pd.api.types.CategoricalDtype(
categories=["Unknown","false","true"]))
# features, label
label = df['label']
del df['label']
features = pd.get_dummies(df)
return features, label
In [37]:
train_x, train_y = input_fn(train_df)
print(train_x[:5])
print(train_y[:5])
In [38]:
from sklearn.ensemble import RandomForestRegressor
estimator = RandomForestRegressor(max_depth=5, n_estimators=100, random_state=0)
estimator.fit(train_x, train_y)
Out[38]:
In [39]:
import numpy as np
eval_x, eval_y = input_fn(eval_df)
eval_pred = estimator.predict(eval_x)
print(eval_pred[1000:1005])
print(eval_y[1000:1005])
print(np.sqrt(np.mean((eval_pred-eval_y)*(eval_pred-eval_y))))
In [142]:
#%writefile -a babyweight/trainer/model.py
def train_and_evaluate(frac, max_depth=5, n_estimators=100):
import numpy as np
# get data
train_df, eval_df = create_dataframes(frac)
train_x, train_y = input_fn(train_df)
# train
from sklearn.ensemble import RandomForestRegressor
estimator = RandomForestRegressor(max_depth=max_depth, n_estimators=n_estimators, random_state=0)
estimator.fit(train_x, train_y)
# evaluate
eval_x, eval_y = input_fn(eval_df)
eval_pred = estimator.predict(eval_x)
rmse = np.sqrt(np.mean((eval_pred-eval_y)*(eval_pred-eval_y)))
print("Eval rmse={}".format(rmse))
return estimator, rmse
In [72]:
#%writefile -a babyweight/trainer/model.py
def save_model(estimator, gcspath, name):
from sklearn.externals import joblib
import os, subprocess, datetime
model = 'model.joblib'
joblib.dump(estimator, model)
model_path = os.path.join(gcspath, datetime.datetime.now().strftime(
'export_%Y%m%d_%H%M%S'), model)
subprocess.check_call(['gsutil', 'cp', model, model_path])
return model_path
In [69]:
saved = save_model(estimator, 'gs://{}/babyweight/sklearn'.format(BUCKET), 'babyweight')
In [70]:
print saved
In [11]:
%writefile babyweight/trainer/task.py
# Copyright 2018 Google Inc. All Rights Reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
import argparse
import os
import hypertune
import model
if __name__ == '__main__':
parser = argparse.ArgumentParser()
parser.add_argument(
'--bucket',
help = 'GCS path to output.',
required = True
)
parser.add_argument(
'--frac',
help = 'Fraction of input to process',
type = float,
required = True
)
parser.add_argument(
'--maxDepth',
help = 'Depth of trees',
type = int,
default = 5
)
parser.add_argument(
'--numTrees',
help = 'Number of trees',
type = int,
default = 100
)
parser.add_argument(
'--projectId',
help = 'ID (not name) of your project',
required = True
)
parser.add_argument(
'--job-dir',
help = 'output directory for model, automatically provided by gcloud',
required = True
)
args = parser.parse_args()
arguments = args.__dict__
model.PROJECT = arguments['projectId']
model.KEYDIR = 'trainer'
estimator, rmse = model.train_and_evaluate(arguments['frac'],
arguments['maxDepth'],
arguments['numTrees']
)
loc = model.save_model(estimator,
arguments['job_dir'], 'babyweight')
print("Saved model to {}".format(loc))
# this is for hyperparameter tuning
hpt = hypertune.HyperTune()
hpt.report_hyperparameter_tuning_metric(
hyperparameter_metric_tag='rmse',
metric_value=rmse,
global_step=0)
# done
In [127]:
!pip freeze | grep pandas
In [18]:
%writefile babyweight/setup.py
# Copyright 2018 Google Inc. All Rights Reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
from setuptools import setup
setup(name='trainer',
version='1.0',
description='Natality, with sklearn',
url='http://github.com/GoogleCloudPlatform/training-data-analyst',
author='Google',
author_email='nobody@google.com',
license='Apache2',
packages=['trainer'],
## WARNING! Do not upload this package to PyPI
## BECAUSE it contains a private key
package_data={'': ['privatekey.json']},
install_requires=[
'pandas-gbq==0.3.0',
'urllib3',
'google-cloud-bigquery==0.29.0',
'cloudml-hypertune'
],
zip_safe=False)
Try out the package on a subset of the data.
In [ ]:
%bash
export PYTHONPATH=${PYTHONPATH}:${PWD}/babyweight
python -m trainer.task \
--bucket=${BUCKET} --frac=0.001 --job-dir=gs://${BUCKET}/babyweight/sklearn --projectId $PROJECT
Submit the code to the ML Engine service
In [ ]:
%bash
RUNTIME_VERSION="1.8"
PYTHON_VERSION="2.7"
JOB_NAME=babyweight_skl_$(date +"%Y%m%d_%H%M%S")
JOB_DIR="gs://$BUCKET/babyweight/sklearn/${JOBNAME}"
gcloud ml-engine jobs submit training $JOB_NAME \
--job-dir $JOB_DIR \
--package-path $(pwd)/babyweight/trainer \
--module-name trainer.task \
--region us-central1 \
--runtime-version=$RUNTIME_VERSION \
--python-version=$PYTHON_VERSION \
-- \
--bucket=${BUCKET} --frac=0.1 --projectId $PROJECT
The training finished in 20 minutes with a RMSE of 1.05 lbs.
Deploying the trained model to act as a REST web service is a simple gcloud call.
In [8]:
%bash
gsutil ls gs://${BUCKET}/babyweight/sklearn/ | tail -1
In [ ]:
%bash
MODEL_NAME="babyweight"
MODEL_VERSION="skl"
MODEL_LOCATION=$(gsutil ls gs://${BUCKET}/babyweight/sklearn/ | tail -1)
echo "Deleting and deploying $MODEL_NAME $MODEL_VERSION from $MODEL_LOCATION ... this will take a few minutes"
#gcloud ml-engine versions delete ${MODEL_VERSION} --model ${MODEL_NAME}
#gcloud ml-engine models delete ${MODEL_NAME}
#gcloud ml-engine models create ${MODEL_NAME} --regions $REGION
gcloud alpha ml-engine versions create ${MODEL_VERSION} --model ${MODEL_NAME} --origin ${MODEL_LOCATION} \
--framework SCIKIT_LEARN --runtime-version 1.8 --python-version=2.7
Send a JSON request to the endpoint of the service to make it predict a baby's weight ... Note that we need to send in an array of numbers in the same order as when we trained the model. You can sort of save some preprocessing by using sklearn's Pipeline, but we did our preprocessing with Pandas, so that is not an option.
So, let's find the order of columns:
In [40]:
data = []
for i in range(2):
data.append([])
for col in eval_x:
# convert from numpy integers to standard integers
data[i].append(int(np.uint64(eval_x[col][i]).item()))
print(eval_x.columns)
print(json.dumps(data))
As long as you send in the data in that order, it will work:
In [35]:
from googleapiclient import discovery
from oauth2client.client import GoogleCredentials
import json
credentials = GoogleCredentials.get_application_default()
api = discovery.build('ml', 'v1', credentials=credentials)
request_data = {'instances':
# [u'mother_age', u'gestation_weeks', u'is_male_Unknown', u'is_male_0',
# u'is_male_1', u'plurality_Single', u'plurality_Multiple',
# u'plurality_1', u'plurality_2', u'plurality_3', u'plurality_4',
# u'plurality_5']
[[24, 38, 1, 0, 0, 1, 0, 0, 0, 0, 0, 0],
[34, 39, 0, 0, 1, 0, 0, 1, 0, 0, 0, 0]]
}
parent = 'projects/%s/models/%s/versions/%s' % (PROJECT, 'babyweight', 'skl')
response = api.projects().predict(body=request_data, name=parent).execute()
print "response={0}".format(response)
In [14]:
%writefile hyperparam.yaml
trainingInput:
hyperparameters:
goal: MINIMIZE
maxTrials: 100
maxParallelTrials: 5
hyperparameterMetricTag: rmse
params:
- parameterName: maxDepth
type: INTEGER
minValue: 2
maxValue: 8
scaleType: UNIT_LINEAR_SCALE
- parameterName: numTrees
type: INTEGER
minValue: 50
maxValue: 150
scaleType: UNIT_LINEAR_SCALE
In [ ]:
%bash
RUNTIME_VERSION="1.8"
PYTHON_VERSION="2.7"
JOB_NAME=babyweight_skl_$(date +"%Y%m%d_%H%M%S")
JOB_DIR="gs://$BUCKET/babyweight/sklearn/${JOBNAME}"
gcloud ml-engine jobs submit training $JOB_NAME \
--job-dir $JOB_DIR \
--package-path $(pwd)/babyweight/trainer \
--module-name trainer.task \
--region us-central1 \
--runtime-version=$RUNTIME_VERSION \
--python-version=$PYTHON_VERSION \
--config=hyperparam.yaml \
-- \
--bucket=${BUCKET} --frac=0.01 --projectId $PROJECT
If you go to the GCP console and click on the job, you will see the trial information start to populating, with the lowest rmse trial listed first. I got the best performance with these settings:
"hyperparameters": { "maxDepth": "8", "numTrees": "90" }, "finalMetric": { "trainingStep": "1", "objectiveValue": 1.03123724461 }
Let's train on the full dataset with these hyperparameters. I am using a larger machine (8 CPUS, 52 GB of memory).
In [21]:
%writefile largemachine.yaml
trainingInput:
scaleTier: CUSTOM
masterType: large_model
In [ ]:
%bash
RUNTIME_VERSION="1.8"
PYTHON_VERSION="2.7"
JOB_NAME=babyweight_skl_$(date +"%Y%m%d_%H%M%S")
JOB_DIR="gs://$BUCKET/babyweight/sklearn/${JOBNAME}"
gcloud ml-engine jobs submit training $JOB_NAME \
--job-dir $JOB_DIR \
--package-path $(pwd)/babyweight/trainer \
--module-name trainer.task \
--region us-central1 \
--runtime-version=$RUNTIME_VERSION \
--python-version=$PYTHON_VERSION \
--scale-tier=CUSTOM \
--config=largemachine.yaml \
-- \
--bucket=${BUCKET} --frac=1 --projectId $PROJECT --maxDepth 8 --numTrees 90
Copyright 2018 Google Inc. Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with the License. You may obtain a copy of the License at http://www.apache.org/licenses/LICENSE-2.0 Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the specific language governing permissions and limitations under the License
In [ ]: