Structured data prediction using Cloud ML Engine with scikit-learn

This notebook illustrates:

  1. Creating datasets for Machine Learning using BigQuery
  2. Creating a model using scitkit learn
  3. Training on Cloud ML Engine
  4. Deploying model
  5. Predicting with model
  6. Hyperparameter tuning of scikit-learn models

Please see this notebook for more context on this problem and how the features were chosen.


In [1]:
# change these to try this notebook out
BUCKET = 'cloud-training-demos-ml'
PROJECT = 'cloud-training-demos'
PROJECTNUMBER = '663413318684'
REGION = 'us-central1'

In [2]:
import os
os.environ['BUCKET'] = BUCKET
os.environ['PROJECT'] = PROJECT
os.environ['PROJECTNUMBER'] = PROJECTNUMBER
os.environ['REGION'] = REGION

In [3]:
%bash
gcloud config set project $PROJECT
gcloud config set compute/region $REGION


Updated property [core/project].
Updated property [compute/region].

In [4]:
%%bash
if ! gsutil ls | grep -q gs://${BUCKET}/; then
  gsutil mb -l ${REGION} gs://${BUCKET}
fi

In [138]:
%bash
# Pandas will use this privatekey to access BigQuery on our behalf.
# Do NOT check in the private key into git!!!
# if you get a JWT grant error when using this key, create the key via gcp web console in IAM > Service Accounts section
KEYFILE=babyweight/trainer/privatekey.json
if [ ! -f $KEYFILE ]; then
  gcloud iam service-accounts keys create \
      --iam-account ${PROJECTNUMBER}-compute@developer.gserviceaccount.com \
      $KEYFILE
fi

In [21]:
KEYDIR='babyweight/trainer'

Exploring dataset

Please see this notebook for more context on this problem and how the features were chosen.


In [76]:
#%writefile babyweight/trainer/model.py

# Copyright 2018 Google Inc. All Rights Reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
#     http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.

Creating a ML dataset using BigQuery

We can use BigQuery to create the training and evaluation datasets. Because of the masking (ultrasound vs. no ultrasound), the query itself is a little complex.


In [1]:
#%writefile -a babyweight/trainer/model.py
def create_queries():
  query_all = """
  WITH with_ultrasound AS (
    SELECT
      weight_pounds AS label,
      CAST(is_male AS STRING) AS is_male,
      mother_age,
      CAST(plurality AS STRING) AS plurality,
      gestation_weeks,
      FARM_FINGERPRINT(CONCAT(CAST(YEAR AS STRING), CAST(month AS STRING))) AS hashmonth
    FROM
      publicdata.samples.natality
    WHERE
      year > 2000
      AND gestation_weeks > 0
      AND mother_age > 0
      AND plurality > 0
      AND weight_pounds > 0
  ),

  without_ultrasound AS (
    SELECT
      weight_pounds AS label,
      'Unknown' AS is_male,
      mother_age,
      IF(plurality > 1, 'Multiple', 'Single') AS plurality,
      gestation_weeks,
      FARM_FINGERPRINT(CONCAT(CAST(YEAR AS STRING), CAST(month AS STRING))) AS hashmonth
    FROM
      publicdata.samples.natality
    WHERE
      year > 2000
      AND gestation_weeks > 0
      AND mother_age > 0
      AND plurality > 0
      AND weight_pounds > 0
  ),

  preprocessed AS (
    SELECT * from with_ultrasound
    UNION ALL
    SELECT * from without_ultrasound
  )

  SELECT
      label,
      is_male,
      mother_age,
      plurality,
      gestation_weeks
  FROM
      preprocessed
  """

  train_query = "{} WHERE ABS(MOD(hashmonth, 4)) < 3".format(query_all)
  eval_query  = "{} WHERE ABS(MOD(hashmonth, 4)) = 3".format(query_all)
  return train_query, eval_query

In [2]:
print create_queries()[0]


  WITH with_ultrasound AS (
    SELECT
      weight_pounds AS label,
      CAST(is_male AS STRING) AS is_male,
      mother_age,
      CAST(plurality AS STRING) AS plurality,
      gestation_weeks,
      FARM_FINGERPRINT(CONCAT(CAST(YEAR AS STRING), CAST(month AS STRING))) AS hashmonth
    FROM
      publicdata.samples.natality
    WHERE
      year > 2000
      AND gestation_weeks > 0
      AND mother_age > 0
      AND plurality > 0
      AND weight_pounds > 0
  ),

  without_ultrasound AS (
    SELECT
      weight_pounds AS label,
      'Unknown' AS is_male,
      mother_age,
      IF(plurality > 1, 'Multiple', 'Single') AS plurality,
      gestation_weeks,
      FARM_FINGERPRINT(CONCAT(CAST(YEAR AS STRING), CAST(month AS STRING))) AS hashmonth
    FROM
      publicdata.samples.natality
    WHERE
      year > 2000
      AND gestation_weeks > 0
      AND mother_age > 0
      AND plurality > 0
      AND weight_pounds > 0
  ),

  preprocessed AS (
    SELECT * from with_ultrasound
    UNION ALL
    SELECT * from without_ultrasound
  )

  SELECT
      label,
      is_male,
      mother_age,
      plurality,
      gestation_weeks
  FROM
      preprocessed
   WHERE ABS(MOD(hashmonth, 4)) < 3

In [19]:
#%writefile -a babyweight/trainer/model.py
def query_to_dataframe(query):
  import pandas as pd
  import pkgutil
  privatekey = pkgutil.get_data(KEYDIR, 'privatekey.json')
  print(privatekey[:200])
  return pd.read_gbq(query,
                     project_id=PROJECT,
                     dialect='standard',
                     private_key=privatekey)

def create_dataframes(frac):  
  # small dataset for testing
  if frac > 0 and frac < 1:
    sample = " AND RAND() < {}".format(frac)
  else:
    sample = ""

  train_query, eval_query = create_queries()
  train_query = "{} {}".format(train_query, sample)
  eval_query =  "{} {}".format(eval_query, sample)

  train_df = query_to_dataframe(train_query)
  eval_df = query_to_dataframe(eval_query)
  return train_df, eval_df

In [22]:
train_df, eval_df = create_dataframes(0.001)
train_df.describe()


{
  "type": "service_account",
  "project_id": "cloud-training-demos",
  "private_key_id": "ef88065bb770531b91fb45e31ae60539475547d8",
  "private_key": "-----BEGIN PRIVATE KEY-----\nMIIEvgIBADANBgkqhk
{
  "type": "service_account",
  "project_id": "cloud-training-demos",
  "private_key_id": "ef88065bb770531b91fb45e31ae60539475547d8",
  "private_key": "-----BEGIN PRIVATE KEY-----\nMIIEvgIBADANBgkqhk
Out[22]:
label mother_age gestation_weeks
count 52962.000000 52962.000000 52962.000000
mean 7.221937 27.415996 38.581757
std 1.325003 6.148160 2.580018
min 0.500449 12.000000 17.000000
25% 6.563162 23.000000 38.000000
50% 7.312733 27.000000 39.000000
75% 8.062305 32.000000 40.000000
max 13.232145 53.000000 47.000000

In [23]:
eval_df.head()


Out[23]:
label is_male mother_age plurality gestation_weeks
0 1.124358 Unknown 24 Single 17
1 0.562179 false 34 1 19
2 1.563077 true 18 1 20
3 0.522496 false 18 2 20
4 0.623908 false 21 1 20

Creating a scikit-learn model using random forests

Let's train the model locally


In [36]:
#%writefile -a babyweight/trainer/model.py
def input_fn(indf):
  import copy
  import pandas as pd
  df = copy.deepcopy(indf)

  # one-hot encode the categorical columns
  df["plurality"] = df["plurality"].astype(pd.api.types.CategoricalDtype(
                    categories=["Single","Multiple","1","2","3","4","5"]))
  df["is_male"] = df["is_male"].astype(pd.api.types.CategoricalDtype(
                  categories=["Unknown","false","true"]))
  # features, label
  label = df['label']
  del df['label']
  features = pd.get_dummies(df)
  return features, label

In [37]:
train_x, train_y = input_fn(train_df)
print(train_x[:5])
print(train_y[:5])


   mother_age  gestation_weeks  is_male_Unknown  is_male_false  is_male_true  \
0          20               17                0              1             0   
1          21               17                0              1             0   
2          26               17                0              0             1   
3          27               17                1              0             0   
4          29               17                1              0             0   

   plurality_Single  plurality_Multiple  plurality_1  plurality_2  \
0                 0                   0            1            0   
1                 0                   0            1            0   
2                 0                   0            0            0   
3                 0                   1            0            0   
4                 1                   0            0            0   

   plurality_3  plurality_4  plurality_5  
0            0            0            0  
1            0            0            0  
2            1            0            0  
3            0            0            0  
4            0            0            0  
0    1.000899
1    1.437414
2    0.518086
3    1.344820
4    0.551156
Name: label, dtype: float64

In [38]:
from sklearn.ensemble import RandomForestRegressor
estimator = RandomForestRegressor(max_depth=5, n_estimators=100, random_state=0)
estimator.fit(train_x, train_y)


Out[38]:
RandomForestRegressor(bootstrap=True, criterion='mse', max_depth=5,
           max_features='auto', max_leaf_nodes=None,
           min_impurity_decrease=0.0, min_impurity_split=None,
           min_samples_leaf=1, min_samples_split=2,
           min_weight_fraction_leaf=0.0, n_estimators=100, n_jobs=1,
           oob_score=False, random_state=0, verbose=0, warm_start=False)

In [39]:
import numpy as np
eval_x, eval_y = input_fn(eval_df)
eval_pred = estimator.predict(eval_x)
print(eval_pred[1000:1005])
print(eval_y[1000:1005])
print(np.sqrt(np.mean((eval_pred-eval_y)*(eval_pred-eval_y))))


[6.10063904 6.10063904 5.19180783 5.20729443 6.10063904]
1000    6.563162
1001    6.415452
1002    5.749656
1003    5.500533
1004    9.751046
Name: label, dtype: float64
1.0421166304081477

In [142]:
#%writefile -a babyweight/trainer/model.py
def train_and_evaluate(frac, max_depth=5, n_estimators=100):
  import numpy as np

  # get data
  train_df, eval_df = create_dataframes(frac)
  train_x, train_y = input_fn(train_df)
  # train
  from sklearn.ensemble import RandomForestRegressor
  estimator = RandomForestRegressor(max_depth=max_depth, n_estimators=n_estimators, random_state=0)
  estimator.fit(train_x, train_y)
  # evaluate
  eval_x, eval_y = input_fn(eval_df)
  eval_pred = estimator.predict(eval_x)
  rmse = np.sqrt(np.mean((eval_pred-eval_y)*(eval_pred-eval_y)))
  print("Eval rmse={}".format(rmse))
  return estimator, rmse

In [72]:
#%writefile -a babyweight/trainer/model.py
def save_model(estimator, gcspath, name):
  from sklearn.externals import joblib
  import os, subprocess, datetime
  model = 'model.joblib'
  joblib.dump(estimator, model)
  model_path = os.path.join(gcspath, datetime.datetime.now().strftime(
    'export_%Y%m%d_%H%M%S'), model)
  subprocess.check_call(['gsutil', 'cp', model, model_path])
  return model_path

In [69]:
saved = save_model(estimator, 'gs://{}/babyweight/sklearn'.format(BUCKET), 'babyweight')

In [70]:
print saved


gs://cloud-training-demos-ml/babyweight/sklearn/export_20180524_233356/babyweight.joblib

Packaging up as a Python package

Note the %writefile in the cells above. I uncommented those and ran the cells to write out a model.py The following cell writes out a task.py


In [11]:
%writefile babyweight/trainer/task.py
# Copyright 2018 Google Inc. All Rights Reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
#     http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
import argparse
import os

import hypertune
import model

if __name__ == '__main__':
    parser = argparse.ArgumentParser()
    parser.add_argument(
        '--bucket',
        help = 'GCS path to output.',
        required = True
    )
    parser.add_argument(
        '--frac',
        help = 'Fraction of input to process',
        type = float,
        required = True
    )
    parser.add_argument(
        '--maxDepth',
        help = 'Depth of trees',
        type = int,
        default = 5
    )
    parser.add_argument(
        '--numTrees',
        help = 'Number of trees',
        type = int,
        default = 100
    )
    parser.add_argument(
        '--projectId',
        help = 'ID (not name) of your project',
        required = True
    )
    parser.add_argument(
        '--job-dir',
        help = 'output directory for model, automatically provided by gcloud',
        required = True
    )
    
    args = parser.parse_args()
    arguments = args.__dict__
    
    model.PROJECT = arguments['projectId']
    model.KEYDIR  = 'trainer'
    
    estimator, rmse = model.train_and_evaluate(arguments['frac'],
                                         arguments['maxDepth'],
                                         arguments['numTrees']
                                        )
    loc = model.save_model(estimator, 
                           arguments['job_dir'], 'babyweight')
    print("Saved model to {}".format(loc))
    
    # this is for hyperparameter tuning
    hpt = hypertune.HyperTune()
    hpt.report_hyperparameter_tuning_metric(
        hyperparameter_metric_tag='rmse',
        metric_value=rmse,
        global_step=0)

# done


Overwriting babyweight/trainer/task.py

In [127]:
!pip freeze | grep pandas


pandas==0.22.0
pandas-gbq==0.3.0
pandas-profiling==1.4.1
You are using pip version 9.0.3, however version 10.0.1 is available.
You should consider upgrading via the 'pip install --upgrade pip' command.

In [18]:
%writefile babyweight/setup.py
# Copyright 2018 Google Inc. All Rights Reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
#     http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
from setuptools import setup

setup(name='trainer',
      version='1.0',
      description='Natality, with sklearn',
      url='http://github.com/GoogleCloudPlatform/training-data-analyst',
      author='Google',
      author_email='nobody@google.com',
      license='Apache2',
      packages=['trainer'],
      ## WARNING! Do not upload this package to PyPI
      ## BECAUSE it contains a private key
      package_data={'': ['privatekey.json']},
      install_requires=[
          'pandas-gbq==0.3.0',
          'urllib3',
          'google-cloud-bigquery==0.29.0',
          'cloudml-hypertune'
      ],
      zip_safe=False)


Overwriting babyweight/setup.py

Try out the package on a subset of the data.


In [ ]:
%bash
export PYTHONPATH=${PYTHONPATH}:${PWD}/babyweight
python -m trainer.task \
   --bucket=${BUCKET} --frac=0.001 --job-dir=gs://${BUCKET}/babyweight/sklearn --projectId $PROJECT

Training on Cloud ML Engine

Submit the code to the ML Engine service


In [ ]:
%bash

RUNTIME_VERSION="1.8"
PYTHON_VERSION="2.7"
JOB_NAME=babyweight_skl_$(date +"%Y%m%d_%H%M%S")
JOB_DIR="gs://$BUCKET/babyweight/sklearn/${JOBNAME}"

gcloud ml-engine jobs submit training $JOB_NAME \
  --job-dir $JOB_DIR \
  --package-path $(pwd)/babyweight/trainer \
  --module-name trainer.task \
  --region us-central1 \
  --runtime-version=$RUNTIME_VERSION \
  --python-version=$PYTHON_VERSION \
  -- \
  --bucket=${BUCKET} --frac=0.1 --projectId $PROJECT

The training finished in 20 minutes with a RMSE of 1.05 lbs.

Deploying the trained model

Deploying the trained model to act as a REST web service is a simple gcloud call.


In [8]:
%bash
gsutil ls gs://${BUCKET}/babyweight/sklearn/ | tail -1


gs://cloud-training-demos-ml/babyweight/sklearn/export_20180526_185457/

In [ ]:
%bash
MODEL_NAME="babyweight"
MODEL_VERSION="skl"
MODEL_LOCATION=$(gsutil ls gs://${BUCKET}/babyweight/sklearn/ | tail -1)
echo "Deleting and deploying $MODEL_NAME $MODEL_VERSION from $MODEL_LOCATION ... this will take a few minutes"
#gcloud ml-engine versions delete ${MODEL_VERSION} --model ${MODEL_NAME}
#gcloud ml-engine models delete ${MODEL_NAME}
#gcloud ml-engine models create ${MODEL_NAME} --regions $REGION
gcloud alpha ml-engine versions create ${MODEL_VERSION} --model ${MODEL_NAME} --origin ${MODEL_LOCATION} \
    --framework SCIKIT_LEARN --runtime-version 1.8  --python-version=2.7

Using the model to predict

Send a JSON request to the endpoint of the service to make it predict a baby's weight ... Note that we need to send in an array of numbers in the same order as when we trained the model. You can sort of save some preprocessing by using sklearn's Pipeline, but we did our preprocessing with Pandas, so that is not an option.

So, let's find the order of columns:


In [40]:
data = []
for i in range(2):
  data.append([])
  for col in eval_x:
    # convert from numpy integers to standard integers
    data[i].append(int(np.uint64(eval_x[col][i]).item()))

print(eval_x.columns)
print(json.dumps(data))


Index([u'mother_age', u'gestation_weeks', u'is_male_Unknown', u'is_male_false',
       u'is_male_true', u'plurality_Single', u'plurality_Multiple',
       u'plurality_1', u'plurality_2', u'plurality_3', u'plurality_4',
       u'plurality_5'],
      dtype='object')
[[24, 17, 1, 0, 0, 1, 0, 0, 0, 0, 0, 0], [34, 19, 0, 1, 0, 0, 0, 1, 0, 0, 0, 0]]

As long as you send in the data in that order, it will work:


In [35]:
from googleapiclient import discovery
from oauth2client.client import GoogleCredentials
import json

credentials = GoogleCredentials.get_application_default()
api = discovery.build('ml', 'v1', credentials=credentials)

request_data = {'instances':
  # [u'mother_age', u'gestation_weeks', u'is_male_Unknown', u'is_male_0',
  #     u'is_male_1', u'plurality_Single', u'plurality_Multiple',
  #     u'plurality_1', u'plurality_2', u'plurality_3', u'plurality_4',
  #     u'plurality_5']
  [[24, 38, 1, 0, 0, 1, 0, 0, 0, 0, 0, 0], 
   [34, 39, 0, 0, 1, 0, 0, 1, 0, 0, 0, 0]]
}

parent = 'projects/%s/models/%s/versions/%s' % (PROJECT, 'babyweight', 'skl')
response = api.projects().predict(body=request_data, name=parent).execute()
print "response={0}".format(response)


response={u'predictions': [7.075709854636022, 7.715287844651161]}

Hyperparameter tuning

Let's do a bunch of parallel trials to find good maxDepth and numTrees


In [14]:
%writefile hyperparam.yaml
trainingInput:
  hyperparameters:
    goal: MINIMIZE
    maxTrials: 100
    maxParallelTrials: 5
    hyperparameterMetricTag: rmse
    params:
    - parameterName: maxDepth
      type: INTEGER
      minValue: 2
      maxValue: 8
      scaleType: UNIT_LINEAR_SCALE
    - parameterName: numTrees
      type: INTEGER
      minValue: 50
      maxValue: 150
      scaleType: UNIT_LINEAR_SCALE


Writing hyperparam.yaml

In [ ]:
%bash
RUNTIME_VERSION="1.8"
PYTHON_VERSION="2.7"
JOB_NAME=babyweight_skl_$(date +"%Y%m%d_%H%M%S")
JOB_DIR="gs://$BUCKET/babyweight/sklearn/${JOBNAME}"

gcloud ml-engine jobs submit training $JOB_NAME \
  --job-dir $JOB_DIR \
  --package-path $(pwd)/babyweight/trainer \
  --module-name trainer.task \
  --region us-central1 \
  --runtime-version=$RUNTIME_VERSION \
  --python-version=$PYTHON_VERSION \
  --config=hyperparam.yaml \
  -- \
  --bucket=${BUCKET} --frac=0.01 --projectId $PROJECT

If you go to the GCP console and click on the job, you will see the trial information start to populating, with the lowest rmse trial listed first. I got the best performance with these settings:

      "hyperparameters": {
        "maxDepth": "8",
        "numTrees": "90"
      },
      "finalMetric": {
        "trainingStep": "1",
        "objectiveValue": 1.03123724461
      }

Train on full dataset

Let's train on the full dataset with these hyperparameters. I am using a larger machine (8 CPUS, 52 GB of memory).


In [21]:
%writefile largemachine.yaml
trainingInput:
  scaleTier: CUSTOM
  masterType: large_model


Writing largemachine.yaml

In [ ]:
%bash

RUNTIME_VERSION="1.8"
PYTHON_VERSION="2.7"
JOB_NAME=babyweight_skl_$(date +"%Y%m%d_%H%M%S")
JOB_DIR="gs://$BUCKET/babyweight/sklearn/${JOBNAME}"

gcloud ml-engine jobs submit training $JOB_NAME \
  --job-dir $JOB_DIR \
  --package-path $(pwd)/babyweight/trainer \
  --module-name trainer.task \
  --region us-central1 \
  --runtime-version=$RUNTIME_VERSION \
  --python-version=$PYTHON_VERSION \
  --scale-tier=CUSTOM \
  --config=largemachine.yaml \
  -- \
  --bucket=${BUCKET} --frac=1 --projectId $PROJECT --maxDepth 8 --numTrees 90

Copyright 2018 Google Inc. Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with the License. You may obtain a copy of the License at http://www.apache.org/licenses/LICENSE-2.0 Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the specific language governing permissions and limitations under the License


In [ ]: