Structured data prediction using Cloud ML Engine with scikit-learn

This notebook illustrates:

Creating datasets for Machine Learning using BigQuery
Creating a model using scitkit learn
Training on Cloud ML Engine
Deploying model
Predicting with model
Hyperparameter tuning of scikit-learn models

Please see this notebook for more context on this problem and how the features were chosen.



In [1]:

    
# change these to try this notebook out
BUCKET = 'cloud-training-demos-ml'
PROJECT = 'cloud-training-demos'
PROJECTNUMBER = '663413318684'
REGION = 'us-central1'



In [2]:

    
import os
os.environ['BUCKET'] = BUCKET
os.environ['PROJECT'] = PROJECT
os.environ['PROJECTNUMBER'] = PROJECTNUMBER
os.environ['REGION'] = REGION



In [3]:

    
%bash
gcloud config set project $PROJECT
gcloud config set compute/region $REGION









    



Updated property [core/project].
Updated property [compute/region].



In [4]:

    
%%bash
if ! gsutil ls | grep -q gs://${BUCKET}/; then
  gsutil mb -l ${REGION} gs://${BUCKET}
fi



In [138]:

    
%bash
# Pandas will use this privatekey to access BigQuery on our behalf.
# Do NOT check in the private key into git!!!
# if you get a JWT grant error when using this key, create the key via gcp web console in IAM > Service Accounts section
KEYFILE=babyweight/trainer/privatekey.json
if [ ! -f $KEYFILE ]; then
  gcloud iam service-accounts keys create \
      --iam-account ${PROJECTNUMBER}-compute@developer.gserviceaccount.com \
      $KEYFILE
fi



In [21]:

    
KEYDIR='babyweight/trainer'

Exploring dataset

Please see this notebook for more context on this problem and how the features were chosen.



In [76]:

    
#%writefile babyweight/trainer/model.py

# Copyright 2018 Google Inc. All Rights Reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
#     http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.

Creating a ML dataset using BigQuery

We can use BigQuery to create the training and evaluation datasets. Because of the masking (ultrasound vs. no ultrasound), the query itself is a little complex.



In [1]:

    
#%writefile -a babyweight/trainer/model.py
def create_queries():
  query_all = """
  WITH with_ultrasound AS (
    SELECT
      weight_pounds AS label,
      CAST(is_male AS STRING) AS is_male,
      mother_age,
      CAST(plurality AS STRING) AS plurality,
      gestation_weeks,
      FARM_FINGERPRINT(CONCAT(CAST(YEAR AS STRING), CAST(month AS STRING))) AS hashmonth
    FROM
      publicdata.samples.natality
    WHERE
      year > 2000
      AND gestation_weeks > 0
      AND mother_age > 0
      AND plurality > 0
      AND weight_pounds > 0
  ),

  without_ultrasound AS (
    SELECT
      weight_pounds AS label,
      'Unknown' AS is_male,
      mother_age,
      IF(plurality > 1, 'Multiple', 'Single') AS plurality,
      gestation_weeks,
      FARM_FINGERPRINT(CONCAT(CAST(YEAR AS STRING), CAST(month AS STRING))) AS hashmonth
    FROM
      publicdata.samples.natality
    WHERE
      year > 2000
      AND gestation_weeks > 0
      AND mother_age > 0
      AND plurality > 0
      AND weight_pounds > 0
  ),

  preprocessed AS (
    SELECT * from with_ultrasound
    UNION ALL
    SELECT * from without_ultrasound
  )

  SELECT
      label,
      is_male,
      mother_age,
      plurality,
      gestation_weeks
  FROM
      preprocessed
  """

  train_query = "{} WHERE ABS(MOD(hashmonth, 4)) < 3".format(query_all)
  eval_query  = "{} WHERE ABS(MOD(hashmonth, 4)) = 3".format(query_all)
  return train_query, eval_query



In [2]:

    
print create_queries()[0]









    



  WITH with_ultrasound AS (
    SELECT
      weight_pounds AS label,
      CAST(is_male AS STRING) AS is_male,
      mother_age,
      CAST(plurality AS STRING) AS plurality,
      gestation_weeks,
      FARM_FINGERPRINT(CONCAT(CAST(YEAR AS STRING), CAST(month AS STRING))) AS hashmonth
    FROM
      publicdata.samples.natality
    WHERE
      year > 2000
      AND gestation_weeks > 0
      AND mother_age > 0
      AND plurality > 0
      AND weight_pounds > 0
  ),

  without_ultrasound AS (
    SELECT
      weight_pounds AS label,
      'Unknown' AS is_male,
      mother_age,
      IF(plurality > 1, 'Multiple', 'Single') AS plurality,
      gestation_weeks,
      FARM_FINGERPRINT(CONCAT(CAST(YEAR AS STRING), CAST(month AS STRING))) AS hashmonth
    FROM
      publicdata.samples.natality
    WHERE
      year > 2000
      AND gestation_weeks > 0
      AND mother_age > 0
      AND plurality > 0
      AND weight_pounds > 0
  ),

  preprocessed AS (
    SELECT * from with_ultrasound
    UNION ALL
    SELECT * from without_ultrasound
  )

  SELECT
      label,
      is_male,
      mother_age,
      plurality,
      gestation_weeks
  FROM
      preprocessed
   WHERE ABS(MOD(hashmonth, 4)) < 3



In [19]:

    
#%writefile -a babyweight/trainer/model.py
def query_to_dataframe(query):
  import pandas as pd
  import pkgutil
  privatekey = pkgutil.get_data(KEYDIR, 'privatekey.json')
  print(privatekey[:200])
  return pd.read_gbq(query,
                     project_id=PROJECT,
                     dialect='standard',
                     private_key=privatekey)

def create_dataframes(frac):  
  # small dataset for testing
  if frac > 0 and frac < 1:
    sample = " AND RAND() < {}".format(frac)
  else:
    sample = ""

  train_query, eval_query = create_queries()
  train_query = "{} {}".format(train_query, sample)
  eval_query =  "{} {}".format(eval_query, sample)

  train_df = query_to_dataframe(train_query)
  eval_df = query_to_dataframe(eval_query)
  return train_df, eval_df



In [22]:

    
train_df, eval_df = create_dataframes(0.001)
train_df.describe()









    



{
  "type": "service_account",
  "project_id": "cloud-training-demos",
  "private_key_id": "ef88065bb770531b91fb45e31ae60539475547d8",
  "private_key": "-----BEGIN PRIVATE KEY-----\nMIIEvgIBADANBgkqhk
{
  "type": "service_account",
  "project_id": "cloud-training-demos",
  "private_key_id": "ef88065bb770531b91fb45e31ae60539475547d8",
  "private_key": "-----BEGIN PRIVATE KEY-----\nMIIEvgIBADANBgkqhk






    Out[22]:







  
    
      
      label
      mother_age
      gestation_weeks
    
  
  
    
      count
      52962.000000
      52962.000000
      52962.000000
    
    
      mean
      7.221937
      27.415996
      38.581757
    
    
      std
      1.325003
      6.148160
      2.580018
    
    
      min
      0.500449
      12.000000
      17.000000
    
    
      25%
      6.563162
      23.000000
      38.000000
    
    
      50%
      7.312733
      27.000000
      39.000000
    
    
      75%
      8.062305
      32.000000
      40.000000
    
    
      max
      13.232145
      53.000000
      47.000000



In [23]:

    
eval_df.head()









    Out[23]:







  
    
      
      label
      is_male
      mother_age
      plurality
      gestation_weeks
    
  
  
    
      0
      1.124358
      Unknown
      24
      Single
      17
    
    
      1
      0.562179
      false
      34
      1
      19
    
    
      2
      1.563077
      true
      18
      1
      20
    
    
      3
      0.522496
      false
      18
      2
      20
    
    
      4
      0.623908
      false
      21
      1
      20

Creating a scikit-learn model using random forests

Let's train the model locally



In [36]:

    
#%writefile -a babyweight/trainer/model.py
def input_fn(indf):
  import copy
  import pandas as pd
  df = copy.deepcopy(indf)

  # one-hot encode the categorical columns
  df["plurality"] = df["plurality"].astype(pd.api.types.CategoricalDtype(
                    categories=["Single","Multiple","1","2","3","4","5"]))
  df["is_male"] = df["is_male"].astype(pd.api.types.CategoricalDtype(
                  categories=["Unknown","false","true"]))
  # features, label
  label = df['label']
  del df['label']
  features = pd.get_dummies(df)
  return features, label



In [37]:

    
train_x, train_y = input_fn(train_df)
print(train_x[:5])
print(train_y[:5])









    



   mother_age  gestation_weeks  is_male_Unknown  is_male_false  is_male_true  \
0          20               17                0              1             0   
1          21               17                0              1             0   
2          26               17                0              0             1   
3          27               17                1              0             0   
4          29               17                1              0             0   

   plurality_Single  plurality_Multiple  plurality_1  plurality_2  \
0                 0                   0            1            0   
1                 0                   0            1            0   
2                 0                   0            0            0   
3                 0                   1            0            0   
4                 1                   0            0            0   

   plurality_3  plurality_4  plurality_5  
0            0            0            0  
1            0            0            0  
2            1            0            0  
3            0            0            0  
4            0            0            0  
0    1.000899
1    1.437414
2    0.518086
3    1.344820
4    0.551156
Name: label, dtype: float64



In [38]:

    
from sklearn.ensemble import RandomForestRegressor
estimator = RandomForestRegressor(max_depth=5, n_estimators=100, random_state=0)
estimator.fit(train_x, train_y)









    Out[38]:





RandomForestRegressor(bootstrap=True, criterion='mse', max_depth=5,
           max_features='auto', max_leaf_nodes=None,
           min_impurity_decrease=0.0, min_impurity_split=None,
           min_samples_leaf=1, min_samples_split=2,
           min_weight_fraction_leaf=0.0, n_estimators=100, n_jobs=1,
           oob_score=False, random_state=0, verbose=0, warm_start=False)



In [39]:

    
import numpy as np
eval_x, eval_y = input_fn(eval_df)
eval_pred = estimator.predict(eval_x)
print(eval_pred[1000:1005])
print(eval_y[1000:1005])
print(np.sqrt(np.mean((eval_pred-eval_y)*(eval_pred-eval_y))))









    



[6.10063904 6.10063904 5.19180783 5.20729443 6.10063904]
1000    6.563162
1001    6.415452
1002    5.749656
1003    5.500533
1004    9.751046
Name: label, dtype: float64
1.0421166304081477



In [142]:

    
#%writefile -a babyweight/trainer/model.py
def train_and_evaluate(frac, max_depth=5, n_estimators=100):
  import numpy as np

  # get data
  train_df, eval_df = create_dataframes(frac)
  train_x, train_y = input_fn(train_df)
  # train
  from sklearn.ensemble import RandomForestRegressor
  estimator = RandomForestRegressor(max_depth=max_depth, n_estimators=n_estimators, random_state=0)
  estimator.fit(train_x, train_y)
  # evaluate
  eval_x, eval_y = input_fn(eval_df)
  eval_pred = estimator.predict(eval_x)
  rmse = np.sqrt(np.mean((eval_pred-eval_y)*(eval_pred-eval_y)))
  print("Eval rmse={}".format(rmse))
  return estimator, rmse



In [72]:

    
#%writefile -a babyweight/trainer/model.py
def save_model(estimator, gcspath, name):
  from sklearn.externals import joblib
  import os, subprocess, datetime
  model = 'model.joblib'
  joblib.dump(estimator, model)
  model_path = os.path.join(gcspath, datetime.datetime.now().strftime(
    'export_%Y%m%d_%H%M%S'), model)
  subprocess.check_call(['gsutil', 'cp', model, model_path])
  return model_path



In [69]:

    
saved = save_model(estimator, 'gs://{}/babyweight/sklearn'.format(BUCKET), 'babyweight')



In [70]:

    
print saved









    



gs://cloud-training-demos-ml/babyweight/sklearn/export_20180524_233356/babyweight.joblib

Packaging up as a Python package

Note the %writefile in the cells above. I uncommented those and ran the cells to write out a model.py The following cell writes out a task.py



In [11]:

    
%writefile babyweight/trainer/task.py
# Copyright 2018 Google Inc. All Rights Reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
#     http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
import argparse
import os

import hypertune
import model

if __name__ == '__main__':
    parser = argparse.ArgumentParser()
    parser.add_argument(
        '--bucket',
        help = 'GCS path to output.',
        required = True
    )
    parser.add_argument(
        '--frac',
        help = 'Fraction of input to process',
        type = float,
        required = True
    )
    parser.add_argument(
        '--maxDepth',
        help = 'Depth of trees',
        type = int,
        default = 5
    )
    parser.add_argument(
        '--numTrees',
        help = 'Number of trees',
        type = int,
        default = 100
    )
    parser.add_argument(
        '--projectId',
        help = 'ID (not name) of your project',
        required = True
    )
    parser.add_argument(
        '--job-dir',
        help = 'output directory for model, automatically provided by gcloud',
        required = True
    )
    
    args = parser.parse_args()
    arguments = args.__dict__
    
    model.PROJECT = arguments['projectId']
    model.KEYDIR  = 'trainer'
    
    estimator, rmse = model.train_and_evaluate(arguments['frac'],
                                         arguments['maxDepth'],
                                         arguments['numTrees']
                                        )
    loc = model.save_model(estimator, 
                           arguments['job_dir'], 'babyweight')
    print("Saved model to {}".format(loc))
    
    # this is for hyperparameter tuning
    hpt = hypertune.HyperTune()
    hpt.report_hyperparameter_tuning_metric(
        hyperparameter_metric_tag='rmse',
        metric_value=rmse,
        global_step=0)

# done









    



Overwriting babyweight/trainer/task.py



In [127]:

    
!pip freeze | grep pandas









    



pandas==0.22.0
pandas-gbq==0.3.0
pandas-profiling==1.4.1
You are using pip version 9.0.3, however version 10.0.1 is available.
You should consider upgrading via the 'pip install --upgrade pip' command.



In [18]:

    
%writefile babyweight/setup.py
# Copyright 2018 Google Inc. All Rights Reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
#     http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
from setuptools import setup

setup(name='trainer',
      version='1.0',
      description='Natality, with sklearn',
      url='http://github.com/GoogleCloudPlatform/training-data-analyst',
      author='Google',
      author_email='nobody@google.com',
      license='Apache2',
      packages=['trainer'],
      ## WARNING! Do not upload this package to PyPI
      ## BECAUSE it contains a private key
      package_data={'': ['privatekey.json']},
      install_requires=[
          'pandas-gbq==0.3.0',
          'urllib3',
          'google-cloud-bigquery==0.29.0',
          'cloudml-hypertune'
      ],
      zip_safe=False)









    



Overwriting babyweight/setup.py

Try out the package on a subset of the data.



In [ ]:

    
%bash
export PYTHONPATH=${PYTHONPATH}:${PWD}/babyweight
python -m trainer.task \
   --bucket=${BUCKET} --frac=0.001 --job-dir=gs://${BUCKET}/babyweight/sklearn --projectId $PROJECT

Training on Cloud ML Engine

Submit the code to the ML Engine service



In [ ]:

    
%bash

RUNTIME_VERSION="1.8"
PYTHON_VERSION="2.7"
JOB_NAME=babyweight_skl_$(date +"%Y%m%d_%H%M%S")
JOB_DIR="gs://$BUCKET/babyweight/sklearn/${JOBNAME}"

gcloud ml-engine jobs submit training $JOB_NAME \
  --job-dir $JOB_DIR \
  --package-path $(pwd)/babyweight/trainer \
  --module-name trainer.task \
  --region us-central1 \
  --runtime-version=$RUNTIME_VERSION \
  --python-version=$PYTHON_VERSION \
  -- \
  --bucket=${BUCKET} --frac=0.1 --projectId $PROJECT

The training finished in 20 minutes with a RMSE of 1.05 lbs.

Deploying the trained model

Deploying the trained model to act as a REST web service is a simple gcloud call.



In [8]:

    
%bash
gsutil ls gs://${BUCKET}/babyweight/sklearn/ | tail -1









    



gs://cloud-training-demos-ml/babyweight/sklearn/export_20180526_185457/



In [ ]:

    
%bash
MODEL_NAME="babyweight"
MODEL_VERSION="skl"
MODEL_LOCATION=$(gsutil ls gs://${BUCKET}/babyweight/sklearn/ | tail -1)
echo "Deleting and deploying $MODEL_NAME $MODEL_VERSION from $MODEL_LOCATION ... this will take a few minutes"
#gcloud ml-engine versions delete ${MODEL_VERSION} --model ${MODEL_NAME}
#gcloud ml-engine models delete ${MODEL_NAME}
#gcloud ml-engine models create ${MODEL_NAME} --regions $REGION
gcloud alpha ml-engine versions create ${MODEL_VERSION} --model ${MODEL_NAME} --origin ${MODEL_LOCATION} \
    --framework SCIKIT_LEARN --runtime-version 1.8  --python-version=2.7

Using the model to predict

Send a JSON request to the endpoint of the service to make it predict a baby's weight ... Note that we need to send in an array of numbers in the same order as when we trained the model. You can sort of save some preprocessing by using sklearn's Pipeline, but we did our preprocessing with Pandas, so that is not an option.

So, let's find the order of columns:



In [40]:

    
data = []
for i in range(2):
  data.append([])
  for col in eval_x:
    # convert from numpy integers to standard integers
    data[i].append(int(np.uint64(eval_x[col][i]).item()))

print(eval_x.columns)
print(json.dumps(data))









    



Index([u'mother_age', u'gestation_weeks', u'is_male_Unknown', u'is_male_false',
       u'is_male_true', u'plurality_Single', u'plurality_Multiple',
       u'plurality_1', u'plurality_2', u'plurality_3', u'plurality_4',
       u'plurality_5'],
      dtype='object')
[[24, 17, 1, 0, 0, 1, 0, 0, 0, 0, 0, 0], [34, 19, 0, 1, 0, 0, 0, 1, 0, 0, 0, 0]]

As long as you send in the data in that order, it will work:



In [35]:

    
from googleapiclient import discovery
from oauth2client.client import GoogleCredentials
import json

credentials = GoogleCredentials.get_application_default()
api = discovery.build('ml', 'v1', credentials=credentials)

request_data = {'instances':
  # [u'mother_age', u'gestation_weeks', u'is_male_Unknown', u'is_male_0',
  #     u'is_male_1', u'plurality_Single', u'plurality_Multiple',
  #     u'plurality_1', u'plurality_2', u'plurality_3', u'plurality_4',
  #     u'plurality_5']
  [[24, 38, 1, 0, 0, 1, 0, 0, 0, 0, 0, 0], 
   [34, 39, 0, 0, 1, 0, 0, 1, 0, 0, 0, 0]]
}

parent = 'projects/%s/models/%s/versions/%s' % (PROJECT, 'babyweight', 'skl')
response = api.projects().predict(body=request_data, name=parent).execute()
print "response={0}".format(response)









    



response={u'predictions': [7.075709854636022, 7.715287844651161]}

Hyperparameter tuning

Let's do a bunch of parallel trials to find good maxDepth and numTrees



In [14]:

    
%writefile hyperparam.yaml
trainingInput:
  hyperparameters:
    goal: MINIMIZE
    maxTrials: 100
    maxParallelTrials: 5
    hyperparameterMetricTag: rmse
    params:
    - parameterName: maxDepth
      type: INTEGER
      minValue: 2
      maxValue: 8
      scaleType: UNIT_LINEAR_SCALE
    - parameterName: numTrees
      type: INTEGER
      minValue: 50
      maxValue: 150
      scaleType: UNIT_LINEAR_SCALE









    



Writing hyperparam.yaml



In [ ]:

    
%bash
RUNTIME_VERSION="1.8"
PYTHON_VERSION="2.7"
JOB_NAME=babyweight_skl_$(date +"%Y%m%d_%H%M%S")
JOB_DIR="gs://$BUCKET/babyweight/sklearn/${JOBNAME}"

gcloud ml-engine jobs submit training $JOB_NAME \
  --job-dir $JOB_DIR \
  --package-path $(pwd)/babyweight/trainer \
  --module-name trainer.task \
  --region us-central1 \
  --runtime-version=$RUNTIME_VERSION \
  --python-version=$PYTHON_VERSION \
  --config=hyperparam.yaml \
  -- \
  --bucket=${BUCKET} --frac=0.01 --projectId $PROJECT

If you go to the GCP console and click on the job, you will see the trial information start to populating, with the lowest rmse trial listed first. I got the best performance with these settings:

      "hyperparameters": {
        "maxDepth": "8",
        "numTrees": "90"
      },
      "finalMetric": {
        "trainingStep": "1",
        "objectiveValue": 1.03123724461
      }

Train on full dataset

Let's train on the full dataset with these hyperparameters. I am using a larger machine (8 CPUS, 52 GB of memory).



In [21]:

    
%writefile largemachine.yaml
trainingInput:
  scaleTier: CUSTOM
  masterType: large_model









    



Writing largemachine.yaml



In [ ]:

    
%bash

RUNTIME_VERSION="1.8"
PYTHON_VERSION="2.7"
JOB_NAME=babyweight_skl_$(date +"%Y%m%d_%H%M%S")
JOB_DIR="gs://$BUCKET/babyweight/sklearn/${JOBNAME}"

gcloud ml-engine jobs submit training $JOB_NAME \
  --job-dir $JOB_DIR \
  --package-path $(pwd)/babyweight/trainer \
  --module-name trainer.task \
  --region us-central1 \
  --runtime-version=$RUNTIME_VERSION \
  --python-version=$PYTHON_VERSION \
  --scale-tier=CUSTOM \
  --config=largemachine.yaml \
  -- \
  --bucket=${BUCKET} --frac=1 --projectId $PROJECT --maxDepth 8 --numTrees 90

Copyright 2018 Google Inc. Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with the License. You may obtain a copy of the License at http://www.apache.org/licenses/LICENSE-2.0 Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the specific language governing permissions and limitations under the License



In [ ]:

	label	mother_age	gestation_weeks
count	52962.000000	52962.000000	52962.000000
mean	7.221937	27.415996	38.581757
std	1.325003	6.148160	2.580018
min	0.500449	12.000000	17.000000
25%	6.563162	23.000000	38.000000
50%	7.312733	27.000000	39.000000
75%	8.062305	32.000000	40.000000
max	13.232145	53.000000	47.000000

	label	is_male	mother_age	plurality	gestation_weeks
0	1.124358	Unknown	24	Single	17
1	0.562179	false	34	1	19
2	1.563077	true	18	1	20
3	0.522496	false	18	2	20
4	0.623908	false	21	1	20