About This Notebook

This notebook demonstrate how to use ML Workbench to create a regression model that accepts numeric and categorical data. This one shows "local run" mode, which does most work (except for exporting data from Bigquery) on Datalab's VM. And it uses only 0.3% of the data (about 200K instances). Next notebook demonstrates how to deal with large data (~70M instances) by running every steps in Google Cloud.

Execution of this notebook requires Google Datalab (see setup instructions).

The Data

We will use Chicago Taxi Trip Data. Using pickup location, drop off location, taxi company, the model we will build predicts the trip fare.

Sample Data from BigQuery and Export it to DataFrame

We will use the following query to get our training data.

Note that we convert weekday, day, hour to STRING so that they can be used as categorical instead of numeric features.



In [34]:

    
%%bq query --name texi_query
SELECT
  unique_key,
  fare,
  CAST(EXTRACT(DAYOFWEEK FROM trip_start_timestamp) AS STRING) as weekday,
  CAST(EXTRACT(DAYOFYEAR FROM trip_start_timestamp) AS STRING) as day,
  CAST(EXTRACT(HOUR FROM trip_start_timestamp) AS STRING) as hour,
  pickup_latitude,
  pickup_longitude,
  dropoff_latitude,
  dropoff_longitude,
  company
FROM `bigquery-public-data.chicago_taxi_trips.taxi_trips`
WHERE 
  fare > 2.0 AND fare < 200.0 AND
  pickup_latitude IS NOT NULL AND
  pickup_longitude IS NOT NULL AND
  dropoff_latitude IS NOT NULL AND
  dropoff_longitude IS NOT NULL AND
  taxi_id IS NOT NULL



In [35]:

    
# Sample 0.3% data, and split it into train/eval set.

import google.datalab.bigquery as bq
import numpy as np

sampling = bq.Sampling.random(percent=0.3)
job = texi_query.execute(sampling=sampling)
df = job.result().to_dataframe()
msk = np.random.rand(len(df)) < 0.95
train_df = df[msk]
eval_df = df[~msk]



In [36]:

    
print('Training set includes %d instances.' % len(train_df))
print('Eval set includes %d instances.' % len(eval_df))









    





          
          
          






    



Training set includes 203998 instances.
Eval set includes 10832 instances.

Save Data

Save the data for model training.



In [37]:

    
!mkdir -p ./taxi



In [38]:

    
train_df.to_csv('./taxi/train.csv', header=False, index=False)
eval_df.to_csv('./taxi/eval.csv', header=False, index=False)

Explore Data

Before we use the data, we need to explore it. In reality, data exploration/feature engineering is an iterative process. For example, the above query is impacted by data exploration (fare < 200, pickup_latitude IS NOT NULL, etc).

The following %%ml command defines the dataset, and also does exploration on the data with one overview and one facets view. Note that these views don't show up if this notebook is viewed from github, because it requires frontend files that are served from Google Cloud Datalab.



In [39]:

    
# This loads %%ml commands
import google.datalab.contrib.mlworkbench.commands



In [40]:

    
%%ml dataset create
format: csv
train: ./taxi/train.csv
eval: ./taxi/eval.csv
name: taxi_data
schema:
    - name: unique_key
      type: STRING
    - name: fare
      type: FLOAT
    - name: weekday
      type: STRING
    - name: day
      type: STRING
    - name: hour
      type: STRING
    - name: pickup_latitude
      type: FLOAT
    - name: pickup_longitude
      type: FLOAT
    - name: dropoff_latitude
      type: FLOAT
    - name: dropoff_longitude
      type: FLOAT
    - name: company
      type: STRING



In [41]:

    
%%ml dataset explore --overview
name: taxi_data









    





          
          
          






    



train data instances: 203998
eval data instances: 10832
Sampled 1000 instances for each.



In [42]:

    
%%ml dataset explore --facets
name: taxi_data









    





          
          
          






    



train data instances: 203998
eval data instances: 10832
Sampled 1000 instances for each.

Create Model with ML Workbench

The MLWorkbench Magics are a set of Datalab commands that allow an easy code-free experience to training, deploying, and predicting ML models. This notebook will take the sampled data and build a regression model. The MLWorkbench Magics are a collection of magic commands for each step in ML workflows: analyzing input data to build transforms, transforming data, training a model, evaluating a model, and deploying a model.

For details of each command, run with --help. For example, "%%ml train --help".

When the dataset is small, there is little benefit of using cloud services. This notebook will run the analyze, transform, and training steps locally. However, we will take the locally trained model and deploy it to ML Engine and show how to make real predictions on a deployed model. Every MLWorkbench magic can run locally or use cloud services (adding --cloud flag).

The next notebook in this sequence shows the cloud version of every command, and we will use full data.

Step 1: Analyze

The first step in the MLWorkbench workflow is to analyze the data for the requested transformations. Analysis in this case builds vocabulary for categorical features, and compute numeric stats for numeric features.



In [43]:

    
!rm -r -f ./taxi/analysis # Delete previous run results.



In [44]:

    
%%ml analyze
output: ./taxi/analysis
data: $taxi_data
features:
  unique_key:
    transform: key
  fare:
    transform: target   
  weekday:
    transform: one_hot
  day:
    transform: one_hot
  hour:
    transform: one_hot
  pickup_latitude:
    transform: scale    
  pickup_longitude:
    transform: scale
  dropoff_latitude:
    transform: scale
  dropoff_longitude:
    transform: scale
  company:
    transform: embedding
    embedding_dim: 10









    





          
          
          






    



Expanding any file patterns...
file list computed.
Analyzing file /content/datalab/docs/samples/contrib/mlworkbench/structured_data_regression_taxi/taxi/train.csv...
file /content/datalab/docs/samples/contrib/mlworkbench/structured_data_regression_taxi/taxi/train.csv analyzed.

Note in the above "features" config, "target" is required and has to be specified explicitly. "key" means the column is not used in model features, and just a pass-through. For others, there is a default transform chosen for each column.

Step 2: Transform

This step is optional as training can start from csv data (the same data used in the analysis step). The transform step performs some transformations on the input data and saves the results to a special TensorFlow file called a TFRecord file containing TF.Example protocol buffers. This allows training to start from preprocessed data. If this step is not used, training would have to perform the same preprocessing on every row of csv data every time it is used. As TensorFlow reads the same data row multiple times during training, this means the same row would be preprocessed multiple times. By writing the preprocessed data to disk, we can speed up training.

We run the transform step for the training and eval data.



In [45]:

    
!rm -r -f ./taxi/transform # Delete previous run results.



In [46]:

    
%%ml transform
output: ./taxi/transform
analysis: ./taxi/analysis
data: $taxi_data









    





          
          
          






    



WARNING:root:Couldn't find python-snappy so the implementation of _TFRecordUtil._masked_crc32c is not as fast as it could be.
WARNING:root:Couldn't find python-snappy so the implementation of _TFRecordUtil._masked_crc32c is not as fast as it could be.

Now define the transformed dataset.



In [47]:

    
%%ml dataset create
format: transformed
name: taxi_transformed_data
train: ./taxi/transform/train-*
eval: ./taxi/transform/eval-*



In [48]:

    
%%ml dataset explore
name: taxi_transformed_data









    





          
          
          






    



train data instances: 203998
eval data instances: 10832

Step 3: Training

MLWorkbench help build standard TensorFlow models without you having to write any TensorFlow code.



In [49]:

    
# Delete previous run results.
!rm -r -f ./taxi/linear_train
!rm -r -f ./taxi/dnn_train

Linear Regression Model

Let's build a linear model first.



In [50]:

    
%%ml train
output: ./taxi/linear_train
analysis: ./taxi/analysis
data: $taxi_transformed_data
model_args:
    model: linear_regression
    learning-rate: 0.1
    max-steps: 30000









    





          
          
          






    




TensorBoard was started successfully with pid 27128. Click here to access it.

You can click the link to Tensorboard to monitor the progress.

From Tensorboard, the last eval loss value is 50.9953, so RMSE is around 7.14 (sqrt(50.9953)).

Or, you may want to plot the events inside notebook for sharing or presentation.



In [51]:

    
from google.datalab.ml import Summary

summary = Summary('./taxi/linear_train')
summary.plot('loss')

DNN Regression Model

RMSE = 7.14 is not very impressive. Let's see if we can do better with a DNN regression model. Note that this time, we added a few parameters (hidden-layer-size1, hidden-layer-size2). For DNN models, you need to provide the number and size of hidden layers. Also, max-steps is not there, which means the training will run until it detects eval loss is no longer decreasing, or hit number of epochs limit (1000).



In [52]:

    
%%ml train
output: ./taxi/dnn_train
analysis: ./taxi/analysis
data: $taxi_transformed_data
model_args:
    model: dnn_regression
    hidden-layer-size1: 200
    hidden-layer-size2: 100









    





          
          
          






    




TensorBoard was started successfully with pid 37026. Click here to access it.



In [53]:

    
summary = Summary('./taxi/dnn_train')
summary.plot('loss')

Loss = 13.79, and RMSE is about 3.71. It seems DNN model performs much better than linear. It is not surprising because trip fare is probably not "very" linear to any features. Instead we need to build some non-linear activations into the model, which is exactly what DNN does.

Step 4: Evaluation using batch prediction

Below, we use the evaluation model and run batch prediction locally. Batch prediction is needed for large datasets where the data cannot fit in memory. For demo purpose, we will use the evaluation data again.



In [54]:

    
!rm -r -f ./taxi/batch_predict # Delete previous results.

There are two model dirs under our training dir: "evaluation_model" and "model". The difference between these two is that evaluation model expects input with target (truth) value, while regular model expects no target column. Evaluation model outputs the input target value as is. Because it outputs both target and predicted value, it is good for model evaluation.



In [55]:

    
!ls ./taxi/dnn_train/









    





          
          
          






    



evaluation_model  model  schema_without_target.json  train



In [56]:

    
%%ml batch_predict
model: ./taxi/dnn_train/evaluation_model/
output: ./taxi/batch_predict
format: csv
data:
  csv: ./taxi/eval.csv









    





          
          
          






    



local prediction...
INFO:tensorflow:Restoring parameters from ./taxi/dnn_train/evaluation_model/variables/variables
done.



In [57]:

    
!ls ./taxi/batch_predict









    





          
          
          






    



predict_results_eval.csv  predict_results_schema.json

Note that the "predict_results_schema.json" file includes the csv schema of "predict_results_eval.csv".



In [58]:

    
%%ml evaluate regression
csv: ./taxi/batch_predict/predict_results_eval.csv









    





          
          
          






    Out[58]:






  
    
      
      metric
      value
    
  
  
    
      0
      Root Mean Square Error
      3.565291
    
    
      1
      Mean Absolute Error
      1.896622
    
    
      2
      50 Percentile Absolute Error
      1.290230
    
    
      3
      90 Percentile Absolute Error
      3.567000
    
    
      4
      99 Percentile Absolute Error
      12.834100

Prediction

The MLWorkbench supports running prediction and displaying the results within the notebook.

Local Prediction

Note that now we use the non-evaluation model below (./dnn_train/model) which takes input with no target column. The prediction data is taken from eval csv file with target (fare) column removed.



In [59]:

    
%%ml predict
model: ./taxi/dnn_train/model/
data:
  - 144b42f903352f760b969b3a7bca941fa7474b26,4,289,22,42.009018227,-87.672723959,42.009018227,-87.672723959,
  - 2c09f875e5a58220344e717c4276fd322ff3c3e6,1,307,0,41.912364354,-87.675062757,41.963374382,-87.67018455,Taxi Affiliation Services
  - b352a154e8670f35d4050d35be6b8c73222854fc,7,214,14,41.912364354,-87.675062757,41.891971508,-87.612945414,Taxi Affiliation Services
  - 2e84ad9967c1a07de42582679a2891b2ecacd3b0,7,38,1,41.912364354,-87.675062757,41.921877461,-87.66407824,









    





          
          
          






    





  
    
      predicted
      unique_key
      company
      day
      dropoff_latitude
      dropoff_longitude
      hour
      pickup_latitude
      pickup_longitude
      weekday
    
  
  
    
      6.540888
      144b42f903352f760b969b3a7bca941fa7474b26
      
      289
      42.009018227
      -87.672723959
      22
      42.009018227
      -87.672723959
      4
    
    
      11.551096
      2c09f875e5a58220344e717c4276fd322ff3c3e6
      Taxi Affiliation Services
      307
      41.963374382
      -87.67018455
      0
      41.912364354
      -87.675062757
      1
    
    
      11.965454
      b352a154e8670f35d4050d35be6b8c73222854fc
      Taxi Affiliation Services
      214
      41.891971508
      -87.612945414
      14
      41.912364354
      -87.675062757
      7
    
    
      7.704762
      2e84ad9967c1a07de42582679a2891b2ecacd3b0
      
      38
      41.921877461
      -87.66407824
      1
      41.912364354
      -87.675062757
      7

Online Prediction

Datalab includes a prediction client for online (deployed) models.

Deploy Model

We can deploy locally trained model online so it can serve prediction requests via http. To deploy a local model, we'll need:

Enable Machine Learning API in your Google Cloud project.
Have a staging GCS location ready.



In [60]:

    
# Create a staging GCS bucket

!gsutil mb gs://datalab-taxi-local-model-staging









    





          
          
          






    



Creating gs://datalab-taxi-local-model-staging/...



In [61]:

    
# Copy model files over.

!gsutil -m cp -r ./taxi/dnn_train/model gs://datalab-taxi-local-model-staging/model









    





          
          
          






    



Copying file://./taxi/dnn_train/model/assets.extra/features.json [Content-Type=application/json]...
Copying file://./taxi/dnn_train/model/saved_model.pb [Content-Type=application/octet-stream]...
Copying file://./taxi/dnn_train/model/variables/variables.data-00000-of-00001 [Content-Type=application/octet-stream]...
Copying file://./taxi/dnn_train/model/assets.extra/schema.json [Content-Type=application/json]...
Copying file://./taxi/dnn_train/model/variables/variables.index [Content-Type=application/octet-stream]...
\
Operation completed over 5 objects/550.4 KiB.



In [62]:

    
%%ml model deploy
name: chicago_taxi.v1
path: gs://datalab-taxi-local-model-staging/model









    





          
          
          






    



Waiting for operation "projects/bradley-playground/operations/create_chicago_taxi_v1-1508445089708"
Done.

Build your own prediction client with Python

A common task is to call a deployed model from different applications. Below is an example of writing a python client to run prediction outside of Datalab.

For more information about model permissions, see https://cloud.google.com/ml-engine/docs/tutorials/python-guide and https://developers.google.com/identity/protocols/application-default-credentials .



In [63]:

    
import json
from oauth2client.client import GoogleCredentials
from googleapiclient import discovery
from googleapiclient import errors

# Store your project ID, model name, and version name in the format the API needs.
api_path = 'projects/{your_project_ID}/models/{model_name}/versions/{version_name}'.format(
    your_project_ID=google.datalab.Context.default().project_id,
    model_name='chicago_taxi',
    version_name='v1')

# Get application default credentials (possible only if the gcloud tool is
#  configured on your machine). See https://developers.google.com/identity/protocols/application-default-credentials
#  for more info.
credentials = GoogleCredentials.get_application_default()

# Build a representation of the Cloud ML API.
ml = discovery.build('ml', 'v1', credentials=credentials)

# Create a dictionary containing data to predict.
# Note that the data is a list of csv strings.
body = {
    'instances': [
        'cacd255b228cae40828feb8575b7d51d01f7c30e,7,201,21,41.912364354,-87.675062757,41.892042136,-87.63186395,',
        'd41200b7ad9f1ae499a27eacec13ccebd3f227e4,1,327,0,41.912364354,-87.675062757,41.949060526,-87.661642904,Northwest Management LLC',
        'd36e0da792ff7d075a31460945a473fd91f1770b,6,262,19,41.912364354,-87.675062757,41.914747305,-87.654007029,',
    ]
}

# Create a request
request = ml.projects().predict(
    name=api_path,
    body=body)

# Make the call.
try:
    response = request.execute()
    print('\nThe response:\n')
    print(json.dumps(response, indent=2))
except errors.HttpError, err:
    # Something went wrong, print out some information.
    print('There was an error. Check the details:')
    print(err._get_reason())









    





          
          
          






    



The response:

{
  "predictions": [
    {
      "predicted": 11.054550170898438, 
      "unique_key": "cacd255b228cae40828feb8575b7d51d01f7c30e"
    }, 
    {
      "predicted": 9.904248237609863, 
      "unique_key": "d41200b7ad9f1ae499a27eacec13ccebd3f227e4"
    }, 
    {
      "predicted": 7.496986389160156, 
      "unique_key": "d36e0da792ff7d075a31460945a473fd91f1770b"
    }
  ]
}

Clean up



In [64]:

    
%%ml model delete
name: chicago_taxi.v1









    





          
          
          






    



Waiting for operation "projects/bradley-playground/operations/delete_chicago_taxi_v1-1508445173136"
Done.



In [65]:

    
%%ml model delete
name: chicago_taxi









    





          
          
          






    



Waiting for operation "projects/bradley-playground/operations/delete_model_chicago_taxi-1508445215"
Done.



In [66]:

    
# Delete the GCS bucket
!gsutil -m rm -r gs://datalab-taxi-local-model-staging









    





          
          
          






    



Removing gs://datalab-taxi-local-model-staging/model/assets.extra/features.json#1508445065872883...
Removing gs://datalab-taxi-local-model-staging/model/assets.extra/schema.json#1508445065806192...
Removing gs://datalab-taxi-local-model-staging/model/saved_model.pb#1508445066055646...
Removing gs://datalab-taxi-local-model-staging/model/variables/variables.index#1508445070066488...
Removing gs://datalab-taxi-local-model-staging/model/variables/variables.data-00000-of-00001#1508445066062965...
/ [5/5 objects] 100% Done                                                       
Operation completed over 5 objects.                                              
Removing gs://datalab-taxi-local-model-staging/...



In [ ]:

	metric	value
0	Root Mean Square Error	3.565291
1	Mean Absolute Error	1.896622
2	50 Percentile Absolute Error	1.290230
3	90 Percentile Absolute Error	3.567000
4	99 Percentile Absolute Error	12.834100

predicted	unique_key	company	day	dropoff_latitude	dropoff_longitude	hour	pickup_latitude	pickup_longitude	weekday
6.540888	144b42f903352f760b969b3a7bca941fa7474b26		289	42.009018227	-87.672723959	22	42.009018227	-87.672723959	4
11.551096	2c09f875e5a58220344e717c4276fd322ff3c3e6	Taxi Affiliation Services	307	41.963374382	-87.67018455	0	41.912364354	-87.675062757	1
11.965454	b352a154e8670f35d4050d35be6b8c73222854fc	Taxi Affiliation Services	214	41.891971508	-87.612945414	14	41.912364354	-87.675062757	7
7.704762	2e84ad9967c1a07de42582679a2891b2ecacd3b0		38	41.921877461	-87.66407824	1	41.912364354	-87.675062757	7