This notebook demonstrate how to use ML Workbench to create a regression model that accepts numeric and categorical data. This one shows "local run" mode, which does most work (except for exporting data from Bigquery) on Datalab's VM. And it uses only 0.3% of the data (about 200K instances). Next notebook demonstrates how to deal with large data (~70M instances) by running every steps in Google Cloud.
Execution of this notebook requires Google Datalab (see setup instructions).
We will use Chicago Taxi Trip Data. Using pickup location, drop off location, taxi company, the model we will build predicts the trip fare.
In [34]:
%%bq query --name texi_query
SELECT
unique_key,
fare,
CAST(EXTRACT(DAYOFWEEK FROM trip_start_timestamp) AS STRING) as weekday,
CAST(EXTRACT(DAYOFYEAR FROM trip_start_timestamp) AS STRING) as day,
CAST(EXTRACT(HOUR FROM trip_start_timestamp) AS STRING) as hour,
pickup_latitude,
pickup_longitude,
dropoff_latitude,
dropoff_longitude,
company
FROM `bigquery-public-data.chicago_taxi_trips.taxi_trips`
WHERE
fare > 2.0 AND fare < 200.0 AND
pickup_latitude IS NOT NULL AND
pickup_longitude IS NOT NULL AND
dropoff_latitude IS NOT NULL AND
dropoff_longitude IS NOT NULL AND
taxi_id IS NOT NULL
In [35]:
# Sample 0.3% data, and split it into train/eval set.
import google.datalab.bigquery as bq
import numpy as np
sampling = bq.Sampling.random(percent=0.3)
job = texi_query.execute(sampling=sampling)
df = job.result().to_dataframe()
msk = np.random.rand(len(df)) < 0.95
train_df = df[msk]
eval_df = df[~msk]
In [36]:
print('Training set includes %d instances.' % len(train_df))
print('Eval set includes %d instances.' % len(eval_df))
In [37]:
!mkdir -p ./taxi
In [38]:
train_df.to_csv('./taxi/train.csv', header=False, index=False)
eval_df.to_csv('./taxi/eval.csv', header=False, index=False)
Before we use the data, we need to explore it. In reality, data exploration/feature engineering is an iterative process. For example, the above query is impacted by data exploration (fare < 200, pickup_latitude IS NOT NULL, etc).
The following %%ml command defines the dataset, and also does exploration on the data with one overview and one facets view. Note that these views don't show up if this notebook is viewed from github, because it requires frontend files that are served from Google Cloud Datalab.
In [39]:
# This loads %%ml commands
import google.datalab.contrib.mlworkbench.commands
In [40]:
%%ml dataset create
format: csv
train: ./taxi/train.csv
eval: ./taxi/eval.csv
name: taxi_data
schema:
- name: unique_key
type: STRING
- name: fare
type: FLOAT
- name: weekday
type: STRING
- name: day
type: STRING
- name: hour
type: STRING
- name: pickup_latitude
type: FLOAT
- name: pickup_longitude
type: FLOAT
- name: dropoff_latitude
type: FLOAT
- name: dropoff_longitude
type: FLOAT
- name: company
type: STRING
In [41]:
%%ml dataset explore --overview
name: taxi_data
In [42]:
%%ml dataset explore --facets
name: taxi_data
The MLWorkbench Magics are a set of Datalab commands that allow an easy code-free experience to training, deploying, and predicting ML models. This notebook will take the sampled data and build a regression model. The MLWorkbench Magics are a collection of magic commands for each step in ML workflows: analyzing input data to build transforms, transforming data, training a model, evaluating a model, and deploying a model.
For details of each command, run with --help. For example, "%%ml train --help".
When the dataset is small, there is little benefit of using cloud services. This notebook will run the analyze, transform, and training steps locally. However, we will take the locally trained model and deploy it to ML Engine and show how to make real predictions on a deployed model. Every MLWorkbench magic can run locally or use cloud services (adding --cloud flag).
The next notebook in this sequence shows the cloud version of every command, and we will use full data.
In [43]:
!rm -r -f ./taxi/analysis # Delete previous run results.
In [44]:
%%ml analyze
output: ./taxi/analysis
data: $taxi_data
features:
unique_key:
transform: key
fare:
transform: target
weekday:
transform: one_hot
day:
transform: one_hot
hour:
transform: one_hot
pickup_latitude:
transform: scale
pickup_longitude:
transform: scale
dropoff_latitude:
transform: scale
dropoff_longitude:
transform: scale
company:
transform: embedding
embedding_dim: 10
Note in the above "features" config, "target" is required and has to be specified explicitly. "key" means the column is not used in model features, and just a pass-through. For others, there is a default transform chosen for each column.
This step is optional as training can start from csv data (the same data used in the analysis step). The transform step performs some transformations on the input data and saves the results to a special TensorFlow file called a TFRecord file containing TF.Example protocol buffers. This allows training to start from preprocessed data. If this step is not used, training would have to perform the same preprocessing on every row of csv data every time it is used. As TensorFlow reads the same data row multiple times during training, this means the same row would be preprocessed multiple times. By writing the preprocessed data to disk, we can speed up training.
We run the transform step for the training and eval data.
In [45]:
!rm -r -f ./taxi/transform # Delete previous run results.
In [46]:
%%ml transform
output: ./taxi/transform
analysis: ./taxi/analysis
data: $taxi_data
Now define the transformed dataset.
In [47]:
%%ml dataset create
format: transformed
name: taxi_transformed_data
train: ./taxi/transform/train-*
eval: ./taxi/transform/eval-*
In [48]:
%%ml dataset explore
name: taxi_transformed_data
In [49]:
# Delete previous run results.
!rm -r -f ./taxi/linear_train
!rm -r -f ./taxi/dnn_train
In [50]:
%%ml train
output: ./taxi/linear_train
analysis: ./taxi/analysis
data: $taxi_transformed_data
model_args:
model: linear_regression
learning-rate: 0.1
max-steps: 30000
You can click the link to Tensorboard to monitor the progress.
From Tensorboard, the last eval loss value is 50.9953, so RMSE is around 7.14 (sqrt(50.9953)).
Or, you may want to plot the events inside notebook for sharing or presentation.
In [51]:
from google.datalab.ml import Summary
summary = Summary('./taxi/linear_train')
summary.plot('loss')
RMSE = 7.14 is not very impressive. Let's see if we can do better with a DNN regression model. Note that this time, we added a few parameters (hidden-layer-size1, hidden-layer-size2). For DNN models, you need to provide the number and size of hidden layers. Also, max-steps is not there, which means the training will run until it detects eval loss is no longer decreasing, or hit number of epochs limit (1000).
In [52]:
%%ml train
output: ./taxi/dnn_train
analysis: ./taxi/analysis
data: $taxi_transformed_data
model_args:
model: dnn_regression
hidden-layer-size1: 200
hidden-layer-size2: 100
In [53]:
summary = Summary('./taxi/dnn_train')
summary.plot('loss')
Loss = 13.79, and RMSE is about 3.71. It seems DNN model performs much better than linear. It is not surprising because trip fare is probably not "very" linear to any features. Instead we need to build some non-linear activations into the model, which is exactly what DNN does.
In [54]:
!rm -r -f ./taxi/batch_predict # Delete previous results.
There are two model dirs under our training dir: "evaluation_model" and "model". The difference between these two is that evaluation model expects input with target (truth) value, while regular model expects no target column. Evaluation model outputs the input target value as is. Because it outputs both target and predicted value, it is good for model evaluation.
In [55]:
!ls ./taxi/dnn_train/
In [56]:
%%ml batch_predict
model: ./taxi/dnn_train/evaluation_model/
output: ./taxi/batch_predict
format: csv
data:
csv: ./taxi/eval.csv
In [57]:
!ls ./taxi/batch_predict
Note that the "predict_results_schema.json" file includes the csv schema of "predict_results_eval.csv".
In [58]:
%%ml evaluate regression
csv: ./taxi/batch_predict/predict_results_eval.csv
Out[58]:
In [59]:
%%ml predict
model: ./taxi/dnn_train/model/
data:
- 144b42f903352f760b969b3a7bca941fa7474b26,4,289,22,42.009018227,-87.672723959,42.009018227,-87.672723959,
- 2c09f875e5a58220344e717c4276fd322ff3c3e6,1,307,0,41.912364354,-87.675062757,41.963374382,-87.67018455,Taxi Affiliation Services
- b352a154e8670f35d4050d35be6b8c73222854fc,7,214,14,41.912364354,-87.675062757,41.891971508,-87.612945414,Taxi Affiliation Services
- 2e84ad9967c1a07de42582679a2891b2ecacd3b0,7,38,1,41.912364354,-87.675062757,41.921877461,-87.66407824,
In [60]:
# Create a staging GCS bucket
!gsutil mb gs://datalab-taxi-local-model-staging
In [61]:
# Copy model files over.
!gsutil -m cp -r ./taxi/dnn_train/model gs://datalab-taxi-local-model-staging/model
In [62]:
%%ml model deploy
name: chicago_taxi.v1
path: gs://datalab-taxi-local-model-staging/model
A common task is to call a deployed model from different applications. Below is an example of writing a python client to run prediction outside of Datalab.
For more information about model permissions, see https://cloud.google.com/ml-engine/docs/tutorials/python-guide and https://developers.google.com/identity/protocols/application-default-credentials .
In [63]:
import json
from oauth2client.client import GoogleCredentials
from googleapiclient import discovery
from googleapiclient import errors
# Store your project ID, model name, and version name in the format the API needs.
api_path = 'projects/{your_project_ID}/models/{model_name}/versions/{version_name}'.format(
your_project_ID=google.datalab.Context.default().project_id,
model_name='chicago_taxi',
version_name='v1')
# Get application default credentials (possible only if the gcloud tool is
# configured on your machine). See https://developers.google.com/identity/protocols/application-default-credentials
# for more info.
credentials = GoogleCredentials.get_application_default()
# Build a representation of the Cloud ML API.
ml = discovery.build('ml', 'v1', credentials=credentials)
# Create a dictionary containing data to predict.
# Note that the data is a list of csv strings.
body = {
'instances': [
'cacd255b228cae40828feb8575b7d51d01f7c30e,7,201,21,41.912364354,-87.675062757,41.892042136,-87.63186395,',
'd41200b7ad9f1ae499a27eacec13ccebd3f227e4,1,327,0,41.912364354,-87.675062757,41.949060526,-87.661642904,Northwest Management LLC',
'd36e0da792ff7d075a31460945a473fd91f1770b,6,262,19,41.912364354,-87.675062757,41.914747305,-87.654007029,',
]
}
# Create a request
request = ml.projects().predict(
name=api_path,
body=body)
# Make the call.
try:
response = request.execute()
print('\nThe response:\n')
print(json.dumps(response, indent=2))
except errors.HttpError, err:
# Something went wrong, print out some information.
print('There was an error. Check the details:')
print(err._get_reason())
In [64]:
%%ml model delete
name: chicago_taxi.v1
In [65]:
%%ml model delete
name: chicago_taxi
In [66]:
# Delete the GCS bucket
!gsutil -m rm -r gs://datalab-taxi-local-model-staging
In [ ]: