About This Notebook

This notebook demonstrate how to use ML Workbench to create a regression model that accepts numeric and categorical data. This one shows "cloud run" mode, which does each step in Google Cloud Platform with various services. Cloud run can be distributed so it can handle large data without being restricted on memory, computation, or disk limits. The notebook is similar to last one (Taxi Fare Model (small data)), but it uses full data (about 77M instances).

There are only a few things that need to change between "local run" and "cloud run":

  • all data sources or file paths must be on GCS.
  • the --cloud flag must be set for each step.
  • "cloud_config" can be set for cloud specific settings, such as project_id, machine_type. In some cases it is required.

Other than this, nothing else changes from local to cloud!

Note: "Run all cells" does not work for this notebook because the steps are asynchonous. In many steps it submits a cloud job, and you should track the status by following the job link.

Execution of this notebook requires Google Datalab (see setup instructions).

The Data

We will use Chicago Taxi Trip Data. Using pickup location, drop off location, taxi company, the model we will build predicts the trip fare.

Split Data Into Train/Eval Sets

Use bigquery to select the features we need and also randomly choose 5% for eval, 95% for training.


In [27]:
%%bq query --name texi_query_eval
SELECT
  unique_key,
  fare,
  CAST(EXTRACT(DAYOFWEEK FROM trip_start_timestamp) AS STRING) as weekday,
  CAST(EXTRACT(DAYOFYEAR FROM trip_start_timestamp) AS STRING) as day,
  CAST(EXTRACT(HOUR FROM trip_start_timestamp) AS STRING) as hour,
  pickup_latitude,
  pickup_longitude,
  dropoff_latitude,
  dropoff_longitude,
  company
FROM `bigquery-public-data.chicago_taxi_trips.taxi_trips`
WHERE 
  fare > 2.0 AND fare < 200.0 AND
  pickup_latitude IS NOT NULL AND
  pickup_longitude IS NOT NULL AND
  dropoff_latitude IS NOT NULL AND
  dropoff_longitude IS NOT NULL AND
  MOD(ABS(FARM_FINGERPRINT(unique_key)), 100) < 5



In [28]:
%%bq query --name texi_query_train
SELECT
  unique_key,
  fare,
  CAST(EXTRACT(DAYOFWEEK FROM trip_start_timestamp) AS STRING) as weekday,
  CAST(EXTRACT(DAYOFYEAR FROM trip_start_timestamp) AS STRING) as day,
  CAST(EXTRACT(HOUR FROM trip_start_timestamp) AS STRING) as hour,
  pickup_latitude,
  pickup_longitude,
  dropoff_latitude,
  dropoff_longitude,
  company
FROM `bigquery-public-data.chicago_taxi_trips.taxi_trips`
WHERE 
  fare > 2.0 AND fare < 200.0 AND
  pickup_latitude IS NOT NULL AND
  pickup_longitude IS NOT NULL AND
  dropoff_latitude IS NOT NULL AND
  dropoff_longitude IS NOT NULL AND
  MOD(ABS(FARM_FINGERPRINT(unique_key)), 100) >= 5


Create "chicago_taxi.train" and "chicago_taxi.eval" BQ tables to store results.


In [29]:
%%bq datasets create --name chicago_taxi



In [30]:
%%bq execute
query: texi_query_eval
table: chicago_taxi.eval
mode: overwrite


Out[30]:
unique_keyfareweekdaydayhourpickup_latitudepickup_longitudedropoff_latitudedropoff_longitudecompany
2bc572255bcaa2389a211f282a7291916ed3da075.451313741.934762456-87.63985385941.936310131-87.651562592
5d5fcecab5b21f369daab6c08f909fec5eef39f73.25632141.946294536-87.65429808441.946294536-87.654298084
abb241ae6ed453a54fa9761f99b109272c0af77010.657741341.89967018-87.66983779841.921778356-87.641459759Dispatch Taxi Affiliation
de3b3be03f4428b542c5abe8afd52545e72d9cbe4.6541841941.906025969-87.67531162241.906025969-87.675311622Taxi Affiliation Services
583aeed939da4b9cd39746eed19e96a1c8c0b18517.253411241.808916283-87.59618334441.706587882-87.623366512Taxi Affiliation Services
c60a2ab89b7d57fec060b5a6b8fa1c141e1540bd7.657130341.946294536-87.65429808441.921854911-87.646210977Taxi Affiliation Services
4ec0e84280d13fce4efb3a035297ac0aeb1fa5bb5.051321041.921854911-87.64621097741.93057857-87.642206313Taxi Affiliation Services
7cd32a624a19e513615a142e6e485da0814c90227.05753341.965445784-87.6631958541.946294536-87.654298084
34746e341940dce3f90e98873ed2d54cd5d1515d28.05154241.946294536-87.65429808441.945069205-87.67606274
549a108a9579646b1d1fb674866d93873bfa48755.4562911941.965445784-87.6631958541.972667956-87.663865496
c775f7599eccdecd422456099dea5714d1a8c2a78.651118141.928431564-87.69996859141.906025969-87.675311622
f45daa290893f86198aae73365887dd5682ba4f84.25154441.958055933-87.66038945641.965445784-87.66319585Dispatch Taxi Affiliation
59ab7a8d5b76d5b1e319ab1fbdece667eb7e4b5c8.857752141.921778356-87.64145975941.921273105-87.68508211Taxi Affiliation Services
5f32076aac90efddfe968b27215b1ce70c96837e5.457326141.794090253-87.59231085541.794090253-87.592310855
26be997dec6bbedbc4bf9c501ea721ecc9414ffc4.4571442341.950545696-87.67618249641.957843375-87.676373281
79605fa02f425d36ab45f24648269ecae519c2a78.0571381841.943155086-87.64069807641.935988906-87.670966384
8f978edb00ada6b7ec748a61049730f2b34985a68.8573542041.945170453-87.66879443941.972667956-87.663865496Blue Ribbon Taxi Association Inc.
f066faa6c6e190a7f0e46a30d46e8d5e9d4c09f314.056451941.9725808-87.69400106141.906025969-87.675311622
8e294dd8e5423956becb52242b4e06947b24e1e74.651200241.949139771-87.65680390941.962178629-87.645378762Taxi Affiliation Services
e994354ae1e87314d5fe89c99b7b86c2ca120b2c7.853175641.916005274-87.67509511641.972437081-87.671109526
718786093b94c14f22b7d83932841826109e06085.651145342.009622881-87.67016685742.001571027-87.695012589Taxi Affiliation Services
1b32672d8892fa920b139c562e12e0e3cad088626.2531902341.942691844-87.65177050741.929077655-87.646293476Choice Taxi Association
0af210da5de585e781b94d55b9c3137e536fd7367.0572901941.906025969-87.67531162241.87866742-87.671653621
886e5c156e49a584f5709d7f02dbe0ed43917adb13.0553311041.878594358-87.73023242841.878594358-87.730232428Taxi Affiliation Services
b40f10ff61edc4de680001d11b314b39fbaa96296.2561011241.829922304-87.67250264641.829922304-87.672502646Choice Taxi Association

(rows: 3585149, time: 51.7s, 9GB processed, job: job_BPas0uNDL2FpMAydl51JoD1V3CED)

In [31]:
%%bq execute
query: texi_query_train
table: chicago_taxi.train
mode: overwrite


Out[31]:
unique_keyfareweekdaydayhourpickup_latitudepickup_longitudedropoff_latitudedropoff_longitudecompany
7bc601797a07c11ac351a49d02850789acda94b135.2542032241.97907082-87.90303966141.890922026-87.618868355Taxi Affiliation Services
a2c7b99420515793e18c2cf896963edc1dbaafff38.0512301341.785998518-87.75093428941.949139771-87.656803909Taxi Affiliation Services
82a1dfb248c4cb6bdede9fbe6a50d2fd2d6043dc8.25416841.899602111-87.63330803741.953582125-87.72345239
d3131be93213ce27b6f5c895b7400b2b80bb48a58.6552762041.904935302-87.64990722641.880994471-87.632746489Taxi Affiliation Services
b95cff84b984569fdab6a91ec8b0233656aafa3b9.8542251741.880994471-87.63274648941.849246754-87.624135298Taxi Affiliation Services
cd1641a6062ada8f3019500e5b71e8a6f5edca2d46.6551271342.001571027-87.69501258941.983636307-87.723583185Taxi Affiliation Services
94977a09ce63010d0edb5692beb4643161fb4a876.854712141.884987192-87.62099291341.867902418-87.642958665Taxi Affiliation Services
98fce58c13ae616b787a93a5a1df7156077a730624.4552202041.785998518-87.75093428941.89321636-87.63784421Northwest Management LLC
d6288f216a2d82fbd43a311d1cf70260940dcd785.054992241.942691844-87.65177050741.93057857-87.642206313
58fd7078712ed5a1cea6a87289212af5747c20ea36.0532181241.880994471-87.63274648941.97907082-87.903039661Taxi Affiliation Services
8aa56eec78409be1652a9b83a4f3e070731f92e94.84754041.890922026-87.61886835541.892507781-87.626214906
827dbd61d7a4499fa2a61f385c4e8333ed823a5e5.0553161641.902788048-87.6261455941.90156691-87.638404012
f38715ad77adcd5acf6845f710f0deaa5418cf2513.0551352141.878865584-87.62519214241.901206994-87.676355989Dispatch Taxi Affiliation
2fd1b1f9cd7ea5b2906436048020b29962fcbb104.842331441.884987192-87.62099291341.884987192-87.620992913
fee9d6f95e140ce67caf1a886adaf9eff7c1245310.055231941.880994471-87.63274648941.914585709-87.645966207Taxi Affiliation Services
913f60022795a4f743e244bad7d45de8cdd0724f5.4523082041.89321636-87.6378442141.89321636-87.63784421KOAM Taxi Association
3ab08ea6f7fa217191616d8a961250e7ff15c00f4.4561771141.880994471-87.63274648941.884987192-87.620992913Taxi Affiliation Services
a808620e2671265062b6532c0a4ae91fe0d9c2377.2562001741.922686284-87.64948872941.944226601-87.655998182Dispatch Taxi Affiliation
6bd8607cc0ffac63a0f87601bc14851a6c9924a515.0512212041.899602111-87.63330803741.9867118-87.663416405Choice Taxi Association
ee2e429040b251cce7a4974b81f222c32746e16f16.85710241.947791586-87.68383494241.874005383-87.66351755
076b03e7d42357350d78c4cef875b6b41957cdba7.7571912341.877406123-87.62197165241.892072635-87.628874157Taxi Affiliation Services
4bb3b161ee381ddbc6ad3f1a2af88af575863f356.052300041.922686284-87.64948872941.944226601-87.655998182Taxi Affiliation Services
14f397b18ffb5a3e6f84b59d88bc6a655cc4640913.455255841.914616286-87.63171736641.88528132-87.6572332
91e572f071c93228b93fc9bef66f71c28eb7ecf26.053344941.892072635-87.62887415741.880994471-87.632746489
f5636cb8d64a808666af7721c43279830d1d4e1b9.2562071841.899155613-87.62621053241.880994471-87.632746489Taxi Affiliation Services

(rows: 68126775, time: 66.1s, 9GB processed, job: job_XY_3EdOUA4H4U-H1mmrmxnjGNXgw)

Sanity check on the data.


In [32]:
%%bq query
SELECT count(*) FROM chicago_taxi.train


Out[32]:
f0_
68126775

(rows: 1, time: 1.7s, 0B processed, job: job_6KM__IJkMn19rntxw102Zo5365PA)

In [10]:
%%bq query
SELECT count(*) FROM chicago_taxi.eval


Out[10]:
f0_
3585149

(rows: 1, time: 0.5s, cached, job: job_zNwX2dEzvBNHEC3TN36z6IcFs7Th)

Explore Data

See previous notebook (Taxi Fare Model (small data)) for data exploration.

Create Model with ML Workbench

The MLWorkbench Magics are a set of Datalab commands that allow an easy code-free experience to training, deploying, and predicting ML models. This notebook will take the data in BigQuery tables and build a regression model. The MLWorkbench Magics are a collection of magic commands for each step in ML workflows: analyzing input data to build transforms, transforming data, training a model, evaluating a model, and deploying a model.

For details of each command, run with --help. For example, "%%ml train --help".

This notebook will run the analyze, transform, and training steps in cloud with services. Notice the "--cloud" flag is set for each step.


In [3]:
import google.datalab.contrib.mlworkbench.commands # this loads the %%ml commands



In [35]:
%%ml dataset create
name: taxi_data_full
format: bigquery
train: chicago_taxi.train
eval: chicago_taxi.eval



In [ ]:
!gsutil mb gs://datalab-chicago-taxi-demo # Create a Storage Bucket to store results.

Step 1: Analyze

The first step in the MLWorkbench workflow is to analyze the data for the requested transformations. Analysis in this case builds vocabulary for categorical features, and compute numeric stats for numeric features.


In [ ]:
!gsutil rm -r -f gs://datalab-chicago-taxi-demo/analysis # Remove previous analysis results if any

In [38]:
%%ml analyze --cloud
output: gs://datalab-chicago-taxi-demo/analysis
data: taxi_data_full
features:
  unique_key:
    transform: key
  fare:
    transform: target         
  company:
    transform: embedding
    embedding_dim: 10
  weekday:
    transform: one_hot
  day:
    transform: one_hot
  hour:
    transform: one_hot
  pickup_latitude:
    transform: scale    
  pickup_longitude:
    transform: scale
  dropoff_latitude:
    transform: scale
  dropoff_longitude:
    transform: scale


Analyzing column fare...
Updated property [core/project].
column fare analyzed.
Analyzing column hour...
Updated property [core/project].
column hour analyzed.
Analyzing column company...
Updated property [core/project].
column company analyzed.
Analyzing column pickup_longitude...
Updated property [core/project].
column pickup_longitude analyzed.
Analyzing column day...
Updated property [core/project].
column day analyzed.
Analyzing column dropoff_longitude...
Updated property [core/project].
column dropoff_longitude analyzed.
Analyzing column weekday...
Updated property [core/project].
column weekday analyzed.
Analyzing column pickup_latitude...
Updated property [core/project].
column pickup_latitude analyzed.
Analyzing column dropoff_latitude...
Updated property [core/project].
column dropoff_latitude analyzed.
Updated property [core/project].

Step 2: Transform

The transform step performs some transformations on the input data and saves the results to a special TensorFlow file called a TFRecord file containing TF.Example protocol buffers. This allows training to start from preprocessed data. If this step is not used, training would have to perform the same preprocessing on every row of csv data every time it is used. As TensorFlow reads the same data row multiple times during training, this means the same row would be preprocessed multiple times. By writing the preprocessed data to disk, we can speed up training.

The transform is required if your source data is in BigQuery table.

We run the transform step for the training and eval data.


In [ ]:
!gsutil -m rm -r -f gs://datalab-chicago-taxi-demo/transform # Remove previous transform results if any.

Transform takes about 6 hours in cloud. Data is fairely big (33GB) and processing locally on a single VM would be much longer.


In [40]:
%%ml transform --cloud
output: gs://datalab-chicago-taxi-demo/transform
analysis: gs://datalab-chicago-taxi-demo/analysis
data: taxi_data_full


/usr/local/lib/python2.7/dist-packages/apache_beam/coders/typecoders.py:135: UserWarning: Using fallback coder for typehint: Any.
  warnings.warn('Using fallback coder for typehint: %r.' % typehint)
running sdist
running egg_info
writing requirements to trainer.egg-info/requires.txt
writing trainer.egg-info/PKG-INFO
writing top-level names to trainer.egg-info/top_level.txt
writing dependency_links to trainer.egg-info/dependency_links.txt
reading manifest file 'trainer.egg-info/SOURCES.txt'
writing manifest file 'trainer.egg-info/SOURCES.txt'
warning: sdist: standard file not found: should have one of README, README.rst, README.txt, README.md

running check
warning: check: missing required meta-data: url

creating trainer-1.0.0
creating trainer-1.0.0/trainer
creating trainer-1.0.0/trainer.egg-info
copying files to trainer-1.0.0...
copying setup.py -> trainer-1.0.0
copying trainer/__init__.py -> trainer-1.0.0/trainer
copying trainer/feature_analysis.py -> trainer-1.0.0/trainer
copying trainer/feature_transforms.py -> trainer-1.0.0/trainer
copying trainer/task.py -> trainer-1.0.0/trainer
copying trainer.egg-info/PKG-INFO -> trainer-1.0.0/trainer.egg-info
copying trainer.egg-info/SOURCES.txt -> trainer-1.0.0/trainer.egg-info
copying trainer.egg-info/dependency_links.txt -> trainer-1.0.0/trainer.egg-info
copying trainer.egg-info/requires.txt -> trainer-1.0.0/trainer.egg-info
copying trainer.egg-info/top_level.txt -> trainer-1.0.0/trainer.egg-info
Writing trainer-1.0.0/setup.cfg
Creating tar archive
removing 'trainer-1.0.0' (and everything under it)
DEPRECATION: pip install --download has been deprecated and will be removed in the future. Pip now has a download command that should be used instead.
Collecting google-cloud-dataflow==2.0.0
  Downloading google-cloud-dataflow-2.0.0.tar.gz (576kB)
  Saved /tmp/tmp0aFKBX/google-cloud-dataflow-2.0.0.tar.gz
Successfully downloaded google-cloud-dataflow
/usr/local/lib/python2.7/dist-packages/apache_beam/io/gcp/gcsio.py:113: DeprecationWarning: object() takes no parameters
  super(GcsIO, cls).__new__(cls, storage_client))
View job at https://console.developers.google.com/dataflow/job/2017-10-27_14_53_34-6842464010721369152?project=bradley-playground
/usr/local/lib/python2.7/dist-packages/apache_beam/coders/typecoders.py:135: UserWarning: Using fallback coder for typehint: Any.
  warnings.warn('Using fallback coder for typehint: %r.' % typehint)
running sdist
running egg_info
writing requirements to trainer.egg-info/requires.txt
writing trainer.egg-info/PKG-INFO
writing top-level names to trainer.egg-info/top_level.txt
writing dependency_links to trainer.egg-info/dependency_links.txt
reading manifest file 'trainer.egg-info/SOURCES.txt'
writing manifest file 'trainer.egg-info/SOURCES.txt'
warning: sdist: standard file not found: should have one of README, README.rst, README.txt, README.md

running check
warning: check: missing required meta-data: url

creating trainer-1.0.0
creating trainer-1.0.0/trainer
creating trainer-1.0.0/trainer.egg-info
copying files to trainer-1.0.0...
copying setup.py -> trainer-1.0.0
copying trainer/__init__.py -> trainer-1.0.0/trainer
copying trainer/feature_analysis.py -> trainer-1.0.0/trainer
copying trainer/feature_transforms.py -> trainer-1.0.0/trainer
copying trainer/task.py -> trainer-1.0.0/trainer
copying trainer.egg-info/PKG-INFO -> trainer-1.0.0/trainer.egg-info
copying trainer.egg-info/SOURCES.txt -> trainer-1.0.0/trainer.egg-info
copying trainer.egg-info/dependency_links.txt -> trainer-1.0.0/trainer.egg-info
copying trainer.egg-info/requires.txt -> trainer-1.0.0/trainer.egg-info
copying trainer.egg-info/top_level.txt -> trainer-1.0.0/trainer.egg-info
Writing trainer-1.0.0/setup.cfg
Creating tar archive
removing 'trainer-1.0.0' (and everything under it)
DEPRECATION: pip install --download has been deprecated and will be removed in the future. Pip now has a download command that should be used instead.
Collecting google-cloud-dataflow==2.0.0
  Using cached google-cloud-dataflow-2.0.0.tar.gz
  Saved /tmp/tmpoTcybt/google-cloud-dataflow-2.0.0.tar.gz
Successfully downloaded google-cloud-dataflow
/usr/local/lib/python2.7/dist-packages/apache_beam/io/gcp/gcsio.py:113: DeprecationWarning: object() takes no parameters
  super(GcsIO, cls).__new__(cls, storage_client))
View job at https://console.developers.google.com/dataflow/job/2017-10-27_14_53_46-5094631904622372762?project=bradley-playground

In [5]:
!gsutil list gs://datalab-chicago-taxi-demo/transform/eval-*


gs://datalab-chicago-taxi-demo/transform/eval-00000-of-00056.tfrecord.gz
gs://datalab-chicago-taxi-demo/transform/eval-00001-of-00056.tfrecord.gz
gs://datalab-chicago-taxi-demo/transform/eval-00002-of-00056.tfrecord.gz
gs://datalab-chicago-taxi-demo/transform/eval-00003-of-00056.tfrecord.gz
gs://datalab-chicago-taxi-demo/transform/eval-00004-of-00056.tfrecord.gz
gs://datalab-chicago-taxi-demo/transform/eval-00005-of-00056.tfrecord.gz
gs://datalab-chicago-taxi-demo/transform/eval-00006-of-00056.tfrecord.gz
gs://datalab-chicago-taxi-demo/transform/eval-00007-of-00056.tfrecord.gz
gs://datalab-chicago-taxi-demo/transform/eval-00008-of-00056.tfrecord.gz
gs://datalab-chicago-taxi-demo/transform/eval-00009-of-00056.tfrecord.gz
gs://datalab-chicago-taxi-demo/transform/eval-00010-of-00056.tfrecord.gz
gs://datalab-chicago-taxi-demo/transform/eval-00011-of-00056.tfrecord.gz
gs://datalab-chicago-taxi-demo/transform/eval-00012-of-00056.tfrecord.gz
gs://datalab-chicago-taxi-demo/transform/eval-00013-of-00056.tfrecord.gz
gs://datalab-chicago-taxi-demo/transform/eval-00014-of-00056.tfrecord.gz
gs://datalab-chicago-taxi-demo/transform/eval-00015-of-00056.tfrecord.gz
gs://datalab-chicago-taxi-demo/transform/eval-00016-of-00056.tfrecord.gz
gs://datalab-chicago-taxi-demo/transform/eval-00017-of-00056.tfrecord.gz
gs://datalab-chicago-taxi-demo/transform/eval-00018-of-00056.tfrecord.gz
gs://datalab-chicago-taxi-demo/transform/eval-00019-of-00056.tfrecord.gz
gs://datalab-chicago-taxi-demo/transform/eval-00020-of-00056.tfrecord.gz
gs://datalab-chicago-taxi-demo/transform/eval-00021-of-00056.tfrecord.gz
gs://datalab-chicago-taxi-demo/transform/eval-00022-of-00056.tfrecord.gz
gs://datalab-chicago-taxi-demo/transform/eval-00023-of-00056.tfrecord.gz
gs://datalab-chicago-taxi-demo/transform/eval-00024-of-00056.tfrecord.gz
gs://datalab-chicago-taxi-demo/transform/eval-00025-of-00056.tfrecord.gz
gs://datalab-chicago-taxi-demo/transform/eval-00026-of-00056.tfrecord.gz
gs://datalab-chicago-taxi-demo/transform/eval-00027-of-00056.tfrecord.gz
gs://datalab-chicago-taxi-demo/transform/eval-00028-of-00056.tfrecord.gz
gs://datalab-chicago-taxi-demo/transform/eval-00029-of-00056.tfrecord.gz
gs://datalab-chicago-taxi-demo/transform/eval-00030-of-00056.tfrecord.gz
gs://datalab-chicago-taxi-demo/transform/eval-00031-of-00056.tfrecord.gz
gs://datalab-chicago-taxi-demo/transform/eval-00032-of-00056.tfrecord.gz
gs://datalab-chicago-taxi-demo/transform/eval-00033-of-00056.tfrecord.gz
gs://datalab-chicago-taxi-demo/transform/eval-00034-of-00056.tfrecord.gz
gs://datalab-chicago-taxi-demo/transform/eval-00035-of-00056.tfrecord.gz
gs://datalab-chicago-taxi-demo/transform/eval-00036-of-00056.tfrecord.gz
gs://datalab-chicago-taxi-demo/transform/eval-00037-of-00056.tfrecord.gz
gs://datalab-chicago-taxi-demo/transform/eval-00038-of-00056.tfrecord.gz
gs://datalab-chicago-taxi-demo/transform/eval-00039-of-00056.tfrecord.gz
gs://datalab-chicago-taxi-demo/transform/eval-00040-of-00056.tfrecord.gz
gs://datalab-chicago-taxi-demo/transform/eval-00041-of-00056.tfrecord.gz
gs://datalab-chicago-taxi-demo/transform/eval-00042-of-00056.tfrecord.gz
gs://datalab-chicago-taxi-demo/transform/eval-00043-of-00056.tfrecord.gz
gs://datalab-chicago-taxi-demo/transform/eval-00044-of-00056.tfrecord.gz
gs://datalab-chicago-taxi-demo/transform/eval-00045-of-00056.tfrecord.gz
gs://datalab-chicago-taxi-demo/transform/eval-00046-of-00056.tfrecord.gz
gs://datalab-chicago-taxi-demo/transform/eval-00047-of-00056.tfrecord.gz
gs://datalab-chicago-taxi-demo/transform/eval-00048-of-00056.tfrecord.gz
gs://datalab-chicago-taxi-demo/transform/eval-00049-of-00056.tfrecord.gz
gs://datalab-chicago-taxi-demo/transform/eval-00050-of-00056.tfrecord.gz
gs://datalab-chicago-taxi-demo/transform/eval-00051-of-00056.tfrecord.gz
gs://datalab-chicago-taxi-demo/transform/eval-00052-of-00056.tfrecord.gz
gs://datalab-chicago-taxi-demo/transform/eval-00053-of-00056.tfrecord.gz
gs://datalab-chicago-taxi-demo/transform/eval-00054-of-00056.tfrecord.gz
gs://datalab-chicago-taxi-demo/transform/eval-00055-of-00056.tfrecord.gz

In [5]:
%%ml dataset create
name: taxi_data_transformed
format: transformed
train: gs://datalab-chicago-taxi-demo/transform/train-*
eval: gs://datalab-chicago-taxi-demo/transform/eval-*


Step 3: Training

MLWorkbench help build standard TensorFlow models without you having to write any TensorFlow code. We already know from last notebook that DNN regression model works better.


In [ ]:
!gsutil -m rm -r -f gs://datalab-chicago-taxi-demo/train # Remove previous training results.

Training takes about 30 min with "STANRDARD_1" scale_tier. Note that we will perform 1M steps. This will take much longer if we run it locally on Datalab's VM. With CloudML Engine, it runs training in a distributed way with multiple VMs, so it runs much faster.


In [6]:
%%ml train --cloud
output: gs://datalab-chicago-taxi-demo/train
analysis: gs://datalab-chicago-taxi-demo/analysis
data: taxi_data_transformed
model_args:
    model: dnn_regression
    hidden-layer-size1: 400
    hidden-layer-size2: 200
    train-batch-size: 1000
    max-steps: 1000000
cloud_config:
    region: us-east1
    scale_tier: STANDARD_1


Job "trainer_task_171028_154103" submitted.

Click here to view cloud log.

TensorBoard was started successfully with pid 13979. Click here to access it.

Step 4: Evaluation using batch prediction

Below, we use the evaluation model and run batch prediction in cloud. For demo purpose, we will use the evaluation data again.


In [ ]:
# Delete previous results
!gsutil -m rm -r gs://datalab-chicago-taxi-demo/batch_prediction

Currently, batch_prediction service does not work with BigQuery data. So we export eval data to csv file.


In [9]:
%%bq extract
table: chicago_taxi.eval
format: csv
path: gs://datalab-chicago-taxi-demo/eval.csv


Run batch prediction. Note that we use evaluation_model because it takes input data with target (truth) column.


In [8]:
%%ml batch_predict --cloud
model: gs://datalab-chicago-taxi-demo/train/evaluation_model
output: gs://datalab-chicago-taxi-demo/batch_prediction
format: csv
data:
  csv: gs://datalab-chicago-taxi-demo/eval.csv
cloud_config:
  region: us-east1


Job "prediction_171028_180359" submitted.

Click here to view cloud log.

Once batch prediction is done, check results files. Batch prediction service outputs to JSON files.


In [14]:
!gsutil list -l -h gs://datalab-chicago-taxi-demo/batch_prediction


       0 B  2017-10-28T04:23:12Z  gs://datalab-chicago-taxi-demo/batch_prediction/prediction.errors_stats-00000-of-00001
       0 B  2017-10-28T04:10:44Z  gs://datalab-chicago-taxi-demo/batch_prediction/prediction.results-00000-of-00001
 19.74 MiB  2017-10-28T04:23:11Z  gs://datalab-chicago-taxi-demo/batch_prediction/prediction.results-00000-of-00022
 19.76 MiB  2017-10-28T04:23:11Z  gs://datalab-chicago-taxi-demo/batch_prediction/prediction.results-00001-of-00022
 19.79 MiB  2017-10-28T04:23:11Z  gs://datalab-chicago-taxi-demo/batch_prediction/prediction.results-00002-of-00022
 19.76 MiB  2017-10-28T04:23:11Z  gs://datalab-chicago-taxi-demo/batch_prediction/prediction.results-00003-of-00022
 19.86 MiB  2017-10-28T04:23:11Z  gs://datalab-chicago-taxi-demo/batch_prediction/prediction.results-00004-of-00022
 19.81 MiB  2017-10-28T04:23:11Z  gs://datalab-chicago-taxi-demo/batch_prediction/prediction.results-00005-of-00022
 19.87 MiB  2017-10-28T04:23:11Z  gs://datalab-chicago-taxi-demo/batch_prediction/prediction.results-00006-of-00022
 19.82 MiB  2017-10-28T04:23:11Z  gs://datalab-chicago-taxi-demo/batch_prediction/prediction.results-00007-of-00022
 19.65 MiB  2017-10-28T04:23:11Z  gs://datalab-chicago-taxi-demo/batch_prediction/prediction.results-00008-of-00022
  19.9 MiB  2017-10-28T04:23:11Z  gs://datalab-chicago-taxi-demo/batch_prediction/prediction.results-00009-of-00022
 19.88 MiB  2017-10-28T04:23:11Z  gs://datalab-chicago-taxi-demo/batch_prediction/prediction.results-00010-of-00022
 19.86 MiB  2017-10-28T04:23:11Z  gs://datalab-chicago-taxi-demo/batch_prediction/prediction.results-00011-of-00022
 19.75 MiB  2017-10-28T04:23:11Z  gs://datalab-chicago-taxi-demo/batch_prediction/prediction.results-00012-of-00022
 19.73 MiB  2017-10-28T04:23:11Z  gs://datalab-chicago-taxi-demo/batch_prediction/prediction.results-00013-of-00022
 19.74 MiB  2017-10-28T04:23:11Z  gs://datalab-chicago-taxi-demo/batch_prediction/prediction.results-00014-of-00022
 19.76 MiB  2017-10-28T04:23:11Z  gs://datalab-chicago-taxi-demo/batch_prediction/prediction.results-00015-of-00022
  8.91 MiB  2017-10-28T04:23:11Z  gs://datalab-chicago-taxi-demo/batch_prediction/prediction.results-00016-of-00022
  5.93 MiB  2017-10-28T04:23:11Z  gs://datalab-chicago-taxi-demo/batch_prediction/prediction.results-00017-of-00022
 10.88 MiB  2017-10-28T04:23:11Z  gs://datalab-chicago-taxi-demo/batch_prediction/prediction.results-00018-of-00022
 19.92 MiB  2017-10-28T04:23:11Z  gs://datalab-chicago-taxi-demo/batch_prediction/prediction.results-00019-of-00022
 19.78 MiB  2017-10-28T04:23:11Z  gs://datalab-chicago-taxi-demo/batch_prediction/prediction.results-00020-of-00022
 19.73 MiB  2017-10-28T04:23:11Z  gs://datalab-chicago-taxi-demo/batch_prediction/prediction.results-00021-of-00022
TOTAL: 24 objects, 421338589 bytes (401.82 MiB)

We can load the results back to BigQuery.


In [10]:
%%bq load
format: json
mode: overwrite  
table: chicago_taxi.eval_results
path: gs://datalab-chicago-taxi-demo/batch_prediction/prediction.results*
schema:
  - name: unique_key
    type: STRING
  - name: predicted
    type: FLOAT
  - name: target
    type: FLOAT


With data in BigQuery can do some query analysis. For example, RMSE.


In [11]:
%%ml evaluate regression
bigquery: chicago_taxi.eval_results


Out[11]:
metric value
0 Root Mean Square Error 3.474433
1 Mean Absolute Error 1.531116
2 50 Percentile Absolute Error 0.911206
3 90 Percentile Absolute Error 2.882050
4 99 Percentile Absolute Error 12.204504

From above, the results are better than local run with sampled data. RMSE reduced by 2.5%, MAE reduced by around 20%. Average absolute error reduced by around 30%.

Select top results sorted by error.


In [12]:
%%bq query
SELECT
  predicted, 
  target,
  ABS(predicted-target) as error,
  s.* 
FROM `chicago_taxi.eval_results` as r 
JOIN `chicago_taxi.eval` as s 
ON r.unique_key = s.unique_key 
ORDER BY error DESC
LIMIT 10


Out[12]:
predictedtargeterrorunique_keyfareweekdaydayhourpickup_latitudepickup_longitudedropoff_latitudedropoff_longitudecompany
7.06353187561197.770004272190.706472397357449c0af2cc90b5c49fb04087183ba7f90cea5197.774280441.899602111-87.63330803741.899602111-87.633308037
8.17792129517197.050003052188.872081757e2f0acebdc1d8ac7d998818e399e4b76505247e2197.0571501741.9867118-87.66341640542.009622881-87.670166857Taxi Affiliation Services
8.41388988495195.850006104187.4361162193b03f0108c84bc2b4c42ba874b0cfdccd72e575e195.8563182041.97907082-87.90303966141.97907082-87.903039661Dispatch Taxi Affiliation
7.15682601929192.850006104185.69318008476266a055557deb60189004a10d5a43f48843c86192.851222341.9867118-87.66341640541.96581197-87.655878786Blue Ribbon Taxi Association Inc.
13.7977991104199.050003052185.252203941a759a0491eef2661c12027d5dc5283b64124a18c199.0523421441.944226601-87.65599818242.009622881-87.670166857
5.72312879562189.669998169183.946869373b0fb64578ac47490df8e94a223f0ec82613fa1f0189.6772501141.879066994-87.65700502741.879255084-87.642648998
12.2571325302194.649993896182.39286136618136afd91cf91f58bb664b5fba4c9f33ebe46b4194.652327941.97907082-87.90303966141.97907082-87.903039661
5.31241512299187.0181.6875848771bc27016f87f8de32dfead8b282c996e4ff0ce15187.01136341.892072635-87.62887415741.892507781-87.626214906Chicago Medallion Leasing INC
9.88418579102188.050003052178.165817261a5c3dd3f3f8228eb0ca208e5e949d1e9401e7de5188.0531812341.922686284-87.64948872941.947791586-87.683834942
18.1057472229196.050003052177.9442558291ac25e0730bbefa234752b6a2e6824296936cc82196.0541342141.878865584-87.62519214241.96581197-87.655878786Taxi Affiliation Services

(rows: 10, time: 14.0s, 610MB processed, job: job_nkhfAM3ZPP_jzDXv8H20S8LtC1xg)

There is also a feature slice visualization component designed for viewing evaluation results. It shows correlation between features and prediction results.


In [40]:
%%bq query --name error_by_hour
SELECT
  COUNT(*) as count,
  hour as feature,
  AVG(ABS(predicted - target)) as avg_error,
  STDDEV(ABS(predicted - target)) as stddev_error
FROM `chicago_taxi.eval_results` as r
JOIN `chicago_taxi.eval` as s 
ON r.unique_key = s.unique_key 
GROUP BY hour

In [44]:
# Note: the interactive output is replaced with a static image so it displays well in github.
# Please execute this cell to see the interactive component.

from google.datalab.ml import FeatureSliceView

FeatureSliceView().plot(error_by_hour)


Out[44]:

In [42]:
%%bq query --name error_by_weekday
SELECT
  COUNT(*) as count,
  weekday as feature,
  AVG(ABS(predicted - target)) as avg_error,
  STDDEV(ABS(predicted - target)) as stddev_error
FROM `chicago_taxi.eval_results` as r
JOIN `chicago_taxi.eval` as s 
ON r.unique_key = s.unique_key 
GROUP BY weekday

In [45]:
# Note: the interactive output is replaced with a static image so it displays well in github.
# Please execute this cell to see the interactive component.

from google.datalab.ml import FeatureSliceView

FeatureSliceView().plot(error_by_weekday)


Out[45]:

What we can see from above charts is that model performs worst in hour 5 and 6 (why?), and best on Sundays (less traffic?).

Model Deployment and Online Prediction

Model deployment works the same between locally trained models and cloud trained models. Please see previous notebook (Taxi Fare Model (small data)).

Cleanup


In [ ]:
!gsutil -m rm -rf gs://datalab-chicago-taxi-demo

In [ ]: