Distributed Training

Learning Objectives

  • Use AI Platform Training Service to run a distributed training job

Introduction

In the previous notebook we trained our model on AI Platform Training Service, but we didn't recieve any benefit. In fact it was much slower to train on the Cloud (5-10 minutes) than it was to train locally! Why is this?

1. The job was too small

AI Platform Training Service provisions hardware on-demand. This is good because it means you only pay for what you use, but for small jobs it means the start up time for the hardware is longer than the training time itself!

To address this we'll use a dataset that is 100x as big, and enough steps to go through all the data at least once.

2. The hardware was too small

By default AI Platform Training Service jobs train on an n1-standard-4 instance, which isn't that much more powerful than our local VM. And even if it was we could easily increase the specs of our local VM to match.

To get the most benefit out of AI Platform Training Service we need to move beyond training on a single instance and instead train across multiple machines.

Because we're using tf.estimator.train_and_evaluate(), our model already knows how to distribute itself while training! So all we need to do is supply a --scale-tier parameter to the AI Platform Training Service train job which will provide the distributed training environment. See the different scale tiers avaialable here.

We will use STANDARD_1 which corresponds to 1 n1-highcpu-8 master instance, 4 n1-highcpu-8 worker instances, and n1-standard-4 3 parameter servers. We will cover the details of the distribution strategy and why there are master/worker/parameter designations later in the course.

Training will take about 20 minutes


In [ ]:
PROJECT = "cloud-training-demos"  # Replace with your PROJECT
BUCKET = "cloud-training-bucket"  # Replace with your BUCKET
REGION = "us-central1"            # Choose an available region for AI Platform Training Service
TFVERSION = "1.14"                # TF version for AI Platform Training Service to use

Run distributed cloud job

After having testing our training pipeline both locally and in the cloud on a susbset of the data, we'll now submit another (much larger) training job to the cloud. The gcloud command is almost exactly the same though we'll need to alter some of the previous parameters to point our training job at the much larger dataset.

Note the train_data_path and eval_data_path in the Exercise code below as well train_steps, the number of training steps.

To start, we'll set up our output directory as before, now calling it trained_large. Then we submit the training job using gcloud ml-engine similar to before.

Exercise 1

In the cell below, we will submit another (much larger) training job to the cloud. However, this time we'll alter some of the previous parameters. Fill in the missing code in the TODOs below. You can reference the previous f_ai_platform notebook if you get stuck. Note that, now we will want to include an additional parameter for scale-tier to specify the distributed training environment. You can follow these links to read more about "Using Distributed TensorFlow with Cloud ML Engine" or "Specifying Machine Types or Scale Tiers".

Exercise 2

Notice how our train_data_path contains a wildcard character. This means we're going to be reading in a list of sharded files, modify your read_dataset() function in the model.py to handle this (or verify it already does).


In [ ]:
OUTDIR = "gs://{}/taxifare/trained_large".format(BUCKET)
!gsutil -m rm -rf # TODO: Your code goes here
!gcloud ai-platform # TODO: Your code goes here
    --package-path= # TODO: Your code goes here
    --module-name= # TODO: Your code goes here
    --job-dir= # TODO: Your code goes here
    --python-version= # TODO: Your code goes here
    --runtime-version= # TODO: Your code goes here
    --region= # TODO: Your code goes here
    --scale-tier= # TODO: Your code goes here
    -- \
    --train_data_path=gs://cloud-training-demos/taxifare/large/taxi-train*.csv \
    --eval_data_path=gs://cloud-training-demos/taxifare/small/taxi-valid.csv  \
    --train_steps=200000 \
    --output_dir={OUTDIR}

Instructions to obtain larger dataset

Note the new train_data_path above. It is ~20,000,000 rows (100x the original dataset) and 1.25GB sharded across 10 files. How did we create this file?

Go to https://console.cloud.google.com/bigquery and paste the query:

    #standardSQL
    SELECT
      (tolls_amount + fare_amount) AS fare_amount,
      EXTRACT(DAYOFWEEK from pickup_datetime) AS dayofweek,
      EXTRACT(HOUR from pickup_datetime) AS hourofday,
      pickup_longitude AS pickuplon,
      pickup_latitude AS pickuplat,
      dropoff_longitude AS dropofflon,
      dropoff_latitude AS dropofflat
    FROM
      `nyc-tlc.yellow.trips`
    WHERE
      trip_distance > 0
      AND fare_amount >= 2.5
      AND pickup_longitude > -78
      AND pickup_longitude < -70
      AND dropoff_longitude > -78
      AND dropoff_longitude < -70
      AND pickup_latitude > 37
      AND pickup_latitude < 45
      AND dropoff_latitude > 37
      AND dropoff_latitude < 45
      AND passenger_count > 0
      AND ABS(MOD(FARM_FINGERPRINT(CAST(pickup_datetime AS STRING)), 50)) = 1

Export this to CSV using the following steps (Note that we have already done this and made the resulting GCS data publicly available, so following these steps is optional):

  1. Click on the "Save Results" button and select "BigQuery Table" (we can't directly export to CSV because the file is too large).
  2. Specify a dataset name and table name (if you don't have an existing dataset, create one).
  3. On the BigQuery console, find the newly exported table in the left-hand-side menu, and click on the name.
  4. Click on the "Export" button, then select "Export to GCS".
  5. Supply your bucket and file name (for example: gs://cloud-training-demos/taxifare/large/taxi-train*.csv). The asterisk allows for sharding of large files.

Note: We are still using the original smaller validation dataset. This is because it already contains ~31K records so should suffice to give us a good indication of learning. 100xing the validation dataset would slow down training because the full validation dataset is proccesed at each checkpoint, and the value of a larger validation dataset is questionable.

Analysis

Our previous RMSE was 9.26, and the new RMSE is about the same (9.24), so more training data didn't help.

However we still haven't done any feature engineering, so the signal in the data is very hard for the model to extract, even if we have lots of it. In the next section we'll apply feature engineering to try to improve our model.

Copyright 2019 Google Inc. Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with the License. You may obtain a copy of the License at http://www.apache.org/licenses/LICENSE-2.0 Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the specific language governing permissions and limitations under the License