Scaling up ML using Cloud AI Platform

In this notebook, we take a previously developed TensorFlow model to predict taxifare rides and package it up so that it can be run in Cloud AI Platform. For now, we'll run this on a small dataset. The model that was developed is rather simplistic, and therefore, the accuracy of the model is not great either. However, this notebook illustrates how to package up a TensorFlow model to run it within Cloud AI Platform.

Later in the course, we will look at ways to make a more effective machine learning model.

Environment variables for project and bucket

Note that:

  1. Your project id is the *unique* string that identifies your project (not the project name). You can find this from the GCP Console dashboard's Home page. My dashboard reads: Project ID: cloud-training-demos
  2. Cloud training often involves saving and restoring model files. If you don't have a bucket already, I suggest that you create one from the GCP console (because it will dynamically check whether the bucket name you want is available). A common pattern is to prefix the bucket name by the project id, so that it is unique. Also, for cost reasons, you might want to use a single region bucket.
  3. </ol> Change the cell below to reflect your Project ID and bucket name.

    
    
    In [ ]:
    !sudo chown -R jupyter:jupyter /home/jupyter/training-data-analyst
    
    
    
    In [ ]:
    import os
    PROJECT = 'cloud-training-demos' # REPLACE WITH YOUR PROJECT ID
    BUCKET = 'cloud-training-demos-ml' # REPLACE WITH YOUR BUCKET NAME
    REGION = 'us-central1' # REPLACE WITH YOUR BUCKET REGION e.g. us-central1
    
    
    
    In [ ]:
    # for bash
    os.environ['PROJECT'] = PROJECT
    os.environ['BUCKET'] = BUCKET
    os.environ['REGION'] = REGION
    os.environ['TFVERSION'] = '2.1'  # Tensorflow version
    
    
    
    In [ ]:
    %%bash
    gcloud config set project $PROJECT
    gcloud config set compute/region $REGION
    

    Allow the Cloud AI Platform service account to read/write to the bucket containing training data.

    
    
    In [ ]:
    %%bash
    PROJECT_ID=$PROJECT
    AUTH_TOKEN=$(gcloud auth print-access-token)
    SVC_ACCOUNT=$(curl -X GET -H "Content-Type: application/json" \
        -H "Authorization: Bearer $AUTH_TOKEN" \
        https://ml.googleapis.com/v1/projects/${PROJECT_ID}:getConfig \
        | python -c "import json; import sys; response = json.load(sys.stdin); \
        print(response['serviceAccount'])")
    
    echo "Authorizing the Cloud AI Platform account $SVC_ACCOUNT to access files in $BUCKET"
    gsutil -m defacl ch -u $SVC_ACCOUNT:R gs://$BUCKET
    gsutil -m acl ch -u $SVC_ACCOUNT:R -r gs://$BUCKET  # error message (if bucket is empty) can be ignored
    gsutil -m acl ch -u $SVC_ACCOUNT:W gs://$BUCKET
    

    Packaging up the code

    Take your code and put into a standard Python package structure. model.py and task.py containing the Tensorflow code from earlier (explore the directory structure).

    
    
    In [ ]:
    %%bash
    ## check whether there are anymore TODOs 
    ## exit with 0 to avoid notebook process error
    grep TODO taxifare/trainer/*.py; rc=$?
    
    case $rc in 
        0) ;;
        1) echo "No more TODOs!"; exit 0;;
    esac
    

    Find absolute paths to your data

    Note the absolute paths below. /content is mapped in Datalab to where the home icon takes you

    
    
    In [ ]:
    %%bash
    echo $PWD
    rm -rf $PWD/taxi_trained
    head -1 $PWD/taxi-train.csv
    head -1 $PWD/taxi-valid.csv
    

    Running the Python module from the command-line

    
    
    In [ ]:
    %%bash
    rm -rf taxifare.tar.gz taxi_trained
    export PYTHONPATH=${PYTHONPATH}:${PWD}/taxifare
    python -m trainer.task \
       --train_data_paths="${PWD}/taxi-train*" \
       --eval_data_paths=${PWD}/taxi-valid.csv  \
       --output_dir=${PWD}/taxi_trained \
       --train_steps=100 --job-dir=./tmp
    
    
    
    In [ ]:
    %%bash
    ls $PWD/taxi_trained/export/exporter/
    
    
    
    In [ ]:
    %%writefile ./test.json
    {"pickuplon": -73.885262,"pickuplat": 40.773008,"dropofflon": -73.987232,"dropofflat": 40.732403,"passengers": 2}
    
    
    
    In [ ]:
    %%bash
    sudo find "/usr/lib/google-cloud-sdk/lib/googlecloudsdk/command_lib/ml_engine" -name '*.pyc' -delete
    
    
    
    In [ ]:
    %%bash
    model_dir=$(ls ${PWD}/taxi_trained/export/exporter)
    gcloud ai-platform local predict \
        --model-dir=${PWD}/taxi_trained/export/exporter/${model_dir} \
        --json-instances=./test.json
    

    Running locally using gcloud

    
    
    In [ ]:
    %%bash
    rm -rf taxifare.tar.gz taxi_trained
    gcloud ai-platform local train \
       --module-name=trainer.task \
       --package-path=${PWD}/taxifare/trainer \
       -- \
       --train_data_paths=${PWD}/taxi-train.csv \
       --eval_data_paths=${PWD}/taxi-valid.csv  \
       --train_steps=1000 \
       --output_dir=${PWD}/taxi_trained
    

    When I ran it (due to random seeds, your results will be different), the average_loss (Mean Squared Error) on the evaluation dataset was 187, meaning that the RMSE was around 13.

    
    
    In [ ]:
    !ls $PWD/taxi_trained
    
    
    
    In [ ]:
    %%bash
    echo $BUCKET
    gsutil -m rm -rf gs://${BUCKET}/taxifare/smallinput/
    gsutil -m cp ${PWD}/*.csv gs://${BUCKET}/taxifare/smallinput/
    
    
    
    In [ ]:
    %%bash
    OUTDIR=gs://${BUCKET}/taxifare/smallinput/taxi_trained
    JOBNAME=lab3a_$(date -u +%y%m%d_%H%M%S)
    echo $OUTDIR $REGION $JOBNAME
    gsutil -m rm -rf $OUTDIR
    gcloud ai-platform jobs submit training $JOBNAME \
       --region=$REGION \
       --module-name=trainer.task \
       --package-path=${PWD}/taxifare/trainer \
       --job-dir=$OUTDIR \
       --staging-bucket=gs://$BUCKET \
       --scale-tier=BASIC \
       --runtime-version 2.1 \
       --python-version 3.5 \
       -- \
       --train_data_paths="gs://${BUCKET}/taxifare/smallinput/taxi-train*" \
       --eval_data_paths="gs://${BUCKET}/taxifare/smallinput/taxi-valid*"  \
       --output_dir=$OUTDIR \
       --train_steps=10000
    

    Don't be concerned if the notebook appears stalled (with a blue progress bar) or returns with an error about being unable to refresh auth tokens. This is a long-lived Cloud job and work is going on in the cloud.

    Use the Cloud Console link to monitor the job and do NOT proceed until the job is done.

    
    
    In [ ]:
    %%bash
    gsutil ls gs://${BUCKET}/taxifare/smallinput/
    

    Train on larger dataset

    I have already followed the steps below and the files are already available. You don't need to do the steps in this comment. In the next chapter (on feature engineering), we will avoid all this manual processing by using Cloud Dataflow.

    Go to http://bigquery.cloud.google.com/ and type the query:

    SELECT
      (tolls_amount + fare_amount) AS fare_amount,
      pickup_longitude AS pickuplon,
      pickup_latitude AS pickuplat,
      dropoff_longitude AS dropofflon,
      dropoff_latitude AS dropofflat,
      passenger_count*1.0 AS passengers,
      'nokeyindata' AS key
    FROM
      [nyc-tlc:yellow.trips]
    WHERE
      trip_distance > 0
      AND fare_amount >= 2.5
      AND pickup_longitude > -78
      AND pickup_longitude < -70
      AND dropoff_longitude > -78
      AND dropoff_longitude < -70
      AND pickup_latitude > 37
      AND pickup_latitude < 45
      AND dropoff_latitude > 37
      AND dropoff_latitude < 45
      AND passenger_count > 0
      AND ABS(HASH(pickup_datetime)) % 1000 == 1
    

    Note that this is now 1,000,000 rows (i.e. 100x the original dataset). Export this to CSV using the following steps (Note that I have already done this and made the resulting GCS data publicly available, so you don't need to do it.):

    1. Click on the "Save As Table" button and note down the name of the dataset and table.
    2. On the BigQuery console, find the newly exported table in the left-hand-side menu, and click on the name.
    3. Click on "Export Table"
    4. Supply your bucket name and give it the name train.csv (for example: gs://cloud-training-demos-ml/taxifare/ch3/train.csv). Note down what this is. Wait for the job to finish (look at the "Job History" on the left-hand-side menu)
    5. In the query above, change the final "== 1" to "== 2" and export this to Cloud Storage as valid.csv (e.g. gs://cloud-training-demos-ml/taxifare/ch3/valid.csv)
    6. Download the two files, remove the header line and upload it back to GCS.

    Run Cloud training on 1-million row dataset

    This took 60 minutes and uses as input 1-million rows. The model is exactly the same as above. The only changes are to the input (to use the larger dataset) and to the Cloud MLE tier (to use STANDARD_1 instead of BASIC -- STANDARD_1 is approximately 10x more powerful than BASIC). At the end of the training the loss was 32, but the RMSE (calculated on the validation dataset) was stubbornly at 9.03. So, simply adding more data doesn't help.

    
    
    In [ ]:
    %%bash
    
    OUTDIR=gs://${BUCKET}/taxifare/ch3/taxi_trained
    JOBNAME=lab3a_$(date -u +%y%m%d_%H%M%S)
    CRS_BUCKET=cloud-training-demos # use the already exported data
    echo $OUTDIR $REGION $JOBNAME
    gsutil -m rm -rf $OUTDIR
    gcloud ai-platform jobs submit training $JOBNAME \
       --region=$REGION \
       --module-name=trainer.task \
       --package-path=${PWD}/taxifare/trainer \
       --job-dir=$OUTDIR \
       --staging-bucket=gs://$BUCKET \
       --scale-tier=STANDARD_1 \
       --runtime-version 2.1 \
       --python-version 3.5 \
       -- \
       --train_data_paths="gs://${CRS_BUCKET}/taxifare/ch3/train.csv" \
       --eval_data_paths="gs://${CRS_BUCKET}/taxifare/ch3/valid.csv"  \
       --output_dir=$OUTDIR \
       --train_steps=100000
    

    Challenge Exercise

    Modify your solution to the challenge exercise in d_trainandevaluate.ipynb appropriately. Make sure that you implement training and deployment. Increase the size of your dataset by 10x since you are running on the cloud. Does your accuracy improve?

    Copyright 2020 Google Inc. Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with the License. You may obtain a copy of the License at http://www.apache.org/licenses/LICENSE-2.0 Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the specific language governing permissions and limitations under the License