Extract Datasets and Establish Benchmark

Learning Objectives

Divide into Train, Evaluation and Test datasets
Understand why we need each
Pull data out of BigQuery and into CSV
Establish Rules Based Benchmark

Introduction

In the previous notebook we demonstrated how to do ML in BigQuery. However BQML is limited to linear models.

For advanced ML we need to pull the data out of BigQuery and load it into a ML Framework, in our case TensorFlow.

While TensorFlow can read from BigQuery directly, the performance is slow. The best practice is to first stage the BigQuery files as .csv files, and then read the .csv files into TensorFlow.

The .csv files can reside on local disk if we're training locally, but if we're training in the cloud we'll need to move the .csv files to the cloud, in our case Google Cloud Storage.

Set up environment variables and load necessary libraries



In [ ]:

    
PROJECT = "cloud-training-demos"  # Replace with your PROJECT
REGION = "us-central1"            # Choose an available region for Cloud MLE



In [ ]:

    
import os
os.environ["PROJECT"] = PROJECT
os.environ["REGION"] = REGION



In [ ]:

    
!pip freeze | grep google-cloud-bigquery==1.21.0 || pip install google-cloud-bigquery==1.21.0



In [ ]:

    
%load_ext google.cloud.bigquery

Review

In the a_sample_explore_clean notebook we came up with the following query to extract a repeatable and clean sample:

#standardSQL
SELECT
  (tolls_amount + fare_amount) AS fare_amount, -- label
  pickup_datetime,
  pickup_longitude, 
  pickup_latitude, 
  dropoff_longitude, 
  dropoff_latitude
FROM
  `nyc-tlc.yellow.trips`
WHERE
  -- Clean Data
  trip_distance > 0
  AND passenger_count > 0
  AND fare_amount >= 2.5
  AND pickup_longitude > -78
  AND pickup_longitude < -70
  AND dropoff_longitude > -78
  AND dropoff_longitude < -70
  AND pickup_latitude > 37
  AND pickup_latitude < 45
  AND dropoff_latitude > 37
  AND dropoff_latitude < 45
  -- repeatable 1/5000th sample
  AND ABS(MOD(FARM_FINGERPRINT(CAST(pickup_datetime AS STRING)), 5000)) = 1

We will use the same query with one change. Instead of using pickup_datetime as is, we will extract dayofweek and hourofday from it. This is to give us some categorical features in our dataset so we can illustrate how to deal with them when we get to feature engineering. The new query will be:

SELECT
  (tolls_amount + fare_amount) AS fare_amount, -- label
  EXTRACT(DAYOFWEEK from pickup_datetime) AS dayofweek,
  EXTRACT(HOUR from pickup_datetime) AS hourofday,
  pickup_longitude, 
  pickup_latitude, 
  dropoff_longitude, 
  dropoff_latitude
-- rest same as before

Split into train, evaluation, and test sets

For ML modeling we need not just one, but three datasets.

Train: This is what our model learns on

Evaluation (aka Validation): We shouldn't evaluate our model on the same data we trained on because then we couldn't know whether it was memorizing the input data or whether it was generalizing. Therefore we evaluate on the evaluation dataset, aka validation dataset.

Test: We use our evaluation dataset to tune our hyperparameters (we'll cover hyperparameter tuning in a future lesson). We need to know that our chosen set of hyperparameters will work well for data we haven't seen before because in production, that will be the case. For this reason, we create a third dataset that we never use during the model development process. We only evaluate on this once our model development is finished. Data scientists don't always create a test dataset (aka holdout dataset), but to be thorough you should.

We can divide our existing 1/5000th sample three ways 70%/15%/15% (or whatever split we like) with some modulo math demonstrated below.

Because we are using a hash function these results are deterministic, we'll get the same exact split every time the query is run (assuming the underlying data hasn't changed)

Exercise 1

The create_query function below returns a query string that we will pass to BigQuery to collect our data. It takes as arguments the phase (TRAIN, VALID, or TEST) and the sample_size (relating to the fraction of the data we wish to sample). Complete the code below so that when the phase is set as VALID or TEST a new 15% split of the data will be created.



In [ ]:

    
def create_query(phase, sample_size):
    basequery = """
    SELECT
        (tolls_amount + fare_amount) AS fare_amount,
        EXTRACT(DAYOFWEEK from pickup_datetime) AS dayofweek,
        EXTRACT(HOUR from pickup_datetime) AS hourofday,
        pickup_longitude AS pickuplon,
        pickup_latitude AS pickuplat,
        dropoff_longitude AS dropofflon,
        dropoff_latitude AS dropofflat
    FROM
        `nyc-tlc.yellow.trips`
    WHERE
        trip_distance > 0
        AND fare_amount >= 2.5
        AND pickup_longitude > -78
        AND pickup_longitude < -70
        AND dropoff_longitude > -78
        AND dropoff_longitude < -70
        AND pickup_latitude > 37
        AND pickup_latitude < 45
        AND dropoff_latitude > 37
        AND dropoff_latitude < 45
        AND passenger_count > 0
        AND ABS(MOD(FARM_FINGERPRINT(CAST(pickup_datetime AS STRING)), EVERY_N)) = 1
    """

    if phase == "TRAIN":
        subsample = """
        AND ABS(MOD(FARM_FINGERPRINT(CAST(pickup_datetime AS STRING)), EVERY_N * 100)) >= (EVERY_N * 0)
        AND ABS(MOD(FARM_FINGERPRINT(CAST(pickup_datetime AS STRING)), EVERY_N * 100)) <  (EVERY_N * 70)
        """
    elif phase == "VALID":
        subsample = """
        # TODO: Your code goes here
        """
    elif phase == "TEST":
        subsample = """
        # TODO: Your code goes here
        """

    query = basequery + subsample
    return query.replace("EVERY_N", sample_size)

Write to CSV

Now let's execute a query for train/valid/test and write the results to disk in csv format. We use Pandas's .to_csv() method to do so.

Exercise 2

The for loop below will generate the TRAIN/VALID/TEST sampled subsets of our dataset. Complete the code in the cell below to 1) create the BigQuery query_string using the create_query function you completed above, taking our original 1/5000th of the dataset and 2) load the BigQuery results of that query_string to a DataFrame labeled df.

The remaining lines of code write that DataFrame to a csv file with the appropriate naming.



In [ ]:

    
from google.cloud import bigquery
bq = bigquery.Client(project=PROJECT)

for phase in ["TRAIN", "VALID", "TEST"]:
    # 1. Create query string
    query_string = # TODO: Your code goes here

    # 2. Load results into DataFrame
    df = # TODO: Your code goes here

    # 3. Write DataFrame to CSV
    df.to_csv("taxi-{}.csv".format(phase.lower()), index_label = False, index = False)
    print("Wrote {} lines to {}".format(len(df), "taxi-{}.csv".format(phase.lower())))

Note that even with a 1/5000th sample we have a good amount of data for ML. 150K training examples and 30K validation.

Verify that datasets exist



In [ ]:

    
!ls -l *.csv

Preview one of the files



In [ ]:

    
!head taxi-train.csv

Looks good! We now have our ML datasets and are ready to train ML models, validate them and test them.

Establish rules-based benchmark

Before we start building complex ML models, it is a good idea to come up with a simple rules based model and use that as a benchmark. After all, there's no point using ML if it can't beat the traditional rules based approach!

Our rule is going to be to divide the mean fare_amount by the mean estimated distance to come up with a rate and use that to predict.

Recall we can't use the actual trip_distance because we won't have that available at prediction time (depends on the route taken), however we do know the users pick up and drop off location so we can use euclidean distance between those coordinates.

Exercise 3

In the code below, we create a rules-based benchmark and measure the Root Mean Squared Error against the label. The function euclidean_distance takes as input a Pandas dataframe and should measure the straight line distance between the pickup location and the dropoff location. Complete the code so that the function returns Euclidean distance between the pickup and dropoff location.

The compute_rmse funciton takes the actual (label) value and the predicted value and computes the Root Mean Squared Error between the the two. Complete the code below for the compute_rmse function.



In [ ]:

    
import pandas as pd

def euclidean_distance(df):
    return # TODO: Your code goes here

def compute_rmse(actual, predicted):
    return # TODO: Your code goes here

def print_rmse(df, rate, name):
    print("{} RMSE = {}".format(compute_rmse(df["fare_amount"], rate * euclidean_distance(df)), name))

df_train = pd.read_csv("taxi-train.csv")
df_valid = pd.read_csv("taxi-valid.csv")

rate = df_train["fare_amount"].mean() / euclidean_distance(df_train).mean()

print_rmse(df_train, rate, "Train")
print_rmse(df_valid, rate, "Valid")

The simple distance-based rule gives us an RMSE of $7.70 on the validation dataset. We have to beat this, of course, but you will find that simple rules of thumb like this can be surprisingly difficult to beat.

You don't want to set a goal on the test dataset because you'll want to tweak your hyperparameters and model architecture to get the best validation error. Then, you can evaluate ONCE on the test data.

Challenge exercise

Let's say that you want to predict whether a Stackoverflow question will be acceptably answered. Using this public dataset of questions, create a machine learning dataset that you can use for classification.

What is a reasonable benchmark for this problem? What features might be useful?

If you got the above easily, try this harder problem: you want to predict whether a question will be acceptably answered within 2 days. How would you create the dataset?

Hint (highlight to see):

You will need to do a SQL join with the table of [answers]( https://bigquery.cloud.google.com/table/bigquery-public-data:stackoverflow.posts_answers) to determine whether the answer was within 2 days.

Copyright 2019 Google Inc. Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with the License. You may obtain a copy of the License at http://www.apache.org/licenses/LICENSE-2.0 Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the specific language governing permissions and limitations under the License.