Sample, Explore, and Clean Taxifare Dataset

Learning Objectives

  • Practice querying BigQuery
  • Sample from large dataset in a reproducible way
  • Practice exploring data using Pandas
  • Identify corrupt data and clean accordingly

Introduction

In this notebook, we will explore a dataset corresponding to taxi rides in New York City to build a Machine Learning model that estimates taxi fares. The idea is to suggest a likely fare to taxi riders so that they are not surprised, and so that they can protest if the charge is much higher than expected. Such a model would also be useful for ride-hailing apps that quote you the trip price in advance.

Set up environment variables and load necessary libraries


In [ ]:
PROJECT = "cloud-training-demos"  # Replace with your PROJECT
REGION = "us-central1"            # Choose an available region for Cloud MLE

In [ ]:
import os
os.environ["PROJECT"] = PROJECT
os.environ["REGION"] = REGION

Check that the Google BigQuery library is installed and if not, install it.


In [ ]:
!pip freeze | grep google-cloud-bigquery==1.21.0 || pip install google-cloud-bigquery==1.21.0

In [ ]:
%load_ext google.cloud.bigquery

View data schema and size

Our dataset is hosted in BigQuery: Google's petabyte scale, SQL queryable, fully managed cloud data warehouse. It is a publically available dataset, meaning anyone with a GCP account has access.

  1. Click here to acess the dataset.
  2. In the web UI, below the query editor, you will see the schema of the dataset. What fields are available, what does each mean?
  3. Click the 'details' tab. How big is the dataset?

Preview data

Let's see what a few rows of our data looks like. Any cell that starts with %%bigquery will be interpreted as a SQL query that is executed on BigQuery, and the result is printed to our notebook.

BigQuery supports two flavors of SQL syntax: legacy SQL and standard SQL. The preferred is standard SQL because it complies with the official SQL:2011 standard. To instruct BigQuery to interpret our syntax as such we start the query with #standardSQL.

There are over 1 Billion rows in this dataset and it's 130GB large, so let's retrieve a small sample


In [ ]:
%%bigquery --project $PROJECT
#standardSQL
SELECT
    *
FROM
    `nyc-tlc.yellow.trips`
WHERE
    RAND() < .0000001 -- sample a small fraction of the data

Preview data (alternate way)

Alternatively we can use BigQuery's web UI to execute queries.

  1. Open the web UI
  2. Paste the above query minus the %%bigquery part into the Query Editor
  3. Click the 'Run' button or type 'CTRL + ENTER' to execute the query

Query results will be displayed below the Query editor.

Sample data repeatably

There's one issue with using RAND() < N to sample. It's non-deterministic. Each time you run the query above you'll get a different sample.

Since repeatability is key to data science, let's instead use a hash function (which is deterministic by definition) and then sample the using the modulo operation on the hashed value.

We obtain our hash values using:

FARM_FINGERPRINT(CAST(hashkey AS STRING))

Working from inside out:

  • CAST(): Casts hashkey to string because our hash function only works on strings
  • FARM_FINGERPRINT(): Hashes strings to 64bit integers

The hashkey should be:

  1. Unrelated to the objective
  2. Sufficiently high cardinality

Given these properties we can sample our data repeatably using the modulo operation.

To get a 1% sample:

WHERE ABS(MOD(hashvalue, 100)) = 0

To get a different 1% sample change the remainder condition, for example:

WHERE ABS(MOD(hashvalue, 100)) = 55

To get a 20% sample:

WHERE ABS(MOD(hashvalue, 100)) < 20 Alternatively: WHERE ABS(MOD(hashvalue, 5)) = 0

And so forth...

We'll use pickup_datetime as our hash key because it meets our desired properties. If such a column doesn't exist in the data you can synthesize a hashkey by concatenating multiple columns.

Below we sample 1/5000th of the data. The syntax is admittedly less elegant than RAND() < N, but now each time you run the query you'll get the same result.

*Tech note: Taking absolute value doubles the chances of hash collisions but since there are 2^64 possible hash values and less than 2^30 hash keys the collision risk is negligable.


In [ ]:
%%bigquery --project $PROJECT
#standardSQL
SELECT
    *
FROM
    `nyc-tlc.yellow.trips`
WHERE 
    -- repeatable 1/5000th sample
    ABS(MOD(FARM_FINGERPRINT(CAST(pickup_datetime AS STRING)), 5000)) = 1

Load sample into Pandas dataframe

The advantage of querying BigQuery directly as opposed to the web UI is that we can supplement SQL analysis with Python analysis. A popular Python library for data analysis on structured data is Pandas, and the primary data strucure in Pandas is called a DataFrame.

To store BigQuery results in a Pandas DataFrame we have have to query the data with a slightly differently syntax.

  1. Import the google.cloud bigquery module
  2. Create a variable called bq which is equal to the BigQuery Client bigquery.Client()
  3. Store the desired SQL query as a Python string
  4. Execute bq.query(query_string).to_dataframe() where query_string is what you created in the previous step

This will take about a minute

Tip: Use triple quotes for a multi-line string in Python

Tip: You can measure execution time of a cell by starting that cell with %%time


In [ ]:
from google.cloud import bigquery
bq = bigquery.Client(project=PROJECT)

query_string = """
#standardSQL
SELECT
    *
FROM
    `nyc-tlc.yellow.trips`
Where
    -- repeatable 1/5000th sample
    ABS(MOD(FARM_FINGERPRINT(CAST(pickup_datetime AS STRING)), 5000)) = 1
"""

trips = bq.query(query_string).to_dataframe()

Explore datafame


In [ ]:
print(type(trips))
trips.head()

The Python variable trips is now a Pandas DataFrame. The .head() function above prints the first 5 rows of a DataFrame.

The rows in the DataFrame may be in a different order than when using %%bigquery, but the data is the same.

It would be useful to understand the distribution of each of our columns, which is to say the mean, min, max, standard deviation etc..

A DataFrame's .describe() method provides this. By default it only analyzes numeric columns. To include stats about non-numeric column use describe(include='all').


In [ ]:
trips.describe()

Distribution analysis

Do you notice anything off about the data? Pay attention to min and max. Latitudes should be between -90 and 90, and longitudes should be between -180 and 180, so clearly some of this data is bad.

Further more some trip fares are negative and some passenger counts are 0 which doesn't seem right. We'll clean this up later.

Investigate trip distance

Looks like some trip distances are 0 as well, let's investigate this.


In [ ]:
trips[trips["trip_distance"] == 0][:10] # first 10 rows with trip_distance == 0

It appears that trips are being charged substantial fares despite having 0 distance.

Let's graph trip_distance vs fare_amount using the Pandas .plot() method to corroborate.


In [ ]:
%matplotlib inline
trips.plot(x = "trip_distance", y = "fare_amount", kind = "scatter")

It appears that we have a lot of invalid data that is being coded as zero distance and some fare amounts that are definitely illegitimate. Let's remove them from our analysis. We can do this by modifying the BigQuery query to keep only trips longer than zero miles and fare amounts that are at least the minimum cab fare ($2.50).

Identify correct label

Should we use fare_amount or total_amount as our label? What's the difference?

To make this clear let's look at some trips that included a toll.


In [ ]:
trips[trips["tolls_amount"] > 0][:10] # first 10 rows with toll_amount > 0

Looking at the samples above, we can see that the total amount reflects fare amount, toll and tip somewhat arbitrarily -- this is because when customers pay cash, the tip is not known. In any case tips are discretionary and shoud not be included in our fare estimation tool.

So, we'll use the sum of fare_amount + tolls_amount as our label.

Select useful fields

What fields do you see that may be useful in modeling taxifare? They should be

  1. Related to the objective
  2. Available at prediction time

Related to the objective

For example we know passenger_count shouldn't have any affect on fare because fare is calculated by time and distance. Best to eliminate it to reduce the amount of noise in the data and make the job of the ML algorithm easier.

If you're not sure whether a column is related to the objective, err on the side of keeping it and let the ML algorithm figure out whether it's useful or not.

Available at prediction time

For example trip_distance is certainly related to the objective, but we can't know the value until a trip is completed (depends on the route taken), so it can't be used for prediction.

We will use the following

pickup_datetime, pickup_longitude, pickup_latitude, dropoff_longitude, and dropoff_latitude.

Clean the data

We need to do some clean-up of the data:

  • Filter to latitudes and longitudes that are reasonable for NYC
  • We shouldn't fare amounts < 2.50
  • Trip distances and passenger counts should be non-zero
  • Have the label reflect the sum of fare_amount and tolls_amount

Let's change the BigQuery query appropriately, and only return the fields we'll use in our model.


In [ ]:
%%bigquery --project $PROJECT
#standardSQL
SELECT
    (tolls_amount + fare_amount) AS fare_amount, -- label
    pickup_datetime,
    pickup_longitude, 
    pickup_latitude, 
    dropoff_longitude, 
    dropoff_latitude
FROM
    `nyc-tlc.yellow.trips`
WHERE
    -- Clean Data
    trip_distance > 0
    AND passenger_count > 0
    AND fare_amount >= 2.5
    AND pickup_longitude > -78
    AND pickup_longitude < -70
    AND dropoff_longitude > -78
    AND dropoff_longitude < -70
    AND pickup_latitude > 37
    AND pickup_latitude < 45
    AND dropoff_latitude > 37
    AND dropoff_latitude < 45
    -- repeatable 1/5000th sample
    AND ABS(MOD(FARM_FINGERPRINT(CAST(pickup_datetime AS STRING)), 5000)) = 1

We now have a repeatable and clean sample we can use for modeling taxi fares.

Copyright 2019 Google Inc. Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with the License. You may obtain a copy of the License at http://www.apache.org/licenses/LICENSE-2.0 Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the specific language governing permissions and limitations under the License.