Learning Objectives
In this notebook, we will explore a dataset corresponding to taxi rides in New York City to build a Machine Learning model that estimates taxi fares. The idea is to suggest a likely fare to taxi riders so that they are not surprised, and so that they can protest if the charge is much higher than expected. Such a model would also be useful for ride-hailing apps that quote you the trip price in advance.
In [ ]:
PROJECT = "cloud-training-demos" # Replace with your PROJECT
REGION = "us-central1" # Choose an available region for Cloud MLE
In [ ]:
import os
os.environ["PROJECT"] = PROJECT
os.environ["REGION"] = REGION
Check that the Google BigQuery library is installed and if not, install it.
In [ ]:
!pip freeze | grep google-cloud-bigquery==1.21.0 || pip install google-cloud-bigquery==1.21.0
In [ ]:
%load_ext google.cloud.bigquery
Our dataset is hosted in BigQuery: Google's petabyte scale, SQL queryable, fully managed cloud data warehouse. It is a publically available dataset, meaning anyone with a GCP account has access.
Let's see what a few rows of our data looks like. Any cell that starts with %%bigquery
will be interpreted as a SQL query that is executed on BigQuery, and the result is printed to our notebook.
BigQuery supports two flavors of SQL syntax: legacy SQL and standard SQL. The preferred is standard SQL because it complies with the official SQL:2011 standard. To instruct BigQuery to interpret our syntax as such we start the query with #standardSQL
.
There are over 1 Billion rows in this dataset and it's 130GB large, so let's retrieve a small sample
In [ ]:
%%bigquery --project $PROJECT
#standardSQL
SELECT
*
FROM
`nyc-tlc.yellow.trips`
WHERE
RAND() < .0000001 -- sample a small fraction of the data
Alternatively we can use BigQuery's web UI to execute queries.
%%bigquery
part into the Query EditorQuery results will be displayed below the Query editor.
There's one issue with using RAND() < N
to sample. It's non-deterministic. Each time you run the query above you'll get a different sample.
Since repeatability is key to data science, let's instead use a hash function (which is deterministic by definition) and then sample the using the modulo operation on the hashed value.
We obtain our hash values using:
FARM_FINGERPRINT(CAST(hashkey AS STRING))
Working from inside out:
CAST()
: Casts hashkey to string because our hash function only works on stringsFARM_FINGERPRINT()
: Hashes strings to 64bit integersThe hashkey
should be:
Given these properties we can sample our data repeatably using the modulo operation.
To get a 1% sample:
WHERE ABS(MOD(hashvalue, 100)) = 0
To get a different 1% sample change the remainder condition, for example:
WHERE ABS(MOD(hashvalue, 100)) = 55
To get a 20% sample:
WHERE ABS(MOD(hashvalue, 100)) < 20
Alternatively: WHERE ABS(MOD(hashvalue, 5)) = 0
And so forth...
We'll use pickup_datetime
as our hash key because it meets our desired properties. If such a column doesn't exist in the data you can synthesize a hashkey by concatenating multiple columns.
Below we sample 1/5000th of the data. The syntax is admittedly less elegant than RAND() < N
, but now each time you run the query you'll get the same result.
*Tech note: Taking absolute value doubles the chances of hash collisions but since there are 2^64 possible hash values and less than 2^30 hash keys the collision risk is negligable.
In [ ]:
%%bigquery --project $PROJECT
#standardSQL
SELECT
*
FROM
`nyc-tlc.yellow.trips`
WHERE
-- repeatable 1/5000th sample
ABS(MOD(FARM_FINGERPRINT(CAST(pickup_datetime AS STRING)), 5000)) = 1
The advantage of querying BigQuery directly as opposed to the web UI is that we can supplement SQL analysis with Python analysis. A popular Python library for data analysis on structured data is Pandas, and the primary data strucure in Pandas is called a DataFrame.
To store BigQuery results in a Pandas DataFrame we have have to query the data with a slightly differently syntax.
google.cloud
bigquery
modulebq
which is equal to the BigQuery Client bigquery.Client()
bq.query(query_string).to_dataframe()
where query_string
is what you created in the previous stepThis will take about a minute
Tip: Use triple quotes for a multi-line string in Python
Tip: You can measure execution time of a cell by starting that cell with %%time
In [ ]:
from google.cloud import bigquery
bq = bigquery.Client(project=PROJECT)
query_string = """
#standardSQL
SELECT
*
FROM
`nyc-tlc.yellow.trips`
Where
-- repeatable 1/5000th sample
ABS(MOD(FARM_FINGERPRINT(CAST(pickup_datetime AS STRING)), 5000)) = 1
"""
trips = bq.query(query_string).to_dataframe()
In [ ]:
print(type(trips))
trips.head()
The Python variable trips
is now a Pandas DataFrame. The .head()
function above prints the first 5 rows of a DataFrame.
The rows in the DataFrame may be in a different order than when using %%bigquery
, but the data is the same.
It would be useful to understand the distribution of each of our columns, which is to say the mean, min, max, standard deviation etc..
A DataFrame's .describe()
method provides this. By default it only analyzes numeric columns. To include stats about non-numeric column use describe(include='all')
.
In [ ]:
trips.describe()
Do you notice anything off about the data? Pay attention to min
and max
. Latitudes should be between -90 and 90, and longitudes should be between -180 and 180, so clearly some of this data is bad.
Further more some trip fares are negative and some passenger counts are 0 which doesn't seem right. We'll clean this up later.
In [ ]:
trips[trips["trip_distance"] == 0][:10] # first 10 rows with trip_distance == 0
It appears that trips are being charged substantial fares despite having 0 distance.
Let's graph trip_distance
vs fare_amount
using the Pandas .plot()
method to corroborate.
In [ ]:
%matplotlib inline
trips.plot(x = "trip_distance", y = "fare_amount", kind = "scatter")
It appears that we have a lot of invalid data that is being coded as zero distance and some fare amounts that are definitely illegitimate. Let's remove them from our analysis. We can do this by modifying the BigQuery query to keep only trips longer than zero miles and fare amounts that are at least the minimum cab fare ($2.50).
In [ ]:
trips[trips["tolls_amount"] > 0][:10] # first 10 rows with toll_amount > 0
Looking at the samples above, we can see that the total amount reflects fare amount, toll and tip somewhat arbitrarily -- this is because when customers pay cash, the tip is not known. In any case tips are discretionary and shoud not be included in our fare estimation tool.
So, we'll use the sum of fare_amount
+ tolls_amount
as our label.
What fields do you see that may be useful in modeling taxifare? They should be
Related to the objective
For example we know passenger_count
shouldn't have any affect on fare because fare is calculated by time and distance. Best to eliminate it to reduce the amount of noise in the data and make the job of the ML algorithm easier.
If you're not sure whether a column is related to the objective, err on the side of keeping it and let the ML algorithm figure out whether it's useful or not.
Available at prediction time
For example trip_distance
is certainly related to the objective, but we can't know the value until a trip is completed (depends on the route taken), so it can't be used for prediction.
We will use the following
pickup_datetime
, pickup_longitude
, pickup_latitude
, dropoff_longitude
, and dropoff_latitude
.
We need to do some clean-up of the data:
Let's change the BigQuery query appropriately, and only return the fields we'll use in our model.
In [ ]:
%%bigquery --project $PROJECT
#standardSQL
SELECT
(tolls_amount + fare_amount) AS fare_amount, -- label
pickup_datetime,
pickup_longitude,
pickup_latitude,
dropoff_longitude,
dropoff_latitude
FROM
`nyc-tlc.yellow.trips`
WHERE
-- Clean Data
trip_distance > 0
AND passenger_count > 0
AND fare_amount >= 2.5
AND pickup_longitude > -78
AND pickup_longitude < -70
AND dropoff_longitude > -78
AND dropoff_longitude < -70
AND pickup_latitude > 37
AND pickup_latitude < 45
AND dropoff_latitude > 37
AND dropoff_latitude < 45
-- repeatable 1/5000th sample
AND ABS(MOD(FARM_FINGERPRINT(CAST(pickup_datetime AS STRING)), 5000)) = 1
We now have a repeatable and clean sample we can use for modeling taxi fares.
Copyright 2019 Google Inc. Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with the License. You may obtain a copy of the License at http://www.apache.org/licenses/LICENSE-2.0 Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the specific language governing permissions and limitations under the License.