Datalab Tutorial

In this tutorial, we'll do some exploratory data analysis in BigQuery using Datalab.

Requirements

  1. If you haven't already, you may sign-up for the free GCP trial credit. Before you begin, give this project any name you like and enable the BigQuery API.
  2. Create a Datalab instance.

NYC Yellow Taxi Data

We'll analyze BigQuery's public dataset on the NYC yellow taxi ride. BigQuery supports both standard and legacy SQL, which are demonstrated in this tutorial.


In [3]:
%sql -d standard
SELECT
  *
FROM
  `nyc-tlc.yellow.trips`
LIMIT
  5


Out[3]:
vendor_idpickup_datetimedropoff_datetimepickup_longitudepickup_latitudedropoff_longitudedropoff_latituderate_codepassenger_counttrip_distancepayment_typefare_amountextramta_taximp_surchargetip_amounttolls_amounttotal_amountstore_and_fwd_flag
VTS2009-02-21 11:18:002009-02-21 11:19:00-73.90702240.751902-73.90673540.751867 20.06CSH2.50.0  0.00.02.5 
VTS2009-11-02 22:28:002009-11-02 22:28:00-73.80835240.688103-73.80909340.689398 20.14CSH2.50.50.5 0.00.03.5 
VTS2011-03-24 23:12:002011-03-24 23:12:00-73.96763740.761357-73.96775740.761223110.0CSH2.50.50.5 0.00.03.5 
VTS2010-06-11 05:20:002010-06-11 05:20:00-73.78624840.644517-73.78651340.644857110.07CAS2.50.50.5 0.00.03.5 
VTS2013-11-22 07:04:002013-11-22 07:05:00-73.95452240.78734-73.95439340.787322110.01CSH2.50.00.5 0.00.03.0 

(rows: 5, time: 1.5s, cached, job: job_soX2fvl6-OhWk1bPqaaImUxHG2nu)

Let's look at the table schema:


In [4]:
%bigquery schema --table nyc-tlc:yellow.trips


Out[4]:

1. What is the most common pick-up time?


In [8]:
%%bq query -n pickup_time
WITH subquery AS (
  SELECT
    EXTRACT(HOUR FROM pickup_datetime) AS hour
  FROM
    `nyc-tlc.yellow.trips`)
SELECT
  Hour,
  COUNT(Hour) AS count
FROM
  subquery
GROUP BY
  Hour
ORDER BY
  count DESC

Let's name this query result pickup_time and reference it to create the chart below.


In [9]:
# Let's visualize the pick-up time distribution
%chart columns --data pickup_time


Out[9]:

7:00 PM is the most common pick-up time.

2. Give the vendor distribution

The above queries were all standard SQL. This is an example of how legacy SQL can be executed in Datalab.


In [10]:
%%sql -d legacy -m vendor
SELECT
  TOP(vendor_id) AS vendor,
  COUNT(*) AS count
FROM
  [nyc-tlc:yellow.trips]

Let's label this query result vendor and reference it to create the following pie chart.


In [11]:
%chart pie --data vendor


Out[11]:

3. Provide summary statistics on trip distance


In [12]:
%%sql -d legacy
SELECT
  QUANTILES(trip_distance, 5) AS quantile,
  MIN(trip_distance) AS min,
  MAX(trip_distance) AS max,
  AVG(trip_distance) AS avg,
  STDDEV(trip_distance) AS std_dev
FROM
  [nyc-tlc:yellow.trips]


Out[12]:
quantileminmaxavgstd_dev
-40840124.4-40840124.459016609.35.108387425095114.54193218
0.9-40840124.459016609.35.108387425095114.54193218
1.53-40840124.459016609.35.108387425095114.54193218
2.7-40840124.459016609.35.108387425095114.54193218
59016609.3-40840124.459016609.35.108387425095114.54193218

(rows: 5, time: 6.8s, 8GB processed, job: job_6g7cBuvtCm4Xu-w9npXEjcyYVr7n)

Datalab also supports LaTeX rendering. The min distance is $-4.08\times10^7$ miles (interesting!), $Q_1$ is 0.9 miles and $Q_3$ is 2.7 miles. The trip distance is skewed to the right since the mean is greater than the median (1.54 miles).

4. Let's plot the pickup location


In [13]:
%%bq query -n pickup_location
SELECT
  pickup_latitude,
  pickup_longitude
FROM
  `nyc-tlc.yellow.trips`
LIMIT
  10

In [25]:
%%chart map --data pickup_location


Out[25]:

4. Could distance and fare amount explain the payment disputes for rides from the JFK airport?


In [28]:
%%bq query -n dispute
SELECT
  trip_distance,
  fare_amount
FROM
  `nyc-tlc.yellow.trips`
WHERE
  rate_code = "2"
  AND payment_type = "DIS"

In [29]:
%%chart scatter --data dispute
height: 400
hAxis:
  title: Distance
vAxis:
  title: Fare Amount
trendlines:
  0:
    type: line
    color: green
    showR2: true
    visibleInLegend: true


Out[29]:

There seems to be a weak positive relationship ($r = +\sqrt{r^2} = 0.145$) between the trip distance and the fare amount for taxis that picked up rides from the airport and had payment disputes.

How can you share your notebook?

  1. To download your notebook, go to Notebook > Download in Datalab.
  2. To push your notebook to your GitHub repo, type the usual git commands in a cell precendented with an exclamation mark, like so:

In [ ]:
!git add * 
!git commit -m "your message"
!git push

Clean-up

Delete your Datalab VM instance to avoid incurring charges to your account.