Example of using BigQuery and Pandas together

This notebook illustrates a typical data science workflow that leverages the relative advantages of both Pandas and BigQuery:

  • Pandas for transformations and graphics.
  • BigQuery for scale

Install library and extensions if needed

On Notebook instances on Google Cloud, the BigQuery client library is already installed.


In [1]:
#!python -m pip install google-cloud-bigquery
%matplotlib inline
#%load_ext google.cloud.bigquery

In [2]:
PROJECT='cloud-training-demos'  # CHANGE THIS

Problem

We believe that if someone rents a bicycle for less than 10 minutes and returns the bicycle to the same station that they rented it at, it is likely that the bicycle has a problem. We'll call this a "bad" trip.

We want to send a crew out to examine a few of the stations that had lots of bad trips to see if there are any systemic problems.

Find stations with problem bikes

As a first step, let's find which stations had the most "bad" trips in 2015


In [3]:
%%bigquery badtrips --project $PROJECT
SELECT *, bad_trips / num_trips AS fraction_bad FROM (
    SELECT 
      start_station_name
      , SUM(IF(duration < 600 AND start_station_name = end_station_name, 1, 0)) AS bad_trips
      , COUNT(start_station_name) as num_trips
    FROM `bigquery-public-data`.london_bicycles.cycle_hire
    WHERE EXTRACT(YEAR FROM start_date) = 2015
    GROUP BY start_station_name
    HAVING num_trips > 10
)
ORDER BY fraction_bad DESC

Statistics

Let's look at the statistics of the dataset to learn what kind of threshold of the fraction_bad is appropriate


In [4]:
badtrips.describe()


Out[4]:
bad_trips num_trips fraction_bad
count 823.000000 823.000000 823.000000
mean 75.074119 11869.755772 0.007636
std 70.512207 9906.268656 0.014739
min 0.000000 11.000000 0.000000
25% 41.000000 5903.000000 0.005002
50% 62.000000 9998.000000 0.006368
75% 91.500000 14852.500000 0.008383
max 967.000000 95740.000000 0.416667

Looks like the fraction_bad ranges from 0 to 0.4, but is not clear how relevant this ratio is because the stations also vary quite dramatically. For example, the number of trips ranges from 11 to 95740. A 0.4 fraction_bad is not terrible if num_trips=11.

Let's look at a scatter plot to see if there is any clear trend here.


In [5]:
badtrips.plot.scatter('num_trips', 'fraction_bad');


Let's zoom in a bit and add a line of best fit


In [6]:
import seaborn as sns
ax = sns.regplot(badtrips['num_trips'],badtrips['fraction_bad']);
ax.set_ylim(0, 0.05);


Indeed, we see that higher fraction_bad are associated with lower num_trips. We need to use the fraction_bad carefully.

Banding

Let's pick the 5 worst of the really busy stations, 5 of the next most busy, etc. We can do this by creating 4 different bands from the quantile of the station by num_trips, and within each band, finding the 5 stations that are most bad.


In [7]:
stations_to_examine = []

import pandas as pd
for band in range(1,5):
  min_trips = badtrips['num_trips'].quantile(0.2*(band))
  max_trips = badtrips['num_trips'].quantile(0.2*(band+1))
  query = 'num_trips >= {} and num_trips < {}'.format(min_trips, max_trips)
  print(query) # band
  stations = badtrips.query(query)
  stations = stations.sort_values(by=['fraction_bad'], ascending=False)[:5] 
  print(stations) # 5 worst
  stations_to_examine.append(stations)
  print()


num_trips >= 4826.4 and num_trips < 8511.8
                 start_station_name  bad_trips  num_trips  fraction_bad
6        River Street , Clerkenwell        221       8279      0.026694
9   Courland Grove, Wandsworth Road        105       5369      0.019557
10         Stanley Grove, Battersea         92       4882      0.018845
12              Southern Grove, Bow        112       6152      0.018205
18    Richmond Way, Shepherd's Bush        126       8149      0.015462

num_trips >= 8511.8 and num_trips < 11502.6
                  start_station_name  bad_trips  num_trips  fraction_bad
3              Ormonde Gate, Chelsea        315       8932      0.035266
41  Emperor's Gate, South Kensington        136      10554      0.012886
53     Chepstow Villas, Notting Hill        135      11038      0.012230
57            Salmon Lane, Limehouse        113       9345      0.012092
63       Greyhound Road, Hammersmith        118       9998      0.011802

num_trips >= 11502.6 and num_trips < 16509.2
                           start_station_name  bad_trips  num_trips  \
35            Hertford Road, De Beauvoir Town        195      14188   
39              Bramham Gardens, Earl's Court        212      16178   
62                 Lancaster Gate , Bayswater        172      14554   
75               Ilchester Gardens, Bayswater        159      14137   
80  Westfield Library Corner, Shepherd's Bush        126      11517   

    fraction_bad  
35      0.013744  
39      0.013104  
62      0.011818  
75      0.011247  
80      0.010940  

num_trips >= 16509.2 and num_trips < 95740.0
                  start_station_name  bad_trips  num_trips  fraction_bad
25  Queen's Gate, Kensington Gardens        396      27457      0.014423
74     Speakers' Corner 2, Hyde Park        468      41107      0.011385
76        Cumberland Gate, Hyde Park        303      26981      0.011230
77            Albert Gate, Hyde Park        729      66547      0.010955
82      Triangle Car Park, Hyde Park        454      41675      0.010894

Notice how, by banding, we are able to use lower thresholds for the busier stations. Had we chosen a single threshold, our crew would have not get to look at a diverse set of stations.

Stations to examine


In [8]:
stations_to_examine = pd.concat(stations_to_examine)
stations_to_examine


Out[8]:
start_station_name bad_trips num_trips fraction_bad
6 River Street , Clerkenwell 221 8279 0.026694
9 Courland Grove, Wandsworth Road 105 5369 0.019557
10 Stanley Grove, Battersea 92 4882 0.018845
12 Southern Grove, Bow 112 6152 0.018205
18 Richmond Way, Shepherd's Bush 126 8149 0.015462
3 Ormonde Gate, Chelsea 315 8932 0.035266
41 Emperor's Gate, South Kensington 136 10554 0.012886
53 Chepstow Villas, Notting Hill 135 11038 0.012230
57 Salmon Lane, Limehouse 113 9345 0.012092
63 Greyhound Road, Hammersmith 118 9998 0.011802
35 Hertford Road, De Beauvoir Town 195 14188 0.013744
39 Bramham Gardens, Earl's Court 212 16178 0.013104
62 Lancaster Gate , Bayswater 172 14554 0.011818
75 Ilchester Gardens, Bayswater 159 14137 0.011247
80 Westfield Library Corner, Shepherd's Bush 126 11517 0.010940
25 Queen's Gate, Kensington Gardens 396 27457 0.014423
74 Speakers' Corner 2, Hyde Park 468 41107 0.011385
76 Cumberland Gate, Hyde Park 303 26981 0.011230
77 Albert Gate, Hyde Park 729 66547 0.010955
82 Triangle Car Park, Hyde Park 454 41675 0.010894

Write the stations to a new table in BigQuery


In [ ]:
!pip install pyarrow

In [11]:
from google.cloud import bigquery
bq = bigquery.Client(project=PROJECT)

table_id = '{}.ch05eu.bad_bikes'.format(PROJECT)
job = bq.load_table_from_dataframe(stations_to_examine, table_id)
job.result() # blocks and waits
print("Loaded {} rows into {}".format(job.output_rows, table_id))


Loaded 20 rows into swast-scratch.ch05eu.bad_bikes

Plot map for crew

Add latitude and longitude through a join


In [12]:
%%bigquery stations_to_examine --project $PROJECT
SELECT 
  start_station_name AS station_name
  , num_trips
  , fraction_bad
  , latitude
  , longitude
FROM ch05eu.bad_bikes AS bad
JOIN `bigquery-public-data`.london_bicycles.cycle_stations AS s
ON bad.start_station_name = s.name

In [13]:
!pip install folium


Collecting folium
  Downloading https://files.pythonhosted.org/packages/47/28/b3199bf87100e389c1dff88a44a38936d27e5e99eece870b5308186217c8/folium-0.8.2-py2.py3-none-any.whl (87kB)
    100% |████████████████████████████████| 92kB 1.4MB/s ta 0:00:01
Collecting branca>=0.3.0 (from folium)
  Downloading https://files.pythonhosted.org/packages/63/36/1c93318e9653f4e414a2e0c3b98fc898b4970e939afeedeee6075dd3b703/branca-0.3.1-py3-none-any.whl
Requirement already satisfied: jinja2 in /Users/swast/.pyenv/versions/3.7.2/envs/bigquery-oreilly-book/lib/python3.7/site-packages (from folium) (2.10)
Requirement already satisfied: requests in /Users/swast/.pyenv/versions/3.7.2/envs/bigquery-oreilly-book/lib/python3.7/site-packages (from folium) (2.21.0)
Requirement already satisfied: numpy in /Users/swast/.pyenv/versions/3.7.2/envs/bigquery-oreilly-book/lib/python3.7/site-packages (from folium) (1.16.2)
Requirement already satisfied: six in /Users/swast/.pyenv/versions/3.7.2/envs/bigquery-oreilly-book/lib/python3.7/site-packages (from folium) (1.12.0)
Requirement already satisfied: MarkupSafe>=0.23 in /Users/swast/.pyenv/versions/3.7.2/envs/bigquery-oreilly-book/lib/python3.7/site-packages (from jinja2->folium) (1.1.1)
Requirement already satisfied: chardet<3.1.0,>=3.0.2 in /Users/swast/.pyenv/versions/3.7.2/envs/bigquery-oreilly-book/lib/python3.7/site-packages (from requests->folium) (3.0.4)
Requirement already satisfied: urllib3<1.25,>=1.21.1 in /Users/swast/.pyenv/versions/3.7.2/envs/bigquery-oreilly-book/lib/python3.7/site-packages (from requests->folium) (1.24.1)
Requirement already satisfied: idna<2.9,>=2.5 in /Users/swast/.pyenv/versions/3.7.2/envs/bigquery-oreilly-book/lib/python3.7/site-packages (from requests->folium) (2.8)
Requirement already satisfied: certifi>=2017.4.17 in /Users/swast/.pyenv/versions/3.7.2/envs/bigquery-oreilly-book/lib/python3.7/site-packages (from requests->folium) (2018.11.29)
Installing collected packages: branca, folium
Successfully installed branca-0.3.1 folium-0.8.2
You are using pip version 18.1, however version 19.0.3 is available.
You should consider upgrading via the 'pip install --upgrade pip' command.

In [14]:
import folium
map_pts = folium.Map(location=[51.5, -0.15], zoom_start=12)
for idx, row in stations_to_examine.iterrows():
  folium.Marker( location=[row['latitude'], row['longitude']], popup=row['station_name'] ).add_to(map_pts)
map_pts


Out[14]:

Copyright 2019 Google Inc. Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with the License. You may obtain a copy of the License at http://www.apache.org/licenses/LICENSE-2.0 Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the specific language governing permissions and limitations under the License