Example of using BigQuery and Pandas together

This notebook illustrates a typical data science workflow that leverages the relative advantages of both Pandas and BigQuery:

Pandas for transformations and graphics.
BigQuery for scale

Install library and extensions if needed

On Notebook instances on Google Cloud, the BigQuery client library is already installed.



In [1]:

    
#!python -m pip install google-cloud-bigquery
%matplotlib inline
#%load_ext google.cloud.bigquery



In [2]:

    
PROJECT='cloud-training-demos'  # CHANGE THIS

Problem

We believe that if someone rents a bicycle for less than 10 minutes and returns the bicycle to the same station that they rented it at, it is likely that the bicycle has a problem. We'll call this a "bad" trip.

We want to send a crew out to examine a few of the stations that had lots of bad trips to see if there are any systemic problems.

Find stations with problem bikes

As a first step, let's find which stations had the most "bad" trips in 2015



In [3]:

    
%%bigquery badtrips --project $PROJECT
SELECT *, bad_trips / num_trips AS fraction_bad FROM (
    SELECT 
      start_station_name
      , SUM(IF(duration < 600 AND start_station_name = end_station_name, 1, 0)) AS bad_trips
      , COUNT(start_station_name) as num_trips
    FROM `bigquery-public-data`.london_bicycles.cycle_hire
    WHERE EXTRACT(YEAR FROM start_date) = 2015
    GROUP BY start_station_name
    HAVING num_trips > 10
)
ORDER BY fraction_bad DESC

Statistics

Let's look at the statistics of the dataset to learn what kind of threshold of the fraction_bad is appropriate



In [4]:

    
badtrips.describe()









    Out[4]:







  
    
      
      bad_trips
      num_trips
      fraction_bad
    
  
  
    
      count
      823.000000
      823.000000
      823.000000
    
    
      mean
      75.074119
      11869.755772
      0.007636
    
    
      std
      70.512207
      9906.268656
      0.014739
    
    
      min
      0.000000
      11.000000
      0.000000
    
    
      25%
      41.000000
      5903.000000
      0.005002
    
    
      50%
      62.000000
      9998.000000
      0.006368
    
    
      75%
      91.500000
      14852.500000
      0.008383
    
    
      max
      967.000000
      95740.000000
      0.416667

Looks like the fraction_bad ranges from 0 to 0.4, but is not clear how relevant this ratio is because the stations also vary quite dramatically. For example, the number of trips ranges from 11 to 95740. A 0.4 fraction_bad is not terrible if num_trips=11.

Let's look at a scatter plot to see if there is any clear trend here.



In [5]:

    
badtrips.plot.scatter('num_trips', 'fraction_bad');

Let's zoom in a bit and add a line of best fit



In [6]:

    
import seaborn as sns
ax = sns.regplot(badtrips['num_trips'],badtrips['fraction_bad']);
ax.set_ylim(0, 0.05);

Indeed, we see that higher fraction_bad are associated with lower num_trips. We need to use the fraction_bad carefully.

Banding

Let's pick the 5 worst of the really busy stations, 5 of the next most busy, etc. We can do this by creating 4 different bands from the quantile of the station by num_trips, and within each band, finding the 5 stations that are most bad.



In [7]:

    
stations_to_examine = []

import pandas as pd
for band in range(1,5):
  min_trips = badtrips['num_trips'].quantile(0.2*(band))
  max_trips = badtrips['num_trips'].quantile(0.2*(band+1))
  query = 'num_trips >= {} and num_trips < {}'.format(min_trips, max_trips)
  print(query) # band
  stations = badtrips.query(query)
  stations = stations.sort_values(by=['fraction_bad'], ascending=False)[:5] 
  print(stations) # 5 worst
  stations_to_examine.append(stations)
  print()









    



num_trips >= 4826.4 and num_trips < 8511.8
                 start_station_name  bad_trips  num_trips  fraction_bad
6        River Street , Clerkenwell        221       8279      0.026694
9   Courland Grove, Wandsworth Road        105       5369      0.019557
10         Stanley Grove, Battersea         92       4882      0.018845
12              Southern Grove, Bow        112       6152      0.018205
18    Richmond Way, Shepherd's Bush        126       8149      0.015462

num_trips >= 8511.8 and num_trips < 11502.6
                  start_station_name  bad_trips  num_trips  fraction_bad
3              Ormonde Gate, Chelsea        315       8932      0.035266
41  Emperor's Gate, South Kensington        136      10554      0.012886
53     Chepstow Villas, Notting Hill        135      11038      0.012230
57            Salmon Lane, Limehouse        113       9345      0.012092
63       Greyhound Road, Hammersmith        118       9998      0.011802

num_trips >= 11502.6 and num_trips < 16509.2
                           start_station_name  bad_trips  num_trips  \
35            Hertford Road, De Beauvoir Town        195      14188   
39              Bramham Gardens, Earl's Court        212      16178   
62                 Lancaster Gate , Bayswater        172      14554   
75               Ilchester Gardens, Bayswater        159      14137   
80  Westfield Library Corner, Shepherd's Bush        126      11517   

    fraction_bad  
35      0.013744  
39      0.013104  
62      0.011818  
75      0.011247  
80      0.010940  

num_trips >= 16509.2 and num_trips < 95740.0
                  start_station_name  bad_trips  num_trips  fraction_bad
25  Queen's Gate, Kensington Gardens        396      27457      0.014423
74     Speakers' Corner 2, Hyde Park        468      41107      0.011385
76        Cumberland Gate, Hyde Park        303      26981      0.011230
77            Albert Gate, Hyde Park        729      66547      0.010955
82      Triangle Car Park, Hyde Park        454      41675      0.010894

Notice how, by banding, we are able to use lower thresholds for the busier stations. Had we chosen a single threshold, our crew would have not get to look at a diverse set of stations.

Stations to examine



In [8]:

    
stations_to_examine = pd.concat(stations_to_examine)
stations_to_examine









    Out[8]:







  
    
      
      start_station_name
      bad_trips
      num_trips
      fraction_bad
    
  
  
    
      6
      River Street , Clerkenwell
      221
      8279
      0.026694
    
    
      9
      Courland Grove, Wandsworth Road
      105
      5369
      0.019557
    
    
      10
      Stanley Grove, Battersea
      92
      4882
      0.018845
    
    
      12
      Southern Grove, Bow
      112
      6152
      0.018205
    
    
      18
      Richmond Way, Shepherd's Bush
      126
      8149
      0.015462
    
    
      3
      Ormonde Gate, Chelsea
      315
      8932
      0.035266
    
    
      41
      Emperor's Gate, South Kensington
      136
      10554
      0.012886
    
    
      53
      Chepstow Villas, Notting Hill
      135
      11038
      0.012230
    
    
      57
      Salmon Lane, Limehouse
      113
      9345
      0.012092
    
    
      63
      Greyhound Road, Hammersmith
      118
      9998
      0.011802
    
    
      35
      Hertford Road, De Beauvoir Town
      195
      14188
      0.013744
    
    
      39
      Bramham Gardens, Earl's Court
      212
      16178
      0.013104
    
    
      62
      Lancaster Gate , Bayswater
      172
      14554
      0.011818
    
    
      75
      Ilchester Gardens, Bayswater
      159
      14137
      0.011247
    
    
      80
      Westfield Library Corner, Shepherd's Bush
      126
      11517
      0.010940
    
    
      25
      Queen's Gate, Kensington Gardens
      396
      27457
      0.014423
    
    
      74
      Speakers' Corner 2, Hyde Park
      468
      41107
      0.011385
    
    
      76
      Cumberland Gate, Hyde Park
      303
      26981
      0.011230
    
    
      77
      Albert Gate, Hyde Park
      729
      66547
      0.010955
    
    
      82
      Triangle Car Park, Hyde Park
      454
      41675
      0.010894

Write the stations to a new table in BigQuery



In [ ]:

    
!pip install pyarrow



In [11]:

    
from google.cloud import bigquery
bq = bigquery.Client(project=PROJECT)

table_id = '{}.ch05eu.bad_bikes'.format(PROJECT)
job = bq.load_table_from_dataframe(stations_to_examine, table_id)
job.result() # blocks and waits
print("Loaded {} rows into {}".format(job.output_rows, table_id))









    



Loaded 20 rows into swast-scratch.ch05eu.bad_bikes

Plot map for crew

Add latitude and longitude through a join



In [12]:

    
%%bigquery stations_to_examine --project $PROJECT
SELECT 
  start_station_name AS station_name
  , num_trips
  , fraction_bad
  , latitude
  , longitude
FROM ch05eu.bad_bikes AS bad
JOIN `bigquery-public-data`.london_bicycles.cycle_stations AS s
ON bad.start_station_name = s.name



In [13]:

    
!pip install folium









    



Collecting folium
  Downloading https://files.pythonhosted.org/packages/47/28/b3199bf87100e389c1dff88a44a38936d27e5e99eece870b5308186217c8/folium-0.8.2-py2.py3-none-any.whl (87kB)
    100% |████████████████████████████████| 92kB 1.4MB/s ta 0:00:01
Collecting branca>=0.3.0 (from folium)
  Downloading https://files.pythonhosted.org/packages/63/36/1c93318e9653f4e414a2e0c3b98fc898b4970e939afeedeee6075dd3b703/branca-0.3.1-py3-none-any.whl
Requirement already satisfied: jinja2 in /Users/swast/.pyenv/versions/3.7.2/envs/bigquery-oreilly-book/lib/python3.7/site-packages (from folium) (2.10)
Requirement already satisfied: requests in /Users/swast/.pyenv/versions/3.7.2/envs/bigquery-oreilly-book/lib/python3.7/site-packages (from folium) (2.21.0)
Requirement already satisfied: numpy in /Users/swast/.pyenv/versions/3.7.2/envs/bigquery-oreilly-book/lib/python3.7/site-packages (from folium) (1.16.2)
Requirement already satisfied: six in /Users/swast/.pyenv/versions/3.7.2/envs/bigquery-oreilly-book/lib/python3.7/site-packages (from folium) (1.12.0)
Requirement already satisfied: MarkupSafe>=0.23 in /Users/swast/.pyenv/versions/3.7.2/envs/bigquery-oreilly-book/lib/python3.7/site-packages (from jinja2->folium) (1.1.1)
Requirement already satisfied: chardet<3.1.0,>=3.0.2 in /Users/swast/.pyenv/versions/3.7.2/envs/bigquery-oreilly-book/lib/python3.7/site-packages (from requests->folium) (3.0.4)
Requirement already satisfied: urllib3<1.25,>=1.21.1 in /Users/swast/.pyenv/versions/3.7.2/envs/bigquery-oreilly-book/lib/python3.7/site-packages (from requests->folium) (1.24.1)
Requirement already satisfied: idna<2.9,>=2.5 in /Users/swast/.pyenv/versions/3.7.2/envs/bigquery-oreilly-book/lib/python3.7/site-packages (from requests->folium) (2.8)
Requirement already satisfied: certifi>=2017.4.17 in /Users/swast/.pyenv/versions/3.7.2/envs/bigquery-oreilly-book/lib/python3.7/site-packages (from requests->folium) (2018.11.29)
Installing collected packages: branca, folium
Successfully installed branca-0.3.1 folium-0.8.2
You are using pip version 18.1, however version 19.0.3 is available.
You should consider upgrading via the 'pip install --upgrade pip' command.



In [14]:

    
import folium
map_pts = folium.Map(location=[51.5, -0.15], zoom_start=12)
for idx, row in stations_to_examine.iterrows():
  folium.Marker( location=[row['latitude'], row['longitude']], popup=row['station_name'] ).add_to(map_pts)
map_pts









    Out[14]:

Copyright 2019 Google Inc. Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with the License. You may obtain a copy of the License at http://www.apache.org/licenses/LICENSE-2.0 Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the specific language governing permissions and limitations under the License

	bad_trips	num_trips	fraction_bad
count	823.000000	823.000000	823.000000
mean	75.074119	11869.755772	0.007636
std	70.512207	9906.268656	0.014739
min	0.000000	11.000000	0.000000
25%	41.000000	5903.000000	0.005002
50%	62.000000	9998.000000	0.006368
75%	91.500000	14852.500000	0.008383
max	967.000000	95740.000000	0.416667

	start_station_name	bad_trips	num_trips	fraction_bad
6	River Street , Clerkenwell	221	8279	0.026694
9	Courland Grove, Wandsworth Road	105	5369	0.019557
10	Stanley Grove, Battersea	92	4882	0.018845
12	Southern Grove, Bow	112	6152	0.018205
18	Richmond Way, Shepherd's Bush	126	8149	0.015462
3	Ormonde Gate, Chelsea	315	8932	0.035266
41	Emperor's Gate, South Kensington	136	10554	0.012886
53	Chepstow Villas, Notting Hill	135	11038	0.012230
57	Salmon Lane, Limehouse	113	9345	0.012092
63	Greyhound Road, Hammersmith	118	9998	0.011802
35	Hertford Road, De Beauvoir Town	195	14188	0.013744
39	Bramham Gardens, Earl's Court	212	16178	0.013104
62	Lancaster Gate , Bayswater	172	14554	0.011818
75	Ilchester Gardens, Bayswater	159	14137	0.011247
80	Westfield Library Corner, Shepherd's Bush	126	11517	0.010940
25	Queen's Gate, Kensington Gardens	396	27457	0.014423
74	Speakers' Corner 2, Hyde Park	468	41107	0.011385
76	Cumberland Gate, Hyde Park	303	26981	0.011230
77	Albert Gate, Hyde Park	729	66547	0.010955
82	Triangle Car Park, Hyde Park	454	41675	0.010894