In [1]:
#!python -m pip install google-cloud-bigquery
%matplotlib inline
#%load_ext google.cloud.bigquery
In [2]:
PROJECT='cloud-training-demos' # CHANGE THIS
We believe that if someone rents a bicycle for less than 10 minutes and returns the bicycle to the same station that they rented it at, it is likely that the bicycle has a problem. We'll call this a "bad" trip.
We want to send a crew out to examine a few of the stations that had lots of bad trips to see if there are any systemic problems.
In [3]:
%%bigquery badtrips --project $PROJECT
SELECT *, bad_trips / num_trips AS fraction_bad FROM (
SELECT
start_station_name
, SUM(IF(duration < 600 AND start_station_name = end_station_name, 1, 0)) AS bad_trips
, COUNT(start_station_name) as num_trips
FROM `bigquery-public-data`.london_bicycles.cycle_hire
WHERE EXTRACT(YEAR FROM start_date) = 2015
GROUP BY start_station_name
HAVING num_trips > 10
)
ORDER BY fraction_bad DESC
In [4]:
badtrips.describe()
Out[4]:
Looks like the fraction_bad ranges from 0 to 0.4, but is not clear how relevant this ratio is because the stations also vary quite dramatically. For example, the number of trips ranges from 11 to 95740. A 0.4 fraction_bad is not terrible if num_trips=11.
Let's look at a scatter plot to see if there is any clear trend here.
In [5]:
badtrips.plot.scatter('num_trips', 'fraction_bad');
Let's zoom in a bit and add a line of best fit
In [6]:
import seaborn as sns
ax = sns.regplot(badtrips['num_trips'],badtrips['fraction_bad']);
ax.set_ylim(0, 0.05);
Indeed, we see that higher fraction_bad are associated with lower num_trips. We need to use the fraction_bad carefully.
In [7]:
stations_to_examine = []
import pandas as pd
for band in range(1,5):
min_trips = badtrips['num_trips'].quantile(0.2*(band))
max_trips = badtrips['num_trips'].quantile(0.2*(band+1))
query = 'num_trips >= {} and num_trips < {}'.format(min_trips, max_trips)
print(query) # band
stations = badtrips.query(query)
stations = stations.sort_values(by=['fraction_bad'], ascending=False)[:5]
print(stations) # 5 worst
stations_to_examine.append(stations)
print()
Notice how, by banding, we are able to use lower thresholds for the busier stations. Had we chosen a single threshold, our crew would have not get to look at a diverse set of stations.
In [8]:
stations_to_examine = pd.concat(stations_to_examine)
stations_to_examine
Out[8]:
In [ ]:
!pip install pyarrow
In [11]:
from google.cloud import bigquery
bq = bigquery.Client(project=PROJECT)
table_id = '{}.ch05eu.bad_bikes'.format(PROJECT)
job = bq.load_table_from_dataframe(stations_to_examine, table_id)
job.result() # blocks and waits
print("Loaded {} rows into {}".format(job.output_rows, table_id))
Add latitude and longitude through a join
In [12]:
%%bigquery stations_to_examine --project $PROJECT
SELECT
start_station_name AS station_name
, num_trips
, fraction_bad
, latitude
, longitude
FROM ch05eu.bad_bikes AS bad
JOIN `bigquery-public-data`.london_bicycles.cycle_stations AS s
ON bad.start_station_name = s.name
In [13]:
!pip install folium
In [14]:
import folium
map_pts = folium.Map(location=[51.5, -0.15], zoom_start=12)
for idx, row in stations_to_examine.iterrows():
folium.Marker( location=[row['latitude'], row['longitude']], popup=row['station_name'] ).add_to(map_pts)
map_pts
Out[14]:
Copyright 2019 Google Inc. Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with the License. You may obtain a copy of the License at http://www.apache.org/licenses/LICENSE-2.0 Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the specific language governing permissions and limitations under the License