This notebook accompanies the presentation "Machine Learning and Bayesian Statistics in minutes: How data scientists use BigQuery"
Let's say that we want to find the probability of a flight being late $\theta$ given a specific departure delay $\textbf{D}$ Bayes' Law tells that can be obtained for any specific departure delay using the formula:
Once you have large datasets, the probabilities above are just exercises in counting and so, applying Bayesian statistics is super-easy in BigQuery.
For example, let's find the probability that a flight will be 15+ minutes late:
In [4]:
%%bigquery df
WITH rawnumbers AS (
SELECT
departure_delay,
COUNT(1) AS num_flights,
COUNTIF(arrival_delay < 15) AS num_ontime
FROM
`bigquery-samples.airline_ontime_data.flights`
GROUP BY
departure_delay
HAVING
num_flights > 100
),
totals AS (
SELECT
SUM(num_flights) AS tot_flights,
SUM(num_ontime) AS tot_ontime
FROM rawnumbers
),
bayes AS (
SELECT
departure_delay,
num_flights / tot_flights AS prob_D,
num_ontime / tot_ontime AS prob_D_theta,
tot_ontime / tot_flights AS prob_theta
FROM
rawnumbers, totals
WHERE
num_ontime > 0
)
SELECT
*, (prob_theta * prob_D_theta / prob_D) AS prob_ontime
FROM
bayes
ORDER BY
departure_delay ASC
In [5]:
df.plot(x='departure_delay', y='prob_ontime');
But is it right, though? What's with the weird hump for early departures (departure_delay less than zero)?
First, we should verify that we can apply Bayes Law. Grouping by the departure delay is incorrect if the departure delay is a chaotic input variable. We have do exploratory analysis to validate that:
This, too, is straightforward in BigQuery
In [7]:
%%bigquery df
SELECT
departure_delay,
COUNT(1) AS num_flights,
APPROX_QUANTILES(arrival_delay, 10) AS arrival_delay_deciles
FROM
`bigquery-samples.airline_ontime_data.flights`
GROUP BY
departure_delay
HAVING
num_flights > 100
ORDER BY
departure_delay ASC
In [8]:
import pandas as pd
percentiles = df['arrival_delay_deciles'].apply(pd.Series)
percentiles = percentiles.rename(columns = lambda x : str(x*10) + "%")
df = pd.concat([df['departure_delay'], percentiles], axis=1)
df.head()
Out[8]:
In [9]:
without_extremes = df.drop(['0%', '100%'], 1)
without_extremes.plot(x='departure_delay', xlim=(-30,50), ylim=(-50,50));
Note the crazy non-linearity for top half of of the flights that leave more than 20 minutes early. Most likely, these are planes that try to beat some weather situation. About half of such flights succeed (the linear bottom) and the other half don't (the non-linear top). The average is what we saw as the weird hump in the probability plot. So yes, the hump is real. The rest of the distribution is clear-cut and the Bayes probabilities are quite valid.
In [ ]:
%%bigquery
CREATE OR REPLACE MODEL ch09eu.bicycle_model_dnn
OPTIONS(input_label_cols=['duration'],
model_type='dnn_regressor', hidden_units=[32, 4])
TRANSFORM(
duration
, start_station_name
, CAST(EXTRACT(dayofweek from start_date) AS STRING)
as dayofweek
, CAST(EXTRACT(hour from start_date) AS STRING)
as hourofday
)
AS
SELECT
duration, start_station_name, start_date
FROM
`bigquery-public-data`.london_bicycles.cycle_hire
In [10]:
%%bigquery
SELECT * FROM ML.EVALUATE(MODEL ch09eu.bicycle_model_dnn)
Out[10]:
In [13]:
%%bigquery
SELECT * FROM ML.PREDICT(MODEL ch09eu.bicycle_model_dnn,(
SELECT
'Park Street, Bankside' AS start_station_name
,CURRENT_TIMESTAMP() AS start_date
))
Out[13]:
In [ ]:
%%bigquery
CREATE OR REPLACE MODEL advdata.txtclass_tf
OPTIONS (model_type='tensorflow',
model_path='gs://cloud-training-demos/txtclass/export/exporter/1549825580/*')
In [14]:
%%bigquery
SELECT
input,
(SELECT AS STRUCT(p, ['github', 'nytimes', 'techcrunch'][ORDINAL(s)]) prediction FROM
(SELECT p, ROW_NUMBER() OVER() AS s FROM
(SELECT * FROM UNNEST(dense_1) AS p))
ORDER BY p DESC LIMIT 1).*
FROM ML.PREDICT(MODEL advdata.txtclass_tf,
(
SELECT 'Unlikely Partnership in House Gives Lawmakers Hope for Border Deal' AS input
UNION ALL SELECT "Fitbit\'s newest fitness tracker is just for employees and health insurance members"
UNION ALL SELECT "Show HN: Hello, a CLI tool for managing social media"
))
Out[14]:
Copyright 2019 Google Inc. Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with the License. You may obtain a copy of the License at http://www.apache.org/licenses/LICENSE-2.0 Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the specific language governing permissions and limitations under the License.
In [ ]: