Heuristic Benchmark

This notebook demonstrates the Heuristic Benchmark design pattern

1. Regression on poorly understood features

Problem: Time interval before a question on Stack Overflow is answered.

Benchmark: Median time to first answer over the entire training dataset, so 2120 seconds.


In [1]:
%%bigquery
SELECT 
  bqutil.fn.median(ARRAY_AGG(TIMESTAMP_DIFF(a.creation_date, q.creation_date, SECOND))) AS time_to_answer
FROM `bigquery-public-data.stackoverflow.posts_questions` q
JOIN `bigquery-public-data.stackoverflow.posts_answers` a
ON q.accepted_answer_id = a.id


Out[1]:
time_to_answer
0 2120.0

Find the error metric of always predicting that it will take 2120 seconds to get an answer. This the baseline metric against which to report model performance.


In [2]:
%%bigquery
WITH benchmark_eval AS (
SELECT 
  2120 - TIMESTAMP_DIFF(a.creation_date, q.creation_date, SECOND) AS error
FROM `bigquery-public-data.stackoverflow.posts_questions` q
JOIN `bigquery-public-data.stackoverflow.posts_answers` a
ON q.accepted_answer_id = a.id
)

SELECT
   AVG(ABS(error)) AS mean_absolute_error
FROM
   benchmark_eval


Out[2]:
mean_absolute_error
0 857315.119106

2. Classification on poorly understood features

Problem: Whether or not an accepted answer will be edited.

Benchmark: Probability distribution of accepted answers that are edited.


In [3]:
%%bigquery
SELECT 
  AVG(IF(a.last_edit_date IS NULL, 0, 1)) AS prob_edited
FROM `bigquery-public-data.stackoverflow.posts_questions` q
JOIN `bigquery-public-data.stackoverflow.posts_answers` a
ON q.accepted_answer_id = a.id


Out[3]:
prob_edited
0 0.36226

Problem: Country from which a Stack Overflow question will be answered.

Benchmark: Fractions of answers written by people from France, India, and so on.


In [4]:
%%bigquery
SELECT 
  COUNTIF(ENDS_WITH(u.location, 'France')) / COUNT(u.location) AS from_france,
  COUNTIF(ENDS_WITH(u.location, 'India')) / COUNT(u.location) AS from_india
FROM `bigquery-public-data.stackoverflow.posts_questions` q
JOIN `bigquery-public-data.stackoverflow.posts_answers` a
ON q.accepted_answer_id = a.id
JOIN `bigquery-public-data.stackoverflow.users` u
ON u.id = a.owner_user_id


Out[4]:
from_france from_india
0 0.029717 0.08415

3. Regression with one good numeric feature

Problem: Predict taxi fare amount given pickup and dropoff locations. The distance between the two points is, intuitively, a key feature.

Benchmark: linear regression based on this feature


In [6]:
%%bigquery
With trips AS (
SELECT
  total_amount,
  ST_Distance(ST_GeogPoint(pickup_longitude, pickup_latitude),
              ST_GeogPoint(dropoff_longitude, dropoff_latitude))/1000 AS dist
FROM `bigquery-public-data.new_york.tlc_yellow_trips_2015`
WHERE pickup_latitude BETWEEN 35 and 45
AND dropoff_latitude BETWEEN 35 and 45
AND pickup_longitude BETWEEN -80 and -70
AND dropoff_longitude BETWEEN -80 and -70
AND total_amount IS NOT NULL
)

SELECT AVG(total_amount)/AVG(dist)
FROM trips


Out[6]:
f0_
0 4.644356

4. Regression with one or two important features

Problem: Predict duration of bicycle rental.

Benchmark: Lookup table


In [7]:
%%bigquery
CREATE TEMPORARY FUNCTION is_peak_hour(start_date TIMESTAMP) aS
(EXTRACT(DAYOFWEEK FROM start_date) BETWEEN 2 AND 6 -- weekday
    AND (
       EXTRACT(HOUR FROM start_date) BETWEEN 6 AND 10
       OR
       EXTRACT(HOUR FROM start_date) BETWEEN 15 AND 18))
;

SELECT 
   start_station_name,
   is_peak_hour(start_date) AS is_peak,
   AVG(duration) AS predicted_duration,
FROM `bigquery-public-data.london_bicycles.cycle_hire`
GROUP BY 1, 2
ORDER BY predicted_duration DESC
LIMIT 10


Out[7]:
start_station_name is_peak predicted_duration
0 Contact Centre, Southbury House False 7012.500000
1 Stewart's Road, Nine Elms False 6401.018182
2 Speakers' Corner 2, Hyde Park True 4455.441717
3 Speakers' Corner 2, Hyde Park False 3785.754375
4 Speakers' Corner 1, Hyde Park True 3728.008525
5 Stewart's Road, Nine Elms True 3727.422680
6 Speakers' Corner 1, Hyde Park False 3702.115147
7 Black Lion Gate, Kensington Gardens True 3653.733728
8 Black Lion Gate, Kensington Gardens False 3552.613008
9 Mechanical Workshop Penton True 3533.424658

In [ ]:

Copyright 2020 Google Inc. Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with the License. You may obtain a copy of the License at http://www.apache.org/licenses/LICENSE-2.0 Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the specific language governing permissions and limitations under the License