Hashed Features Design Pattern

Solves problems related to not knowing the vocabulary of a categorical variable beforehand.


In [22]:
%%bigquery df
CREATE TEMPORARY FUNCTION hashed(airport STRING, numbuckets INT64) AS (
   ABS(MOD(FARM_FINGERPRINT(airport), numbuckets))
);

WITH airports AS (
SELECT 
   DISTINCT(departure_airport)
FROM `bigquery-samples.airline_ontime_data.flights`
)

SELECT 
   departure_airport,
   hashed(departure_airport, 3) AS hash3,
   hashed(departure_airport, 10) AS hash10,
   hashed(departure_airport, 1000) AS hash1000,
FROM airports

In [23]:
df.head(n=10)


Out[23]:
departure_airport hash3 hash10 hash1000
0 SJU 2 1 81
1 PVD 1 7 667
2 MSY 1 1 991
3 MHT 2 7 507
4 CHS 1 2 432
5 CPR 2 6 306
6 PHX 0 3 743
7 SFO 0 4 714
8 SNA 2 7 587
9 SAN 0 7 447

In [24]:
len(df)


Out[24]:
347

Some airports had very few flights


In [27]:
%%bigquery
SELECT 
   departure_airport, COUNT(1) AS num_flights
FROM `bigquery-samples.airline_ontime_data.flights`
GROUP BY departure_airport
ORDER BY num_flights ASC
LIMIT 10


Out[27]:
departure_airport num_flights
0 MKC 1
1 BFF 1
2 PVU 1
3 GLH 2
4 FMN 3
5 PUB 4
6 OGD 5
7 CKB 8
8 PIR 14
9 SHD 18

Likelihood of collision

To calculate the likelihood of collision, consider that previous N entries don't collide. Even with 100,000 hash-length, the chance of a collision is 45%. So, hashed columns should never be used if collisions will cause problems.


In [37]:
import pandas as pd

def calc_collision_prob(num_total, num_hash):
    no_collision_prob = 1.0
    for i in range(num_total):
        # i of the previous buckets is occupied now
        collision_likelihood = float(i) / num_hash
        no_collision_prob *= (1 - collision_likelihood)
    return 1 - no_collision_prob


data = []
for num_hash in [3, 10, 100, 1000, 10000, 100000]:
    data.append([num_hash, 
                 len(df)/num_hash, 
                 calc_collision_prob(len(df), num_hash)
                ])
prob = pd.DataFrame(data, columns=['num_hash_buckets', 'entries_per_bucket', 'collision_prob'])
prob


Out[37]:
num_hash_buckets entries_per_bucket collision_prob
0 3 115.666667 1.000000
1 10 34.700000 1.000000
2 100 3.470000 1.000000
3 1000 0.347000 1.000000
4 10000 0.034700 0.997697
5 100000 0.003470 0.451739

In [29]:
calc_collision_prob(5, 1000) # num_hash >> num_total


Out[29]:
0.009965049976000118

Airports that share hash buckets

Which airports have the same hashbucket as Chicago, assuming 100 hash buckets?


In [33]:
%%bigquery
CREATE TEMPORARY FUNCTION hashed(airport STRING, numbuckets INT64) AS (
   ABS(MOD(FARM_FINGERPRINT(airport), numbuckets))
);

WITH airports AS (
SELECT 
   departure_airport, COUNT(1) AS num_flights
FROM `bigquery-samples.airline_ontime_data.flights`
GROUP BY departure_airport 
)

SELECT 
   departure_airport, num_flights
FROM airports
WHERE hashed(departure_airport, 100) = hashed('ORD', 100)


Out[33]:
departure_airport num_flights
0 ORD 3610491
1 BTV 66555
2 MCI 597761

Copyright 2020 Google Inc. Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with the License. You may obtain a copy of the License at http://www.apache.org/licenses/LICENSE-2.0 Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the specific language governing permissions and limitations under the License