Hashed Features Design Pattern

Solves problems related to not knowing the vocabulary of a categorical variable beforehand.



In [22]:

    
%%bigquery df
CREATE TEMPORARY FUNCTION hashed(airport STRING, numbuckets INT64) AS (
   ABS(MOD(FARM_FINGERPRINT(airport), numbuckets))
);

WITH airports AS (
SELECT 
   DISTINCT(departure_airport)
FROM `bigquery-samples.airline_ontime_data.flights`
)

SELECT 
   departure_airport,
   hashed(departure_airport, 3) AS hash3,
   hashed(departure_airport, 10) AS hash10,
   hashed(departure_airport, 1000) AS hash1000,
FROM airports



In [23]:

    
df.head(n=10)









    Out[23]:







  
    
      
      departure_airport
      hash3
      hash10
      hash1000
    
  
  
    
      0
      SJU
      2
      1
      81
    
    
      1
      PVD
      1
      7
      667
    
    
      2
      MSY
      1
      1
      991
    
    
      3
      MHT
      2
      7
      507
    
    
      4
      CHS
      1
      2
      432
    
    
      5
      CPR
      2
      6
      306
    
    
      6
      PHX
      0
      3
      743
    
    
      7
      SFO
      0
      4
      714
    
    
      8
      SNA
      2
      7
      587
    
    
      9
      SAN
      0
      7
      447



In [24]:

    
len(df)









    Out[24]:





347

Some airports had very few flights



In [27]:

    
%%bigquery
SELECT 
   departure_airport, COUNT(1) AS num_flights
FROM `bigquery-samples.airline_ontime_data.flights`
GROUP BY departure_airport
ORDER BY num_flights ASC
LIMIT 10









    Out[27]:







  
    
      
      departure_airport
      num_flights
    
  
  
    
      0
      MKC
      1
    
    
      1
      BFF
      1
    
    
      2
      PVU
      1
    
    
      3
      GLH
      2
    
    
      4
      FMN
      3
    
    
      5
      PUB
      4
    
    
      6
      OGD
      5
    
    
      7
      CKB
      8
    
    
      8
      PIR
      14
    
    
      9
      SHD
      18

Likelihood of collision

To calculate the likelihood of collision, consider that previous N entries don't collide. Even with 100,000 hash-length, the chance of a collision is 45%. So, hashed columns should never be used if collisions will cause problems.



In [37]:

    
import pandas as pd

def calc_collision_prob(num_total, num_hash):
    no_collision_prob = 1.0
    for i in range(num_total):
        # i of the previous buckets is occupied now
        collision_likelihood = float(i) / num_hash
        no_collision_prob *= (1 - collision_likelihood)
    return 1 - no_collision_prob


data = []
for num_hash in [3, 10, 100, 1000, 10000, 100000]:
    data.append([num_hash, 
                 len(df)/num_hash, 
                 calc_collision_prob(len(df), num_hash)
                ])
prob = pd.DataFrame(data, columns=['num_hash_buckets', 'entries_per_bucket', 'collision_prob'])
prob









    Out[37]:







  
    
      
      num_hash_buckets
      entries_per_bucket
      collision_prob
    
  
  
    
      0
      3
      115.666667
      1.000000
    
    
      1
      10
      34.700000
      1.000000
    
    
      2
      100
      3.470000
      1.000000
    
    
      3
      1000
      0.347000
      1.000000
    
    
      4
      10000
      0.034700
      0.997697
    
    
      5
      100000
      0.003470
      0.451739



In [29]:

    
calc_collision_prob(5, 1000) # num_hash >> num_total









    Out[29]:





0.009965049976000118

Which airports have the same hashbucket as Chicago, assuming 100 hash buckets?



In [33]:

    
%%bigquery
CREATE TEMPORARY FUNCTION hashed(airport STRING, numbuckets INT64) AS (
   ABS(MOD(FARM_FINGERPRINT(airport), numbuckets))
);

WITH airports AS (
SELECT 
   departure_airport, COUNT(1) AS num_flights
FROM `bigquery-samples.airline_ontime_data.flights`
GROUP BY departure_airport 
)

SELECT 
   departure_airport, num_flights
FROM airports
WHERE hashed(departure_airport, 100) = hashed('ORD', 100)









    Out[33]:







  
    
      
      departure_airport
      num_flights
    
  
  
    
      0
      ORD
      3610491
    
    
      1
      BTV
      66555
    
    
      2
      MCI
      597761

Copyright 2020 Google Inc. Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with the License. You may obtain a copy of the License at http://www.apache.org/licenses/LICENSE-2.0 Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the specific language governing permissions and limitations under the License

	departure_airport	hash3	hash10	hash1000
0	SJU	2	1	81
1	PVD	1	7	667
2	MSY	1	1	991
3	MHT	2	7	507
4	CHS	1	2	432
5	CPR	2	6	306
6	PHX	0	3	743
7	SFO	0	4	714
8	SNA	2	7	587
9	SAN	0	7	447

	num_hash_buckets	entries_per_bucket	collision_prob
0	3	115.666667	1.000000
1	10	34.700000	1.000000
2	100	3.470000	1.000000
3	1000	0.347000	1.000000
4	10000	0.034700	0.997697
5	100000	0.003470	0.451739

	departure_airport	hash3	hash10	hash1000
0	SJU	2	1	81
1	PVD	1	7	667
2	MSY	1	1	991
3	MHT	2	7	507
4	CHS	1	2	432
5	CPR	2	6	306
6	PHX	0	3	743
7	SFO	0	4	714
8	SNA	2	7	587
9	SAN	0	7	447

Hashed Features Design Pattern

Likelihood of collision

Airports that share hash buckets

	departure_airport	hash3	hash10	hash1000
0	SJU	2	1	81
1	PVD	1	7	667
2	MSY	1	1	991
3	MHT	2	7	507
4	CHS	1	2	432
5	CPR	2	6	306
6	PHX	0	3	743
7	SFO	0	4	714
8	SNA	2	7	587
9	SAN	0	7	447