Hashing can be useful in speeding up the search process for a specific item that is part of a larger collection of items. Depending on the implementation of the hashing algorithm, this can turn the computational complexity of our search algorithm from $O(n)$ to $O(1)$. We do this by building a specific data structure, which we'll dive into next.
A hash table is a collection of items, stored in such a way as to make it easier to find them later. The table consists of slots that hold items and are named by a specific integer value, starting with 0.
Example of a hash table (sorry for the poor formatting because markdown :
0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | 10 |
---|---|---|---|---|---|---|---|---|---|---|
None | None | None | None | None | None | None | None | None | None | None |
Each entry in this hash table, is currently set to a value of None
.
A hash function is used when mapping values into the slots available within a Hash table. The hash function typically takes, as input, an item from a collection, and will return an integer in the range of slot names, between $0$ and $m-1$. There are many different hash functions, but the first we can discuss is the "remainder method" hash function.
The remainder hash function takes an item from a collection, divides it by the table size, returning the remainder of it's hash value $h(item) = item \% \text{table_size}$. Typically modulo arithmetic is present in some form for all hash functions, as the result must be in the range of the total number of slots within the table.
Assuming we have a set of integer items $\{25,54,34,67,75,21,77,31\}$, we can use our hash function to find slots for our values, accordingly.
In [1]:
items = [25,54,34,67,75,21,77,31]
def hash(item_list, table_size):
hash_table = dict([(i,None) for i,x in enumerate(range(table_size))])
for item in item_list:
i = item % table_size
print("The hash for %s is %s" % (item, i))
hash_table[i] = item
return hash_table
# Execute the hash function
# Create table with 11 entries to match example above
hash_table = hash(items, 11)
# Print the resulting hash table
print(hash_table)
Once the hash values have been computed, we inset each item into the hash table at the designated position(s). We can now see that there are entries with corresponding hash values stored in a python dictionary. This is obviously a very simple implementation of a hash table.
There is something interesting to note here, though, when working through using a simple hashing algorithm like the remainder method. We have items, in our case integers, which hash to the same value. Specifically, we can see that there are 2 items that hash to each of the 1, 9, and 10 slots. These are what are known as collisions.
Clearly these collisions can cause problems, as out of the 8 initial items that we'd started with, we only have 5 items actually stored in our hash table. This leads us into the next section we'll discuss, and that is hash functions that can help alleviate this collision problem.
Hash functions that map, perfectly, every item into it's own unique slot in a hash table is known as a perfect hash function. If we knew the collection of items and that it would never change, it's possible to construct a perfect hash function specific to this collection, but we know that the dynamics of the real world tend to not allow something so simple.
Dynamically growing the hash table size so each possible item in the item range can be accomodated is one way to construct a perfect hash function. This guarantees each item will have it's own slot. But this isn't feasible, as something as simple as tracking social security numbers would require over one billion slots within the hash table. And if we're only tracking a small subset of the full set of social security numbers, this would become horribly inefficient with respect to hardware resources available within the machine our code is running on.
With the goal of constructing a hash function that will minimize the number of collisions, has low computational complexity, and evenly distributes our items within the hash table, we can take a look at some common ways to extend this remainder method.
The folding method for hashing an item begins by diving the item into equal size pieces (though the last piece may not be of equal size to the rest). These pieces are then added together to create the resulting hash value. A good example of this is a phone number,such as 456-555-1234. We can break each pair of integers up into groups of 2, add them up, and use that resulting value as an input to our hashing function.
In [2]:
def stringify(item_list):
"""
Method to convert integer values into array of component integers
"""
string_items = []
while len(item_list) > 0:
for item in item_list:
chars = [int(c) for c in str(item)]
item_list.remove(item)
string_items.append(chars)
return string_items
def folding_hash(item_list):
'''
Quick hack at a folding hash algorithm
'''
hashes = []
while len(item_list) > 0:
hash_val = 0
for item in item_list:
while len(item) > 1:
str_1 = str(item[0])
str_2 = str(item[1])
str_concat = str_1 + str_2
bifold = int(str_concat)
hash_val += bifold
item.pop(0)
item.pop(0)
else:
if len(item) > 0:
hash_val += item[0]
else:
pass
hashes.append(hash_val)
return hashes
# Example phone numbers
phone_number = [4565551234, 4565557714, 9871542544, 4365554601]
# String/Character-fy the phone numbers
str_pn = stringify(phone_number)
# Hash the phone numbers
folded_hash = folding_hash(str_pn)
# Input values into hash table
folding_hash_table = hash(folded_hash, 11)
# Print the results
print(folding_hash_table)
In [3]:
def ord_hash(string, table_size):
hash_val = 0
for position in range(len(string)):
hash_val = hash_val + ord(string[position])
return hash_val % table_size
print(ord_hash("cat", 11))
print(ord_hash("tac", 11))
In the case above, just using ordinal values can cause hash collisions. We can actually use the positional structure of the word to as a set of weights for generating a given hash. As seen below.
A simple multiplication by the positional value of each character will cause anagrams to evaluate to different hash values.
In [4]:
def weighted_ord_hash(string, table_size):
hash_val = 0
for position in range(len(string)):
hash_val = hash_val + (ord(string[position]) * position)
return hash_val % table_size
# ord_hash
print(ord_hash("cat", 11))
# weighted_ord_hash
print(weighted_ord_hash("tac", 11))
When there are hash collisions, like we've seen previously, it's important to understand ways that we can alleviate the collisions.
One simple way to handle the collision, should there already be an entry in our hash table with the same hash value, is to search sequentially through all slots near the original hash, for an empty slot. This may require us to circularly traverse the entire hash table to allow us to cover all possible slots. This process is known as open addressing and the technique within this process that we're using is called linear probing.
In the following code examples, we'll reuse the simple remainder method hash function that we've defined above. Along with the original set of integers we were hashing, as there were some collisions that occured.
In [5]:
items = [25,54,34,67,75,21,77,31]
# Execute the hash function
# Create table with 11 entries to match example above
hash_table = hash(items, 11)
# Print the resulting hash table
print(hash_table)
We can see there were multiple collisions within this dataset. Specifically hashes of 1, 9, and 10. And we can see in the resulting table that only the last computed hashes are stored in the respective table slots.
Below we'll implement an lp_hash
function that will perform linear probing over the slots available within the table for any collisions that occur.
In [6]:
items = [25,54,34,67,75,21,77,31]
def rehash(oldhash, table_size):
return (oldhash+1) % table_size
def lp_hash(item_list, table_size):
lp_hash_table = dict([(i,None) for i,x in enumerate(range(table_size))])
for item in item_list:
i = item % table_size
print("%s hashed == %s \n" %(item, i))
if lp_hash_table[i] == None:
lp_hash_table[i] = item
elif lp_hash_table[i] != None:
print("Collision, attempting linear probe \n")
next_slot = rehash(i, table_size)
print("Setting next slot to %s \n" % next_slot)
while lp_hash_table[next_slot] != None:
next_slot = rehash(next_slot, len(lp_hash_table.keys()))
print("Next slot was not empty, trying next slot %s \n" % next_slot)
if lp_hash_table[next_slot] == None:
lp_hash_table[next_slot] = item
return lp_hash_table
print(lp_hash(items, 11))
Used a little more interestingly, we can use the weighted ordinal hash function that we've defined above, combined with the lp_hash function that we've just defined, to store string(s) for later lookup.
In [7]:
animal_items = ["cat", "dog", "goat",
"chicken", "pig", "horse",
"ostrich", "lion", "puma"]
def rehash(oldhash, table_size):
return (oldhash+1) % table_size
def weighted_ord_hash(string, table_size):
hash_val = 0
for position in range(len(string)):
hash_val = hash_val + (ord(string[position]) * position)
return hash_val % table_size
def lp_hash(item_list, table_size):
lp_hash_table = dict([(i,None) for i,x in enumerate(range(table_size))])
for item in item_list:
i = weighted_ord_hash(item, table_size)
print("%s hashed == %s \n" %(item, i))
if lp_hash_table[i] == None:
lp_hash_table[i] = item
elif lp_hash_table[i] != None:
print("Collision, attempting linear probe \n")
next_slot = rehash(i, table_size)
print("Setting next slot to %s \n" % next_slot)
while lp_hash_table[next_slot] != None:
next_slot = rehash(next_slot, len(lp_hash_table.keys()))
print("Next slot was not empty, trying next slot %s \n" % next_slot)
if lp_hash_table[next_slot] == None:
lp_hash_table[next_slot] = item
return lp_hash_table
print(lp_hash(animal_items, 11))
In [ ]: