As you probably know, disk access is painfully slow and should be avoided at all costs. Often, programs use a large set, a group of unique values, that cannot fit in RAM and must be stored on disk. In a simple set implementation, we must access the full set data in order to add a value (regardless of whether it's already a member) and to check whether a value is a member.

In many cases, it would be useful to have a smaller data structure stored in memory that estimates whether a value is a member of our set. Because this structure would be much smaller than the full set, thereby storing much less information, it could not give a definitive answer. However, instead of returning “True” or “False” when asked if a set contains a value, our “first-pass filter” could return “Possibly” or “False”. When it returns “Possibly”, the full set must then be searched to determine whether the value is actually a member. When it returns “False”, though, we need not access disk.

A simple solution would be storing each value’s hash in a list. Hash functions are many-to-one; multiple values can have the same hash, but each value has only one valid hash. If a value’s hash exists in our list, the value might be in the set, or the set might just contain another value with the same hash (two values having the same hash is known as a collision). If the value’s hash isn’t in the list, the value definitely isn’t in the set.

A simply Python implementation of this filter is below.


In [56]:
class ListFilter:
    
    def __init__(self):
        self.hashes = []
        
        
    def add(self, val):
        valHash = hash(val)
        if valHash not in self.hashes:
            self.hashes.insert(valHash)
            
            
    def check(self, val):
        return hash(val) in self.hashes

True returned from check implies "Probably" because the filter can't definitively tell us whether a value is in the actual set.

For simplicity and modularity, we didn't incorporate the actual set into the filter. This means that, after adding a value to the filter with add, a program would have to also add the value to the set. Similarly, if a program calls check and True is returned, it will have to check the set to see if the value is actually a member.

Our first-pass list filter is a decent solution, but it isn’t quite ideal. Ignoring collisions, it grows in proportion to the set, and therefore becomes unwieldy for really big sets. Assuming we don’t want to use extra memory to make a hash table of the hashes, finding a hash is $O(\log_2{n})$ if the list is sorted and $O(n)$ if it isn’t.

Bloom filters, an alternative, offer constant space and $O(1)$ checking. While the rate of false positives (“Probably” responses for values not in the set) increases modestly as the set grows, relatively small Bloom filters have low ($\leq1\%$) false positive rates for huge data sets.

Below is an implementation of a Bloom filter.


In [4]:
class BloomFilter:

    def __init__(self, m, k):
        self.m = m
        self.k = k
        self.bitArray = [0 for _ in range(m)]

        
    def getIndices(self, val):
        indices = []
        for i in range(self.k):
            hash_ = hash(str(i) + val)
            index = hash_ % self.m
            indices.append(index)
        return indices

    
    def add(self, val):
        indices = self.getIndices(val)
        for index in indices:
            self.bitArray[index] = 1

            
    def check(self, val):
        indices = self.getIndices(val)
        for index in indices:
            if self.bitArray[index] == 0:
                return False
        return True

A Bloom filter is stores an array of $m$ bits (bitArray, in our code), all of which are initially set to $0$. We use an integer list here for clarity. In a real implementation, each array index is only one bit. For readability, our below examples will use $m=16$. Typical real-world bloom filters use an $m$ value about $9.6$ times larger than the the expected number of elements in the set. This gives a false positive rate of about $1\%$.

The filter uses $k$ different hash functions. Often, these are just the same hash function with a different salt, a value prepended to the value to be hashed. In our implementation, the first hash function prepends "0", the second prepends "1", and so on. The hashing algorithm used is Python's default: hash.

Each hash function’s output must map to an index of the bit array. One way to do this is to ensure that the function’s output is $\log_2 m$ bits, mapping each hash value directly to an index of the bit array. If no available hash functions have our desired output size, we can use $hash \bmod{m}$ to map the hash value to an array index:


In [1]:
hash("0" + "hopkins")


Out[1]:
4028870035670873848

In [54]:
filter = BloomFilter(m = 16, k = 3)
print hash("0" + "hopkins") % filter.m
print hash("1" + "hopkins") % filter.m
print hash("2" + "hopkins") % filter.m


8
15
6

The getIndices method, which does the hashing in our implementation, uses the $\bmod$ solution. It returns a list of $k$ indices, each generated by a salted hash function:


In [6]:
filter.getIndices("hopkins")


Out[6]:
[8, 15, 6]

When a value is added to a Bloom filter, the bit array index indicated by each hash value is set to $1$. Our code uses the add method, calling getIndices and updating bitArray accordingly.


In [7]:
filter.getIndices("hopkins")


Out[7]:
[8, 15, 6]

In [8]:
filter.bitArray


Out[8]:
[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]

In [9]:
filter.add("hopkins")
filter.bitArray


Out[9]:
[0, 0, 0, 0, 0, 0, 1, 0, 1, 0, 0, 0, 0, 0, 0, 1]

To check if a value is possibly in the Bloom filter's set, we compute its indices using getIndices. If the bit array values at any of these indices are $0$, then the value cannot be in the set:


In [13]:
for index in filter.getIndices("johns"):
    print filter.bitArray[index]


1
0
0

In [55]:
filter.check("johns")


Out[55]:
False

The possibility of a false positive remains (especially with an $m$ value as unrealistically small as $16$):


In [15]:
filter.getIndices("bootstrap")


Out[15]:
[8, 15, 6]

In [16]:
filter.check("bootstrap")


Out[16]:
True

Using simple probability laws, we can estimate the probability of a false positive. We will assume that the hashing algorithm used is effectively random, which implies that the probability of a given $0$ bit being flipped to $1$ during a single bit assignment is $\frac{1}{m}$. The probability of this bit remaining $0$ for a single index assignment is therefore $1 - \frac{1}{m}$. By exponential decay, the probability that it will remain $0$ when a new value is added to the filter is:

$(1 - \frac{1}{m})^k$.

Again by exponential decay, the probability that it will remain $0$ after $n$ values are added is:

$((1 - \frac{1}{m})^k)^n = (1 - \frac{1}{m})^{kn}$

This implies that the probability of a bit being $1$ after $n$ values are added to the filter is:

$1 - (1 - \frac{1}{m})^{kn}$

For a false positive, all $k$ of a value's bit array indices must already be $1$ by chance. The probability of a false positive is therefore:

$(1 - (1 - \frac{1}{m})^{kn})^k$

If we use five hash functions ($k=5$) and two gigabytes of RAM ($m=1.6 \times 10^{10}$) for a Bloom filter and store a billion values in the set ($n=10^9$), the likelihood of a false positive is only:

$(1 - (1 - \frac{1}{m})^{kn})^k = (1 - (1 - \frac{1}{1.6 \times 10^10}) ^{5 \times 10^9}) ^5 \approx 0.0014 = 0.14\%$

It's clear why Bloom filters are used for massive-scale database software like Google BigTable, Apache Hadoop, and Apache Cassandra.


There are situations in which we can further improve our filter's performance. We may, for example, load a set of $k$-character-long strings into a Bloom filter and then search for these substrings in a larger string. Similarly, we may want to find if and where a genome sequence read differs from a reference genome. We could do this by storing all the reference genome's $k$-mers (sequences that are $k$ nucleotides long) in a set and checking which $k$-mers in the sequencing reads don't exist in the reference genome.

These problems are best treated with frame shifting; to search for 4-long substrings in the string "refrigerator", you would check "refr", then "efri", then "frig", and so on. No matter the value of $k$, each substring differs from the one before it by only a deletion at the front and an insertion at the back. They are very similar, yet our current Bloom filter implementation doesn't reuse any computation when calculating their hashes.

A family of hash functions known as rolling hashes exist for this purpose. They can convert a substring's hash to that of its successor frame, such as hash("refr") to hash("efri"), in $O(1)$ time. We will use the rolling hash function traditionally associated with the Rabin–Karp algorithm, an efficient method of searching for substrings. The hash value of a string of length $m$ is:

$\sum_{i=m-1}^{0} a_i \times b^i$

where $b$ is an arbitrary constant (typically prime) and $a_i$ is the ASCII value of the $i$th character of the string $a$. For example, the hash value of the string "deck" where $b=73$ (relevant ASCII values: $d=100$, $e=101$, $c=99$, $k=107$) is:

$(100 \times 73^3) + (101 \times 73^2) + (99 \times 73^1) + (107 \times 73^0) = 39,447,263$

A Python implementation is below.


In [62]:
class RKHash:
    
    def __init__(self, b):
        self.b = b
        self.prevFrame = None
        self.prevHash = None
        
        
    def getHash(self, frame):
    
        m = len(frame)
    
        if self.prevFrame and self.prevHash:
            tailHash = self.prevHash - (ord(self.prevFrame[0]) * self.b ** (m-1))
            thisHash = (tailHash * self.b) + ord(frame[-1])
            
        else:
            thisHash = 0
            for i in range(m):
                thisHash += ord(frame[i]) * (self.b ** (m - (i + 1)))
                
        self.prevFrame = frame
        self.prevHash = thisHash
        return thisHash
    
    
    def clear(self):
        self.prevFrame = None
        self.prevHash = None
    
    
hasher = RKHash(6)
firstHash = hasher.getHash("tomat")
print firstHash
nextHash = hasher.getHash("omato")
print nextHash


178934
171699

Logically, prevFrame and prevHash are initialized to None, and the first frame's hash value is computed from scratch. This is executed by the else clause in getHash, which simply computes the summation equation given above.

The if clause is executed for all but the first frame. It converts prevHash, the previous frame's hash, to thisHash, the current frame's hash, in three simple $O(1)$ operations. We will observe these operations with an example prevFrame of length $4$, composed of characters $a_0$ through $a_3$. The algorithm converts its hash to that of its successor, frame, which is comprised of characters $a_1$ through $a_4$. Initially:

$prevHash = (a_0 \times b^3) + (a_1 \times b^2) + (a_2 \times b^1) + (a_3 \times b^0)$

We then subtract $a_0 \times b^3$, obtaining the hash of prevHash's tail (tail refers to all but the first character; the tail of "couch" is "ouch"):

$(a_1 \times b^2) + (a_2 \times b^1) + (a_3 \times b^0)$

We then multiply by $b$:

$(a_1 \times b^3) + (a_2 \times b^2) + (a_3 \times b^1)$

And add $a_4 \times b^0 = a_4$ obtaining thisHash:

$(a_1 \times b^3) + (a_2 \times b^2) + (a_3 \times b^1) + (a_4 \times b^0) = thisHash$

Regardless of the length of the substrings processed, the algorithm uses only these three $O(1)$ arithmetic operations.

The clear function simply resets the RKHasher instance, allowing us to start with a new first frame.

We omitted exception statements checking soundness and sanity, so remember that getHash assumes that len(frame) == len(prevFrame), that prevFrame[1:] == frame[:-1], and that b != 0.

Below is a modified version of our above Bloom filter implementation that uses Rabin-Karp hash functions:


In [67]:
import random

class BloomFilter:

    def __init__(self, m, k):
        self.m = m
        self.k = k
        self.bitArray = [0 for _ in range(m)]
        
        # random.sample(population, x) a list of x unique
        # elements from the supplied population list
        randBs = random.sample(range(1, 200), k)
        self.hashFuncs = [RKHash(b) for b in randBs]
        

    def getIndices(self, val):
        hashes = [f.getHash(val) for f in self.hashFuncs]
        return map((lambda hash_: hash_ % self.m), hashes)
            
    
    def add(self, val):
        indices = self.getIndices(val)
        for index in indices:
            self.bitArray[index] = 1

            
    def check(self, val):
        indices = self.getIndices(val)
        for index in indices:
            if self.bitArray[index] == 0:
                return False
        return True
    
bloomFilter = BloomFilter(1024, 3)
print "randomly selected b's for Rabin-Karp hash functions:", [f.b for f in bloomFilter.hashFuncs]
print "indices returned for the string \"johns\":", bloomFilter.getIndices("johns")
bloomFilter.add("johns")


randomly selected b's for Rabin-Karp hash functions: [101, 3, 145]
indices returned for the string "johns": [478, 676, 978]

The value of $b$ for each Rabin-Karp hashing algorithm is unique and pseudorandomly selected. Higher values of $b$ more heavily weight earlier characters relative to later ones, so two values that collide in one hash function are unlikely to collide in any other:


In [66]:
hasher6 = RKHash(6)
hasher7 = RKHash(7)

print 'hasher6.getHash("kudos") =', hasher6.getHash("kudos")
hasher6.clear()
print 'hasher6.getHash("locus") =', hasher6.getHash("locus")
print 'hasher7.getHash("kudos") =', hasher7.getHash("kudos")
hasher7.clear()
print 'hasher7.getHash("locus") =', hasher7.getHash("locus")


hasher6.getHash("kudos") = 168325
hasher6.getHash("locus") = 168325
hasher7.getHash("kudos") = 302830
hasher7.getHash("locus") = 303166

In [ ]: