In [1]:
%load_ext watermark

In [2]:
%watermark -a 'Sebastian Raschka' -u -d -v


Sebastian Raschka 
last updated: 2017-07-23 

CPython 3.6.1
IPython 6.0.0

Bloom Filters

Bloom filters in a nutshell

A bloom filter is a probablistic data structures for memory-efficient look-ups to test if a element or value is a member of a set. In a nutshell, you can think of a bloom filter as a large bit array (an array that contains 1s and 0s), and by only checking a few elements (bits) of this array, we can tell whether an element is likely a member of a set or is definitely not a member of a set. Checking the set membership via bloom filters can return the following outputs:

  • Element is probably a member of the set
  • Element is definitely not a member of a set

Or in other words, bloom filters can produce false positives (a match means that a element is a member of the set with a given uncertainty) but does not produce false negatives (non-matches always mean that the element is not a member of that set).

So, if bloom filters are probablistic and can produce false positives, why are they useful in practice? Bloom filters are extremely useful if we are working with large databases and want to run a quick check whether or not it's worth to run a computationally more expensive database query to retrieve an element. If a bloom-filter returns a non-match, we know that an element is not contained in the set and thus we can't save computational resources to check the actual database for that element.

The basic mechanics of bloom filters

Bloom filters allow us to implement computationally cheap and memory efficient set membership checks using bit arrays. Given an element, a bloom filter uses multiple hash functions (or the same hash function with different random seeds) to encode the element as a position in the bit array. To walk through the inner works of a bloom filter step by step.

1)

let's assume we have initialized the following, empty bitarray b underlying the bloom filter of size 10:

b = [0 0 0 0 0 0 0 0 0 0]

2)

Next, we use two different hash functions, h1 and h2 to encode an element e. These hash functions convert the output of the hash into an integer and normalize the integer so that it fits into the bounds of the array b:

h1(e) -> 5
h2(e) -> 3

In the example above, the first hash function hashes element e to the array index position 5, and the second hash function hashes the element to the array index position 3.

3)

If we consider the hash operations of step 2) as part of an add to the set operation, we would use those returned array index position to update the bitarray b as follows:

[0 0 0 0 0 0 0 0 0 0] -> [0 0 0 1 0 1 0 0 0 0]

If step 2) was part of a query or look-up operation, we would simply check the respective array index positions:

  • If both position 3 and 5 have the bit value 1, the query returns "probably in set"
  • If position 3 or 5 (or both) have the bit value 0, the query returns "definitely not in set"

Implementing a bloom filter in Python

In this section, we are going to implement a bloom filter in Python. However, note that the following implementation of a bloom filter in Python mainly serves illustrative purposes and has not be designed for efficiency. For example, using Python list objects for representing bit arrays is very inefficient compared to using the bitarray package.

To generate the hashes, we will use the hashlib module from Python's standard library. Let's start with a simple example generating an integer hash using the MD5 hash function:


In [3]:
import hashlib

h1 = hashlib.md5()
h1.update('hello-world'.encode('utf-8'))
int(h1.hexdigest(), 16)


Out[3]:
43309944592180122138158756568456140780

Unfortunately, the update method will render the hash function non-deterministic in the context of bloom filters. I.e., hashing the same value would return a different integer hash:


In [4]:
h1.update('hello-world'.encode('utf-8'))
int(h1.hexdigest(), 16)


Out[4]:
94628577942110548051725978613383116803
A one-line work around that we can use in our bloom filter implementation would be as follows:

In [5]:
int(hashlib.new('md5', ('%s' % 'hello-world').encode('utf-8')).hexdigest(), 16)


Out[5]:
43309944592180122138158756568456140780

In [6]:
int(hashlib.new('md5', ('%s' % 'hello-world').encode('utf-8')).hexdigest(), 16)


Out[6]:
43309944592180122138158756568456140780

Next, let's implement a BloomFilter class based on the concepts we discussed earlier:


In [7]:
class BloomFilter():
    def __init__(self, array_size, hash_names):
        self.array_size = array_size
        self.bitarray = [0] * array_size
        self.hash_names = hash_names
        
    def _get_hash_positions(self, value):
        pos = []
        for h in self.hash_names:
            hash_hex = hashlib.new(h, ('%s' % value).encode(
                'utf-8')).hexdigest()
            # convert hashed value into an integer
            asint = int(hash_hex, 16)
            # modulo array_size to fit hash value into the bitarray
            pos.append(asint % self.array_size)
        return pos
        
    def add(self, value):
        hash_pos = self._get_hash_positions(value)
        for pos in hash_pos:
            self.bitarray[pos] = 1
        
    def query(self, value):
        hash_pos = self._get_hash_positions(value)
        for pos in hash_pos:
            if not self.bitarray[pos]:
                return False
        return True

To test our implementation, let's initialize a bloom filter with array size 10 and two different hash function, a simple SHA hash and MD5 hash:


In [8]:
bloom = BloomFilter(array_size=10, hash_names=('md5', 'sha1'))
bloom.bitarray


Out[8]:
[0, 0, 0, 0, 0, 0, 0, 0, 0, 0]

Next, we will add a new value to the bloom filter, 'hello world!'


In [9]:
bloom.add('hello world!')
bloom.bitarray


Out[9]:
[0, 1, 0, 0, 0, 0, 0, 1, 0, 0]

As we can see from running the previous code example, the array index position the value was hashed into are 1 and 7. Let's check the element for membership now:


In [10]:
bloom.query('hello world!')


Out[10]:
True

So far so good. Let's add another value, 'foo-bar':


In [11]:
bloom.add('foo-bar')
bloom.bitarray


Out[11]:
[0, 1, 0, 1, 0, 1, 0, 1, 0, 0]

The 'foo-bar' value was hashed into positions 3 and 5. Similarly, we can check the membership as follows:


In [12]:
bloom.query('foo-bar')


Out[12]:
True

Just to confirm that our bloom filter is implemented correctly and does not return false negative, let's run a query on a new value:


In [13]:
bloom.query('test')


Out[13]:
False

Determining the probability of false positives

A nice property of bloom filters is that we can determine the size of the bitarray mathematically for a desired false positive rate. The following equation approximates the probability of returning a false positive (for more details, and how to determine the optimal size of the bitarray and number of hashes, see the excellent article at https://en.wikipedia.org/wiki/Bloom_filter):

$$P \approx \big(1 - e^{-kn / m}\big)^k$$

Here, $k$ is the number of hashes, $m$ is the size of the bitarray, $n$ is the expected number of elements that are to be stored in the bloom filter, and $k$ is the number of hash functions.

For instance, say we only expect 3 values to be stored, only have 2 hash functions, and a tiny bitarray (just as in the previous code example), then our probability of returning a false positive would be approximately 20%:

$$\big(1 - e^{-2*3 / 10}\big)^2 \approx 0.2$$