Basic Idea of Count Min sketch

We map the input value to multiple points in a relatively small output space. Therefore, the count associated with a given input will be applied to multiple counts in the output space. Even though collisions will occur, the minimum count associated with a given input will have some desirable properties, including the ability to be used to estimate the largest N counts.

http://debasishg.blogspot.com/2014/01/count-min-sketch-data-structure-for.html

Parameters of the sketch:

  • epsilon
  • delta

These parameters are inversely and exponentially (respectively) related to the sketch size parameters, d and w.

Implementation of the CM sketch


In [1]:
import sys
import random
import numpy as np
import heapq
import json
import time

BIG_PRIME = 9223372036854775783

def random_parameter():
    return random.randrange(0, BIG_PRIME - 1)


class Sketch:
    def __init__(self, delta, epsilon, k):
        """
        Setup a new count-min sketch with parameters delta, epsilon and k

        The parameters delta and epsilon control the accuracy of the
        estimates of the sketch

        Cormode and Muthukrishnan prove that for an item i with count a_i, the
        estimate from the sketch a_i_hat will satisfy the relation

        a_hat_i <= a_i + epsilon * ||a||_1

        with probability at least 1 - delta, where a is the the vector of all
        all counts and ||x||_1 is the L1 norm of a vector x

        Parameters
        ----------
        delta : float
            A value in the unit interval that sets the precision of the sketch
        epsilon : float
            A value in the unit interval that sets the precision of the sketch
        k : int
            A positive integer that sets the number of top items counted

        Examples
        --------
        >>> s = Sketch(10**-7, 0.005, 40)

        Raises
        ------
        ValueError
            If delta or epsilon are not in the unit interval, or if k is
            not a positive integer

        """
        if delta <= 0 or delta >= 1:
            raise ValueError("delta must be between 0 and 1, exclusive")
        if epsilon <= 0 or epsilon >= 1:
            raise ValueError("epsilon must be between 0 and 1, exclusive")
        if k < 1:
            raise ValueError("k must be a positive integer")

        self.w = int(np.ceil(np.exp(1) / epsilon))
        self.d = int(np.ceil(np.log(1 / delta)))
        self.k = k
        self.hash_functions = [self.__generate_hash_function() for i in range(self.d)]
        self.count = np.zeros((self.d, self.w), dtype='int32')
        self.heap, self.top_k = [], {} # top_k => [estimate, key] pairs

    def update(self, key, increment):
        """
        Updates the sketch for the item with name of key by the amount
        specified in increment

        Parameters
        ----------
        key : string
            The item to update the value of in the sketch
        increment : integer
            The amount to update the sketch by for the given key

        Examples
        --------
        >>> s = Sketch(10**-7, 0.005, 40)
        >>> s.update('http://www.cnn.com/', 1)

        """
        for row, hash_function in enumerate(self.hash_functions):
            column = hash_function(abs(hash(key)))
            self.count[row, column] += increment

        self.update_heap(key)

    def update_heap(self, key):
        """
        Updates the class's heap that keeps track of the top k items for a
        given key

        For the given key, it checks whether the key is present in the heap,
        updating accordingly if so, and adding it to the heap if it is
        absent

        Parameters
        ----------
        key : string
            The item to check against the heap

        """
        estimate = self.get(key)

        if not self.heap or estimate >= self.heap[0][0]:
            if key in self.top_k:
                old_pair = self.top_k.get(key)
                old_pair[0] = estimate
                heapq.heapify(self.heap)
            else:
                if len(self.top_k) < self.k:
                    heapq.heappush(self.heap, [estimate, key])
                    self.top_k[key] = [estimate, key]
                else:
                    new_pair = [estimate, key]
                    old_pair = heapq.heappushpop(self.heap, new_pair)
                    del self.top_k[old_pair[1]]
                    self.top_k[key] = new_pair

    def get(self, key):
        """
        Fetches the sketch estimate for the given key

        Parameters
        ----------
        key : string
            The item to produce an estimate for

        Returns
        -------
        estimate : int
            The best estimate of the count for the given key based on the
            sketch

        Examples
        --------
        >>> s = Sketch(10**-7, 0.005, 40)
        >>> s.update('http://www.cnn.com/', 1)
        >>> s.get('http://www.cnn.com/')
        1

        """
        value = sys.maxint
        for row, hash_function in enumerate(self.hash_functions):
            column = hash_function(abs(hash(key)))
            value = min(self.count[row, column], value)

        return value

    def __generate_hash_function(self):
        """
        Returns a hash function from a family of pairwise-independent hash
        functions

        """
        a, b = random_parameter(), random_parameter()
        return lambda x: (a * x + b) % BIG_PRIME % self.w

In [2]:
# define a function to return a list of the exact top users, sorted by count
def exact_top_users(f, top_n = 10):
    import operator
    counts = {}
    for line in f:
        try:
            user = json.loads(line)['actor']['preferredUsername']
            if user not in counts:
                counts[user] = 1
            else:
                counts[user] += 1
        except ValueError:
            pass
        except KeyError:
            pass
    counter = 0
    results = []
    for user,count in reversed(sorted(counts.iteritems(), key=operator.itemgetter(1))):
        if counter >= top_n:
            break
        results.append('{} {}'.format(user,str(count)))
        counter += 1
    return results

In [8]:
f = open('CM_small.json')
results_exact = sorted(exact_top_users(f))
print(results_exact)


['HPSupport 3', 'user1 3', 'user2 3', 'user4 1']

In [9]:
# define a function to return a list of the estimated top users, sorted by count
def CM_top_users(f, s):
    for line in f:
        try:
            user_name = json.loads(line)['actor']['preferredUsername']
            s.update(user_name,1)
        except ValueError:
            pass
        except KeyError:
            pass
    
    results = []
    for value in reversed(sorted(s.top_k.values())):
        results.append('{1} {0}'.format(str(value[0]),str(value[1])))
    return results

In [10]:
# instantiate a Sketch object
s = Sketch(10**-3, 0.1, 10)

In [13]:
f = open('CM_small.json')
results_CM = sorted(CM_top_users(f,s))
print(results_CM)


['HPSupport 6', 'user1 6', 'user2 6', 'user4 2']

In [14]:
for item in zip(results_exact,results_CM):
    print(item)


('HPSupport 3', 'HPSupport 6')
('user1 3', 'user1 6')
('user2 3', 'user2 6')
('user4 1', 'user4 2')

Is it possible to make the sketchs so coarse that its estimates are wrong even for this data set?


In [182]:
s = Sketch(0.9, 0.9, 10)
f = open('CM_small.json')
results_coarse_CM = CM_top_users(f,s)
print(results_coarse_CM)


['user2 6', 'HPSupport 5', 'user1 3', 'user4 1']

Yes! (if you try enough) Why?

  • The 'w' parameter goes like ceiling(exp(1)/epsilon), which is always >=~ 3.
  • The 'd' parameter goes like ceiling(log(1/delta), which is always >= 1.

So, you're dealing with a space with minimum size 3 x 1. With 10 records, it's possible that all 4 users map their counts to the point. So it's possible to see an estimate as high as 10, in this case.

Now for a larger data set.


In [184]:
f = open('CM_large.json')
%time results_exact = exact_top_users(f)
print(results_exact)


CPU times: user 18.6 s, sys: 255 ms, total: 18.9 s
Wall time: 19 s
['ryuutuu19 39', 'jidousya_ 29', 'food_nourin 27', 'life_style_s 26', 'punpun4 25', 'SportsAB 25', 'fudousankensetu 22', 'TaylorMadeGolf 17', '333_shy 15', 'FuckMica_ 14']

In [185]:
f = open('CM_large.json')
s = Sketch(10**-4, 0.001, 10)
%time results_CM = CM_top_users(f,s)
print(results_CM)


CPU times: user 40.1 s, sys: 222 ms, total: 40.4 s
Wall time: 40.3 s
['ryuutuu19 82', 'food_nourin 72', 'life_style_s 70', 'jidousya_ 69', 'SportsAB 67', '333_shy 62', 'fudousankensetu 61', 'punpun4 59', 'TaylorMadeGolf 59', 'FuckMica_ 58']

For this precision and dataset size, the CM algo takes longer than the exact solution. How's the accuracy?


In [190]:
for item in zip(results_exact,results_CM):
    print(item)
    print item[1] in results_exact and item[0] in results_CM


('ryuutuu19 39', 'yuki_kkth 538')
False
('jidousya_ 29', 'georgiaa_wbuu 531')
False
('food_nourin 27', '_MahlonDallee 529')
False
('life_style_s 26', 'daniel_leparulo 527')
False
('punpun4 25', 'Carla_Carpediem 525')
False
('SportsAB 25', 'viishthay 523')
False
('fudousankensetu 22', 'YOLOfonseca 521')
False
('TaylorMadeGolf 17', 'cautiouX 520')
False
('333_shy 15', 'Ana_Bia147 520')
False
('FuckMica_ 14', 'AdrianaCelis7 520')
False

In [188]:
f = open('CM_large.json')
s = Sketch(10**-3, 0.01, 10)
%time results_CM = CM_top_users(f,s)
print(results_CM)


CPU times: user 35.9 s, sys: 254 ms, total: 36.1 s
Wall time: 36.1 s
['yuki_kkth 538', 'georgiaa_wbuu 531', '_MahlonDallee 529', 'daniel_leparulo 527', 'Carla_Carpediem 525', 'viishthay 523', 'YOLOfonseca 521', 'cautiouX 520', 'Ana_Bia147 520', 'AdrianaCelis7 520']

In [191]:
for item in zip(results_exact,results_CM):
    print(item)
    print item[1] in results_exact and item[0] in results_CM


('ryuutuu19 39', 'yuki_kkth 538')
False
('jidousya_ 29', 'georgiaa_wbuu 531')
False
('food_nourin 27', '_MahlonDallee 529')
False
('life_style_s 26', 'daniel_leparulo 527')
False
('punpun4 25', 'Carla_Carpediem 525')
False
('SportsAB 25', 'viishthay 523')
False
('fudousankensetu 22', 'YOLOfonseca 521')
False
('TaylorMadeGolf 17', 'cautiouX 520')
False
('333_shy 15', 'Ana_Bia147 520')
False
('FuckMica_ 14', 'AdrianaCelis7 520')
False

The most common use of the CM sketch is analysis of streaming data. Why?

  • Becasue the data are arriving in real time, the hashing of the inputs is not a bottleneck as it is when the data are already collected.
  • The sketches are associative, meaning that the operation can be parallelized trivially, and the results easily combined in the end.

This will be a primary function of the Enumerator, which will be demo-ed soon.


In [ ]: