We map the input value to multiple points in a relatively small output space. Therefore, the count associated with a given input will be applied to multiple counts in the output space. Even though collisions will occur, the minimum count associated with a given input will have some desirable properties, including the ability to be used to estimate the largest N counts.
http://debasishg.blogspot.com/2014/01/count-min-sketch-data-structure-for.html
Parameters of the sketch:
These parameters are inversely and exponentially (respectively) related to the sketch size parameters, d and w.
In [1]:
import sys
import random
import numpy as np
import heapq
import json
import time
BIG_PRIME = 9223372036854775783
def random_parameter():
return random.randrange(0, BIG_PRIME - 1)
class Sketch:
def __init__(self, delta, epsilon, k):
"""
Setup a new count-min sketch with parameters delta, epsilon and k
The parameters delta and epsilon control the accuracy of the
estimates of the sketch
Cormode and Muthukrishnan prove that for an item i with count a_i, the
estimate from the sketch a_i_hat will satisfy the relation
a_hat_i <= a_i + epsilon * ||a||_1
with probability at least 1 - delta, where a is the the vector of all
all counts and ||x||_1 is the L1 norm of a vector x
Parameters
----------
delta : float
A value in the unit interval that sets the precision of the sketch
epsilon : float
A value in the unit interval that sets the precision of the sketch
k : int
A positive integer that sets the number of top items counted
Examples
--------
>>> s = Sketch(10**-7, 0.005, 40)
Raises
------
ValueError
If delta or epsilon are not in the unit interval, or if k is
not a positive integer
"""
if delta <= 0 or delta >= 1:
raise ValueError("delta must be between 0 and 1, exclusive")
if epsilon <= 0 or epsilon >= 1:
raise ValueError("epsilon must be between 0 and 1, exclusive")
if k < 1:
raise ValueError("k must be a positive integer")
self.w = int(np.ceil(np.exp(1) / epsilon))
self.d = int(np.ceil(np.log(1 / delta)))
self.k = k
self.hash_functions = [self.__generate_hash_function() for i in range(self.d)]
self.count = np.zeros((self.d, self.w), dtype='int32')
self.heap, self.top_k = [], {} # top_k => [estimate, key] pairs
def update(self, key, increment):
"""
Updates the sketch for the item with name of key by the amount
specified in increment
Parameters
----------
key : string
The item to update the value of in the sketch
increment : integer
The amount to update the sketch by for the given key
Examples
--------
>>> s = Sketch(10**-7, 0.005, 40)
>>> s.update('http://www.cnn.com/', 1)
"""
for row, hash_function in enumerate(self.hash_functions):
column = hash_function(abs(hash(key)))
self.count[row, column] += increment
self.update_heap(key)
def update_heap(self, key):
"""
Updates the class's heap that keeps track of the top k items for a
given key
For the given key, it checks whether the key is present in the heap,
updating accordingly if so, and adding it to the heap if it is
absent
Parameters
----------
key : string
The item to check against the heap
"""
estimate = self.get(key)
if not self.heap or estimate >= self.heap[0][0]:
if key in self.top_k:
old_pair = self.top_k.get(key)
old_pair[0] = estimate
heapq.heapify(self.heap)
else:
if len(self.top_k) < self.k:
heapq.heappush(self.heap, [estimate, key])
self.top_k[key] = [estimate, key]
else:
new_pair = [estimate, key]
old_pair = heapq.heappushpop(self.heap, new_pair)
del self.top_k[old_pair[1]]
self.top_k[key] = new_pair
def get(self, key):
"""
Fetches the sketch estimate for the given key
Parameters
----------
key : string
The item to produce an estimate for
Returns
-------
estimate : int
The best estimate of the count for the given key based on the
sketch
Examples
--------
>>> s = Sketch(10**-7, 0.005, 40)
>>> s.update('http://www.cnn.com/', 1)
>>> s.get('http://www.cnn.com/')
1
"""
value = sys.maxint
for row, hash_function in enumerate(self.hash_functions):
column = hash_function(abs(hash(key)))
value = min(self.count[row, column], value)
return value
def __generate_hash_function(self):
"""
Returns a hash function from a family of pairwise-independent hash
functions
"""
a, b = random_parameter(), random_parameter()
return lambda x: (a * x + b) % BIG_PRIME % self.w
In [2]:
# define a function to return a list of the exact top users, sorted by count
def exact_top_users(f, top_n = 10):
import operator
counts = {}
for line in f:
try:
user = json.loads(line)['actor']['preferredUsername']
if user not in counts:
counts[user] = 1
else:
counts[user] += 1
except ValueError:
pass
except KeyError:
pass
counter = 0
results = []
for user,count in reversed(sorted(counts.iteritems(), key=operator.itemgetter(1))):
if counter >= top_n:
break
results.append('{} {}'.format(user,str(count)))
counter += 1
return results
In [8]:
f = open('CM_small.json')
results_exact = sorted(exact_top_users(f))
print(results_exact)
In [9]:
# define a function to return a list of the estimated top users, sorted by count
def CM_top_users(f, s):
for line in f:
try:
user_name = json.loads(line)['actor']['preferredUsername']
s.update(user_name,1)
except ValueError:
pass
except KeyError:
pass
results = []
for value in reversed(sorted(s.top_k.values())):
results.append('{1} {0}'.format(str(value[0]),str(value[1])))
return results
In [10]:
# instantiate a Sketch object
s = Sketch(10**-3, 0.1, 10)
In [13]:
f = open('CM_small.json')
results_CM = sorted(CM_top_users(f,s))
print(results_CM)
In [14]:
for item in zip(results_exact,results_CM):
print(item)
Is it possible to make the sketchs so coarse that its estimates are wrong even for this data set?
In [182]:
s = Sketch(0.9, 0.9, 10)
f = open('CM_small.json')
results_coarse_CM = CM_top_users(f,s)
print(results_coarse_CM)
Yes! (if you try enough) Why?
So, you're dealing with a space with minimum size 3 x 1. With 10 records, it's possible that all 4 users map their counts to the point. So it's possible to see an estimate as high as 10, in this case.
In [184]:
f = open('CM_large.json')
%time results_exact = exact_top_users(f)
print(results_exact)
In [185]:
f = open('CM_large.json')
s = Sketch(10**-4, 0.001, 10)
%time results_CM = CM_top_users(f,s)
print(results_CM)
For this precision and dataset size, the CM algo takes longer than the exact solution. How's the accuracy?
In [190]:
for item in zip(results_exact,results_CM):
print(item)
print item[1] in results_exact and item[0] in results_CM
In [188]:
f = open('CM_large.json')
s = Sketch(10**-3, 0.01, 10)
%time results_CM = CM_top_users(f,s)
print(results_CM)
In [191]:
for item in zip(results_exact,results_CM):
print(item)
print item[1] in results_exact and item[0] in results_CM
The most common use of the CM sketch is analysis of streaming data. Why?
This will be a primary function of the Enumerator, which will be demo-ed soon.
In [ ]: