Quiz - Week 3B

Q1.

Suppose we hash the elements of a set S having 20 members, to a bit array of length 99. The array is initially all-0's, and we set a bit to 1 whenever a member of S hashes to it. The hash function is random and uniform in its distribution. What is the expected fraction of 0's in the array after hashing? What is the expected fraction of 1's? You may assume that 99 is large enough that asymptotic limits are reached

Solution 1.

Throwing Darts approximation - since we know that in this case we have t = 99, d = S*20, the fraction of 0's that remain will follow the following equation:

\begin{align} (1 - 1/t)^{t(d/t)} &~= e^{-d/t} \\ &= e^{ \frac{-20*S}{99} } \end{align}

So in this cae, we can derive the fraction of 1's that remain will be $ 1 - e^{ \frac{-20*S}{99} }$

Q2.

A certain Web mail service (like gmail, e.g.) has $10^8$ users, and wishes to create a sample of data about these users, occupying $10^{10}$ bytes. Activity at the service can be viewed as a stream of elements, each of which is an email. The element contains the ID of the sender, which must be one of the $10^8$ users of the service, and other information, e.g., the recipient(s), and contents of the message. The plan is to pick a subset of the users and collect in the $10^{10}$ bytes records of length 100 bytes about every email sent by the users in the selected set (and nothing about other users).
The method of Section 4.2.4 will be used. User ID's will be hashed to a bucket number, from 0 to 999,999. At all times, there will be a threshold t such that the 100-byte records for all the users whose ID's hash to t or less will be retained, and other users' records will not be retained. You may assume that each user generates emails at exactly the same rate as other users. As a function of n, the number of emails in the stream so far, what should the threshold t be in order that the selected records will not exceed the $10^{10}$ bytes available to store records? From the list below, identify the true statement about a value of n and its value of t.

Solution 2.

Since in this problem, basically we are going to partition the data stream by its value. Thus the relationship betwee n and t will be as folloing[note that hash value starts at 0]:

\begin{align} \frac{t}{buckets\#} = \frac{10^{10}}{n} \end{align}

Quiz - Week 3A(Advanced)

Q1.

   C -- D -- E
 / |    |    | \
A  |    |    |  B
 \ |    |    | /
   F -- G -- H

Write the adjacency matrix A, the degree matrix D, and the Laplacian matrix L. For each, find the sum of all entries and the number of nonzero entries. Then identify the true statement from the list below.



In [24]:

    
## Solution 1.
import numpy as np

A = np.array([[0, 0, 1, 0, 0, 1, 0, 0], #A
              [0, 0, 0, 0, 1, 0, 0, 1], #B
              [1, 0, 0, 1, 0, 1, 0, 0], #C
              [0, 0, 1, 0, 1, 0, 1, 0], #D
              [0, 1, 0, 1, 0, 0, 0, 1], #E
              [1, 0, 1, 0, 0, 0, 1, 0], #F
              [0, 0, 0, 1, 0, 1, 0, 1], #G
              [0, 1, 0, 0, 1, 0, 1, 0]  #H
             ])
print "Adjacent Matrix A:"
print A
print "Zero elements in A:"
print A.shape[1]* A.shape[0] - A.sum()
print "Sum as well as Non zero elements in A:"
print A.sum()
print "=========================================="

D = np.diag([2,2,3,3,3,3,3,3])
print "Degree Matrix D:"
print D

print "=========================================="

L = D - A

print "Laplacian Matrix L:"
print L









    



Adjacent Matrix A:
[[0 0 1 0 0 1 0 0]
 [0 0 0 0 1 0 0 1]
 [1 0 0 1 0 1 0 0]
 [0 0 1 0 1 0 1 0]
 [0 1 0 1 0 0 0 1]
 [1 0 1 0 0 0 1 0]
 [0 0 0 1 0 1 0 1]
 [0 1 0 0 1 0 1 0]]
Zero elements in A:
42
Sum as well as Non zero elements in A:
22
==========================================
Degree Matrix D:
[[2 0 0 0 0 0 0 0]
 [0 2 0 0 0 0 0 0]
 [0 0 3 0 0 0 0 0]
 [0 0 0 3 0 0 0 0]
 [0 0 0 0 3 0 0 0]
 [0 0 0 0 0 3 0 0]
 [0 0 0 0 0 0 3 0]
 [0 0 0 0 0 0 0 3]]
==========================================
Laplacian Matrix L:
[[ 2  0 -1  0  0 -1  0  0]
 [ 0  2  0  0 -1  0  0 -1]
 [-1  0  3 -1  0 -1  0  0]
 [ 0  0 -1  3 -1  0 -1  0]
 [ 0 -1  0 -1  3  0  0 -1]
 [-1  0 -1  0  0  3 -1  0]
 [ 0  0  0 -1  0 -1  3 -1]
 [ 0 -1  0  0 -1  0 -1  3]]

Q2.

Given the following graph

    2 -----6
   /  \    |
  1    4   |
   \  /  \ |
    3      5

The goal is to find two clusters in this graph using Spectral Clustering on the Laplacian matrix. Compute the Laplacian of this graph. Then compute the second eigen vector of the Laplacian (the one corresponding to the second smallest eigenvalue).
To cluster the points, we decide to split at the mean value. We say that a node is a tie if its value in the eigen-vector is exactly equal to the mean value. Let's assume that if a point is a tie, we choose its cluster at random. Identify the true statement from the list below.



In [62]:

    
import numpy as np

A = np.array([[0, 1, 1, 0, 0, 0],
              [1, 0, 0, 1, 0, 0],
              [1, 0, 0, 1, 0, 0],
              [0, 1, 1, 0, 1, 0],
              [0, 0, 0, 1, 0, 1],
              [0, 1, 0, 0, 1, 0]
            ])

D = np.diag([2, 2, 2, 3, 2, 2])
L = (D - A)
print "=========================================="
print "Graph Laplacian Matrix: "
print L
values, vectors = np.linalg.eig(L)
print "Eigen values: "
values = np.around(values, decimals=4)
print values
print "=========================================="
print "Eigen vectors: "
vectors = np.around(vectors, decimals=4)
print vectors
print "=========================================="
print "Second Smallest Eigen vectors: "
print vectors[:, 2]
print "=========================================="
print "Mean of each row in vectors: "
print np.mean(vectors[:, 2])









    



==========================================
Graph Laplacian Matrix: 
[[ 2 -1 -1  0  0  0]
 [-1  2  0 -1  0  0]
 [-1  0  2 -1  0  0]
 [ 0 -1 -1  3 -1  0]
 [ 0  0  0 -1  2 -1]
 [ 0 -1  0  0 -1  2]]
Eigen values: 
[ 4.618  0.     1.     3.     2.382  2.   ]
==========================================
Eigen vectors: 
[[ 0.277  -0.4082 -0.3536  0.5774 -0.5804 -0.4082]
 [-0.3626 -0.4082 -0.1768 -0.2887  0.1108 -0.4082]
 [-0.3626 -0.4082 -0.1768 -0.2887  0.1108  0.4082]
 [ 0.6723 -0.4082  0.1768 -0.2887  0.538   0.4082]
 [-0.3626 -0.4082  0.7071  0.5774  0.1108  0.4082]
 [ 0.277  -0.4082  0.5303 -0.2887 -0.5804 -0.4082]]
==========================================
Second Smallest Eigen vectors: 
[-0.3536 -0.1768 -0.1768  0.1768  0.7071  0.5303]
==========================================
Mean of each row in vectors: 
0.117833333333

Q3.

We wish to estimate the surprise number (2nd moment) of a data stream, using the method of AMS. It happens that our stream consists of ten different values, which we'll call 1, 2,..., 10, that cycle repeatedly. That is, at timestamps 1 through 10, the element of the stream equals the timestamp, at timestamps 11 through 20, the element is the timestamp minus 10, and so on. It is now timestamp 75, and a 5 has just been read from the stream. As a start, you should calculate the surprise number for this time.
For our estimate of the surprise number, we shall choose three timestamps at random, and estimate the surprise number from each, using the AMS approach (length of the stream times 2m-1, where m is the number of occurrences of the element of the stream at that timestamp, considering all times from that timestamp on, to the current time). Then, our estimate will be the median of the three resulting values.
You should discover the simple rules that determine the estimate derived from any given timestamp and from any set of three timestamps. Then, identify from the list below the set of three "random" timestamps that give the closest estimate.

Some hints:

Be sure you have the surprise number correct. Notice that some of the elements appear 8 times and others appear 7 times. The surprise number is the sum over all elements that appear of the square of the number of times they appear.
Remember that for any given timestamp, the estimate is the length of the stream (75 in this example) times 2m-1, where m is the number of times the element at that timestamp appears, at that time or later.



In [78]:

    
def cal_supprise(current_t = 75):
    common = ( current_t / 10 ) % 10
    more_set = ( current_t ) % 10

    less_set = 10 - more_set
    
    return more_set * (common+1)**2 + less_set * (common**2)
    
    

def AMS(time_list, current_t = 75):
    
    estimates = []
    for t in time_list:
        delta = current_t - t
        
        elem = t % 10
        threshold = current_t % 10
        
        common = (delta / 10) % 10

        if elem > threshold:
            estimates.append( ( 2*common - 1)*current_t )
        else:
            estimates.append( ( 2*common + 1)*current_t )

    return estimates

print cal_supprise()
print np.mean(AMS([31,32,44]))
print np.mean(AMS([14,35,42]))
print np.mean(AMS([32,48,50]))
print np.mean(AMS([22,42,62]))









    



565






    



---------------------------------------------------------------------------
AttributeError                            Traceback (most recent call last)
<ipython-input-78-a79f5ad1e1fe> in <module>()
     29 
     30 print cal_supprise()
---> 31 print math.mean(AMS([31,32,44]))
     32 print math.mean(AMS([14,35,42]))
     33 print math.mean(AMS([32,48,50]))

AttributeError: 'module' object has no attribute 'mean'



In [ ]:

    
#buffer

Q4

We wish to use the Flagolet-Martin lgorithm of Section 4.4 to count the number of distinct elements in a stream. Suppose that there are ten possible elements, 1, 2,..., 10, that could appear in the stream, but only four of them have actually appeared. To make our estimate of the count of distinct elements, we hash each element to a 4-bit binary number. The element x is hashed to 3x + 7 (modulo 11). For example, element 8 hashes to 3*8+7 = 31, which is 9 modulo 11 (i.e., the remainder of 31/11 is 9). Thus, the 4-bit string for element 8 is 1001.
A set of four of the elements 1 through 10 could give an estimate that is exact (if the estimate is 4), or too high, or too low. You should figure out under what circumstances a set of four elements falls into each of those categories. Then, identify in the list below the set of four elements that gives the exactly correct estimate.



In [63]:

    
def hash1(x):
    return (3*x + 7) % 11

def printBinary(x):
    print "- The binary result of {0} is:".format(str(x)) + str(bin(x))

def countTrailingZerosInBinary(num):
    bnum = str(bin(num))
    return len(bnum) - len(bnum.rstrip('0'))
    
def FlagoletMatrtin(hash_list):
    maxnum = 0
    for val in hash_list:
        num = countTrailingZerosInBinary(val)
        if num > maxnum:
            maxnum = num
    return 2**maxnum
        
        
print FlagoletMatrtin(([hash1(x) for x in [1,3,6,8]]))
print FlagoletMatrtin(([hash1(x) for x in [2,4,6,10]]))
print FlagoletMatrtin(([hash1(x) for x in [2,6,8,10]]))
print FlagoletMatrtin(([hash1(x) for x in [3,4,8,10]]))

print "================="

print FlagoletMatrtin(([hash1(x) for x in [2,6,8,9]]))
print FlagoletMatrtin(([hash1(x) for x in [4,6,9,10]]))
print FlagoletMatrtin(([hash1(x) for x in [1,5,8,9]]))
print FlagoletMatrtin(([hash1(x) for x in [1,6,7,10]]))

print "================="

print FlagoletMatrtin(([hash1(x) for x in [1,2,3,9]]))
print FlagoletMatrtin(([hash1(x) for x in [1,3,9,10]]))
print FlagoletMatrtin(([hash1(x) for x in [3,4,8,10]]))
print FlagoletMatrtin(([hash1(x) for x in [4,6,9,10]]))

print "================="

print FlagoletMatrtin(([hash1(x) for x in [1,4,7,9]]))
print FlagoletMatrtin(([hash1(x) for x in [4,6,9,10]]))
print FlagoletMatrtin(([hash1(x) for x in [1,6,7,10]]))
print FlagoletMatrtin(([hash1(x) for x in [4,5,6,10]]))









    



2
8
4
8
=================
2
8
2
4
=================
2
4
8
8
=================
8
8
4
8

Q5

Suppose we are using the DGIM algorithm of Section 4.6.2 to estimate the number of 1's in suffixes of a sliding window of length 40. The current timestamp is 100, and we have the following buckets stored:

    End Time    100  98  95  92  87  80  65
        Size    1    1   2   2   4   8   8

Note: we are showing timestamps as absolute values, rather than modulo the window size, as DGIM would do.
Suppose that at times 101 through 105, 1's appear in the stream. Compute the set of buckets that would exist in the system at time 105. Then identify one such bucket from the list below. Buckets are represented by pairs (end-time, size).

Solution5.

step 1:
    101 100 95 87 80 65
    1   2   4  4  8  8
step 2:
    102 101 100 95 87 80 65
    1   1   2   4  4  8  8
step 3:
    103 102 100 95 87 80 65
    1   2   2   4  4  8  8
step 4:
    104 103 102 100 95 87 80 65
    1   1   2   2   4  4  8  8
step 5:
    105 104 102 95 80
    1   2   4   8  8



In [ ]: