In [3]:
import numpy as np
import math

Q1. Solution

  • 3-shingles for "hello world":
    • hel, ell, llo, lo_, o_w ,_wo, wor, orl, rld => 9 in total

Q2. Solution


In [19]:
## Q2 Solution.
def hash(x):
    return math.fmod(3 * x + 2, 11)

for i in xrange(1,12):
    print hash(i)


5.0
8.0
0.0
3.0
6.0
9.0
1.0
4.0
7.0
10.0
2.0

Q3.

  • This question involves three different Bloom-filter-like scenarios. Each scenario involves setting to 1 certain bits of a 10-bit array, each bit of which is initially 0.

  • Scenario A:

    • we use one hash function that randomly, and with equal probability, selects one of the ten bits of the array. We apply this hash function to four different inputs and set to 1 each of the selected bits.
  • Scenario B:

    • We use two hash functions, each of which randomly, with equal probability, and independently of the other hash function selects one of the of 10 bits of the array. We apply both hash functions to each of two inputs and set to 1 each of the selected bits.
  • Scenario C:

    • We use one hash function that randomly and with equal probability selects two different bits among the ten in the array. We apply this hash function to two inputs and set to 1 each of the selected bits.
  • Let a, b, and c be the expected number of bits set to 1 under scenarios A, B, and C, respectively. Which of the following correctly describes the relationships among a, b, and c?


In [14]:
## Q3 Solution.
prob = 1.0 / 10
a = (1 - prob)**4
print a
b = (1 - ( 1 - (1 - prob)**2) )**2
print b
c = (1 - (1.0 /10 * 1.0 / 9))
print c


0.6561
0.6561
0.988888888889

Q4.

  • In this market-basket problem, there are 99 items, numbered 2 to 100. There is a basket for each prime number between 2 and 100. The basket for p contains all and only the items whose numbers are a multiple of p. For example, the basket for 17 contains the following items: {17, 34, 51, 68, 85}. What is the support of the pair of items {12, 30}?

Q4 Solution.

support = 2 => {2,4,6,8, ...} & {3, 6, 9,...}

Q5.

  • To two decimal places, what is the cosine of the angle between the vectors [2,1,1] and [10,-7,1]?

In [18]:
## Q5 Solution.
vec1 = np.array([2,   1, 1])
vec2 = np.array([10, -7, 1])
print vec1.dot(vec2) / (np.linalg.norm(vec1) * np.linalg.norm(vec2))


0.466666666667

Q6.

  • In this question we use six minhash functions, organized as three bands of two rows each, to identify sets of high Jaccard similarity. If two sets have Jaccard similarity 0.6, what is the probability (to two decimal places) that this pair will become a candidate pair?

In [21]:
## Q6 Solution.

# probability that they agree at one particular band
p1 = 0.6**2
print (1 - p1)**3


0.262144

Q7.

  • Suppose we have a (.4, .6, .9, .1)-sensitive family of functions. If we apply a 3-way OR construction to this family, we get a new family of functions whose sensitivity is:

In [26]:
## Q7 Solution.
p1 = 1 - (1 - .9)**3
p2 = 1 - (1 - .1)**3
print "new LSH is (.4, .6, {}, {})-sensitive family".format(p1, p2)


new LSH is (.4, .6, 0.999, 0.271)-sensitive family

Q8.

  • Suppose we have a database of (Class, Student, Grade) facts, each giving the grade the student got in the class. We want to estimate the fraction of students who have gotten A's in at least 10 classes, but we do not want to examine the entire relation, just a sample of 10% of the tuples. We shall hash tuples to 10 buckets, and take only those tuples in the first bucket. But to get a valid estimate of the fraction of students with at least 10 A's, we need to pick our hash key judiciously. To which Attribute(s) of the relation should we apply the hash function?

Q8 Solution.

  • We will need to hash it to with regard to class and students

Q9

  • Suppose the Web consists of four pages A, B, C, and D, that form a chain A-->B-->C-->D

  • We wish to compute the PageRank of each of these pages, but since D is a "dead end," we will "teleport" from D with probability 1 to one of the four pages, each with equal probability. We do not teleport from pages A, B, or C. Assuming the sum of the PageRanks of the four pages is 1, what is the PageRank of page B, correct to two decimal places?


In [41]:
## Q9 Solution.
M = np.array([[0, 0, 0, .25],
              [1, 0, 0, .25],
              [0, 1, 0, .25],
              [0, 0, 1, .25]])
r = np.array([.25, .25, .25, .25])

for i in xrange(30):
    r = M.dot(r)

print r


[ 0.10000018  0.20000019  0.29999981  0.39999982]

Q10.

  • Suppose in the AGM model we have four individuals (A,B,C,D} and two communities. Community 1 consists of {A,B,C} and Community 2 consists of {B,C,D}. For Community 1 there is a 30% chance- it will cause an edge between any two of its members. For Community 2 there is a 40% chance it will cause an edge between any two of its members. To the nearest two decimal places, what is the probability that there is an edge between B and C?

In [44]:
## Q10 Solution.
print 1 - (1 - .3)*(1 - .4)


0.58

Q11.

  • X is a dataset of n columns for which we train a supervised Machine Learning algorithm. e is the error of the model measured against a validation dataset. Unfortunately, e is too high because model has overfitted on the training data X and it doesn't generalize well. We now decide to reduce the model variance by reducing the dimensionality of X, using a Singular Value Decomposition, and using the resulting dataset to train our model. If i is the number of singular values used in the SVD reduction, how does e change as a function of i, for i ∈ {1, 2,...,n}?

Solution.

  • A Convex Function starts low

In [48]:
##Q12
L = np.array([[-.25, -.5, -.76, -.29, -.03, -.07, -.01],
              [-.05, -.1, -.15,  .20,  .26,  .51,  .77 ]]).T
print L

V = np.array([[6.74, 0],[0, 5.44]])
print V

R = np.array([[-.57, -.11, -.57, -.11, -.57],
              [-.09, 0.70, -.09,  .7,  -.09]])
print R
print L.dot(V).dot(R)


[[-0.25 -0.05]
 [-0.5  -0.1 ]
 [-0.76 -0.15]
 [-0.29  0.2 ]
 [-0.03  0.26]
 [-0.07  0.51]
 [-0.01  0.77]]
[[ 6.74  0.  ]
 [ 0.    5.44]]
[[-0.57 -0.11 -0.57 -0.11 -0.57]
 [-0.09  0.7  -0.09  0.7  -0.09]]
[[ 0.98493  -0.00505   0.98493  -0.00505   0.98493 ]
 [ 1.96986  -0.0101    1.96986  -0.0101    1.96986 ]
 [ 2.993208 -0.007736  2.993208 -0.007736  2.993208]
 [ 1.016202  0.976606  1.016202  0.976606  1.016202]
 [-0.012042  1.012322 -0.012042  1.012322 -0.012042]
 [ 0.01923   1.993978  0.01923   1.993978  0.01923 ]
 [-0.338574  2.939574 -0.338574  2.939574 -0.338574]]

Q13.

  • Recall that the power iteration does r=X·r until converging, where X is a nxn matrix and n is the number of nodes in the graph.

  • Using the power iteration notation above, what is matrix X value when solving topic sensitive Pagerank with teleport set {0,1} for the following graph? Use beta=0.8. (Recall that the teleport set contains the destination nodes used when teleporting).


In [53]:
X  = 0.8 * np.array([[1.0/3, 0, 0],
                     [1.0/3, 0, 0],
                     [1.0/3, 1, 0]])
X += 0.2 * np.array([[.5, .5, .5],
                     [.5, .5, .5],
                     [ 0,  0,  0]])
print X


[[ 0.36666667  0.1         0.1       ]
 [ 0.36666667  0.1         0.1       ]
 [ 0.26666667  0.8         0.        ]]

Q14.

  • Here are two sets of integers S = {1,2,3,4} and T = {1,2,5,6,x}, where x stands for some integer. For how many different integer values of x are the Jaccard similarity and the Jaccard distance of S and T the same? (Note: x can be one of 1, 2, 5, or 6, but in that case T, being a set, will contain x only once and thus have four members, not five.)

Solution.

  • x = 3 or x = 4

Q15.

  • Which of the following are advantages of using decision trees? (check all correct options)

Solution.

  • My Answer
    • It can handle categorical input data without any special preprocessing
    • The resulting model is easy to interpret
    • The training is easy to parallelize

Q16.

  • Consider a dataset of points xi,....xn with labels yi,....,yi ∈ {-1, 1}, such that the data is separable. We run a soft-margin SVM and a hard-margin SVM, and in each case we obtain parameters w and b. Check the option that is true:
    • The resulting w and b can be different, and the boundaries can be different.
    • The resulting w and b are the same in the two cases, hence boundaries are the same.
    • The resulting w and b can be different in the two cases, but the boundaries are the same.
    • None of the above.

Q17.

  • Consider the following MapReduce algorithm. The input is a collection of positive integers. Given integer X, the Map function produces a tuple with key Y and value X for each prime divisor Y of X. For example, if X = 20, there are two key-value pairs: (2,20) and (5,20). The Reduce function, given a key K and list L of values, produces a tuple with key K and value sum(L) i.e., the sum of the values in the list. Given the input 9, 15, 16, 23, 25, 27, 28, 56 which of the following tuples appears in the final output?

Solution.

  • {2, 16 + 28 + 56 } => {2, 100}
  • {3, 9 + 15 + 27 } => {3, 51}
  • {5, 15 + 25} => {5, 40}
  • {7, 28 + 56} => {7, 84}

Q18.

  • Suppose we run K-means clustering over the following set of points in 2-d space using the L1 distance metric:

    • (1,1), (2,1) (2,2), (3,3), (4,2), (2,4), (4,4).
    • We pick k=2 and the initial centroids are (1,1) and (4,4).
    • Which of these is the centroid of the cluster containing the point (3,3) when the algorithm terminates?
  • Recall that the L1 distance between two points is the sum of their distances along each dimension,

    • e.g. the L1 distance between (1, 2) and (-1, 3) is 3.

In [68]:
from scipy.cluster.vq import kmeans,vq

def L1(p1, p2):
    return abs(p1[0] - p2[0]) + abs(p1[1] - p2[1])

c1 = [(1,1), (4,4)]
points = np.array([[1, 1], [2, 1], [2, 2], [3, 3], [4, 2], [2, 4], [4, 4]]
)
# for point in points:
#     minDist = 9999
#     minIdx  = None
#     for point in points:
# print points


---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
<ipython-input-68-5eb3806dc379> in <module>()
     12 #     for point in points:
     13 # print points
---> 14 kmeans(np.array(points),2)
     15 
     16 

/Library/Python/2.7/site-packages/scipy/cluster/vq.pyc in kmeans(obs, k_or_guess, iter, thresh)
    518             # the initial code book is randomly selected from observations
    519             guess = take(obs, randint(0, No, k), 0)
--> 520             book, dist = _kmeans(obs, guess, thresh=thresh)
    521             if dist < best_dist:
    522                 best_book = book

/Library/Python/2.7/site-packages/scipy/cluster/vq.pyc in _kmeans(obs, guess, thresh)
    403         # recalc code_book as centroids of associated obs
    404         if(diff > thresh):
--> 405             code_book, has_members = _vq.update_cluster_means(obs, obs_code, nc)
    406             code_book = code_book.compress(has_members, axis=0)
    407         if len(avg_dist) > 1:

scipy/cluster/_vq.pyx in scipy.cluster._vq.update_cluster_means (scipy/cluster/_vq.c:4539)()

TypeError: type other than float or double not supported

Q20.

  • Consider an execution of the BALANCE algorithm with 4 advertisers, A1, A2, A3, A4, and 4 kinds of queries, Q1, Q2, Q3, Q4.
    • Advertiser A1 bids on queries Q1 and Q2;
    • A2 bids on queries Q2 and Q3;
    • A3 on queries Q3 and Q4;
    • and A4 on queries Q1 and Q4.
    • All bids are equal to 1, and all clickthrough rates are equal.
  • All advertisers have a budget of 3, and ties are broken in favor of the advertiser with the lower index (e.g., A1 beats A2). Queries appear in the following order:

                  Q1, Q2, Q3, Q3, Q1, Q2, Q3, Q1, Q4, Q1 
  • Which advertiser’s budget is exhausted first?

Solution

  • A1 will exhausted first

Q21.

  • Consider the bipartite graph with the following edges (you might want to draw a picture):

          (a,1), (a,3), (b,1), (b,2), (b,4), (c,2), (d,1), (d,4) 
  • Which of the following edges appears in NO perfect matching?

Solution

  • perfect match 1: a-3, b-4, c-2, d-1
  • perfect match 2: a-3, b-1, c-2, d-4

Q22.

  • The Utility Matrix below captures the ratings of 5 users (A,B,C,D,E) for 5 movies (P,Q,R,S,T). Each known rating is a number between 1 and 5, and blanks represent unknown ratings. What is the Pearson Correlation (also known as the Centered Cosine) between users B and D?
      P   Q   R   S   T
    
    A 2 4
    B 3 1 2
    C 5 5
    D 4 3 2 E 4 5 1

</pre>


In [69]:
## Solution
vec1 = np.array([0, 1, -1, 0, 0])
vec2 = np.array([0, 1,  0, 0, -1])
print vec1.dot(vec2) / (np.linalg.norm(vec1) * np.linalg.norm(vec2))


0.5

Q23.

  • The Utility Matrix below captures the ratings of 5 users (A,B,C,D,E) for 5 movies (P,Q,R,S,T). Each known rating is a number between 1 and 5, and blanks represent unknown ratings. Let (U,M) denote the rating of movie M by user U. We evaluate a Recommender System by withholding the ratings (A,P), (B,Q), and (C,S). The recommender system estimates (A,P)=1, (B,Q)=4, and (C,S)=5. What is the RMSE of the Recommender System, rounded to 2 decimal places?

    
    
    
      P   Q   R   S   T
    

    A 2 4
    B 3 1 2
    C 5 5
    D 4 3 2 E 4 5 1

</pre>


In [71]:
## Solution
print "RMSE = {}".format(math.sqrt(2.0 / 3))


RMSE = 0.816496580928

Q24.

  • We are going to perform a hierarchical (agglomerative) clustering on the four strings {he, she, her, their}, using edit distance (just insertions and deletions; no mutations of characters).
    • Initially, each string is in a cluster by itself.
    • The distance between two clusters is the minimum edit distance between two strings, one chosen from each of the two clusters.
    • When we complete the hierarchical clustering, there is one cluster containing all four strings, and we performed three mergers of clusters to get to that point.
    • For each of the three mergers there was a distance between the merged clusters. What is the sum of those three distances?

In [ ]: