In [3]:
import numpy as np
import math

Q1. Solution

  • 3-shingles for "hello world":
    • hel, ell, llo, lo_, o_w ,_wo, wor, orl, rld => 9 in total

Q2. Solution


In [19]:
## Q2 Solution.
def hash(x):
    return math.fmod(3 * x + 2, 11)

for i in xrange(1,12):
    print hash(i)


5.0
8.0
0.0
3.0
6.0
9.0
1.0
4.0
7.0
10.0
2.0

Q3.

  • This question involves three different Bloom-filter-like scenarios. Each scenario involves setting to 1 certain bits of a 10-bit array, each bit of which is initially 0.

  • Scenario A:

    • we use one hash function that randomly, and with equal probability, selects one of the ten bits of the array. We apply this hash function to four different inputs and set to 1 each of the selected bits.
  • Scenario B:

    • We use two hash functions, each of which randomly, with equal probability, and independently of the other hash function selects one of the of 10 bits of the array. We apply both hash functions to each of two inputs and set to 1 each of the selected bits.
  • Scenario C:

    • We use one hash function that randomly and with equal probability selects two different bits among the ten in the array. We apply this hash function to two inputs and set to 1 each of the selected bits.
  • Let a, b, and c be the expected number of bits set to 1 under scenarios A, B, and C, respectively. Which of the following correctly describes the relationships among a, b, and c?


In [14]:
## Q3 Solution.
prob = 1.0 / 10
a = (1 - prob)**4
print a
b = (1 - ( 1 - (1 - prob)**2) )**2
print b
c = (1 - (1.0 /10 * 1.0 / 9))
print c


0.6561
0.6561
0.988888888889

Q4.

  • In this market-basket problem, there are 99 items, numbered 2 to 100. There is a basket for each prime number between 2 and 100. The basket for p contains all and only the items whose numbers are a multiple of p. For example, the basket for 17 contains the following items: {17, 34, 51, 68, 85}. What is the support of the pair of items {12, 30}?

Q4 Solution.

support = 2 => {2,4,6,8, ...} & {3, 6, 9,...}

Q5.

  • To two decimal places, what is the cosine of the angle between the vectors [2,1,1] and [10,-7,1]?

In [18]:
## Q5 Solution.
vec1 = np.array([2,   1, 1])
vec2 = np.array([10, -7, 1])
print vec1.dot(vec2) / (np.linalg.norm(vec1) * np.linalg.norm(vec2))


0.466666666667

Q6.

  • In this question we use six minhash functions, organized as three bands of two rows each, to identify sets of high Jaccard similarity. If two sets have Jaccard similarity 0.6, what is the probability (to two decimal places) that this pair will become a candidate pair?

In [21]:
## Q6 Solution.

# probability that they agree at one particular band
p1 = 0.6**2
print (1 - p1)**3


0.262144

Q7.

  • Suppose we have a (.4, .6, .9, .1)-sensitive family of functions. If we apply a 3-way OR construction to this family, we get a new family of functions whose sensitivity is:

In [26]:
## Q7 Solution.
p1 = 1 - (1 - .9)**3
p2 = 1 - (1 - .1)**3
print "new LSH is (.4, .6, {}, {})-sensitive family".format(p1, p2)


new LSH is (.4, .6, 0.999, 0.271)-sensitive family

Q8.

  • Suppose we have a database of (Class, Student, Grade) facts, each giving the grade the student got in the class. We want to estimate the fraction of students who have gotten A's in at least 10 classes, but we do not want to examine the entire relation, just a sample of 10% of the tuples. We shall hash tuples to 10 buckets, and take only those tuples in the first bucket. But to get a valid estimate of the fraction of students with at least 10 A's, we need to pick our hash key judiciously. To which Attribute(s) of the relation should we apply the hash function?

Q8 Solution.

  • We will need to hash it to with regard to class and students

Q9

  • Suppose the Web consists of four pages A, B, C, and D, that form a chain A-->B-->C-->D

  • We wish to compute the PageRank of each of these pages, but since D is a "dead end," we will "teleport" from D with probability 1 to one of the four pages, each with equal probability. We do not teleport from pages A, B, or C. Assuming the sum of the PageRanks of the four pages is 1, what is the PageRank of page B, correct to two decimal places?


In [41]:
## Q9 Solution.
M = np.array([[0, 0, 0, .25],
              [1, 0, 0, .25],
              [0, 1, 0, .25],
              [0, 0, 1, .25]])
r = np.array([.25, .25, .25, .25])

for i in xrange(30):
    r = M.dot(r)

print r


[ 0.10000018  0.20000019  0.29999981  0.39999982]

Q10.

  • Suppose in the AGM model we have four individuals (A,B,C,D} and two communities. Community 1 consists of {A,B,C} and Community 2 consists of {B,C,D}. For Community 1 there is a 30% chance- it will cause an edge between any two of its members. For Community 2 there is a 40% chance it will cause an edge between any two of its members. To the nearest two decimal places, what is the probability that there is an edge between B and C?

In [44]:
## Q10 Solution.
print 1 - (1 - .3)*(1 - .4)


0.58

Q11.

  • X is a dataset of n columns for which we train a supervised Machine Learning algorithm. e is the error of the model measured against a validation dataset. Unfortunately, e is too high because model has overfitted on the training data X and it doesn't generalize well. We now decide to reduce the model variance by reducing the dimensionality of X, using a Singular Value Decomposition, and using the resulting dataset to train our model. If i is the number of singular values used in the SVD reduction, how does e change as a function of i, for i ∈ {1, 2,...,n}?

Q11 Solution.

  • A Convex Function starts low

In [48]:
##Q12
L = np.array([[-.25, -.5, -.76, -.29, -.03, -.07, -.01],
              [-.05, -.1, -.15,  .20,  .26,  .51,  .77 ]]).T
print L

V = np.array([[6.74, 0],[0, 5.44]])
print V

R = np.array([[-.57, -.11, -.57, -.11, -.57],
              [-.09, 0.70, -.09,  .7,  -.09]])
print R
print L.dot(V).dot(R)


[[-0.25 -0.05]
 [-0.5  -0.1 ]
 [-0.76 -0.15]
 [-0.29  0.2 ]
 [-0.03  0.26]
 [-0.07  0.51]
 [-0.01  0.77]]
[[ 6.74  0.  ]
 [ 0.    5.44]]
[[-0.57 -0.11 -0.57 -0.11 -0.57]
 [-0.09  0.7  -0.09  0.7  -0.09]]
[[ 0.98493  -0.00505   0.98493  -0.00505   0.98493 ]
 [ 1.96986  -0.0101    1.96986  -0.0101    1.96986 ]
 [ 2.993208 -0.007736  2.993208 -0.007736  2.993208]
 [ 1.016202  0.976606  1.016202  0.976606  1.016202]
 [-0.012042  1.012322 -0.012042  1.012322 -0.012042]
 [ 0.01923   1.993978  0.01923   1.993978  0.01923 ]
 [-0.338574  2.939574 -0.338574  2.939574 -0.338574]]

Q13.

  • Recall that the power iteration does r=X·r until converging, where X is a nxn matrix and n is the number of nodes in the graph.

  • Using the power iteration notation above, what is matrix X value when solving topic sensitive Pagerank with teleport set {0,1} for the following graph? Use beta=0.8. (Recall that the teleport set contains the destination nodes used when teleporting).


In [53]:
X  = 0.8 * np.array([[1.0/3, 0, 0],
                     [1.0/3, 0, 0],
                     [1.0/3, 1, 0]])
X += 0.2 * np.array([[.5, .5, .5],
                     [.5, .5, .5],
                     [ 0,  0,  0]])
print X


[[ 0.36666667  0.1         0.1       ]
 [ 0.36666667  0.1         0.1       ]
 [ 0.26666667  0.8         0.        ]]

Q14.

  • Here are two sets of integers S = {1,2,3,4} and T = {1,2,5,6,x}, where x stands for some integer. For how many different integer values of x are the Jaccard similarity and the Jaccard distance of S and T the same? (Note: x can be one of 1, 2, 5, or 6, but in that case T, being a set, will contain x only once and thus have four members, not five.)

Q4 Solution.

  • x = 3 or x = 4

Q15.

  • Which of the following are advantages of using decision trees? (check all correct options)

Q15 Solution.

  • My Answer
    • It can handle categorical input data without any special preprocessing
    • The resulting model is easy to interpret
    • The training is easy to parallelize

Q16.

  • Consider a dataset of points xi,....xn with labels yi,....,yi ∈ {-1, 1}, such that the data is separable. We run a soft-margin SVM and a hard-margin SVM, and in each case we obtain parameters w and b. Check the option that is true:
    • The resulting w and b can be different, and the boundaries can be different.
    • The resulting w and b are the same in the two cases, hence boundaries are the same.
    • The resulting w and b can be different in the two cases, but the boundaries are the same.
    • None of the above.

Q17.

  • Consider the following MapReduce algorithm. The input is a collection of positive integers. Given integer X, the Map function produces a tuple with key Y and value X for each prime divisor Y of X. For example, if X = 20, there are two key-value pairs: (2,20) and (5,20). The Reduce function, given a key K and list L of values, produces a tuple with key K and value sum(L) i.e., the sum of the values in the list. Given the input 9, 15, 16, 23, 25, 27, 28, 56 which of the following tuples appears in the final output?