Quiz Week 7B Basic

Q1.

  • Compute the Topic-Specific PageRank for the following link topology. Assume that pages selected for the teleport set are nodes 1 and 2 and that in the teleport set, the weight assigned for node 1 is twice that of node 2. Assume further that the teleport probability, (1 - beta), is 0.3. Which of the following statements is correct?


In [23]:
import numpy as np

# calculate topic specific page

M = np.array([[0, 0.5, 0.5, 0],
              [1,   0,   0, 0],
              [0,   0,   0, 1],
              [0,   0,   1, 0]
            ])
p = 0.7

teleport_M = np.array([[2.0/3, 1.0/3, 0, 0],
                       [2.0/3, 1.0/3, 0, 0],
                       [2.0/3, 1.0/3, 0, 0],
                       [2.0/3, 1.0/3, 0, 0]
                      ])
M_after = M*p + (1 - p)* teleport_M;
r_old = np.array([ 0.25 for i in xrange(4)])


r_1 = np.dot(M_after.T, r_old)
r_2 = np.dot(M_after.T, r_1)
print M_after
print r_old

cnt = 0
while True:
    cnt += 1
    r_new = np.dot(M_after.T, r_old)
    
    print "iteration#{0}:".format(cnt)
    print "r_new = " + str(r_new)
    if np.sum( np.abs(r_new - r_old) ) < 10**-4 or cnt > 100:
        break
    
    r_old = r_new


[[ 0.2   0.45  0.35  0.  ]
 [ 0.9   0.1   0.    0.  ]
 [ 0.2   0.1   0.    0.7 ]
 [ 0.2   0.1   0.7   0.  ]]
[ 0.25  0.25  0.25  0.25]
iteration#1:
r_new = [ 0.375   0.1875  0.2625  0.175 ]
iteration#2:
r_new = [ 0.33125  0.23125  0.25375  0.18375]
iteration#3:
r_new = [ 0.361875   0.2159375  0.2445625  0.177625 ]
iteration#4:
r_new = [ 0.35115625  0.22665625  0.25099375  0.17119375]
iteration#5:
r_new = [ 0.35865937  0.22290469  0.24274031  0.17569562]
iteration#6:
r_new = [ 0.35603328  0.22553078  0.24851772  0.16991822]
iteration#7:
r_new = [ 0.35787155  0.22461165  0.2435544   0.1739624 ]
iteration#8:
r_new = [ 0.35722815  0.22525504  0.24702872  0.17048808]
iteration#9:
r_new = [ 0.35767853  0.22502985  0.24437151  0.17292011]
iteration#10:
r_new = [ 0.3575209   0.22518749  0.24623156  0.17106006]
iteration#11:
r_new = [ 0.35763124  0.22513231  0.24487435  0.17236209]
iteration#12:
r_new = [ 0.35759262  0.22517093  0.2458244   0.17141205]
iteration#13:
r_new = [ 0.35761965  0.22515742  0.24514585  0.17207708]
iteration#14:
r_new = [ 0.35761019  0.22516688  0.24562083  0.1716021 ]
iteration#15:
r_new = [ 0.35761682  0.22516357  0.24528503  0.17193458]
iteration#16:
r_new = [ 0.3576145   0.22516589  0.24552009  0.17169952]
iteration#17:
r_new = [ 0.35761612  0.22516507  0.24535474  0.17186407]
iteration#18:
r_new = [ 0.35761555  0.22516564  0.24547049  0.17174832]
iteration#19:
r_new = [ 0.35761595  0.22516544  0.24538927  0.17182934]
iteration#20:
r_new = [ 0.35761581  0.22516558  0.24544612  0.17177249]
iteration#21:
r_new = [ 0.35761591  0.22516553  0.24540627  0.17181228]

Q2.

  • The spam-farm architecture described in Section 5.4.1 suffers from the problem that the target page has many links --- one to each supporting page. To avoid that problem, the spammer could use the architecture shown below:

  • There, k "second-tier" nodes act as intermediaries. The target page t has only to link to the k second-tier pages, and each of those pages links to m/k of the m supporting pages. Each of the supporting pages links only to t (although most of these links are not shown). Suppose the taxation parameter is β = 0.85, and x is the amount of PageRank supplied from outside to the target page. Let n be the total number of pages in the Web. Finally, let y be the PageRank of target page t. If we compute the formula for y in terms of k, m, and n, we get a formula with the form
$$ y = a \cdot x + \frac{b \cdot m}{n} + \frac{c \cdot k}{n} $$
  • Note: To arrive at this form, it is necessary at the last step to drop a low-order term that is a fraction of 1/n. Determine coefficients a, b, and c, remembering that β is fixed at 0.85. Then, identify the value, correct to two decimal places, for one of these coefficients.
  • Hint: Here is an outline of the solution.
    • Use w as the PageRank of each second-tier page and z as the PageRank of each supporting page.
    • You can write three equations, one for y in terms of z, one for w in terms of y, and one for z in terms of w.
    • For example, y equals x (the PageRank from outside) plus all the untaxed PageRank of each of the m supporting pages (a total of βzm), plus its share of the tax (which is (1-β)/n).
    • Discover the equations for w and z, then substitute these in the equation for y to get an equation for y in terms of itself, from which you can solve for y.

Solution 2.

  • Let w be the PageRank of each of the second-tier pages, and let z be the PageRank of each of the supporting pages. Then the equations relating y, w, and z are:
    1. $y = x + \beta \cdot z \cdot m + \frac{ (1 - \beta) }{n} $
    2. $w = \frac{\beta \cdot y}{k} + \frac{(1 - \beta)}{n}$
    3. $z = \frac{\beta \cdot w \cdot k}{m} + \frac{(1 - \beta)}{n}$
  • The first equation says that the PageRank of t is the external contribution x, plus βz (the amount of PageRank not taxed) times the number of supporting pages, plus (1-β)/n, which is the share of "tax" that every page gets. The second equation says that each second-tier page gets 1/k-th of the untaxed PageRank of t, plus its share of the tax. The third equation says each supporting page gets 1 part in m/k of the untaxed PageRank of the second-tier page that reaches that supporting page, plus its share of the tax.

    • Begin by substituting for z in the first equation:

      • $y = x + β2*k*w + \frac{β*(1-β)*m}{n} + \frac{(1-β)}{n}$
    • Now, substitute for w in the above:

      • $y = x + β3*y + \frac{β*(1-β)*m}{n} + \frac{β2*(1-β)*k}{n} + \frac{(1-β)}{n}$
    • Neglect the last term (1-β)/n, per the directions in the statement of the problem. If we move the term β3y to the left, and note that β3 = (1-β)(1+β+β2), we get
      • $y = \frac{x}{(1-β3)} + \frac{β}{(1+β+β2)} \cdot \frac{m}{n} + \frac{β}{1+β+β2} \cdot \frac{k}{n}$
    • For β = 0.85, these coefficients evaluate to:
      • y = 2.59x + 0.33(m/n) + 0.28(k/n)

Quiz - Week7A Advanced

Q1.

  • Suppose we have an LSH family h of (d1,d2,.6,.4) hash functions. We can use three functions from h and the AND-construction to form a (d1,d2,w,x) family, and we can use two functions from h and the OR-construction to form a (d1,d2,y,z) family. Calculate w, x, y, and z, and then identify the correct value of one of these in the list below.

In [27]:
import numpy as np
# And Construction
print "w = {0}".format(.6 ** 3)
print "x = {0}".format(.4 ** 3)

# Or Construction
print "y= {0}".format(1 - (1 - .6)**2)
print "z= {0}".format(1 - (1 - .4)**2)


w = 0.216
x = 0.064
y= 0.84
z= 0.64

Q2.

  • Here are eight strings that represent sets:
                                s1 = abcef
                                s2 = acdeg
                                s3 = bcdefg
                                s4 = adfg
                                s5 = bcdfgh
                                s6 = bceg
                                s7 = cdfg
                                s8 = abcd
  • Suppose our upper limit on Jaccard distance is 0.2, and we use the indexing scheme of Section 3.9.4 based on symbols appearing in the prefix (no position or length information). For each of s1, s3, and s6, determine how many other strings that string will be compared with, if it is used as the probe string. Then, identify the true count from the list below.

In [37]:
import numpy as np
import math
jDist = 0.2
def get_symbol(st):
    L = len(st)
    pLen = int(math.floor(jDist * L)) + 1
    return st[:pLen]

print "s1:" + get_symbol('abcef')
print "s2:" + get_symbol('acdeg')
print "s3:" + get_symbol('bcdefg')
print "s4:" + get_symbol('adfg')
print "s5:" + get_symbol('bcdfgh')
print "s6:" + get_symbol('bceg')
print "s7:" + get_symbol('cdfg')
print "s8:" + get_symbol('abcd')


print "===========================\nSo we got the group as: "
print "a : [s1, s2, s4, s8]"
print "b : [s1, s3, s4, s5]"
print "c : [s2, s3, s4]"


ab
ac
bc
a
bc
b
c
a
===========================
So we got the group as: 
a : [s1, s2, s4, s8]
b : [s1, s3, s4, s5]
c : [s2, s3, s4]

Solution2.

  • First, we index a string of length L on the symbols appearing in its prefix of length floor(0.2L+1). Thus, strings of length 5 and 6 are indexed on their first two symbols, while strings of length 4 are indexed on their first symbol only. Thus, the index for a consists of {s1, s2, s4, s8}; the index for b consists of {s1, s3, s5, s6}, the index for c consists of {s2, s3, s5, s7}, and no other symbol is indexed at all.
  • For s1, we examine the indexes for a and b, which contains all strings but s7. Thus, s1 is compared with 6 other strings.

  • For s3, we examine the indexes for b and c, which together contain s1, s2, s3, s5, s6, and s7. Thus, s3 is compared with five other strings.

  • For s6, we examine only the index for b. Thus, s6 is compared only with the three other strings s1, s3, and s5.

Q3.

  • Consider the link graph

  • First, construct the L, the link matrix, as discussed in Section 5.5 on the HITS algorithm. Then do the following:
  1. Start by assuming the hubbiness of each node is 1; that is, the vector h is (the transpose of) [1,1,1,1].
  2. Compute an estimate of the authority vector a=LTh.
  3. Normalize a by dividing all values so the largest value is 1.
  4. Compute an estimate of the hubbiness vector h=La.
  5. Normalize h by dividing all values so the largest value is 1.
  6. Repeat steps 2-5.
  7. Now, identify in the list below the true statement about the final estimates.

Solution 3.

  • Here is the matrix L:

              0   1   1   0
              1   0   0   0
              0   0   0   1
              0   0   1   0
    
  • In what follows, all vectors will be written as rows, i.e., in transposed form. We start with h = [1,1,1,1] and compute LTh = [1,1,2,1]. Since the largest value is 2, we divide all values by 2, giving us the first estimate a = [1/2,1/2,1,1/2].

  • Next, we compute La = [3/2,1/2,1/2,1] and normalize by multiplying by 2/3 to get h = [1,1/3,1/3,2/3].

  • The next calculation of a from the estimate of h gives LTh = [1/3,1,5/3,1/3], and normalizing gives a = [1/5,3/5,1,1/5].

  • For the final estimate of h we compute La = [8/5,1/5,1/5,1], which after normalizing gives h = [1,1/8,1/8,5/8].

Q4.

  • Consider an implementation of the Block-Stripe Algorithm discussed in Section 5.2 to compute page rank on a graph of N nodes (i.e., Web pages). Suppose each page has, on average, 20 links, and we divide the new rank vector into k blocks (and correspondingly, the matrix M into k stripes). Each stripe of M has one line per "source" web page, in the format:
                      [source_id, degree, m, dest_1, ...., dest_m]
  • Notice that we had to add an additional entry, m, to denote the number of destination nodes in this stripe, which of course is no more than the degree of the node. Assume that all entries (scores, degrees, identifiers,...) are encoded using 4 bytes.

  • There is an additional detail we need to account for, namely, locality of links. As a very simple model, assume that we divide web pages into two disjoint sets:

  1. Introvert pages, which link only to other pages within the same host as themselves.
  2. Extrovert pages, which have links to pages across several hosts.
  • Assume a fraction x of pages (0 Construct a formula that counts the amount of I/O per page rank iteration in terms of N, x, and k. The 4-tuples below list combinations of N, k, x, and I/O (in bytes). Pick the correct combination.
  • Note. There are some additional optimizations one can think of, such as striping the old score vector, encoding introvert and extrovert pages using different schemes, etc. For the purposes of working this problem, assume we don't do any optimizations beyond the block-stripe algorithm discussed in class.

Solution4.

  • The number of bytes involved in reading the old pagerank vector and writing the new pagerank vector to disk = 4 (k+1) N For the M matrix: - The introvert pages will appear xN times and each row will have on average 23 entries (3 metadata and 20 destination links). Total number of bytes read = 423 xN - The extrovert pages will appear (1-x) kN times and each row will have 3 (metadata) + 20/k (destination links) entries on average. Total number of bytes read = 4 (3+20/k) * (1-x) kN Total I/O per pagerank iteration (in GB, where 1GB ~ 10^9 = N bytes) = 4 [(k+1) N + 23 xN + (3k + 20) (1-x) N] / N = 4 [(k+1) + 23 x + (3k + 20) (1-x)] = 4 [21 + k + 3 (x + (1-x) k)]

In [ ]: