In [23]:
import numpy as np
# calculate topic specific page
M = np.array([[0, 0.5, 0.5, 0],
[1, 0, 0, 0],
[0, 0, 0, 1],
[0, 0, 1, 0]
])
p = 0.7
teleport_M = np.array([[2.0/3, 1.0/3, 0, 0],
[2.0/3, 1.0/3, 0, 0],
[2.0/3, 1.0/3, 0, 0],
[2.0/3, 1.0/3, 0, 0]
])
M_after = M*p + (1 - p)* teleport_M;
r_old = np.array([ 0.25 for i in xrange(4)])
r_1 = np.dot(M_after.T, r_old)
r_2 = np.dot(M_after.T, r_1)
print M_after
print r_old
cnt = 0
while True:
cnt += 1
r_new = np.dot(M_after.T, r_old)
print "iteration#{0}:".format(cnt)
print "r_new = " + str(r_new)
if np.sum( np.abs(r_new - r_old) ) < 10**-4 or cnt > 100:
break
r_old = r_new
The first equation says that the PageRank of t is the external contribution x, plus βz (the amount of PageRank not taxed) times the number of supporting pages, plus (1-β)/n, which is the share of "tax" that every page gets. The second equation says that each second-tier page gets 1/k-th of the untaxed PageRank of t, plus its share of the tax. The third equation says each supporting page gets 1 part in m/k of the untaxed PageRank of the second-tier page that reaches that supporting page, plus its share of the tax.
Begin by substituting for z in the first equation:
Now, substitute for w in the above:
In [27]:
import numpy as np
# And Construction
print "w = {0}".format(.6 ** 3)
print "x = {0}".format(.4 ** 3)
# Or Construction
print "y= {0}".format(1 - (1 - .6)**2)
print "z= {0}".format(1 - (1 - .4)**2)
s1 = abcef s2 = acdeg s3 = bcdefg s4 = adfg s5 = bcdfgh s6 = bceg s7 = cdfg s8 = abcd
In [37]:
import numpy as np
import math
jDist = 0.2
def get_symbol(st):
L = len(st)
pLen = int(math.floor(jDist * L)) + 1
return st[:pLen]
print "s1:" + get_symbol('abcef')
print "s2:" + get_symbol('acdeg')
print "s3:" + get_symbol('bcdefg')
print "s4:" + get_symbol('adfg')
print "s5:" + get_symbol('bcdfgh')
print "s6:" + get_symbol('bceg')
print "s7:" + get_symbol('cdfg')
print "s8:" + get_symbol('abcd')
print "===========================\nSo we got the group as: "
print "a : [s1, s2, s4, s8]"
print "b : [s1, s3, s4, s5]"
print "c : [s2, s3, s4]"
For s1, we examine the indexes for a and b, which contains all strings but s7. Thus, s1 is compared with 6 other strings.
For s3, we examine the indexes for b and c, which together contain s1, s2, s3, s5, s6, and s7. Thus, s3 is compared with five other strings.
For s6, we examine only the index for b. Thus, s6 is compared only with the three other strings s1, s3, and s5.
Here is the matrix L:
0 1 1 0 1 0 0 0 0 0 0 1 0 0 1 0
In what follows, all vectors will be written as rows, i.e., in transposed form. We start with h = [1,1,1,1] and compute LTh = [1,1,2,1]. Since the largest value is 2, we divide all values by 2, giving us the first estimate a = [1/2,1/2,1,1/2].
Next, we compute La = [3/2,1/2,1/2,1] and normalize by multiplying by 2/3 to get h = [1,1/3,1/3,2/3].
The next calculation of a from the estimate of h gives LTh = [1/3,1,5/3,1/3], and normalizing gives a = [1/5,3/5,1,1/5].
For the final estimate of h we compute La = [8/5,1/5,1/5,1], which after normalizing gives h = [1,1/8,1/8,5/8].
[source_id, degree, m, dest_1, ...., dest_m]
Notice that we had to add an additional entry, m, to denote the number of destination nodes in this stripe, which of course is no more than the degree of the node. Assume that all entries (scores, degrees, identifiers,...) are encoded using 4 bytes.
There is an additional detail we need to account for, namely, locality of links. As a very simple model, assume that we divide web pages into two disjoint sets:
In [ ]: