Suppose we have transactions that satisfy the following assumptions:
Suppose we run the a-priori algorithm to find frequent pairs and can choose on the second pass between the triangular-matrix method for counting candidate pairs (a triangular array count[i][j] that holds an integer count for each pair of items (i, j) where i < j) and a hash table of item-item-count triples. Neglect in the first case the space needed to translate between original item numbers and numbers for the frequent items, and in the second case neglect the space needed for the hash table. Assume that item numbers and counts are always 4-byte integers.
As a function of N and M, what is the minimum number of bytes of main memory needed to execute the a-priori algorithm on this data? Demonstrate that you have the correct formula by selecting, from the choices below, the triple consisting of values for N, M, and the (approximate, i.e., to within 10%) minumum number of bytes of main memory, S, needed for the a-priori algorithm to execute with this data.
In [1]:
import os
import sys
# N = 100,000; M = 50,000,000; S = 5,000,000,000
# N = 40,000; M = 60,000,000; S = 3,200,000,000
# N = 50,000; M = 80,000,000; S = 1,500,000,000
# N = 100,000; M = 100,000,000; S = 1,200,000,000
soln = [[100000, 50000000, 5000000000],
[40000, 60000000, 3200000000],
[50000, 80000000, 1500000000],
[100000, 100000000, 1200000000]]
In [2]:
def a(N, M):
memory = min(12*(M + 10**6), 4*N+2*N*(N-1))
#memory = min(12*(M + 10**6), 4*N*N)
return memory
In [3]:
def pct(e, a):
return 100 * (abs(e - a) / float(a))
In [4]:
for n, m, s in soln:
print pct(s, a(n, m))
Answer: N = 100,000; M = 100,000,000; S = 1,200,000,000
Imagine there are 100 baskets, numbered 1,2,...,100, and 100 items, similarly numbered. Item i is in basket j if and only if i divides j evenly. For example, basket 24 is the set of items {1,2,3,4,6,8,12,24}. Describe all the association rules that have 100% confidence. Which of the following rules has 100% confidence?
In [5]:
baskets = range(1,101)
items = range(1,101)
# Create transactions
transactions = []
for i in baskets:
basket = []
for item in items:
if i % item == 0:
basket.append(item)
transactions.append(basket)
In [6]:
def check(transactions,query):
count=0
for t in transactions:
query_in = True
for q in query:
if q not in t:
query_in = False
if query_in:
count+=1
return count
In [7]:
def confidence(num,denom):
count_denom = check(transactions,denom)
count_num = check(transactions,num)
confidence = count_num /(1.0* count_denom ) * 100
return confidence
In [8]:
print "{1,2}-> 4,Condidence = %d"%(confidence([1,2,4],[1,2]))
print "{1}-> 2,Condidence = %d"%(confidence([1,2],[1]))
print "{1,4,7}-> 14,Condidence = %d"%(confidence([1,4,7,14],[1,4,7]))
print "{1,3,6}-> 12,Condidence = %d"%(confidence([1,3,6,12],[1,3,6]))
print "{4,6}-> 12,Condidence = %d"%(confidence([4,6,12],[4,6]))
print "{8,12}-> 96,Condidence = %d"%(confidence([8,12,96],[8,12]))
print "{4,6}-> 24,Condidence = %d"%(confidence([4,6,24],[4,6]))
print "{1,3,6}-> 12,Condidence = %d"%(confidence([1,3,6,12],[1,3,6]))
Suppose we perform the PCY algorithm to find frequent pairs, with market-basket data meeting the following specifications:
Suppose there are S bytes of main memory. In order to run the PCY algorithm successfully, the number of buckets must be sufficiently large that most buckets are not large. In addition, on the second pass, there must be enough room to count all the candidate pairs. As a function of S, what is the largest value of P for which we can successfully run the PCY algorithm on this data? Demonstrate that you have the correct formula by indicating which of the following is a value for S and a value for P that is approximately (i.e., to within 10%) the largest possible value of P for that S.
In [9]:
import os
import sys
In [10]:
# S = 1,000,000,000; P = 10,000,000,000
# S = 300,000,000; P = 750,000,000
# S = 1,000,000,000; P = 20,000,000,000 (correct)
# S = 100,000,000; P = 120,000,000
soln = [[1000000000, 10000000000],
[300000000, 750000000],
[1000000000, 20000000000],
[100000000, 120000000]]
In [11]:
def b(S, P):
memory = S*S/48000000 - P
return memory
In [12]:
def pct(e, a):
return 100 * (abs(e - a) / float(a))
return 100 * (a/(e*e/48000000))
In [13]:
for s, p in soln:
print "Difference between P and S^2/48,000,000 = %d" % b(s, p)
print "Percentage of P to the largest possible P value for that S = %d" % pct(s,p)
print
Answer: Largest P value that is < S^2/48000000, within 10% to the largest possible value of P for that S.
In [ ]: