In [49]:
import pdb #debugger
import math
import numpy as np

Mining massive datasets week02

Locality-Sensitive Hashing

Many data mining problems can be described as fining similar sets:

  • pages with similar words
    • similar document on the web
    • aggregating news story
  • similar taste with moves
  • moves with similar set of fans
  • entity resolution
    • are those records relating to the same person?

We can get both false positive and false negative. To complete the task we will use 3 steps below

shingling

Concerting document to sets

  • k-gram
  • we are often forced to use k>=10, due to size of document
  • we can compress shingle to token

minhashing

  • Convert large sets to short signatures, while preserving similarity
  • takes less space in memory
  • Jaccard similarity is the size of set intersection divide by size of their union
    • $sim(C_1,C_2) = \frac{ |C_1 \cap C_2|}{ |C_1 \cup C_2|}$
  • we can represent all as matrix where
    • rows are sets of all k-shingles (universal set)
    • columns are sets
    • it will be very sparse
    • sum all rows where both sets ==1 and div by all rows ==1
  • process
    • assume rows to be permuted randomly
    • minhas function h(c) = no of first row in which col c has 1
      • once we got all values we stop iterating
    • use several independent functions to create a signature for each column
    • signature matrix will represent this
  • the probability (over all permutations of rows) that $h(c_1) = h(c_2) == sim(c_1,c_2)$
  • we cant do so much permutations so instead we simulate it by using multiple has functions for each row r:
      for each hi:
          compute hi(r)
      for each column c:
          if c==1 in row r:
              for each hi:
                  if hi(r) < M(i,c):
                      M(i,c) = hi(r)

locality-sensitivity hashing

  • focus on pair of signatures likely to be similar
  • we only work on subset, most close subset
  • how we do it?
    • we first define min required level of similarity
    • divide matrix M in b bands of r rows
    • for each row we hash this subset to hash table with k buckets
    • candidate column pars that hash to same bucket in >= 1 band
    • we tune r and b to catch most similar parts

Nearest-Neighbour Learning

Distance describe similarities of the points. Distance have to:

  • d(x,y) >= 0
  • d(x,y) =0 iff x = y
  • d(x,y) = d(y,x)
  • d(x,y) <= d(x,z)+d(z,y)

Euclidean Distance

  • L2 norm - sqr of sum of squares of differences
  • L1 norm - asum of differences in each dimension

Non-Euclidean Distance

Quiz Week 2A: LSH (Basic)

Question 1

The edit distance is the minimum number of character insertions and character deletions required to turn one string into another. Compute the edit distance between each pair of the strings he, she, his, and hers. Then, identify which of the following is a true statement about the number of pairs at a certain edit distance.


In [44]:
from itertools import combinations as Combinations
#https://github.com/ztane/python-Levenshtein
from Levenshtein import distance as EditDistance
from Levenshtein import editops as EditOps
from collections import Counter

inputWords = ['he', 'she', 'his', 'hers']

distanceGroups = []
for wordA,wordB in list(Combinations(inputWords,2)): #processing all combinations of words
    editDistance = EditDistance(wordA,wordB)
    print "Distance between %s and %s is: %i" %(wordA,wordB,editDistance)
    print EditOps(wordA,wordB)
    distanceGroups.append(editDistance)

#count them
Counter(distanceGroups)


Distance between he and she is: 1
[('insert', 0, 0)]
Distance between he and his is: 2
[('insert', 1, 1), ('replace', 1, 2)]
Distance between he and hers is: 2
[('insert', 2, 2), ('insert', 2, 3)]
Distance between she and his is: 3
[('replace', 0, 0), ('replace', 1, 1), ('replace', 2, 2)]
Distance between she and hers is: 3
[('delete', 0, 0), ('insert', 3, 2), ('insert', 3, 3)]
Distance between his and hers is: 2
[('insert', 1, 1), ('replace', 1, 2)]
Out[44]:
Counter({2: 3, 3: 2, 1: 1})

Note that question asked about insertion and deletion only, which is different to standard defintion as it ignores replacement. We can calculate this by hand using this presentation. This will change results for groups with replace giving us:

  • one pairs of distance 1
  • one pair of distance 2
  • three pairs of distance 3
  • one pair of distance 4

Question 2

Consider the given matrix and perform a minhashing of the data, with the order of rows: R4, R6, R1, R3, R5, R2. Which of the following is the correct minhash value of the stated column? Note: we give the minhash value in terms of the original name of the row, rather than the order of the row in the permutation. These two schemes are equivalent, since we only care whether hash values for two columns are equal, not what their actual values are.


In [120]:
M =  np.matrix('0,1,1,0;\
               1,0,1,1;\
               0,1,0,1;\
               0,0,1,0;\
               1,0,1,0;\
               0,1,0,0')
print M
mininhashOrder = [x -1 for x in [4,6,1,3,5,2]] #python idx from 0
print mininhashOrder

orderedM = M[mininhashOrder,:]
print orderedM
print "Note that answer is based on ORGINAL rows!"
minihashValues = [4,1,0,3]

for idx in range(len(minihashValues)):
    print "Column c%i == R%i" %(idx+1,mininhashOrder[minihashValues[idx]]+1)


[[0 1 1 0]
 [1 0 1 1]
 [0 1 0 1]
 [0 0 1 0]
 [1 0 1 0]
 [0 1 0 0]]
[3, 5, 0, 2, 4, 1]
[[0 0 1 0]
 [0 1 0 0]
 [0 1 1 0]
 [0 1 0 1]
 [1 0 1 0]
 [1 0 1 1]]
Note that answer is based on ORGINAL rows!
Column c1 == R5
Column c2 == R6
Column c3 == R4
Column c4 == R3

In [117]:
mininhashOrder[1]


Out[117]:
5

Question 4

Find the set of 2-shingles for the "documents". Answer the following questions:

  • How many 2-shingles does ABRACADABRA have?
  • How many 2-shingles does BRICABRAC have?
  • How many 2-shingles do they have in common?
  • What is the Jaccard similarity between the two documents"?

In [78]:
#http://locallyoptimal.com/blog/2013/01/20/elegant-n-gram-generation-in-python/
def CaclulateNgrams(document,ngram_size):
  ngram = zip(*[document[idx:] for idx in range(ngram_size)])
  return list(set(ngram)) #remove duplicates

In [84]:
documentA = 'ABRACADABRA'
documentB = 'BRICABRAC'
ngram_documentA = CaclulateNgrams(documentA,2)
ngram_documentB = CaclulateNgrams(documentB,2)
ngram_same =  list(set(ngram_documentA)&set(ngram_documentB))
print ngram_documentA
print CaclulateNgrams(documentB,2)
print "We have %i ngrams for document A and %i for document B" %(len(ngram_documentA),len(ngram_documentB))

similiarityJaccard = len(ngram_same)/float(len(ngram_documentA)+len(ngram_documentB))
print "%i are in common between documents, leading to Jaccard similarity of %.2f"  %(len(ngram_same),similiarityJaccard)
print ngram_same


[('C', 'A'), ('A', 'D'), ('B', 'R'), ('D', 'A'), ('R', 'A'), ('A', 'B'), ('A', 'C')]
[('R', 'I'), ('I', 'C'), ('C', 'A'), ('B', 'R'), ('R', 'A'), ('A', 'B'), ('A', 'C')]
We have 7 ngrams for document A and 7 for document B
5 are in common between documents, leading to Jaccard similarity of 0.36
[('R', 'A'), ('C', 'A'), ('A', 'B'), ('B', 'R'), ('A', 'C')]

Question 6

Suppose we want to assign points to whichever of the points (0,0) or (100,40) is nearer. Depending on whether we use the L1 or L2 norm, a point (x,y) could be clustered with a different one of these two points. For this problem, you should work out the conditions under which a point will be assigned to (0,0) when the L1 norm is used, but assigned to (100,40) when the L2 norm is used. Identify one of those points from the list below.


In [171]:
from scipy.spatial.distance import cdist as Distance

def IsThePointCorrect(point):
    pointA = [(0,0)]
    pointB = [(100,40)]
    
    #print Distance(pointA,point,'cityblock')[0]
    #print Distance(pointB,point,'cityblock')[0]
    #print Distance(pointA,point,'euclidean')[0]
    #print Distance(pointB,point,'euclidean')[0]
    if (Distance(pointA,point,'cityblock')[0]<Distance(pointB,point,'cityblock')[0]) \
    & (Distance(pointB,point,'euclidean')[0]<Distance(pointA,point,'cityblock')[0]):
        isThePointCorrect = True
    else: isThePointCorrect = False
    return isThePointCorrect

In [173]:
print IsThePointCorrect([(56,13)])
print IsThePointCorrect([(61,10)])
print IsThePointCorrect([(58,13)])
print IsThePointCorrect([(58,13)])


True
False
False
False

Quiz Week 2B: LSH (Basic)

Question 1

Suppose we have transactions that satisfy the following assumptions:

  • s, the support threshold, is 10,000.
  • There are one million items, which are represented by the integers 0,1,...,999999.
  • There are N frequent items, that is, items that occur 10,000 times or more.
  • There are one million pairs that occur 10,000 times or more.
  • There are 2M pairs that occur exactly once. M of these pairs consist of two frequent items, the other M each have at least one nonfrequent item.
  • No other pairs occur at all.

Integers are always represented by 4 bytes. Suppose we run the a-priori algorithm to find frequent pairs and can choose on the second pass between the triangular-matrix method for counting candidate pairs (a triangular array count[i][j] that holds an integer count for each pair of items (i, j) where i < j) and a hash table of item-item-count triples. Neglect in the first case the space needed to translate between original item numbers and numbers for the frequent items, and in the second case neglect the space needed for the hash table. Assume that item numbers and counts are always 4-byte integers. As a function of N and M, what is the minimum number of bytes of main memory needed to execute the a-priori algorithm on this data? Demonstrate that you have the correct formula by selecting, from the choices below, the triple consisting of values for N, M, and the (approximate, i.e., to within 10%) minumum number of bytes of main memory, S, needed for the a-priori algorithm to execute with this data.