Applications of Set-Similarity

Pages with similar words e.g., for classification by topic.
NetFlix users with similar tastes in movies, for recommendation systems.
Dual: movies with similar sets of fans.
Entity resolution.

Three Essential Techniques for Similar Documents

Shingling : convert documents, emails, etc., to sets.
Minhashing : convert large sets to short signatures, while preserving similarity.
Locality-sensitive hashing : focus on pairs of signatures likely to be similar.

The Big Picture

LSH Big Picture

Shingles

A k-shingle (or k-gram or popularly n-gram) for a document is a sequence of k characters that appears in the document.
A document can be represented by its set of k -shingles.
k = 5 or 10 is usually used so that we do not get many matches between two documents since the probability of k-shingles matching in two documents is large when k is small. That might lead to many false positives.
Blanks are also considered part of the document. Markups such as XML or HTML may or may not be included depending on the requirements.



In [10]:

    
# Code: k-shingle set for a given document

def create_shingles(doc, k):
    """
    Create k-shingled set for a given document
    
    Keyword arguments:
    doc -- the content of a document
    k -- number of shingles
    """
    shingled_set = set() # create an empty set
    
    doc_length = len(doc) 
    
    # iterate through the string and slice it up by k-chars at a time
    for idx in range(doc_length - k + 1):
        doc_slice = doc[idx:idx + k]
        shingled_set.add(doc_slice)
        
    return shingled_set

# invoke the create_shingles function

doc = 'abcab'
print create_shingles(doc, 2)









    



set(['ca', 'ab', 'bc'])

Shingles and Similarity

Documents that are intuitively similar will have many shingles in common.
Changing a word only affects k‐ shingles within distance k from the word, plus the shingles within the changed word
Reordering paragraphs only affects the 2k shingles that cross paragraph boundaries.



In [2]:

    
# Example: Changing a word will affect only k-shingles in each set

doc1 = 'The dog which chased the cat'
doc2 = 'The dog that chased the cat'

shingle_1 = create_shingles(doc1, 3)
shingle_2 = create_shingles(doc2, 3)

print 'The different shingles between the first and second shingle set are:'
print shingle_1 - shingle_2

print 'The new shingles in the second shingle set are:'
print shingle_2 - shingle_1









    



The different shingles between the first and second shingle set are:
set(['ch ', 'ich', 'h c', 'g w', 'whi', ' wh', 'hic'])
The new shingles in the second shingle set are:
set(['g t', 'hat', 't c', 'tha', 'at '])

Shingles: Compression Option

To compress long shingles,we can hash them to(say) 4 bytes.
- Called tokens.
- Saves memory.
Represent a doc by its tokens, that is, the set of hash values of its k‐shingles.
Two documents could(rarely) appear to have shingles in common, when in fact only the hash‐values were shared.

Minhashing

Minhash is a technique for quickly estimating how similar two sets are. Jaccard Similarity is used to compute similarity betweeb two sets.

Jaccard Similarity

The Jaccard similarity of two sets is the size of their interesection divided by size of their union.
$ Sim(C_1,C_2) = \frac{|C_1 \cap C_2|}{|C_1 \cup C_2|} $
Values of Jaccard similarity close to 1 indicate very similar sets and values close to 0 indicate dissimilar sets.

Example: Jaccard Similarity



In [11]:

    
# we need 'true' division for our functions
from __future__ import division

# Code: Compute Jaccard Similarity between two sets

def compute_jaccard_sim(set1, set2):
    """Compute the Jaccard similarity between two sets
    
    Keyword Arguments
    set1 , set2 -- two sets
    """
    intersection = s1 & s2 # using Python's symbol for intersection
    union = s1 | s2 # using Python's symbol for union
    
    # Jaccard similarity is the number of elements in the intersection divided by  
    # number of elements in the union of two sets
    jaccard_similarity = len(intersection) / len(union)
    return jaccard_similarity

# testing the Jaccard similarity function
s1 = {1, 2, 3, 4, 5, 6}
s2 = { 2, 4, 6, 8, 20}
print compute_jaccard_sim(s1, s2)

# another test
str1 = 'ABRACADABRA'
str2 = 'BRICABRAC'

s3 = create_shingles(str1, 2)
s4 = create_shingles(str2, 2)

print 'Number of 2-Shingles in ' + str1 + ':: ' ,(s3)
print 'Number of 2-Shingles in ' + str2 + ':: ' ,(s4)

print 'Number of common shingles:: ', len(s3 & s4)

print 'Jaccard Similarity between the two documents ::', compute_jaccard_sim(s3,s4)









    



0.375
Number of 2-Shingles in ABRACADABRA::  set(['AC', 'AB', 'AD', 'CA', 'DA', 'RA', 'BR'])
Number of 2-Shingles in BRICABRAC::  set(['AC', 'AB', 'CA', 'RA', 'BR', 'IC', 'RI'])
Number of common shingles::  5
Jaccard Similarity between the two documents :: 0.375

From Sets to Boolean Matrices

It is easier to represent the sets as a Boolean Matrix and then use them to compute the Jaccard similarity.

Rows = elements of the universal set
- Example: the set of all k-shingles.
Columns = sets
1 in row e and and column S only if e is a member of S
Column similarity is the Jaccard similarity of the sets of their rows with 1
Typical matrix is sparse

Example: Column Similarity

Four Types of Rows

Given columns $ C_1$ and $C_2 $, rows may be classified as :
Also, a = # rows of type a etc.
Note: $ Sim(C_1, C_2) = \frac {a}{a + b + c} $
Reason: a is the number of rows in the intersection (both 1) and a + b + c is the number of rows in the union.
Note: we only consider the rows with at least a 1 in it. So we don't use d type columns in our calculation.

Minhashing

Imagine the rows permuted randomly.
Define minhash function h(C)= the number of the first row in which column C has 1. (Note: Row Number is taken from the permuted order)
Use several (e.g.,100) independent hash functions to create a signature for each column.
The signatures can be displayed in another matrix – the signature matrix – whose columns represent the sets and the rows represent the minhash values,in order for that column.

Example

Surprising Property

The probability (over all permutations of the rows) that $ h(C_1) = h(C_2) $ is the same as $ Sim(C_1, C_2) $
Both are $ \frac {a}{a + b + c} $ !!
Why ?
- Look down the permuted columns until we see a 1.
- If it's a type-a rowm then $h(C_1) = h(C_2) $. If it is a type-a or type-c row, then not.

Similarity for Signatures

The similarity of signatures is the fraction of the minhash functions in which they agree.
- Thinking of signatures as columns of integers,the similarity of signatures is the fraction of rows in which they agree.
Thus,the expected similarity of two signatures equals the Jaccard similarity of the columns or sets that the signatures represent.
- And the longer the signatures,the smaller will be the expected error

Minhashing - Example

Implementation of Minhashing

Permutation Method

Suppose 1 billion rows.
Hard to pick a random permutation of 1 billion.
Memory and time constraints
Accessing rows in permuted order leads to thrashing.

Implementing using Hashing

A good approximation to permuting rows: pick, say 100 hash functions.
For each column c and each hash function $ h_i $, keep a slot M(i,c)
Intent: M(i,c) will become the smallest value of $ h_i(r) $ for which column c has 1 in row r.
- i.e., $ h_i(r)$ gives order of rows for $ i^{th} $ permutation.

Algorithm

for each row r do begin
    for each hash function  h  do
        compute h(r) ;
for each column  c
    if c has 1 in row r
    for each hash function h do
        if h(r) is smaller than  M(i,c) then
            M(i,c):= h(r) ;
end;

Example

In the above example, the Minhash value is 0. The actual similarity value is 0.2



In [6]:

    
# TODO: A simple Minhash algorithm implementation



In [ ]:

Applications of Set-Similarity

Similar Documents

Problem

Three Essential Techniques for Similar Documents

The Big Picture

Shingles

Shingles and Similarity

Shingles: Compression Option

Minhashing

Jaccard Similarity

Example: Jaccard Similarity

From Sets to Boolean Matrices

Example: Column Similarity

Four Types of Rows

Minhashing

Example

Surprising Property

Similarity for Signatures

Minhashing - Example

Implementation of Minhashing

Permutation Method

Implementing using Hashing

Algorithm

Example