Applications of Set-Similarity

  1. Pages with similar words e.g., for classification by topic.
  2. NetFlix users with similar tastes in movies, for recommendation systems.
  3. Dual: movies with similar sets of fans.
  4. Entity resolution.

Similar Documents

Problem

Given a large body of documents, e.g., from the web, find pairs of documents that are similar (i.e., have lots of text in common) such as:

  • Mirror sites, or approximate mirrors.
    • Application: Don't want to show both in a search.
  • Plagiarism
  • Similar news articles at many news sites
    • Application: Cluster articles by "same story" (as in Google News)

Three Essential Techniques for Similar Documents

  1. Shingling : convert documents, emails, etc., to sets.
  2. Minhashing : convert large sets to short signatures, while preserving similarity.
  3. Locality-sensitive hashing : focus on pairs of signatures likely to be similar.

The Big Picture

Shingles

  • A k-shingle (or k-gram or popularly n-gram) for a document is a sequence of k characters that appears in the document.
  • A document can be represented by its set of k -shingles.
  • k = 5 or 10 is usually used so that we do not get many matches between two documents since the probability of k-shingles matching in two documents is large when k is small. That might lead to many false positives.
  • Blanks are also considered part of the document. Markups such as XML or HTML may or may not be included depending on the requirements.

In [10]:
# Code: k-shingle set for a given document

def create_shingles(doc, k):
    """
    Create k-shingled set for a given document
    
    Keyword arguments:
    doc -- the content of a document
    k -- number of shingles
    """
    shingled_set = set() # create an empty set
    
    doc_length = len(doc) 
    
    # iterate through the string and slice it up by k-chars at a time
    for idx in range(doc_length - k + 1):
        doc_slice = doc[idx:idx + k]
        shingled_set.add(doc_slice)
        
    return shingled_set

# invoke the create_shingles function

doc = 'abcab'
print create_shingles(doc, 2)


set(['ca', 'ab', 'bc'])

Shingles and Similarity

  • Documents that are intuitively similar will have many shingles in common.
  • Changing a word only affects k‐ shingles within distance k from the word, plus the shingles within the changed word
  • Reordering paragraphs only affects the 2k shingles that cross paragraph boundaries.

In [2]:
# Example: Changing a word will affect only k-shingles in each set

doc1 = 'The dog which chased the cat'
doc2 = 'The dog that chased the cat'

shingle_1 = create_shingles(doc1, 3)
shingle_2 = create_shingles(doc2, 3)

print 'The different shingles between the first and second shingle set are:'
print shingle_1 - shingle_2

print 'The new shingles in the second shingle set are:'
print shingle_2 - shingle_1


The different shingles between the first and second shingle set are:
set(['ch ', 'ich', 'h c', 'g w', 'whi', ' wh', 'hic'])
The new shingles in the second shingle set are:
set(['g t', 'hat', 't c', 'tha', 'at '])

Shingles: Compression Option

  • To compress long shingles,we can hash them to(say) 4 bytes.
    • Called tokens.
    • Saves memory.
  • Represent a doc by its tokens, that is, the set of hash values of its k‐shingles.
  • Two documents could(rarely) appear to have shingles in common, when in fact only the hash‐values were shared.

Minhashing

Minhash is a technique for quickly estimating how similar two sets are. Jaccard Similarity is used to compute similarity betweeb two sets.

Jaccard Similarity

  • The Jaccard similarity of two sets is the size of their interesection divided by size of their union.
  • $ Sim(C_1,C_2) = \frac{|C_1 \cap C_2|}{|C_1 \cup C_2|} $
  • Values of Jaccard similarity close to 1 indicate very similar sets and values close to 0 indicate dissimilar sets.

Example: Jaccard Similarity


In [11]:
# we need 'true' division for our functions
from __future__ import division

# Code: Compute Jaccard Similarity between two sets

def compute_jaccard_sim(set1, set2):
    """Compute the Jaccard similarity between two sets
    
    Keyword Arguments
    set1 , set2 -- two sets
    """
    intersection = s1 & s2 # using Python's symbol for intersection
    union = s1 | s2 # using Python's symbol for union
    
    # Jaccard similarity is the number of elements in the intersection divided by  
    # number of elements in the union of two sets
    jaccard_similarity = len(intersection) / len(union)
    return jaccard_similarity

# testing the Jaccard similarity function
s1 = {1, 2, 3, 4, 5, 6}
s2 = { 2, 4, 6, 8, 20}
print compute_jaccard_sim(s1, s2)

# another test
str1 = 'ABRACADABRA'
str2 = 'BRICABRAC'

s3 = create_shingles(str1, 2)
s4 = create_shingles(str2, 2)

print 'Number of 2-Shingles in ' + str1 + ':: ' ,(s3)
print 'Number of 2-Shingles in ' + str2 + ':: ' ,(s4)

print 'Number of common shingles:: ', len(s3 & s4)

print 'Jaccard Similarity between the two documents ::', compute_jaccard_sim(s3,s4)


0.375
Number of 2-Shingles in ABRACADABRA::  set(['AC', 'AB', 'AD', 'CA', 'DA', 'RA', 'BR'])
Number of 2-Shingles in BRICABRAC::  set(['AC', 'AB', 'CA', 'RA', 'BR', 'IC', 'RI'])
Number of common shingles::  5
Jaccard Similarity between the two documents :: 0.375

From Sets to Boolean Matrices

It is easier to represent the sets as a Boolean Matrix and then use them to compute the Jaccard similarity.

  • Rows = elements of the universal set
    • Example: the set of all k-shingles.
  • Columns = sets
  • 1 in row e and and column S only if e is a member of S
  • Column similarity is the Jaccard similarity of the sets of their rows with 1
  • Typical matrix is sparse

Example: Column Similarity

Four Types of Rows

  • Given columns $ C_1$ and $C_2 $, rows may be classified as :

  • Also, a = # rows of type a etc.

  • Note: $ Sim(C_1, C_2) = \frac {a}{a + b + c} $
  • Reason: a is the number of rows in the intersection (both 1) and a + b + c is the number of rows in the union.
  • Note: we only consider the rows with at least a 1 in it. So we don't use d type columns in our calculation.

Minhashing

  • Imagine the rows permuted randomly.
  • Define minhash function h(C)= the number of the first row in which column C has 1. (Note: Row Number is taken from the permuted order)
  • Use several (e.g.,100) independent hash functions to create a signature for each column.
  • The signatures can be displayed in another matrix – the signature matrix – whose columns represent the sets and the rows represent the minhash values,in order for that column.

Example

Surprising Property

  • The probability (over all permutations of the rows) that $ h(C_1) = h(C_2) $ is the same as $ Sim(C_1, C_2) $
  • Both are $ \frac {a}{a + b + c} $ !!
  • Why ?
    • Look down the permuted columns until we see a 1.
    • If it's a type-a rowm then $h(C_1) = h(C_2) $. If it is a type-a or type-c row, then not.

Similarity for Signatures

  • The similarity of signatures is the fraction of the minhash functions in which they agree.
    • Thinking of signatures as columns of integers,the similarity of signatures is the fraction of rows in which they agree.
  • Thus,the expected similarity of two signatures equals the Jaccard similarity of the columns or sets that the signatures represent.
    • And the longer the signatures,the smaller will be the expected error

Minhashing - Example

Implementation of Minhashing

Permutation Method

  • Suppose 1 billion rows.
  • Hard to pick a random permutation of 1 billion.
  • Memory and time constraints
  • Accessing rows in permuted order leads to thrashing.

Implementing using Hashing

  • A good approximation to permuting rows: pick, say 100 hash functions.
  • For each column c and each hash function $ h_i $, keep a slot M(i,c)
  • Intent: M(i,c) will become the smallest value of $ h_i(r) $ for which column c has 1 in row r.
    • i.e., $ h_i(r)$ gives order of rows for $ i^{th} $ permutation.

Algorithm

for each row r do begin
    for each hash function  h  do
        compute h(r) ;
for each column  c
    if c has 1 in row r
    for each hash function h do
        if h(r) is smaller than  M(i,c) then
            M(i,c):= h(r) ;
end;

Example

In the above example, the Minhash value is 0. The actual similarity value is 0.2


In [6]:
# TODO: A simple Minhash algorithm implementation

In [ ]: