Given a large body of documents, e.g., from the web, find pairs of documents that are similar (i.e., have lots of text in common) such as:
In [10]:
# Code: k-shingle set for a given document
def create_shingles(doc, k):
"""
Create k-shingled set for a given document
Keyword arguments:
doc -- the content of a document
k -- number of shingles
"""
shingled_set = set() # create an empty set
doc_length = len(doc)
# iterate through the string and slice it up by k-chars at a time
for idx in range(doc_length - k + 1):
doc_slice = doc[idx:idx + k]
shingled_set.add(doc_slice)
return shingled_set
# invoke the create_shingles function
doc = 'abcab'
print create_shingles(doc, 2)
In [2]:
# Example: Changing a word will affect only k-shingles in each set
doc1 = 'The dog which chased the cat'
doc2 = 'The dog that chased the cat'
shingle_1 = create_shingles(doc1, 3)
shingle_2 = create_shingles(doc2, 3)
print 'The different shingles between the first and second shingle set are:'
print shingle_1 - shingle_2
print 'The new shingles in the second shingle set are:'
print shingle_2 - shingle_1
Minhash is a technique for quickly estimating how similar two sets are. Jaccard Similarity is used to compute similarity betweeb two sets.
In [11]:
# we need 'true' division for our functions
from __future__ import division
# Code: Compute Jaccard Similarity between two sets
def compute_jaccard_sim(set1, set2):
"""Compute the Jaccard similarity between two sets
Keyword Arguments
set1 , set2 -- two sets
"""
intersection = s1 & s2 # using Python's symbol for intersection
union = s1 | s2 # using Python's symbol for union
# Jaccard similarity is the number of elements in the intersection divided by
# number of elements in the union of two sets
jaccard_similarity = len(intersection) / len(union)
return jaccard_similarity
# testing the Jaccard similarity function
s1 = {1, 2, 3, 4, 5, 6}
s2 = { 2, 4, 6, 8, 20}
print compute_jaccard_sim(s1, s2)
# another test
str1 = 'ABRACADABRA'
str2 = 'BRICABRAC'
s3 = create_shingles(str1, 2)
s4 = create_shingles(str2, 2)
print 'Number of 2-Shingles in ' + str1 + ':: ' ,(s3)
print 'Number of 2-Shingles in ' + str2 + ':: ' ,(s4)
print 'Number of common shingles:: ', len(s3 & s4)
print 'Jaccard Similarity between the two documents ::', compute_jaccard_sim(s3,s4)
It is easier to represent the sets as a Boolean Matrix and then use them to compute the Jaccard similarity.
Given columns $ C_1$ and $C_2 $, rows may be classified as :
Also, a = # rows of type a etc.
for each row r do begin
for each hash function h do
compute h(r) ;
for each column c
if c has 1 in row r
for each hash function h do
if h(r) is smaller than M(i,c) then
M(i,c):= h(r) ;
end;
In the above example, the Minhash value is 0. The actual similarity value is 0.2
In [6]:
# TODO: A simple Minhash algorithm implementation
In [ ]: