This notebook gives an overview of locality sensitive hashing for the purpose of deduplicating large collections of documents. The associated code is available from https://github.com/mattilyra/lsh.
Finding exact duplicates is trivially easy (just use md5
) but finding near duplicates is much harder. Many document collections from tweets to news text to forum discussions contain documents that are almost exactly the same, with only a few sentences or characters that differ. A hash function such as md5
won't find these near duplicates as they are not the same on the byte level, but rather on the semantic level. Finding these semantic duplicates can be done, for instance, by using n-gram shingles and computing the Jaccard similarity of the shingle sets. This is computationally very expensive as the similarity between all pairs of documents needs to be computed. The Reuters Corpus Version 1.0 (RCV1) [1] is ~810k documents, that's $810000^2 = 656100000000$ comparisons, assuming the we have a really fast computer that can generate a pair of shingles and compute their Jaccard similarity in 1 microsecond, that still takes a bit over a week to go through all of them, luckily the similarities are symmetric so going through half of $656100000000$ is enough, that is still ~3 days. The Amazon Product Data [2] is what could be called webscale ~140 million product reviews. I don't even want to think about how long that would take.
Locality senstive hashing (LSH) relies on two different methods, first a hash fingerprint of each document is created and then the locality sensitive hashing is applied to that fingerprint. If the fingerprint is generated using the minhash
algorithm then the probability of a hash collision is equal to the Jaccard distance of the documents. There are other hash functions that correspond to the cosine similarity for instance, but I won't deal with those here.
Minhash is a hash function that computes the lowest hash value for a set of objects to be hashes. In the case of comparing document similarities the set of objects is word or character ngrams taken over a sliding window from the document - also known as shingles. The set of shingles allows us to compute the document similarity (defined in this case as Jaccard similarity) between a pair of documents.
In [1]:
document = 'Lorem Ipsum dolor sit amet'
# shingle and discard the last 5 as these are just the last n<5 characters from the document
shingles = [document[i:i+5] for i in range(len(document))][:-5]
shingles
Out[1]:
In [2]:
other_document = 'Lorem Ipsum dolor sit amet is how dummy text starts'
# shingle and discard the last 5 as these are just the last n<5 characters from the document
other_shingles = [other_document[i:i+5] for i in range(len(other_document))][:-5]
# Jaccard distance is the size of set intersection divided by the size of set union
len(set(shingles) & set(other_shingles)) / len(set(shingles) | set(other_shingles))
Out[2]:
As we can see these two documents are not very similar, at least in terms of their 3-gram shingle Jaccard similarity. That aside the problem with these shingles is that they do not allow us to compute the similarities of large numbers of documents very easily, we have to do an all pairs comparison. To get around that we can use locality sensitive hashing, but before LSH we'll turn the documents into a more manageable and uniform representation: a fixed length fingerprint comprised of $k$ minhash
es.
Every document has a different number of shingles depending on the length of the document, for a corpus of any size predicting the memory requirements for an all pairs comparison is not possible as each document will consume a variable amount of memory. For LSH we would like to have a fixed length representation of the documents without changing the semantics of document similarity. This is where minhash
ing comes in. It turns out that the probability of a hash collision for a minhash
is exactly the Jaccard similarity of two sets. This can be seen by considering the two sets of shingles as a matrix. For two dummy documents the shingles could be represented as the table below (the zeros and ones indicate if a shingle is present in the document or not). Notice that the Jaccard similarity of the documents is 2/5
.
row | shingle ID | Doc 1 | Doc 2 |
1 | 1 | 0 | 1 |
2 | 2 | 1 | 1 |
3 | 3 | 0 | 1 |
4 | 4 | 1 | 0 |
5 | 5 | 1 | 1 |
6 | 6 | 0 | 0 |
The minhash
corresponds to a random permutation of the rows and gives back the row number where the first non zero entry is found. For the above table the minhash
for documents one and two would thus be 2
and 1
respectively - meaning that the documents are not similar. The above table however is just one ordering of the shingles of each document. A different random permutation of the rows will give a different minhash
, in this case 2
and 2
, making the documents similar.
row | shingle ID | Doc 1 | Doc 2 |
1 | 6 | 0 | 0 |
2 | 2 | 1 | 1 |
3 | 3 | 0 | 1 |
4 | 1 | 0 | 1 |
5 | 4 | 1 | 0 |
6 | 5 | 1 | 1 |
A random permutation of the rows can produce any of 6! == 720
(factorial) different orderings. However we only care about the orderings for which the two columns have the same lowest row number with a 1, that is shingle ID
$\in \{2, 5\}$. Since the rows with zeros on them don't count, there are 5 rows with a one on it in any column, and two rows with a 1 in both columns. All a random permutation can therefore do is put two out of the five rows in the lowest row number, in other words produce a hash collision with a probability 2/5
.
The above explanation follows Chapter 3 of Mining Massive Datasets [3]. An in depth explanation for why and how minhash
works is provided there along with other interesting hash functions.
Applying minhash
gives us a fixed length $k$ (you pick the length) representation of each document such that the probability of a hash collision is equal to the Jaccard similarity of any pair. This being a probabilitic measure you're not guaranteed to get a collision. For Lorem Ipsum documents above and $k=100$ we get similarities that are roughly the same as the Jaccard similarity.
In [3]:
from lsh import minhash
for _ in range(5):
hasher = minhash.MinHasher(seeds=100, char_ngram=5)
fingerprint0 = hasher.fingerprint('Lorem Ipsum dolor sit amet'.encode('utf8'))
fingerprint1 = hasher.fingerprint('Lorem Ipsum dolor sit amet is how dummy text starts'.encode('utf8'))
print(sum(fingerprint0[i] in fingerprint1 for i in range(hasher.seeds)) / hasher.seeds)
Increasing the length of the fingerprint from $k=100$ to $k=1000$ reduces the variance between random initialisations of the minhash
er.
In [136]:
for _ in range(5):
hasher = minhash.MinHasher(seeds=1000, char_ngram=5)
fingerprint0 = hasher.fingerprint('Lorem Ipsum dolor sit amet'.encode('utf8'))
fingerprint1 = hasher.fingerprint('Lorem Ipsum dolor sit amet is how dummy text starts'.encode('utf8'))
print(sum(fingerprint0[i] in fingerprint1 for i in range(hasher.seeds)) / hasher.seeds)
Increasing the fingerprint length however comes at the cost of increased memory usage and more time spent computing the minhash
es. For a collection of documents we are still left with comparing all pairs, when that collection grows larger this becomes a very real problem. Queue LSH.
The idea behind locality sensitive hashing is to take the document fingerprints and chop them up into pieces, each piece being some number of minhash
es. Since a single minhash
(single entry in the fingerprint) has a probability equal to the Jaccard similarity of producing a collision, each chopped up portion of the fingerprint should as well. This chopped up portion is the locality
in locality sensitive hashing, the hashing
is just a hash function (any hash function) which produces a bin ID
from the fingerprint locality
being hashed. Each bin holds the entire fingerprint (with optional meta information) of the document and that of other documents that hash to the same bin
.
Let's say our fingerprint has 100 minhash
es in it and we chop the fingerprints into 10 pieces. Each piece of each fingerprint therefore contains 10 minhash
es, we hash those again (not using minhash
this time) to get a bin ID
and store the whole fingerprint in every bin each of the pieces happens to land in.
When we want to know which documents are similar to a query document, we look in all the bins the query document lands in, any document in any of the bins is a potential duplicate. Comparing the full fingerprint of all documents in the bin or computing the actual Jaccard similarity between the shingle sets yields the final similarity of documents. Crucially since not all documents will land in the same bins we've reduced the number of comparisons needed to find similar or near duplicate documents.
The number of pieces to chop each fingerprint into and the size of each piece are parameters that need to be set. These should be set such that $num\_pieces \times size\_of\_piece == num\_minhashes$ - this makes sense since having computed all the $N$ minhash
es we want to use all of them in the locality sensitive hashing part. There is however a further issue that needs to be considered when setting the parameters; the relation between the number and size of the pieces and the probability of LSH "finding" a pair of similar documents.
LSH is a probabilistic model which means that it won't always do the "right thing". Using LSH one needs to consider the similarity of a pair of documents (in this case the Jaccard similarity) and the probability that LSH will find that pair to be similar (a true positive, i.e. a correctly discovered duplicate pair). The pair of documents LSH finds to be similar should be thought of as candidate duplicates. The higher the probability, or guarantee, that LSH will find a pair of documents to be similar the more false positives the model will also produce, that is candidate duplicates that are not in fact duplicates.
In [16]:
%matplotlib inline
import numpy as np
import pandas as pd
import seaborn as sns
from matplotlib import pyplot as plt
ix = pd.IndexSlice
In [4]:
df = pd.DataFrame(data=[(2, 50), (50, 2), (10, 10), (5, 20), (20, 5)], columns=['pieces', 'size'])
df['hashes'] = df['pieces'] * df['size']
for pr in np.linspace(0, 1, 200):
df[pr] = 1 - (1 - pr**df['size']) ** df['pieces']
df = pd.pivot_table(df, index=['hashes', 'pieces', 'size'])
ax = df.T.plot(figsize=(10, 7), title='Probability of LSH finding a candidate pair');
plt.ylabel('p(candidate | Jaccad)');
plt.xlabel('Jaccard similarity');
plt.legend(list(df.loc[ix[100]].index),
bbox_to_anchor=(1., 1, 1., 0), loc='upper left', fontsize=12,
ncol=1, borderaxespad=0., title='Each line shows the\nfingerprint chopped\ninto (pieces, size)\n');
The figure shows the probability that LSH with minhash
will "find" a pair of similar documents (y-axis
) given the Jaccard similarity (x-axis
) of those documents for different settings for LSH. Each of the five lines correspond to different settings, the number of hashes is always 100 so we're just changing the number of pieces to chop each fingerprint into (and the size of those pieces, although that becomes determined by setting the number of hashes).
Creating just two pieces with 50 rows each - that is two localities, each with a size of 50 minhash
es - yields an LSH model (blue line) that tries really really hard not to find documents to be similar. This LSH model will find 80% of documents whose actual Jaccard similarity is over 95%. Documents whose Jaccard similarity is 80% will hardly ever be found to be similar.
Creating 5 pieces with 20 rows (green line) each is slightly more relaxed. The above graph should give you a pretty good idea how to set the parameters for your use case so that you can be reasonably certain that LSH will generate acceptable candidate pairs.
The Reuters Corpus, Volume 1 (RCV1) corpus is a commonly used resource for various NLP tasks, especially document classification. It was made available in 2000 by Reuters Ltd and consists of ~810,000 english language news stories collected between August 20th 1996 and August 19th 1997 from the Reuters news wire.
I've preprocessed the corpus so that it is all in a single file, one line per document. Each line has the format:
ITEMID<TAB>HEADLINE<SPACE>TEXT
In [94]:
!wc -l ../data/rcv1/headline.text.txt
In [123]:
!head -1 ../data/rcv1/headline.text.txt
Some duplicate items are present in the corpus so let's see what happens when we apply LSH to it. First a helper function that takes a file pointer and some parameters for minhash
and LSH and then finds duplicates.
In [65]:
import itertools
from lsh import lsh, minhash # https://github.com/mattilyra/lsh
# a pure python shingling function that will be used in comparing
# LSH to true Jaccard similarities
def shingles(text, char_ngram=5):
return set(text[head:head + char_ngram] for head in range(0, len(text) - char_ngram))
def jaccard(set_a, set_b):
intersection = set_a & set_b
union = set_a | set_b
return len(intersection) / len(union)
def candidate_duplicates(document_feed, char_ngram=5, seeds=100, bands=5, hashbytes=4):
char_ngram = 5
sims = []
hasher = minhash.MinHasher(seeds=seeds, char_ngram=char_ngram, hashbytes=hashbytes)
if seeds % bands != 0:
raise ValueError('Seeds has to be a multiple of bands. {} % {} != 0'.format(seeds, bands))
cache = lsh.Cache(bands=bands, hasher=hasher)
for i, line in enumerate(document_feed):
line = line.decode('utf8')
docid, headline_text = line.split('\t', 1)
fingerprint = hasher.fingerprint(headline_text.encode('utf8'))
# in addition to storing the fingerpring store the line
# number and document ID to help analysis later on
cache.update(fingerprint, meta=(i, docid))
candidate_pairs = set()
for b in cache.bins:
for bucket_id in b:
if len(b[bucket_id]) > 1:
pairs_ = set(itertools.combinations(b[bucket_id], r=2))
candidate_pairs.update(pairs_)
return candidate_pairs
In [62]:
hasher = minhash.MinHasher(seeds=100, char_ngram=5, hashbytes=4)
cache = lsh.Cache(bands=10, hasher=hasher)
with open('../data/rcv1/headline.text.txt', 'rb') as fh:
feed = itertools.islice(fh, 100)
for line in feed:
cache.update(hasher.fingerprint(line))
p = set()
for b in cache.bins:
for bucket_id in b:
if len(b[bucket_id]) > 1:
pairs_ = set(itertools.combinations(b[bucket_id], r=2))
p.update(pairs_)
Now let's run LSH on a few different paramter settings and see what the results look like. To save some time I'm only using the first 1000 documents.
In [66]:
num_candidates = []
bands = [2, 5, 10, 20]
for num_bands in bands:
with open('../data/rcv1/headline.text.txt', 'rb') as fh:
feed = itertools.islice(fh, 1000)
candidates = candidate_duplicates(feed, char_ngram=5, seeds=100, bands=num_bands, hashbytes=4)
num_candidates.append(len(candidates))
In [122]:
fig, ax = plt.subplots(figsize=(8, 6))
plt.bar(bands, num_candidates, align='center');
plt.title('Number of candidate duplicate pairs found by LSH using 100 minhash fingerprint.');
plt.xlabel('Number of bands');
plt.ylabel('Number of candidate duplicates');
plt.xticks(bands, bands);
So the more promiscuous [4] version (20 bands per fingerprint) finds many more candidate pairs than the conservative 2 bands model. The first implication of this difference is that it leads to you having to do more comparisons to find the real duplicates. Let's see what that looks like in practice.
In [93]:
lines = []
with open('../data/rcv1/headline.text.txt', 'rb') as fh:
# read the first 1000 lines into memory so we can compare them
for line in itertools.islice(fh, 1000):
lines.append(line.decode('utf8'))
# reset file pointer and do LSH
fh.seek(0)
feed = itertools.islice(fh, 1000)
candidates = candidate_duplicates(feed, char_ngram=5, seeds=100, bands=20, hashbytes=4)
# go over all the generated candidates comparing their similarities
similarities = []
for (f_a, (line_a, docid_a)), (f_b, (line_b, docid_b)) in candidates:
doc_a, doc_b = lines[line_a], lines[line_b]
shingles_a = shingles(lines[line_a])
shingles_b = shingles(lines[line_b])
jaccard_sim = jaccard(shingles_a, shingles_b)
f_a, f_b = set(f_a), set(f_b)
minhash_sim = len(f_a & f_b) / len(f_a | f_b)
similarities.append((docid_a, docid_b, jaccard_sim, minhash_sim))
In [72]:
import random
print('There are {} candidate duplicates in total'.format(len(candidates)))
random.sample(similarities, k=15)
Out[72]:
So LSH with 20 bands indeed finds a lot of candidates duplicate, some of which - for instance (2865, 2762) above - are not all that similar. Since we only did this for the first 1000 lines let's see how many LSH missed given some similarity threshold.
In [74]:
sims_all = np.zeros((1000, 1000), dtype=np.float64)
for i, line in enumerate(lines):
for j in range(i+1, len(lines)):
shingles_a = shingles(lines[i])
shingles_b = shingles(lines[j])
jaccard_sim = jaccard(shingles_a, shingles_b)
# similarities are symmetric so we only care about the
# upper diagonal here and leave (j, i) to be 0
sims_all[i, j] = jaccard_sim
In [116]:
# turn the candidates into a dictionary so we have easy access to
# candidates pairs that were found
candidates_dict = {(line_a, line_b): (docid_a, docid_b) for (_, (line_a, docid_a)), (_, (line_b, docid_b)) in candidates}
found = 0
for i in range(len(lines)):
for j in range(i+1, len(lines)):
if sims_all[i, j] >= .9:
# documents i and j have an actual Jaccard similarity >= 90%
found += ((i, j) in candidates_dict or (j, i) in candidates_dict)
print('Out of {} pairs with similarity >= 90% {} were found, that\'s {:.1%}'.format((sims_all >= .9).sum(), found, found / (sims_all >= .9).sum()))
That seems pretty well inline with the figure showing how setting bands and rows affects the probability of finding similar documents. So we're doing quite well in terms of the true positives, what about the false positives? 27 pairs of documents from the ones found were true positives, so the rest are false positives. Since LSH found 110 document pairs in total $110-27 = 83$ pairs were incorrect, that's 83 documents that were checked in vein in comparison to the 499000 pairs we would have had to go through for an all pairs comparison.
499000 is the number of entries on the upper diagonal of a $1000\times1000$ matrix. Since document similarities are symmetric we only need to compare i
to j
not j
to i
, so that's $\frac{1000\times1000}{2}$. We also don't need compare i
to i
or j
to j
which cuts out the last 1000 entries on the diagonal.
Just to be absolute
Leskovec, Rajamaran and Ullman
In [93]:
# preprocess RCV1 to be contained in a single file
import glob, zipfile, re
import xml.etree.ElementTree as ET
files = glob.glob('../data/rcv1/xml/*.zip')
with open('../data/rcv1/headline.text.txt', 'wb') as out:
for f in files:
zf = zipfile.ZipFile(f)
for zi in zf.namelist():
fh = zf.open(zi, 'r')
root = ET.fromstring(fh.read().decode('latin-1'))
itemid = root.attrib['itemid']
headline = root.find('./headline').text
text = ' '.join(root.find('./text').itertext())
text = re.sub('\s+', ' ', text)
out.write(('{}\t{} {}\n'.format(itemid, headline, text)).encode('utf8'))