Document deduplication using Locality Sensitive Hashing (LSH) with minhash

This notebook gives an overview of locality sensitive hashing for the purpose of deduplicating large collections of documents. The associated code is available from https://github.com/mattilyra/lsh.

Finding exact duplicates is trivially easy (just use md5) but finding near duplicates is much harder. Many document collections from tweets to news text to forum discussions contain documents that are almost exactly the same, with only a few sentences or characters that differ. A hash function such as md5 won't find these near duplicates as they are not the same on the byte level, but rather on the semantic level. Finding these semantic duplicates can be done, for instance, by using n-gram shingles and computing the Jaccard similarity of the shingle sets. This is computationally very expensive as the similarity between all pairs of documents needs to be computed. The Reuters Corpus Version 1.0 (RCV1) [1] is ~810k documents, that's $810000^2 = 656100000000$ comparisons, assuming the we have a really fast computer that can generate a pair of shingles and compute their Jaccard similarity in 1 microsecond, that still takes a bit over a week to go through all of them, luckily the similarities are symmetric so going through half of $656100000000$ is enough, that is still ~3 days. The Amazon Product Data [2] is what could be called webscale ~140 million product reviews. I don't even want to think about how long that would take.

Locality senstive hashing (LSH) relies on two different methods, first a hash fingerprint of each document is created and then the locality sensitive hashing is applied to that fingerprint. If the fingerprint is generated using the minhash algorithm then the probability of a hash collision is equal to the Jaccard distance of the documents. There are other hash functions that correspond to the cosine similarity for instance, but I won't deal with those here.

Document similarity and minhash

Minhash is a hash function that computes the lowest hash value for a set of objects to be hashes. In the case of comparing document similarities the set of objects is word or character ngrams taken over a sliding window from the document - also known as shingles. The set of shingles allows us to compute the document similarity (defined in this case as Jaccard similarity) between a pair of documents.

For instance:


In [1]:
document = 'Lorem Ipsum dolor sit amet'
# shingle and discard the last 5 as these are just the last n<5 characters from the document
shingles = [document[i:i+5] for i in range(len(document))][:-5]
shingles


Out[1]:
['Lorem',
 'orem ',
 'rem I',
 'em Ip',
 'm Ips',
 ' Ipsu',
 'Ipsum',
 'psum ',
 'sum d',
 'um do',
 'm dol',
 ' dolo',
 'dolor',
 'olor ',
 'lor s',
 'or si',
 'r sit',
 ' sit ',
 'sit a',
 'it am',
 't ame']

In [2]:
other_document = 'Lorem Ipsum dolor sit amet is how dummy text starts'
# shingle and discard the last 5 as these are just the last n<5 characters from the document
other_shingles = [other_document[i:i+5] for i in range(len(other_document))][:-5]

# Jaccard distance is the size of set intersection divided by the size of set union
len(set(shingles) & set(other_shingles)) / len(set(shingles) | set(other_shingles))


Out[2]:
0.45652173913043476

As we can see these two documents are not very similar, at least in terms of their 3-gram shingle Jaccard similarity. That aside the problem with these shingles is that they do not allow us to compute the similarities of large numbers of documents very easily, we have to do an all pairs comparison. To get around that we can use locality sensitive hashing, but before LSH we'll turn the documents into a more manageable and uniform representation: a fixed length fingerprint comprised of $k$ minhashes.

Every document has a different number of shingles depending on the length of the document, for a corpus of any size predicting the memory requirements for an all pairs comparison is not possible as each document will consume a variable amount of memory. For LSH we would like to have a fixed length representation of the documents without changing the semantics of document similarity. This is where minhashing comes in. It turns out that the probability of a hash collision for a minhash is exactly the Jaccard similarity of two sets. This can be seen by considering the two sets of shingles as a matrix. For two dummy documents the shingles could be represented as the table below (the zeros and ones indicate if a shingle is present in the document or not). Notice that the Jaccard similarity of the documents is 2/5.

Document Shingles
rowshingle IDDoc 1Doc 2
1101
2211
3301
4410
5511
6600

The minhash corresponds to a random permutation of the rows and gives back the row number where the first non zero entry is found. For the above table the minhash for documents one and two would thus be 2 and 1 respectively - meaning that the documents are not similar. The above table however is just one ordering of the shingles of each document. A different random permutation of the rows will give a different minhash, in this case 2 and 2, making the documents similar.

Document Shingles
rowshingle IDDoc 1Doc 2
1600
2211
3301
4101
5410
6511

A random permutation of the rows can produce any of 6! == 720 (factorial) different orderings. However we only care about the orderings for which the two columns have the same lowest row number with a 1, that is shingle ID $\in \{2, 5\}$. Since the rows with zeros on them don't count, there are 5 rows with a one on it in any column, and two rows with a 1 in both columns. All a random permutation can therefore do is put two out of the five rows in the lowest row number, in other words produce a hash collision with a probability 2/5.

The above explanation follows Chapter 3 of Mining Massive Datasets [3]. An in depth explanation for why and how minhash works is provided there along with other interesting hash functions.

Applying minhash gives us a fixed length $k$ (you pick the length) representation of each document such that the probability of a hash collision is equal to the Jaccard similarity of any pair. This being a probabilitic measure you're not guaranteed to get a collision. For Lorem Ipsum documents above and $k=100$ we get similarities that are roughly the same as the Jaccard similarity.


In [3]:
from lsh import minhash

for _ in range(5):
    hasher = minhash.MinHasher(seeds=100, char_ngram=5)
    fingerprint0 = hasher.fingerprint('Lorem Ipsum dolor sit amet'.encode('utf8'))
    fingerprint1 = hasher.fingerprint('Lorem Ipsum dolor sit amet is how dummy text starts'.encode('utf8'))
    print(sum(fingerprint0[i] in fingerprint1 for i in range(hasher.seeds)) / hasher.seeds)


0.53
0.51
0.52
0.57
0.46

Increasing the length of the fingerprint from $k=100$ to $k=1000$ reduces the variance between random initialisations of the minhasher.


In [136]:
for _ in range(5):
    hasher = minhash.MinHasher(seeds=1000, char_ngram=5)
    fingerprint0 = hasher.fingerprint('Lorem Ipsum dolor sit amet'.encode('utf8'))
    fingerprint1 = hasher.fingerprint('Lorem Ipsum dolor sit amet is how dummy text starts'.encode('utf8'))
    print(sum(fingerprint0[i] in fingerprint1 for i in range(hasher.seeds)) / hasher.seeds)


0.449
0.472
0.467
0.464
0.463

Increasing the fingerprint length however comes at the cost of increased memory usage and more time spent computing the minhashes. For a collection of documents we are still left with comparing all pairs, when that collection grows larger this becomes a very real problem. Queue LSH.

Locality sensitive hashing

The idea behind locality sensitive hashing is to take the document fingerprints and chop them up into pieces, each piece being some number of minhashes. Since a single minhash (single entry in the fingerprint) has a probability equal to the Jaccard similarity of producing a collision, each chopped up portion of the fingerprint should as well. This chopped up portion is the locality in locality sensitive hashing, the hashing is just a hash function (any hash function) which produces a bin ID from the fingerprint locality being hashed. Each bin holds the entire fingerprint (with optional meta information) of the document and that of other documents that hash to the same bin.

Let's say our fingerprint has 100 minhashes in it and we chop the fingerprints into 10 pieces. Each piece of each fingerprint therefore contains 10 minhashes, we hash those again (not using minhash this time) to get a bin ID and store the whole fingerprint in every bin each of the pieces happens to land in.

When we want to know which documents are similar to a query document, we look in all the bins the query document lands in, any document in any of the bins is a potential duplicate. Comparing the full fingerprint of all documents in the bin or computing the actual Jaccard similarity between the shingle sets yields the final similarity of documents. Crucially since not all documents will land in the same bins we've reduced the number of comparisons needed to find similar or near duplicate documents.

The number of pieces to chop each fingerprint into and the size of each piece are parameters that need to be set. These should be set such that $num\_pieces \times size\_of\_piece == num\_minhashes$ - this makes sense since having computed all the $N$ minhashes we want to use all of them in the locality sensitive hashing part. There is however a further issue that needs to be considered when setting the parameters; the relation between the number and size of the pieces and the probability of LSH "finding" a pair of similar documents.

LSH is a probabilistic model which means that it won't always do the "right thing". Using LSH one needs to consider the similarity of a pair of documents (in this case the Jaccard similarity) and the probability that LSH will find that pair to be similar (a true positive, i.e. a correctly discovered duplicate pair). The pair of documents LSH finds to be similar should be thought of as candidate duplicates. The higher the probability, or guarantee, that LSH will find a pair of documents to be similar the more false positives the model will also produce, that is candidate duplicates that are not in fact duplicates.


In [16]:
%matplotlib inline

import numpy as np
import pandas as pd
import seaborn as sns
from matplotlib import pyplot as plt

ix = pd.IndexSlice

In [4]:
df = pd.DataFrame(data=[(2, 50), (50, 2), (10, 10), (5, 20), (20, 5)], columns=['pieces', 'size'])
df['hashes'] = df['pieces'] * df['size']
for pr in np.linspace(0, 1, 200):
    df[pr] = 1 - (1 - pr**df['size']) ** df['pieces']

df = pd.pivot_table(df, index=['hashes', 'pieces', 'size'])

ax = df.T.plot(figsize=(10, 7), title='Probability of LSH finding a candidate pair');
plt.ylabel('p(candidate | Jaccad)');
plt.xlabel('Jaccard similarity');
plt.legend(list(df.loc[ix[100]].index),
           bbox_to_anchor=(1., 1, 1., 0), loc='upper left', fontsize=12, 
           ncol=1, borderaxespad=0., title='Each line shows the\nfingerprint chopped\ninto (pieces, size)\n');


The figure shows the probability that LSH with minhash will "find" a pair of similar documents (y-axis) given the Jaccard similarity (x-axis) of those documents for different settings for LSH. Each of the five lines correspond to different settings, the number of hashes is always 100 so we're just changing the number of pieces to chop each fingerprint into (and the size of those pieces, although that becomes determined by setting the number of hashes).

Creating just two pieces with 50 rows each - that is two localities, each with a size of 50 minhashes - yields an LSH model (blue line) that tries really really hard not to find documents to be similar. This LSH model will find 80% of documents whose actual Jaccard similarity is over 95%. Documents whose Jaccard similarity is 80% will hardly ever be found to be similar.

Creating 5 pieces with 20 rows (green line) each is slightly more relaxed. The above graph should give you a pretty good idea how to set the parameters for your use case so that you can be reasonably certain that LSH will generate acceptable candidate pairs.


Deduplicating the Reuters RCV1 corpus [1]

The Reuters Corpus, Volume 1 (RCV1) corpus is a commonly used resource for various NLP tasks, especially document classification. It was made available in 2000 by Reuters Ltd and consists of ~810,000 english language news stories collected between August 20th 1996 and August 19th 1997 from the Reuters news wire.

I've preprocessed the corpus so that it is all in a single file, one line per document. Each line has the format:

ITEMID<TAB>HEADLINE<SPACE>TEXT

In [94]:
!wc -l ../data/rcv1/headline.text.txt


  806791 ../data/rcv1/headline.text.txt

In [123]:
!head -1 ../data/rcv1/headline.text.txt


2286	Recovery excitement brings Mexican markets to life.  Emerging evidence that Mexico's economy was back on the recovery track sent Mexican markets into a buzz of excitement Tuesday, with stocks closing at record highs and interest rates at 19-month lows. "Mexico has been trying to stage a recovery since the beginning of this year and it's always been getting ahead of itself in terms of fundamentals," said Matthew Hickman of Lehman Brothers in New York. "Now we're at the point where the fundamentals are with us. The history is now falling out of view." That history is one etched into the minds of all investors in Mexico: an economy in crisis since December 1994, a free-falling peso and stubbornly high interest rates. This week, however, second-quarter gross domestic product was reported up 7.2 percent, much stronger than most analysts had expected. Interest rates on governent Treasury bills, or Cetes, in the secondary market fell on Tuesday to 23.90 percent, their lowest level since Jan. 25, 1995. The stock market's main price index rallied 77.12 points, or 2.32 percent, to a record 3,401.79 points, with volume at a frenzied 159.89 million shares. Confounding all expectations has been the strength of the peso, which ended higher in its longer-term contracts on Tuesday despite the secondary Cetes drop and expectations of lower benchmark rates in Tuesday's weekly auction. With U.S. long-term interest rates expected to remain steady after the Federal Reserve refrained from raising short-term rates on Tuesday, the attraction of Mexico, analysts say, is that it offers robust returns for foreigners and growing confidence that they will not fall victim to a crumbling peso. "The focus is back on Mexican fundamentals," said Lars Schonander, head of researcher at Santander in Mexico City. "You have a continuing decline in inflation, a stronger-than-expected GDP growth figure and the lack of any upward move in U.S. rates." Other factors were also at play, said Felix Boni, head of research at James Capel in Mexico City, such as positive technicals and economic uncertainty in Argentina, which has put it and neighbouring Brazil's markets at risk. "There's a movement out of South American markets into Mexico," he said. But Boni was also wary of what he said could be "a lot of hype." The economic recovery was still export-led, and evidence was patchy that the domestic consumer was back with a vengeance. Also, corporate earnings need to grow strongly to justify the run-up in the stock market, he said. 

Some duplicate items are present in the corpus so let's see what happens when we apply LSH to it. First a helper function that takes a file pointer and some parameters for minhash and LSH and then finds duplicates.


In [65]:
import itertools

from lsh import lsh, minhash # https://github.com/mattilyra/lsh

# a pure python shingling function that will be used in comparing
# LSH to true Jaccard similarities
def shingles(text, char_ngram=5):
    return set(text[head:head + char_ngram] for head in range(0, len(text) - char_ngram))


def jaccard(set_a, set_b):
    intersection = set_a & set_b
    union = set_a | set_b
    return len(intersection) / len(union)


def candidate_duplicates(document_feed, char_ngram=5, seeds=100, bands=5, hashbytes=4):
    char_ngram = 5
    sims = []
    hasher = minhash.MinHasher(seeds=seeds, char_ngram=char_ngram, hashbytes=hashbytes)
    if seeds % bands != 0:
        raise ValueError('Seeds has to be a multiple of bands. {} % {} != 0'.format(seeds, bands))
    
    cache = lsh.Cache(bands=bands, hasher=hasher)
    for i, line in enumerate(document_feed):
        line = line.decode('utf8')
        docid, headline_text = line.split('\t', 1)
        fingerprint = hasher.fingerprint(headline_text.encode('utf8'))
        
        # in addition to storing the fingerpring store the line
        # number and document ID to help analysis later on
        cache.update(fingerprint, meta=(i, docid))

    candidate_pairs = set()
    for b in cache.bins:
        for bucket_id in b:
            if len(b[bucket_id]) > 1:
                pairs_ = set(itertools.combinations(b[bucket_id], r=2))
                candidate_pairs.update(pairs_)
    
    return candidate_pairs

In [62]:
hasher = minhash.MinHasher(seeds=100, char_ngram=5, hashbytes=4)
cache = lsh.Cache(bands=10, hasher=hasher)
with open('../data/rcv1/headline.text.txt', 'rb') as fh:
    feed = itertools.islice(fh, 100)
    for line in feed:
        cache.update(hasher.fingerprint(line))

p = set()
for b in cache.bins:
    for bucket_id in b:
        if len(b[bucket_id]) > 1:
            pairs_ = set(itertools.combinations(b[bucket_id], r=2))
            p.update(pairs_)

Now let's run LSH on a few different paramter settings and see what the results look like. To save some time I'm only using the first 1000 documents.


In [66]:
num_candidates = []
bands = [2, 5, 10, 20]
for num_bands in bands:
    with open('../data/rcv1/headline.text.txt', 'rb') as fh:
        feed = itertools.islice(fh, 1000)
        candidates = candidate_duplicates(feed, char_ngram=5, seeds=100, bands=num_bands, hashbytes=4)
        num_candidates.append(len(candidates))

In [122]:
fig, ax = plt.subplots(figsize=(8, 6))
plt.bar(bands, num_candidates, align='center');
plt.title('Number of candidate duplicate pairs found by LSH using 100 minhash fingerprint.');
plt.xlabel('Number of bands');
plt.ylabel('Number of candidate duplicates');
plt.xticks(bands, bands);


So the more promiscuous [4] version (20 bands per fingerprint) finds many more candidate pairs than the conservative 2 bands model. The first implication of this difference is that it leads to you having to do more comparisons to find the real duplicates. Let's see what that looks like in practice.


In [93]:
lines = []
with open('../data/rcv1/headline.text.txt', 'rb') as fh:
    # read the first 1000 lines into memory so we can compare them
    for line in itertools.islice(fh, 1000):
        lines.append(line.decode('utf8'))
    
    # reset file pointer and do LSH
    fh.seek(0)
    feed = itertools.islice(fh, 1000)
    candidates = candidate_duplicates(feed, char_ngram=5, seeds=100, bands=20, hashbytes=4)

# go over all the generated candidates comparing their similarities
similarities = []
for (f_a, (line_a, docid_a)), (f_b, (line_b, docid_b)) in candidates:
    doc_a, doc_b = lines[line_a], lines[line_b]
    shingles_a = shingles(lines[line_a])
    shingles_b = shingles(lines[line_b])
    
    jaccard_sim = jaccard(shingles_a, shingles_b)
    f_a, f_b = set(f_a), set(f_b)
    minhash_sim = len(f_a & f_b) / len(f_a | f_b)
    similarities.append((docid_a, docid_b, jaccard_sim, minhash_sim))

In [72]:
import random

print('There are {} candidate duplicates in total'.format(len(candidates)))
random.sample(similarities, k=15)


There are 101 candidate duplicates in total
Out[72]:
[('2536', '2334', 0.8379651436646255, 0.7391304347826086),
 ('2889', '2883', 0.5324232081911263, 0.43884892086330934),
 ('2788', '2895', 0.2471655328798186, 0.16279069767441862),
 ('2372', '2332', 0.7001640689089418, 0.5267175572519084),
 ('2554', '2556', 0.6660666066606661, 0.4925373134328358),
 ('2865', '2762', 0.45588235294117646, 0.3333333333333333),
 ('2868', '2497', 0.97796817625459, 1.0),
 ('2500', '2501', 0.9816933638443935, 1.0),
 ('2941', '2762', 0.48120300751879697, 0.3793103448275862),
 ('2511', '2977', 0.7044592030360531, 0.5873015873015873),
 ('2863', '2860', 0.4676258992805755, 0.4084507042253521),
 ('2860', '2762', 0.3355263157894737, 0.27388535031847133),
 ('3005', '2994', 0.33774834437086093, 0.28205128205128205),
 ('2742', '2387', 0.6901698404529079, 0.45985401459854014),
 ('2542', '2399', 0.9192044748290864, 0.8691588785046729)]

So LSH with 20 bands indeed finds a lot of candidates duplicate, some of which - for instance (2865, 2762) above - are not all that similar. Since we only did this for the first 1000 lines let's see how many LSH missed given some similarity threshold.


In [74]:
sims_all = np.zeros((1000, 1000), dtype=np.float64)
for i, line in enumerate(lines):
    for j in range(i+1, len(lines)):
        shingles_a = shingles(lines[i])
        shingles_b = shingles(lines[j])
        jaccard_sim = jaccard(shingles_a, shingles_b)
        
        # similarities are symmetric so we only care about the
        # upper diagonal here and leave (j, i) to be 0
        sims_all[i, j] = jaccard_sim

In [116]:
# turn the candidates into a dictionary so we have easy access to
# candidates pairs that were found
candidates_dict = {(line_a, line_b): (docid_a, docid_b) for (_, (line_a, docid_a)), (_, (line_b, docid_b)) in candidates}
found = 0
for i in range(len(lines)):
    for j in range(i+1, len(lines)):
        if sims_all[i, j] >= .9:
            # documents i and j have an actual Jaccard similarity >= 90%
            found += ((i, j) in candidates_dict or (j, i) in candidates_dict)

print('Out of {} pairs with similarity >= 90% {} were found, that\'s {:.1%}'.format((sims_all >= .9).sum(), found, found / (sims_all >= .9).sum()))


Out of 27 pairs with similarity >= 90% 27 were found, that's 100.0%

That seems pretty well inline with the figure showing how setting bands and rows affects the probability of finding similar documents. So we're doing quite well in terms of the true positives, what about the false positives? 27 pairs of documents from the ones found were true positives, so the rest are false positives. Since LSH found 110 document pairs in total $110-27 = 83$ pairs were incorrect, that's 83 documents that were checked in vein in comparison to the 499000 pairs we would have had to go through for an all pairs comparison.

499000 is the number of entries on the upper diagonal of a $1000\times1000$ matrix. Since document similarities are symmetric we only need to compare i to j not j to i, so that's $\frac{1000\times1000}{2}$. We also don't need compare i to i or j to j which cuts out the last 1000 entries on the diagonal.

Just to be absolute

References


In [93]:
# preprocess RCV1 to be contained in a single file
import glob, zipfile, re
import xml.etree.ElementTree as ET

files = glob.glob('../data/rcv1/xml/*.zip')
with open('../data/rcv1/headline.text.txt', 'wb') as out:
    for f in files:
        zf = zipfile.ZipFile(f)
        for zi in zf.namelist():
            fh = zf.open(zi, 'r')
            root = ET.fromstring(fh.read().decode('latin-1'))
            itemid = root.attrib['itemid']
            headline = root.find('./headline').text
            text = ' '.join(root.find('./text').itertext())
            text = re.sub('\s+', ' ', text)
            out.write(('{}\t{} {}\n'.format(itemid, headline, text)).encode('utf8'))