In data science, it's common to have lots of nearly duplicate data. For instance, you'll find lots of nearly duplicate web pages in crawling the internet; you'll find lots of pictures of the same cat or dog overlaid with slightly different text posted on Twitter as memes.

Near duplicate data just means data that is almost the same. The sentences "Congress returned from recess" and "Congress returned from recess last week" are near duplicates. They are not exactly the same, but they're similar.

Sometimes its helpful to find all such duplicates: either to remove them from the dataset or analyze the dupes themselves. For instance, you might want to find all the groups of memes on Twitter, or delete online comments from bots.

In order to find duplicates, you need a formal way to represent similiarity so that a computer can understand. One commonly used metric is the Jaccard similarity, which is the size of the intersection of two sets divided by the union. We can express the Jaccard in symbols as follows. If $A$ and $B$ are sets and $|A|$ and $|B|$ show the sizes of those sets (sometimes this is called the "cardinality") then:

$$Jaccard(A,B) = \frac{|A \cap B|}{|A \cup B|}$$

If you try calculating a few similarities with pen and paper, you'll quickly get a good intuition for the Jaccard. For instance, say $A = \{1,2,3\}$ and $B=\{2,3,4\}$. $A \cap B$ just is just the elements which are in $A$ and $B$ = $\{2,3\}$, which has a size of 2. Similarly $\{A \cup B\}$ is the elements which are in $A$ or $B$ which is equal to $\{1,2,3,4\}$, which has a size of 4. Thus, the Jaccard is 2/4 = .5.

Exercise: Try calculating the Jaccard when $A = \{1,2,3\}$ and $B=\{1,2,3\}$. What happens? How about when $A =\{1,2,3\}$ and $B=\{4,5,6\}$? Is it possible to have a Jaccard that is lower? How about a Jaccard that is higher?

Now let's say you are trying to find documents that are almost duplicates in your set. You can represent each document as a set of words, assigning each word a unqiue number. Then you find the Jaccard similarity between all pairs of documents, and find those pairs that have a value greater than, say, $0.9$.


In [1]:
import itertools
import string
import functools
letters = string.ascii_lowercase
vocab = list(map(''.join, itertools.product(letters, repeat=2)))


Out[1]:
['aa', 'ab']

In [2]:
from random import choices

def zipf_pdf(k):
    return 1/k**1.07

def exponential_pdf(k, base):
    return base**k

def new_document(n_words, pdf):
    return set(
        choices(
            vocab,
            weights=map(pdf, range(1, 1+len(vocab))),
            k=n_words
        )
    )

def new_documents(n_documents, n_words, pdf):
    return [new_document(n_words, pdf) for _ in range(n_documents)]

def jaccard(a, b):
    return len(a & b) / len(a | b)

def all_pairs(documents):
    return list(itertools.combinations(documents, 2))

def filter_similar(pairs, cutoff=0.9):
    return list(filter(
        lambda docs: jaccard(docs[0], docs[1]) > cutoff,
        pairs
    ))

In [3]:
documents = new_documents(1000, 1000, functools.partial(exponential_pdf, base=1.1))
pairs = all_pairs(documents)

Based on the way we are choosing words, we say that 410 pairs out of 1000 documents have a high enough jaccard to call them similar. This seems realistic enough. We can fiddle with this if we want, by changing the base


In [4]:
len(filter_similar(pairs))


Out[4]:
515

We can also see that the jaccards look normally distributed


In [5]:
jacards = list(map(lambda docs: jaccard(docs[0], docs[1]), pairs))

In [144]:
%matplotlib inline

import seaborn as sns
sns.distplot(jacards)


Out[144]:
<matplotlib.axes._subplots.AxesSubplot at 0x7f024cb823c8>
/usr/local/lib/python3.6/site-packages/matplotlib/font_manager.py:1297: UserWarning: findfont: Font family ['sans-serif'] not found. Falling back to DejaVu Sans
  (prop.get_family(), self.defaultFamily[fontext]))

Now let's time this, to see how it increases with the number of documents $n$. We expect it to be $\Theta(n^2)$, because each document is compared to every other document.


In [20]:
def create_and_filter(n_documents):
    documents = new_documents(n_documents, 500, functools.partial(exponential_pdf, base=1.1))
    pairs = all_pairs(documents)
    return filter_similar(pairs)

In [8]:
import timeit

In [14]:
def time_create_and_filter(n_documents):
    return timeit.timeit(
        'create_and_filter(n)',
        globals={
            "n": n_documents,
            "create_and_filter": create_and_filter
        },
        number=1
    )

In [10]:
import pandas as pd
from tqdm import tnrange, tqdm_notebook

In [11]:
def create_timing_df(ns):
    return pd.DataFrame({
        'n': ns,
        'time': list(map(time_create_and_filter, tqdm_notebook(ns)))
    })

In [21]:
df = create_timing_df([2 ** e for e in range(1, 13)])



In [86]:
sns.lmplot(x="n", y="time", data=df, order=2, )


/usr/local/lib/python3.6/site-packages/matplotlib/font_manager.py:1297: UserWarning: findfont: Font family ['sans-serif'] not found. Falling back to DejaVu Sans
  (prop.get_family(), self.defaultFamily[fontext]))
Out[86]:
<seaborn.axisgrid.FacetGrid at 0x7f0220a978d0>

So it would be helpful if we had a method that would take less time, as the number of documents increased.


In [ ]: