In data science, it's common to have lots of nearly duplicate data. For instance, you'll find lots of nearly duplicate web pages in crawling the internet; you'll find lots of pictures of the same cat or dog overlaid with slightly different text posted on Twitter as memes.

Near duplicate data just means data that is almost the same. The sentences "Congress returned from recess" and "Congress returned from recess last week" are near duplicates. They are not **exactly** the same, but they're similar.

Sometimes its helpful to find all such duplicates: either to remove them from the dataset or analyze the dupes themselves. For instance, you might want to find all the groups of memes on Twitter, or delete online comments from bots.

In order to find duplicates, you need a formal way to represent similiarity so that a computer can understand. One commonly used metric is the Jaccard similarity, which is the size of the intersection of two sets divided by the union. We can express the Jaccard in symbols as follows. If $A$ and $B$ are sets and $|A|$ and $|B|$ show the sizes of those sets (sometimes this is called the "cardinality") then:

$$Jaccard(A,B) = \frac{|A \cap B|}{|A \cup B|}$$

**Exercise**: Try calculating the Jaccard when $A = \{1,2,3\}$ and $B=\{1,2,3\}$. What happens? How about when $A =\{1,2,3\}$ and $B=\{4,5,6\}$? Is it possible to have a Jaccard that is lower? How about a Jaccard that is higher?

```
In [1]:
```import itertools
import string
import functools
letters = string.ascii_lowercase
vocab = list(map(''.join, itertools.product(letters, repeat=2)))

```
Out[1]:
```

```
In [2]:
```from random import choices
def zipf_pdf(k):
return 1/k**1.07
def exponential_pdf(k, base):
return base**k
def new_document(n_words, pdf):
return set(
choices(
vocab,
weights=map(pdf, range(1, 1+len(vocab))),
k=n_words
)
)
def new_documents(n_documents, n_words, pdf):
return [new_document(n_words, pdf) for _ in range(n_documents)]
def jaccard(a, b):
return len(a & b) / len(a | b)
def all_pairs(documents):
return list(itertools.combinations(documents, 2))
def filter_similar(pairs, cutoff=0.9):
return list(filter(
lambda docs: jaccard(docs[0], docs[1]) > cutoff,
pairs
))

```
In [3]:
```documents = new_documents(1000, 1000, functools.partial(exponential_pdf, base=1.1))
pairs = all_pairs(documents)

`base`

```
In [4]:
```len(filter_similar(pairs))

```
Out[4]:
```

We can also see that the jaccards look normally distributed

```
In [5]:
```jacards = list(map(lambda docs: jaccard(docs[0], docs[1]), pairs))

```
In [144]:
```%matplotlib inline
import seaborn as sns
sns.distplot(jacards)

```
Out[144]:
```

```
In [20]:
```def create_and_filter(n_documents):
documents = new_documents(n_documents, 500, functools.partial(exponential_pdf, base=1.1))
pairs = all_pairs(documents)
return filter_similar(pairs)

```
In [8]:
```import timeit

```
In [14]:
```def time_create_and_filter(n_documents):
return timeit.timeit(
'create_and_filter(n)',
globals={
"n": n_documents,
"create_and_filter": create_and_filter
},
number=1
)

```
In [10]:
```import pandas as pd
from tqdm import tnrange, tqdm_notebook

```
In [11]:
```def create_timing_df(ns):
return pd.DataFrame({
'n': ns,
'time': list(map(time_create_and_filter, tqdm_notebook(ns)))
})

```
In [21]:
```df = create_timing_df([2 ** e for e in range(1, 13)])

```
```

```
In [86]:
```sns.lmplot(x="n", y="time", data=df, order=2, )

```
Out[86]:
```

```
In [ ]:
```