Detecting Copied Text Using Bags of Words and Naive Bayes Classifiers

RJ Nowling

Plagiarism detection is concerned with automatically identifying text that has been copied from one document to another. Plagiarism detection is generally concerned with finding text that was copied without permission, so it involves both methods for detecting copied text and matching the text to citations. The methods for identifying copied text have broader applicability in areas such as sociology. For example, sociologists maybe be concerned with studying communication patterns or the spread of information in social media.

As an applied field, plagiarism detection methods borrow heavily from other field such as natural language processing and bioinformatics. For example, Wikipedia cites string matching and alignment techniques from bioinformatics as one of the popular methods.

In this blog post, I'm going to explore a basic method for identifying copied text using the bag of words (unordered set of words) model and Naive Bayes classifiers. I'll explain the Naive Bayes classifier and test the method using Wikipedia articles.

For my approach, I actually want to match each word in a sample document to a corpus of reference documents. To do so, I create a bag of words for each word and its neighbors in the samples and reference documents. I like the bag of words approach since it is robust to perturbations such as:

Word order may vary locally (e.g., "the painted red house" can become "the house painted red").
Individual words may be dropped, added, or changed (e.g., "the painted red house" can become "the red colored house").
Units of copied text may be as small as a sentence or clause.

To extract the bags of words, I use a sliding window approach. For example, consider the sentence

The quick brown fox jumped over the lazy dog.

We get the following bags of words:

A = {the quick brown fox jumped}
B = {quick brown fox jumped over}
C = {brown fox jumped over the}
D = {fox jumped over the lazy}
E = {jumped over the lazy dog}

Now that we have a representation for the words, we can study Naive Bayes classifiers.

Naive Bayes Classifiers

Naive Bayes methods are a set of supervised learning algorithms based on applying Bayes' theorem with the "naive" assumption of independence between every pair of features. Given a class variable $Y$ and a dependent feature vector $x_1$ through $x_n$, Bayes' theorem states the following relationship:

$$ P(Y \mid x_1, \dots, x_n) = \frac{P(y) P(x_1, \dots x_n \mid Y)} {P(x_1, \dots, x_n)} $$

Using the naive independence assumption that

$$ P(x_w | Y, x_1, \dots, x_{i-1}, x_{i+1}, \dots, x_n) = P(x_w | Y) $$

for all $i$, this relationship is simplified to

$$ P(Y \mid x_1, \dots, x_n) = \frac{P(y) \prod_{w} P(x_w \mid Y)} {P(x_1, \dots, x_n)} $$

Since $P(x_1, \dots, x_n)$ is constant given the input, we can use the following classification rule:

$$ P(Y \mid x_1, \dots, x_n) \propto P(Y) \prod_{w} P(x_w \mid Y) \\ \Downarrow \\ \hat{Y} = \arg\max_Y P(Y) \prod_{w} P(x_w \mid Y), $$

The different naive Bayes classifiers differ mainly by the assumptions they make regarding the distribution of $P(x_i \mid Y)$.

When creating a Naive Bayes classifier, we need to make three decisions:

The class probability $P(Y)$ -- For our work, we'll assume that the classes are uniformally distributed. i.e., $P(Y) = \frac{1}{N_c}$
The feature estimator $\hat{\theta}(w \mid Y)$ giving the probability of the occurence of the word $w$ in class $Y$
The probability $P(x_w \mid Y)$ -- the probability of the feature $x_w$ (occurence of the word $w$ in the instance $X$) appearing in an instance of class $Y$

We will look at three variants of Naive Bayes.

Multinomial Estimator, Multinomial Combination Rule

The multinomial estimator is given by:

$$ \hat{\theta}(w \mid Y) = \frac{y_w + \alpha}{N_y + \alpha n} $$

where $y_w$ is a binary decision variable giving whether the word $w$ is in class $Y$, $N_y$ is the number of words in class $Y$, $n$ is the total number of words in the training set, and $\alpha$ is a smoothing variable, often set to 1.

The probability rule is given by:

$$ P(x_w | Y ) \propto \hat{\theta}(w \mid Y)^{x_w} $$

Note that since $x_w = 0$ and $ P(x_w | Y ) = 1 $ for words not in class $X$, we can simplify the probabilitie rules so that we only need to consider words in $X$ instead of all of the words in the training set instances:

$$ P(x_w | Y ) \propto \hat{\theta}(w \mid Y)\\ \\ P(Y \mid x_1, \dots, x_n) \propto P(Y) \prod_{w \in X} P(x_w \mid Y) $$

This observation can reduce the computational time by quite a bit since we can exclude training classes that do not have any overlapping features and words in the training classes that are not in the validation classes.

Let's run through a calculation using the example from the introduction.

We'll use the following bags of words for our validation set:

A = {the quick brown fox jumped}
F = {alpha beta delta gamma kappa}

Let's use bags of words B, C, and D for our training set:

B = {quick brown fox jumped over}
C = {brown fox jumped over the}
D = {fox jumped over the lazy}

The total set of training words is:

{quick brown fox jumped over the lazy}

Since each training class has 5 words and there are a total of 7 words, we can find the feature estimator as:

$$ \hat{\theta}(w | Y) = \frac{y_w + \alpha}{N_y + \alpha n} = \frac{y_w + 1}{5 + 7} = \frac{y_w + 1}{12} $$

As an example, let's find the likelihood of observing "the" and "quick" in class B.

$$ \hat{\theta}(\text{the} \mid B) = \frac{B_{\text{the}} + 1}{12} = \frac{0 + 1}{12} = \frac{1}{12} $$

Since the word "the" is not in class B, the occurence value is 0, resulting in a likelihood of 1/12.

$$ \hat{\theta}(\text{quick} \mid B) = \frac{B_{\text{quick}} + 1}{12} = \frac{1 + 1}{12} = \frac{1}{6} $$

For the word "quick", which is in class B, we find a likelihood of 1/6.

We can compute the probabilites for our validation instances as follows:

$$ P(A \mid B) \propto \frac{1}{3} \Big(\frac{1}{12}\Big) \Big(\frac{2}{12}\Big)^4 = \frac{16}{746496} $$$$ P(A \mid C) \propto \frac{1}{3} \Big(\frac{2}{12}\Big) \Big(\frac{1}{12}\Big) \Big(\frac{2}{12}\Big)^3 = \frac{16}{746496} $$$$ P(A \mid D) \propto \frac{1}{3} \Big(\frac{2}{12}\Big) \Big(\frac{1}{12}\Big)^2 \Big(\frac{2}{12}\Big)^2 = \frac{8}{746496} $$

Validation instance $A$ has an equal probability of belonging to classes $B$ and $C$ and a lower probability of belonging to class $D$. Thus, $A$ would be predicted to belong to either class $B$ or $C$.

Since instance $F$ doesn't have any words in common with any of the training instances, $F$ is computed as having an equal probability of belonging to all three classes:

$$ P(F | B) \propto \frac{1}{3} \Big(\frac{1}{12}\Big)^5 = \frac{1}{746496} $$$$ P(F | C) \propto \frac{1}{3} \Big(\frac{1}{12}\Big)^5 = \frac{1}{746496} $$$$ P(F | D) \propto \frac{1}{3} \Big(\frac{1}{12}\Big)^5 = \frac{1}{746496} $$

Inverse Document Frequency (IDF) Estimator, Multinomial Combination Rule

Since the multinomial estimator does not weight the effect of each occurence or non-occurence by the word itself, the multinomial estimator has difficulty distinguishing between cases which have the same number of missing words, as we saw. Instead, we can weight the words by their "inverse document frequency," or basically, give higher weights to words that appear in fewer documents. The idea behind IDF is that words that appear in fewer documents are stronger indicators that those that appear is many documents.

The IDF estimator is given by:

$$ \hat{\theta}(w \mid Y) = \frac{N_c}{N_w}^{y_w} $$

where $N_c$ is the total number of classes, $y_w$ is a binary decision variable giving whether the word $w$ is in class $Y$, and $N_w$ is the number of classes containing the word $w$.

We can use the same probability rules as above.

To review, the total set of words (features) is:

{quick brown fox jumped over the lazy}

The class counts for each word are as follows:

 quick = 1, brown = 2, fox = 3, jumped = 3,
 over = 3, the = 2, lazy = 1

Since there are 3 training classes, we can find the feature estimator as:

$$ P(x \mid Y) \propto \Big(\frac{N_c}{N_w}\Big)^{y_w} = \Big(\frac{3}{N_w}\Big)^{y_w} $$

As an example, let's find the proportional likelihood of observing "the" and "quick" in class B.

$$ P(\text{the} \mid B) \propto \Big(\frac{3}{N_\text{the}}\Big)^{y_\text{the}} = \Big(\frac{3}{2}\Big)^0 = 1 $$

Since the word "the" is not in class B, the occurence value is 0, resulting in a proportional likelihood of 1.

$$ P(\text{quick} \mid B) \propto \Big(\frac{3}{N_\text{quick}}\Big)^{y_\text{quick}} = \Big(\frac{3}{1}\Big)^1 = 3 $$

For the word "quick", which is in class B, we find a proportional likelihood of 3.

The resulting proportional probabilities for our validation set instances under the IDF model are as follows:

$$ P(A \mid B) \propto \frac{1}{3} \Big(\frac{3}{2}\Big)^0 \Big(\frac{3}{1}\Big)^1 \Big(\frac{3}{2}\Big)^1 \Big(\frac{3}{3}\Big)^1 \Big(\frac{3}{3}\Big)^1 = \frac{81}{54} = 1.5 $$$$ P(A \mid C) \propto \frac{1}{3} \Big(\frac{3}{2}\Big)^1 \Big(\frac{3}{1}\Big)^0 \Big(\frac{3}{2}\Big)^1 \Big(\frac{3}{3}\Big)^1 \Big(\frac{3}{3}\Big)^1 = \frac{81}{108} = 0.75 $$$$ P(A \mid D) \propto \frac{1}{3} \Big(\frac{3}{2}\Big)^1 \Big(\frac{3}{1}\Big)^0 \Big(\frac{3}{2}\Big)^0 \Big(\frac{3}{3}\Big)^1 \Big(\frac{3}{3}\Big)^1 = \frac{27}{54} = 0.5 $$

Validation instance $A$ has the highest probability of belonging to class $B$.

Since $F$ doesn't have any words in common with any of the training instances, $F$ is computed as having equal probabilities of belonging to all three classes:

$$ P(F \mid B) \propto \frac{1}{3} \Big(1\Big)^5 = \frac{1}{3} $$$$ P(F \mid C) \propto \frac{1}{3} \Big(1\Big)^5 = \frac{1}{3} $$$$ P(F \mid D) \propto \frac{1}{3} \Big(1\Big)^5 = \frac{1}{3} $$

Multinomial Estimator, Bernoulli Combination Rule

Unlike the multinomial Naive Bayes, the Bernoulli Naive Bayes penalizes for words that are missing. We can use the same estimator as the multinomial Naive Bayes, but the Bernoulli probability rule is given by:

$$ P(x_w \mid Y) \propto P(w \mid Y)^{x_w} (1 - P(w \mid Y))^{(1 - {x_w})} $$

As a reminder, the total set of words (features) is:

{the quick brown fox jumped over lazy}

Now, on to the examples. We'll use the multinomial feature estimator from above. Since we're using the binomial combination rule, however, we need to consider every word in the vocabulary, unlike the multinomial combination rule. This means we have four cases for every word:

Word is in X and Y
Word is in X and not Y
Word is not in X and is in Y
Word is not in X and not in Y

As a result, we find the following proportional probabilities:

$$ P(A | B) \propto \frac{1}{3} \Big(\frac{1}{12}\Big) \Big(\frac{2}{12}\Big)^4 \Big(\frac{10}{12}\Big) \Big(\frac{11}{12}\Big) = \frac{1760}{107495424} $$$$ P(A | C) \propto \frac{1}{3} \Big(\frac{2}{12}\Big) \Big(\frac{1}{12}\Big) \Big(\frac{2}{12}\Big)^3 \Big(\frac{10}{12}\Big) \Big(\frac{11}{12}\Big) = \frac{1760}{107495424} $$$$ P(A | D) \propto \frac{1}{3} \Big(\frac{2}{12}\Big) \Big(\frac{1}{12}\Big)^2 \Big(\frac{2}{12}\Big)^2 \Big(\frac{10}{12}\Big) \Big(\frac{10}{12}\Big) = \frac{800}{107495424} $$

Validation instance $A$ has equal probabilities for belonging in classes $B$ and $C$ and a lower probability of belonging to class $D$. Thus, $A$ would be predicted to belong to either class $B$ or $C$.

Since $F$ doesn't have any words in common with any of the training instances, $F$ is computed as having equal probabilities of belonging to all three classes:

$$ P(F | B) \propto \frac{1}{3} \Big(\frac{11}{12}\Big)^5 = \frac{161051}{107495424} $$$$ P(F | C) \propto \frac{1}{3} \Big(\frac{11}{12}\Big)^5 = \frac{161051}{107495424} $$$$ P(F | D) \propto \frac{1}{3} \Big(\frac{11}{12}\Big)^5 = \frac{161051}{107495424} $$

Evaluation with Wikipedia Data

Now that we understand how the three methods work, we can evaluate their real-world performance on Wikipedia articles. The English version of the Wikipedia articles can be downloaded here. I used the smallest file of the complete pages, current versions only, in XML format, bzipped.

The first step is to extract the articles from the XML dump and remove non-content (e.g, Talk pages) and short articles.



In [1]:

    
%load_ext autoreload
%autoreload 2
%matplotlib inline
import matplotlib.pyplot as plt
import numpy as np
import random

import xml.etree.ElementTree as ET

def read_articles(flname):
    tree = ET.parse(flname)
    root = tree.getroot()
    formatted_articles = []
    for page in root.iterfind("{http://www.mediawiki.org/xml/export-0.8/}page"):
        article_text = None
        title = page.find("{http://www.mediawiki.org/xml/export-0.8/}title").text
        revision = page.find("{http://www.mediawiki.org/xml/export-0.8/}revision")
        text = revision.find("{http://www.mediawiki.org/xml/export-0.8/}text").text
        formatted_articles.append((title, text))
    return formatted_articles
        
articles = read_articles("data/enwiki-20140707-pages-meta-current1.xml-p000000010p000010000")
articles = filter(lambda t: not t[0].startswith("Talk:"), articles)
articles = filter(lambda t: len(t[1]) > 5000, articles)
sampled_articles = list(random.sample(articles, 30))
training_articles = sampled_articles[:10]
validation_articles = sampled_articles[10:20]
near_match_articles = sampled_articles[20:]

Cleaning the Data

The second step is to clean the data by removing the mediawiki formatting. This will make remove hyperlinks and other syntax that could confuse the classification process.



In [2]:

    
import re

def remove_block_regex(text, start_delim, end_delim):
    block_regex = re.compile(start_delim + r".+?" + end_delim, flags=re.DOTALL)
    new_string = ""
    last_end = 0
    for match in block_regex.finditer(text):
        new_string += text[last_end:match.start(0)] + " " 
        last_end = match.end(0)
    new_string += text[last_end:]
    
    return new_string

def replace_link_block(block):
    if block.startswith("[[image:") or \
        block.startswith("[[file:"):
            return " "
    elif block[1] != "[":
        start = block.find(" ")
        if start == -1:
            return " "
        else:
            return block[start+1:-1]
    elif "|" in block:
        start = block.find("|")
        return block[start+1:-2]
    
    return block[2:-2]

def substitute_links(text):
    pos = 0
    start_pos = 0
    while pos < len(text):
        if text[pos:pos+2] == "[[":
            start_pos = pos
            pos += 2
        elif text[pos] == "[":
            start_pos = pos
            pos += 1
        elif text[pos:pos+2] == "]]":
            block = text[start_pos:pos+2]
            block = replace_link_block(block)
            text = text[:start_pos] + block + text[pos+2:]
            pos = 0
        elif text[pos] == "]":
            block = text[start_pos:pos+1]
            block = replace_link_block(block)
            text = text[:start_pos] + block + text[pos+1:]
            pos = 0
        else:
            pos += 1
            
    return text

def remove_tail(text):
    headers = ["==See Also==", "== See Also ==", "==References==", "== References ==",
                "==External Links==", "== External Links ==", "==Further Reading==",
                "== Further Reading =="]
    positions = [text.find(s.lower()) for s in headers]
    positions = filter(lambda p: p != -1, positions)
    
    if len(positions) == 0:
        return text
    
    pos = min(positions)
    
    return text[:pos]

TAG_REGEX = re.compile(r"<\s*(\w+).+?/\1\s*>", flags=re.DOTALL)

def remove_html_tags(text):
    pos = 0
    start_pos = 0
    last_block_end = 0
    new_string = ""
    while pos < len(text):
        if text[pos] == "<":
            start_pos = pos
        elif text[pos:pos+2] == "/>":
            new_string += text[last_block_end:start_pos] + " "
            pos += 2
            last_block_end = pos
        pos += 1
    new_string += text[last_block_end:]
    text = new_string
    
    new_string = ""
    last_end = 0
    for match in TAG_REGEX.finditer(text):
        new_string += text[last_end:match.start(0)] + " " 
        last_end = match.end(0)
    new_string += text[last_end:]
    
    return new_string
        

def remove_wiki_formatting(text):
    text = text.lower()
    
    text = remove_tail(text)
    
    text = remove_block_regex(text, r"\{\{", r"\}\}")
    text = substitute_links(text)
    
    text = remove_block_regex(text, r"<!--", r"-->")
    text = remove_html_tags(text)
   
    text = text.replace(">", " ")
    text = text.replace("<", " ")
    text = text.replace("-", " ")
    text = text.replace("&nbsp;", " ")
    text = text.replace("&gt;", ">")
    text = text.replace("&lt;", "<")
    text = text.replace("&amp;", " ")
    text = text.replace("|", " ")
    text = text.replace("/", " ")
    text = text.replace("}}", " ")
    text = text.replace("{{", " ")
    
    text = text.replace("===", "")
    text = text.replace("==", "")
    text = text.replace("=", "")
    text = text.replace("'''", "")
    text = text.replace("''", "")
    text = text.replace(".", "")
    text = text.replace(",", "")
    text = text.replace("\"", "")
    text = text.replace("(", "")
    text = text.replace(")", "")
    text = text.replace("'", "")
    text = text.replace("?", "")
    text = text.replace(";", "")
    text = text.replace(":", "")
    text = text.replace("!", "")
    text = text.replace("*", "")
    
    return " ".join(text.split())

stripped_training_articles = map(lambda t: (t[0], remove_wiki_formatting(t[1])), training_articles)
stripped_validation_articles = map(lambda t: (t[0], remove_wiki_formatting(t[1])), validation_articles)
stripped_near_match_articles = map(lambda t: (t[0], remove_wiki_formatting(t[1])), near_match_articles)

Preparation of Data Sets

Now that we have each article in unformatted, plain text, we can extract the bags and prepare our test sets. The bags are found by scanning over the articles' text with a sliding window. For each word, we extract the word, its $n$ neighbors to the left, and its $n$ neighbors to the right. We represent each word set as a tuple containing an unordered set of the extracted words, the index of the center word, and the title of the article. The center word index and title of the article are used to identify the sources of the bags and verify the correctness of the classification.

Training and validation sets for four data sets:

Random word sets -- Our null hypothesis. Want to look at the distributions for randomly-drawn word sets.
Perfect matches -- $H_0$: Duplicate word sets are placed in the training and validation sets along with randomly-drawn word sets. Mimic cases where there is 1 good perfect match among many random documents.
Off-by-1 matches -- $H_1$: Bags of words and their nearest neighbor are placed in the training and validation sets, respectively, along with randomly-drawn bags. Mimic cases where there is 1 match with 1 word differing among many random documents.
Off-by-2 matches -- $H_2$: Bags of words and their neighbor 2 words down are placed in the training and validation sets, respectively, along with randomly-drawn bags. Mimic cases where there is 1 match with 2 words differing among many random documents.

In each data set, we include not only the instances mimicking our test case but also randomly-drawn bags to mimic the normal "background noise" we would find.



In [3]:

    
def extract_word_sets(article, window_length):
    title, text = article
    words = text.split()
    word_sets = list()
    for i in xrange(len(words) - window_length):
        word_set = ((title, i + ((window_length - 1) / 2)), frozenset(words[i:i+window_length]))
        word_sets.append(word_set)
    return word_sets

WINDOW_LENGTH = 7
training_article_word_sets = map(lambda article: extract_word_sets(article, WINDOW_LENGTH), stripped_training_articles)
validation_article_word_sets = map(lambda article: extract_word_sets(article, WINDOW_LENGTH), stripped_validation_articles)
near_match_article_word_sets = map(lambda article: extract_word_sets(article, WINDOW_LENGTH), stripped_near_match_articles)
training_word_sets = random.sample(reduce(lambda x, y: x + y, training_article_word_sets), 3000)
validation_word_sets = random.sample(reduce(lambda x, y: x + y, validation_article_word_sets), 3000)

word_set_pairs_0 = []
for article_word_sets in near_match_article_word_sets:
    for i in xrange(len(article_word_sets)):
        word_set_pairs_0.append((article_word_sets[i], article_word_sets[i]))

word_set_pairs_1 = []
for article_word_sets in near_match_article_word_sets:
    for i in xrange(len(article_word_sets) - 1):
        word_set_pairs_1.append(article_word_sets[i:i+2])
        
word_set_pairs_2 = []
for article_word_sets in near_match_article_word_sets:
    for i in xrange(len(article_word_sets) - 2):
        word_set_pairs_2.append((article_word_sets[i], article_word_sets[i+2]))
        

training_word_set_0 = training_word_sets[:1500]
validation_word_set_0 = validation_word_sets[:1500]
validation_set_labels_0 = dict()
for ws1, ws2 in random.sample(word_set_pairs_0, 1500):
    training_word_set_0.append(ws1)
    validation_word_set_0.append(ws2)
    validation_set_labels_0[ws2[0]] = ws1[0]
        
        
training_word_set_1 = training_word_sets[:1500]
validation_word_set_1 = validation_word_sets[:1500]
validation_set_labels_1 = dict()
for ws1, ws2 in random.sample(word_set_pairs_1, 1500):
    training_word_set_1.append(ws1)
    validation_word_set_1.append(ws2)
    validation_set_labels_1[ws2[0]] = ws1[0]
    
training_word_set_2 = training_word_sets[:1500]
validation_word_set_2 = validation_word_sets[:1500]
validation_set_labels_2 = dict()
for ws1, ws2 in random.sample(word_set_pairs_2, 1500):
    training_word_set_2.append(ws1)
    validation_word_set_2.append(ws2)
    validation_set_labels_2[ws2[0]] = ws1[0]
    
print len(training_word_sets), len(validation_word_sets)
print training_word_sets[0]









    



3000 3000
(('David Beckham', 9313), frozenset([u'england', u'beckham', u'lead', u'into', u'1\u20130', u'put', u'the']))

Examination of Word Set Distance Distributions

To determine the viability of using a Naive Bayes approach, we want to examine the distribution of overlaps between the word sets. The distance function is the number of words in $s_1$ that are not in $s_2$ or the difference of $s_1$ and $s_2$ given by:

$$ d_a(s_1, s_2) = | s_1 - s_2 | $$



In [4]:

    
from collections import defaultdict

def compute_asymm_dist_distr(word_set1, word_set2):
    min_dist_distribution = defaultdict(lambda: 0)
    for key, s1 in word_set1:
        min_dist = len(s1)
        for key2, s2 in word_set2:
            dist = 0
            for w in s1:
                if w not in s2:
                    dist += 1
            min_dist = min(min_dist, dist)
            if min_dist == 0:
                break
        min_dist_distribution[min_dist] += 1
    return min_dist_distribution

dist_random_distr = compute_asymm_dist_distr(validation_word_sets, training_word_sets)
dist_offset0_distr = compute_asymm_dist_distr(validation_word_set_0, training_word_set_0)
dist_offset1_distr = compute_asymm_dist_distr(validation_word_set_1, training_word_set_1)
dist_offset2_distr = compute_asymm_dist_distr(validation_word_set_2, training_word_set_2)

distances_random = list(sorted(dist_random_distr.keys()))
distances_offset0 = list(sorted(dist_offset0_distr.keys()))
distances_offset1 = list(sorted(dist_offset1_distr.keys()))
distances_offset2 = list(sorted(dist_offset2_distr.keys()))
densities_random = [dist_random_distr[d] / float(sum(dist_random_distr.values())) for d in distances_random]
densities_offset0 = [dist_offset0_distr[d] / float(sum(dist_offset0_distr.values())) for d in distances_offset0]
densities_offset1 = [dist_offset1_distr[d] / float(sum(dist_offset1_distr.values())) for d in distances_offset1]
densities_offset2 = [dist_offset2_distr[d] / float(sum(dist_offset2_distr.values())) for d in distances_offset2]

print densities_random

plt.plot(distances_random, densities_random, color="c", label="Random")
plt.hold(True)
plt.plot(distances_offset0, densities_offset0, color="r", label="Offset 0")
plt.plot(distances_offset1, densities_offset1, color="g", label="Offset 1")
plt.plot(distances_offset2, densities_offset2, color="b", label="Offset 2")
plt.xlabel("Distance (missing words)", fontsize=16)
plt.ylabel("Density", fontsize=16)
plt.legend(loc="upper right")









    



[0.0006666666666666666, 0.01, 0.099, 0.4073333333333333, 0.3863333333333333, 0.08833333333333333, 0.008333333333333333]






    Out[4]:





<matplotlib.legend.Legend at 0x5697c90>

The background noise, or probability of finding another bag of words with a minimum distance of $d$, will have a large impact on our ability to distinguish between true and false matches. With a window length of 7 words, we find that that the probability (cyan) of finding another bag of words with the same words, 1 word different, and 2 words different in a randomly-chosen set are 0.07%, 1.0%, and 9.9%, respectively. A majority of the bags have a minimal distance beween 3 and 6.

As true positives with differnces of 0, 1, and 2 words are added into the test sets, we see that the distribution becomes a superposition of the original random distribution and the new distributions. This change suggests that the presence of true matches should be readily observable.

Naive Bayes Implementations

Below, we've implemented the multinomial, IDF, and Bernoulli variants disussed above.



In [5]:

    
from collections import defaultdict

class BaseNaiveBayes(object):
    def __init__(self):
        self.word_classes = defaultdict(list)
        self.total_classes = 0.0
        
    def train(self, labelled_classes):
        self.total_classes = float(len(labelled_classes))

        for label, word_set in labelled_classes:
            for w in word_set:
                self.word_classes[w].append((label, word_set))
                
        # Gotcha!  defaultdict adds default instances
        # when you check for existence of a non-existent
        # key so save size here
        self.n_total_features = len(self.word_classes)
                
    def _log_likelihood(self, class_word_set, word_present_class):
        raise NotImplemented

    def classify(self, instance):
        instance_label, instance_word_set = instance
        
        # base class -- no match
        best_match = (0.0, None)
        
        possible_classes = []
        for word in instance_word_set:
            possible_classes.extend(self.word_classes[word])
            
        for label, class_word_set in possible_classes:
            log_sum_likelihood = -np.log(self.total_classes)
            for word in instance_word_set:
                log_sum_likelihood += self._log_likelihood(class_word_set, instance_word_set, word)

            likelihood = np.exp(log_sum_likelihood)
            best_match = max((likelihood, label), best_match)
            
        return instance_label, best_match[1], best_match[0]

class MultinomialNaiveBayes(BaseNaiveBayes):
    def __init__(self, alpha=1.0):
        super(MultinomialNaiveBayes, self).__init__()
        self.alpha = alpha
        
    def _log_likelihood(self, class_word_set, instance_word_set, word):
        if word in class_word_set:
            occurence = 1.0
        else:
            occurence = 0.0
        
        n_class_features = float(len(class_word_set))
        log_likelihood = np.log(occurence + self.alpha) - np.log(n_class_features + self.alpha * self.n_total_features)
        
        return log_likelihood


class IDFNaiveBayes(BaseNaiveBayes):
    def _log_likelihood(self, class_word_set, instance_word_set, word):
        if word not in class_word_set:
            return np.log(1.0)
        
        n_word_classes = float(len(self.word_classes[word]))
        log_likelihood = np.log(self.total_classes) - np.log(n_word_classes)
        
        return log_likelihood
    
class BernoulliNaiveBayes(BaseNaiveBayes):
    def __init__(self, alpha=1.0):
        super(MultinomialNaiveBayes, self).__init__()
        self.alpha = alpha
                
    def _log_likelihood(self, class_word_set, instance_word_set, word):
        if word in class_word_set:
            occurence = 1.0
        else:
            occurence = 0.0
        
        n_class_features = float(len(class_word_set))
        log_likelihood = np.log(occurence + self.alpha) - np.log(n_class_features + self.alpha * self.n_total_features)
        
        if word in instance_word_set:
            return log_likelihood
        else:
            return np.log(1.0 - np.exp(log_likelihood))

    def classify(self, instance):
        instance_label, instance_word_set = instance
        
        # base class -- no match
        best_match = (0.0, None)
        
        possible_classes = []
        for word in instance_word_set:
            possible_classes.extend(self.word_classes[word])
            
        for label, class_word_set in possible_classes:
            log_sum_likelihood = -np.log(self.total_classes)
            for word in self.word_classes.iterkeys():
                log_sum_likelihood += self._log_likelihood(class_word_set,
                                                           instance_word_set,
                                                           word)

            likelihood = np.exp(log_sum_likelihood)
            best_match = max((likelihood, label), best_match)
            
        return instance_label, best_match[1], best_match[0]

Evaluating Naive Bayes Implementations

We'll evaluate the multinomial and IDF Naive Bayes classifiers using the test sets with true matches with distances of 0, 1, and 2 words.



In [8]:

    
mnb = MultinomialNaiveBayes(alpha=1.0)
mnb.train(training_word_set_0)
offset0_word_set_classifications = map(lambda instance: mnb.classify(instance), validation_word_set_0)

mnb = MultinomialNaiveBayes(alpha=1.0)
mnb.train(training_word_set_1)
offset1_word_set_classifications = map(lambda instance: mnb.classify(instance), validation_word_set_1)

mnb = MultinomialNaiveBayes(alpha=1.0)
mnb.train(training_word_set_2)
offset2_word_set_classifications = map(lambda instance: mnb.classify(instance), validation_word_set_2)

def plot_roc_curve(*triplets):
    colors = ["r", "g", "b",]
    c = 0
    for label, word_classifications, mappings in triplets:
        sorted_classif = sorted([(prob, orig, classif) for orig, classif, prob in word_classifications])
        sorted_classif.reverse()
        tp = 0.0
        fp = 0.0
        xs = []
        ys = []
        for _, orig, classif in sorted_classif:
            if orig not in mappings:
                continue
            if mappings[orig] == classif:
                tp += 1.0
            else:
                fp += 1.0
            xs.append(fp / 1500.0)
            ys.append(tp / 1500.0)
        xs.append(1.0)
        ys.append(1.0)
        print label, tp / 1500.0, fp / 1500.0
        plt.plot(xs, ys, color=colors[c], label=label)
        c += 1
        plt.hold(True)
    plt.xlabel("Incorrect Class", fontsize=16)
    plt.ylabel("Correct Class", fontsize=16)
    plt.legend(loc="lower right")

triplets = [("Offset 0", offset0_word_set_classifications, validation_set_labels_0),
            ("Offset 1", offset1_word_set_classifications, validation_set_labels_1), 
            ("Offset 2", offset2_word_set_classifications, validation_set_labels_2)]
            
plot_roc_curve(*triplets)









    



Offset 0 0.998666666667 0.00133333333333
Offset 1 0.933333333333 0.0666666666667
Offset 2 0.858 0.142

With the multinomial NB model, we get true positive rates of 99.9%, 93.3%, and 85.8% for the test sets with 0, 1, and 2 word differences, respectively.



In [9]:

    
idfnb = IDFNaiveBayes()
idfnb.train(training_word_set_0)
offset0_word_set_classifications = map(lambda instance: idfnb.classify(instance), validation_word_set_0)

idfnb = IDFNaiveBayes()
idfnb.train(training_word_set_1)
offset1_word_set_classifications = map(lambda instance: idfnb.classify(instance), validation_word_set_1)

idfnb = IDFNaiveBayes()
idfnb.train(training_word_set_2)
offset2_word_set_classifications = map(lambda instance: idfnb.classify(instance), validation_word_set_2)

triplets = [("Offset 0", offset0_word_set_classifications, validation_set_labels_0),
            ("Offset 1", offset1_word_set_classifications, validation_set_labels_1), 
            ("Offset 2", offset2_word_set_classifications, validation_set_labels_2)]
            
plot_roc_curve(*triplets)









    



Offset 0 0.995333333333 0.00466666666667
Offset 1 0.93 0.07
Offset 2 0.860666666667 0.139333333333

With the IDF NB model, we get true positive rates of 99.5%, 93.0%, and 86.1% for the test sets with 0, 1, and 2 word differences, respectively. The IDF and multinomial models perfom similarly, suggesting that either would be a good choice.

Conclusion

By this point, we've discussed quite a bit. We reviewed plagiarism detection and the problem of identifying copied text. We looked at the bag of words model for feature extraction and showed how it works with a simple example. We looked at Naive Bayes classifiers and three different variations, the multinomial, IDF, and Bernoulli, and demonstrated how the calculations are performed by hand. We then implemented all three classifiers and evaluated the multinomial and IDF variants on wikipedia articles. We found that both the IDF and multinomial models performed well, correctly classifying about 86.1% - 99.5% of matching bags, depending on the number of words that had been changed.

As a preliminary example, there is still a lot of work to do. We didn't consider how to handle misspellings. One approach would be to cluster the word so that variants such as "definitely" and "defenitely" are grouped together. Then, all of the variants could be replaced by a single version.

We could also improve our validation. From a statistical standpoint, we should perform multiple samples for each data set and compute the variation in the distributions and error rates.

We also didn't test on real data -- we built synthetic data sets from the Wikipedia articles. Since the bags were sampled from a small set of articles, it could be possible that words have a higher repetition rate than if we sampled from all of Wikipedia. For example, an article on psychology is likely to use the word psychology multiple times. Our results could be better in the more general case.



In [ ]: