Wikipedia analysis fun times

What follows is a mini data science proejct with two main parts:

  • web scraping wikipedia with a very simple scraper.
  • building a classifier for wikipedia articles to identify which category they belong to.

Part 1: What is interesting about wikipedia data?

The obvious information in a wikipedia article is the text itself. Some of the information is encoded in the structure of the document as headings and subheadings. There is also interesting information in the links, which form a graph over the topics in wikipedia. More sources of information include the pictures, the references, and the editors (another graph?) to name only a few.

For simplicity, we are only going to look at the words and maybe the link graph as a bonus.

In extracting the words in the wiki pages, we will only extract the words (not tags) or any structure. One result of this is that there will be noise in the data. One example is that every document will contain the words on the left hand navigation menu. These will simply be irrelevant features, but is something to keep in mind if you want to classify a document about navigation bars, for example.

Get the data and preprocess / do data cleaning

We'll keep it simple and not do n-grams, tfidf, or any other transformations that tend to give better results. The data set is fairly small, so n-grams could exacerbate an already underdetermined problem.


In [12]:
# load all of the imports
%matplotlib inline
import numpy
from wiki_scraper import (
    parse_html_simple,
    crawl_page)
from parallel_webscrape import scrape
import pickle
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.naive_bayes import MultinomialNB
from sklearn import cross_validation
from sklearn.metrics import mean_absolute_error, classification_report
from sklearn.feature_extraction.text import TfidfTransformer
from matplotlib import pyplot
from sklearn.metrics import (
    roc_curve,
    auc,
    precision_score,
    recall_score,
    f1_score)
from sklearn.preprocessing import label_binarize
from sklearn.multiclass import OneVsRestClassifier
from sklearn.cross_validation import KFold
from sklearn.feature_extraction.text import TfidfTransformer

In [2]:
# get the data set for a list of pages
# Uncomment to scrape wikipedia
# links = ['/wiki/Category:Rare_diseases',
#          '/wiki/Category:Infectious_diseases',
#          '/wiki/Category:Cancer',
#          '/wiki/Category:Congenital_disorders',
#          '/wiki/Category:Organs',
#          '/wiki/Category:Machine_learning_algorithms',
#          '/wiki/Category:Medical_devices']
# domains = ['http://en.wikipedia.org'] * len(links)
# scrape(domains, links)
categories = pickle.load(open('categories.pickle', 'rb'))
link_edges = pickle.load(open('link_edges.pickle', 'rb'))

In [3]:
# make some dictionaries to preprocess the words
english_words = set()
with open("american-english.txt") as english_dictionary:
    english_words = set(word.strip().lower() for word in english_dictionary)

stop_words = set()
with open("english_stopwords.txt") as stopwords:
    stop_words = set(word.strip().lower() for word in stopwords)

In [4]:
# parse the html, turning it into a list of words
# and removing stop words and non-dictionary words
# we'll also collect all of the words so that we can make a map of words to numbers

all_words = set()
# the category level
for k, v in categories.iteritems():
    # the document level
    for inner_k, inner_document in v.iteritems():
        # parse the html to get lists of words per document
        words = parse_html_simple(inner_document)
        parsed = []
        for word in words:
            if word in english_words and word not in stop_words:
                all_words.add(word)
                parsed.append(word)
        categories[k][inner_k] = parsed

In [7]:
# aggregate all of the documents into one big data set while transforming them into counts
count_vect = CountVectorizer(vocabulary=list(all_words))

# tfidf_vect = TfidfTransformer(vocabulary=list(all_words))
count_data = []
string_data = []
labels = []

# the category level
for k, v in categories.iteritems():
    # the document level
    for inner_k, inner_document in v.iteritems():
        # oops, we actually need this in string format
        string_data.append(' '.join(inner_document))
        labels.append(k)

# transform the string data into count data
count_data = count_vect.transform(string_data).todense()

# transform count_data and babels into numpy arrays for easy indexing
count_data = numpy.array(count_data)
labels = numpy.array(labels)

# make a map from the string label to a number and vice versa
label_map = {}
reverse_label_map = {}
i = 0
for label in sorted(set(labels)):
    reverse_label_map[i] = label
    label_map[label] = i
    i += 1

In [8]:
# Oh, hi there curse of dimensionality.
# We need a lot more data if we want to create a reliable model
# or extract information from rare features
print(count_data.shape)


(814, 21648)

Now let's build a model

We are going to train a multinomial naive bayes model on a dataset composed of word counts for each document along with the category label. We will evaluate the model with 10-fold cross-validation and look at the error and some examples.

Multinomial Naive bayes is about the simplest thing we could do, as we are essentially building a multinomial over the word counts per class, then calculating the probability that documents belong to each multinomial.

What features are important for each category?


In [13]:
from sklearn.cross_validation import train_test_split

x_train, x_test, y_train, y_test = train_test_split(
    count_data, labels,
    test_size=0.2,
    random_state=numpy.random.randint(100))

mnb = MultinomialNB()
mnb.fit(x_train, y_train)
predicted = mnb.predict(x_test)

# print the top 10 words in each class
print("Best features of each class:")
feature_names = numpy.array(count_vect.get_feature_names())
for i, row in enumerate(mnb.coef_):
    
    best_feature_inds = sorted(zip(range(len(row)), row),
                               key=lambda tup: tup[1],
                               reverse=True)
    inds = list(zip(*best_feature_inds)[0])
    best_features = feature_names[inds[:10]]
    print(best_features)
    
print("Some metrics:")
print(classification_report(y_test, predicted))


Best features of each class:
['cancer' 'wikipedia' 'page' 'edit' 'alcohol' 'risk' 'category' 'article'
 'pages' 'may']
['syndrome' 'wikipedia' 'edit' 'page' 'may' 'function' 'congenital'
 'disease' 'article' 'navigation']
['disease' 'wikipedia' 'infection' 'diseases' 'may' 'edit' 'virus' 'page'
 'health' 'infectious']
['wikipedia' 'learning' 'function' 'page' 'edit' 'pages' 'algorithm'
 'articles' 'article' 'category']
['wikipedia' 'edit' 'medical' 'philips' 'devices' 'stimulation' 'page'
 'heart' 'retrieved' 'device']
['wikipedia' 'edit' 'page' 'liver' 'skin' 'category' 'function' 'system'
 'article' 'fish']
['syndrome' 'wikipedia' 'disease' 'edit' 'page' 'may' 'function' 'also'
 'articles' 'article']
Some metrics:
             precision    recall  f1-score   support

/wiki/Category:Cancer       0.40      0.14      0.21        14
/wiki/Category:Congenital_disorders       0.45      0.55      0.49        38
/wiki/Category:Infectious_diseases       0.75      0.75      0.75        28
/wiki/Category:Machine_learning_algorithms       0.29      0.67      0.40         9
/wiki/Category:Medical_devices       0.70      0.37      0.48        19
/wiki/Category:Organs       0.88      0.47      0.61        15
/wiki/Category:Rare_diseases       0.64      0.70      0.67        40

avg / total       0.60      0.56      0.56       163

10-fold cross validation to get a sense of how reliable the results are

Evaluate the model on stratified random folds -- we want each category to be proportionally represented in case the classes aren't balanced.


In [15]:
# we need numeric labels for this library
numeric_labels = [label_map[l] for l in labels]

# Binarize the output
y = label_binarize(numeric_labels, classes=range(len(set(labels))))
n_classes = y.shape[1]

kfold = cross_validation.StratifiedKFold(labels, n_folds=10)
mean_tpr = 0.0
mean_fpr = numpy.linspace(0, 1, 100)
all_tpr = []
for i, (train_index, test_index) in enumerate(kfold):
    # Learn to predict each class against the other
    mnb = MultinomialNB()
    classifier = OneVsRestClassifier(mnb)

    y_score = classifier.fit(count_data[train_index],
                             y[train_index]).predict_proba(count_data[test_index])
    
    # Compute ROC curve and ROC area for each class
    fpr = dict()
    tpr = dict()
    roc_auc = dict()
    for i in range(n_classes):
        fpr[i], tpr[i], _ = roc_curve(y[test_index][:, i], y_score[:, i])
        roc_auc[i] = auc(fpr[i], tpr[i])
    
    # Compute micro-average ROC curve and ROC area
    fpr["micro"], tpr["micro"], _ = roc_curve(y[test_index].ravel(), y_score.ravel())
    roc_auc["micro"] = auc(fpr["micro"], tpr["micro"])
    
# Plot of a ROC curve for a specific class
f = pyplot.figure()
pyplot.plot(fpr[2], tpr[2], label='ROC curve (area = %0.2f)' % roc_auc[2])
pyplot.plot([0, 1], [0, 1], 'k--')
pyplot.xlim([0.0, 1.0])
pyplot.ylim([0.0, 1.05])
pyplot.xlabel('False Positive Rate')
pyplot.ylabel('True Positive Rate')
pyplot.title('Mean ROC curve')
pyplot.legend(loc="lower right")
f.set_size_inches(18.5, 10.5)
pyplot.show()

# Plot ROC curve
f = pyplot.figure()
pyplot.plot(fpr["micro"], tpr["micro"],
         label='micro-average ROC curve (area = {0:0.2f})'
               ''.format(roc_auc["micro"]))
for i in range(n_classes):
    pyplot.plot(fpr[i], tpr[i], label='ROC curve of class {0} (area = {1:0.2f})'
                                   ''.format(reverse_label_map[i], roc_auc[i]))
f.set_size_inches(18.5, 10.5)

pyplot.plot([0, 1], [0, 1], 'k--')
pyplot.xlim([0.0, 1.0])
pyplot.ylim([0.0, 1.05])
pyplot.xlabel('False Positive Rate')
pyplot.ylabel('True Positive Rate')
pyplot.title('ROC curve per class')
pyplot.legend(loc="lower right")
pyplot.show()


The AUC is the probability that a random intance of a given class will have a higher score for that class in the model than a randomly selected instance of a different class. This particular classifier appears to have trouble identifying articles about cancer. These results are over ten folds, so we can be reasonably confident in them. Considering the size of the data and the number and sparsity of features, I'd consider this ok for such a naive approach. However, I'd rather have a much larger data set to work with.

Now we'll take a look at how the classifier is wrong instead of just how wrong it is.


In [19]:
# Evaluate the model on stratified random folds -- we want each category
# to be proportionally represented in case the classes aren't balanced
# in order to get a really good picture of the bias variance tradeoff for this algorithm, we
# could do multiple iterations of 10-fold cross-validation and plot the error

n_iterations = 1
true_class_probabilities = []
kfold = cross_validation.StratifiedKFold(
    range(len(labels)),
    n_folds=10, shuffle=True,
    random_state=numpy.random.randint(100))
i = 0
fold_error = []
for iteration in range(n_iterations):
    print("iteration", iteration)
    for train_index, test_index in kfold:
        i += 1
        
        mnb = MultinomialNB()
        mnb.fit(count_data[train_index], labels[train_index])
        probs = mnb.predict_proba(count_data[test_index])
        for index, instance_p in enumerate(probs):
            label = labels[test_index[index]]
            true_class_probabilities.append(
                (instance_p[label_map[label]],
                 label))


('iteration', 0)

In [17]:
# Make a bar plot the probability assigned to the true class
# for each document

# separate the probs per class
true_class_probabilities = sorted(true_class_probabilities,
                                  key=lambda tup: tup[1])
buckets = []
for label in label_map.keys():
    label_probs = []
    for item in true_class_probabilities:
        if item[1] == label:
            label_probs.append(item[0])
    buckets.append(label_probs)

# make the figure    
f, ax = pyplot.subplots(1, 1)
ax.set_ylabel('Frequency')
ax.set_xlabel('Probability of true class')

n, bins, patches = pyplot.hist(buckets, 10, histtype='bar',
                               label=label_map.keys())
f.set_size_inches(18.5, 10.5)
pyplot.legend(loc=2)
pyplot.show()


When it comes to three of the classes (infectious_deseases, rare_diseases, and congenital_diseases) the classifier is very confident about 2/3rds of the time. When the classifier is wrong, it is very wrong (look at the leftmost mode). Interestingly, there is a middle mode of sorts for the three best discriminated classes. This may be due to some shared features between the three classes -- they are after all disease classes -- and perhaps a rare disease can also be infectious. This could also be due to poor data cleaning.

Other interesting things to do.

Wikipedia is a human curated knowledge base. This means that people with opinions and biases determined the category hierarchy. It would be interesting to see how it would differ if we inferred the categories with a topic model. Unfortunately, that's all of our time for now. For those interested in doing a network analysis, the web scraper does construct an adjacency list.


In [ ]: