Hashing Vectorizer example

In this notebook, we'll experiment with HashingVectorizer by making a classifier that predicts whether a text chunk comes from the English Wikipedia articles "Anarchism" or "Anachronism".

Preamble

First, let's import everything that we'll need.



In [1]:

    
import random

import mwapi
import mwparserfromhell as mwparser
from sklearn.ensemble import GradientBoostingClassifier
from sklearn.feature_extraction.text import HashingVectorizer

Get the text from the API



In [2]:

    
session = mwapi.Session("https://en.wikipedia.org", 
                        user_agent="Hashing vectorizer example <aaron.halfaker@gmail.com>")

doc = session.get(action="query", prop="revisions", titles=["Anarchism", "Anachronism"], rvprop=['content'])

anarchism_text = doc['query']['pages']['12']['revisions'][0]['*']
anarchonism_text = doc['query']['pages']['60731']['revisions'][0]['*']

Build up a set of observations



In [3]:

    
observations = []
for text_chunk in mwparser.parse(anarchism_text).filter_text():
    text = text_chunk.value
    if len(text) > 25:
        observations.append((text, "anarchism"))
for text_chunk in mwparser.parse(anarchonism_text).filter_text():
    text = text_chunk.value
    if len(text) > 25:
        observations.append((text, "anachronism"))

print("anarchism paragraphs:", sum(1 for _, label in observations if label == "anarchism"))
print("anachronism paragraphs:", sum(1 for _, label in observations if label == "anachronism"))









    



anarchism paragraphs: 1440
anachronism paragraphs: 116

Split into train/test



In [4]:

    
random.shuffle(observations)
train_set = observations[:int(len(observations)*0.8)]
test_set = observations[int(len(observations)*0.8):]
len(train_set), len(test_set)









    Out[4]:





(1244, 312)

HashingVectorizer and prediction model

We'll use a GradientBoosting model because it ought to work pretty well



In [13]:

    
hv = HashingVectorizer(n_features=2**16)
gbc = GradientBoostingClassifier()

Train the classifier

We'll set the "sample weight" to be proportionally stronger for "Anachronism" because we don't have as many observations.



In [14]:

    
# Training
texts, labels_y = zip(*train_set)
features_X = hv.transform(texts)
gbc.fit(features_X, labels_y, 
        sample_weight=[119/(119+1433) if l == "anarchism" else 1433/(119+1433) for l in labels_y])









    Out[14]:





GradientBoostingClassifier(init=None, learning_rate=0.1, loss='deviance',
              max_depth=3, max_features=None, max_leaf_nodes=None,
              min_samples_leaf=1, min_samples_split=2,
              min_weight_fraction_leaf=0.0, n_estimators=100,
              presort='auto', random_state=None, subsample=1.0, verbose=0,
              warm_start=False)

Test the classifier

This score() method generates a simple accuracy measure. This result suggest that we can predict ~86% of the test set correctly.



In [15]:

    
# Testing
texts, labels_y = zip(*test_set)
features_X = hv.transform(texts)
gbc.score(features_X.todense(), labels_y)









    Out[15]:





0.95192307692307687

Statistics are great, but let's look at some example predictions. This loop generates predictions



In [16]:

    
for text, label in test_set[:10]:
    features_X = hv.transform([text])
    print("text:", repr(text[:50] + "..."), "\n",
          "\tactual:", label, "\n",
          "\tprediction:", dict(zip(gbc.classes_, 
                                    [int(v*100) for v in gbc.predict_proba(features_X.todense())[0]])))









    



text: ' Tolstoy established a conceptual difference betwe...' 
 	actual: anarchism 
 	prediction: {'anachronism': 28, 'anarchism': 71}
text: 'http://newleftreview.org/II/28/benedict-anderson-i...' 
 	actual: anarchism 
 	prediction: {'anachronism': 63, 'anarchism': 36}
text: '– an online collection of news and information abo...' 
 	actual: anarchism 
 	prediction: {'anachronism': 8, 'anarchism': 91}
text: 'Communist Party of Spain (main)...' 
 	actual: anarchism 
 	prediction: {'anachronism': 28, 'anarchism': 71}
text: 'thumb|left|May day demonstration of Spanish ...' 
 	actual: anarchism 
 	prediction: {'anachronism': 28, 'anarchism': 71}
text: 'Confederación Nacional del Trabajo...' 
 	actual: anarchism 
 	prediction: {'anachronism': 28, 'anarchism': 71}
text: 'http://www.theanarchistlibrary.org/HTML/Murray_Boo...' 
 	actual: anarchism 
 	prediction: {'anachronism': 10, 'anarchism': 89}
text: '"T.A.Z.: The Temporary Autonomous Zone, Ontologica...' 
 	actual: anarchism 
 	prediction: {'anachronism': 33, 'anarchism': 66}
text: 'http://www.libertarian.co.uk/lapubs/polin/polin168...' 
 	actual: anarchism 
 	prediction: {'anachronism': 16, 'anarchism': 83}
text: 'Short history of the IAF-IFA...' 
 	actual: anarchism 
 	prediction: {'anachronism': 33, 'anarchism': 66}

Feature selection

Let's build a histogram of the feature importance measurments.



In [23]:

    
%matplotlib inline
import matplotlib.pyplot as plt

plt.hist(gbc.feature_importances_, bins=100, log=True)
#plt.semilogy()
plt.title("Feature importance histogram")
plt.xlabel("Value")
plt.ylabel("Frequency")









    Out[23]:





<matplotlib.text.Text at 0x7f39ac43c6a0>



In [ ]: