Visual Diagnosis of Text Analysis with Baleen

This notebook has been created as part of the Yellowbrick user study. I hope to explore how visual methods might improve the workflow of text classification on a small to medium sized corpus.

Dataset

The dataset used in this study is a sample of the Baleen Corpus. The Baleen corpus has been ingesting RSS feeds on the hour from a variety of topical feeds since March 2016, including news, hobbies, and political documents and currently has over 1.2M posts from 373 feeds. Baleen (an open source system) has a sister library called Minke that provides multiprocessing support for dealing with Gigabytes worth of text.

The dataset I'll use in this study is a sample of the larger data set that contains 68,052 or roughly 6% of the total corpus. For this test, I've chosen to use the preprocessed corpus, which means I won't have to do any tokenization, but can still apply normalization techniques. The corpus is described as follows:

Baleen corpus contains 68,052 files in 12 categories. Structured as:

  • 1,200,378 paragraphs (17.639 mean paragraphs per file)
  • 2,058,635 sentences (1.715 mean sentences per paragraph).

Word count of 44,821,870 with a vocabulary of 303,034 (147.910 lexical diversity).

Category Counts:

  • books: 1,700 docs
  • business: 9,248 docs
  • cinema: 2,072 docs
  • cooking: 733 docs
  • data science: 692 docs
  • design: 1,259 docs
  • do it yourself: 2,620 docs
  • gaming: 2,884 docs
  • news: 33,253 docs
  • politics: 3,793 docs
  • sports: 4,710 docs
  • tech: 5,088 docs

This is quite a lot of data, so for now we'll simply create a classifier for the "hobbies" categories: e.g. books, cinema, cooking, diy, gaming, and sports.

Note: this data set is not currently publically available, but I am happy to provide it on request.


In [1]:
%matplotlib inline

In [2]:
import os 
import sys 
import nltk
import pickle

# To import yellowbrick 
sys.path.append("../..")

Loading Data

In order to load data, I'd typically use a CorpusReader. However, for the sake of simplicity, I'll load data using some simple Python generator functions. I need to create two primary methods, the first loads the documents using pickle, and the second returns the vector of targets for supervised learning.


In [3]:
CORPUS_ROOT = os.path.join(os.getcwd(), "data") 
CATEGORIES = ["books", "cinema", "cooking", "diy", "gaming", "sports"]

def fileids(root=CORPUS_ROOT, categories=CATEGORIES): 
    """
    Fetch the paths, filtering on categories (pass None for all). 
    """
    for name in os.listdir(root):
        dpath = os.path.join(root, name)
        if not os.path.isdir(dpath):
            continue 
        
        if categories and name in categories: 
            for fname in os.listdir(dpath):
                yield os.path.join(dpath, fname)


def documents(root=CORPUS_ROOT, categories=CATEGORIES):
    """
    Load the pickled documents and yield one at a time. 
    """
    for path in fileids(root, categories):
        with open(path, 'rb') as f:
            yield pickle.load(f)


def labels(root=CORPUS_ROOT, categories=CATEGORIES):
    """
    Return a list of the labels associated with each document. 
    """            
    for path in fileids(root, categories):
        dpath = os.path.dirname(path) 
        yield dpath.split(os.path.sep)[-1]

Feature Extraction and Normalization

In order to conduct analyses with Scikit-Learn, I'll need some helper transformers to modify the loaded data into a form that can be used by the sklearn.feature_extraction text transformers. I'll be mostly using the CountVectorizer and TfidfVectorizer, so these normalizer transformers and identity functions help a lot.


In [4]:
from nltk.corpus import wordnet as wn
from nltk.stem import WordNetLemmatizer 
from unicodedata import category as ucat
from nltk.corpus import stopwords as swcorpus
from sklearn.base import BaseEstimator, TransformerMixin 


def identity(args):
    """
    The identity function is used as the "tokenizer" for 
    pre-tokenized text. It just passes back it's arguments. 
    """
    return args 


def is_punctuation(token):
    """
    Returns true if all characters in the token are
    unicode punctuation (works for most punct). 
    """
    return all(
        ucat(c).startswith('P')
        for c in token 
    )


def wnpos(tag):
    """
    Returns the wn part of speech tag from the penn treebank tag. 
    """
    return {
        "N": wn.NOUN,
        "V": wn.VERB,
        "J": wn.ADJ, 
        "R": wn.ADV, 
    }.get(tag[0], wn.NOUN)


class TextNormalizer(BaseEstimator, TransformerMixin):
    
    def __init__(self, stopwords='english', lowercase=True, lemmatize=True, depunct=True):
        self.stopwords  = frozenset(swcorpus.words(stopwords)) if stopwords else frozenset()
        self.lowercase  = lowercase 
        self.depunct    = depunct 
        self.lemmatizer = WordNetLemmatizer() if lemmatize else None 
    
    def fit(self, docs, labels=None):
        return self

    def transform(self, docs): 
        for doc in docs: 
            yield list(self.normalize(doc)) 
    
    def normalize(self, doc):
        for paragraph in doc:
            for sentence in paragraph:
                for token, tag in sentence: 
                    if token.lower() in self.stopwords:
                        continue 
                    
                    if self.depunct and is_punctuation(token):
                        continue 
                    
                    if self.lowercase:
                        token = token.lower() 
                    
                    if self.lemmatizer:
                        token = self.lemmatizer.lemmatize(token, wnpos(tag))
                    
                    yield token

Corpus Analysis

At this stage, I'd like to get a feel for what was in my corpus, so that I can start thinking about how to best vectorize the text and do different types of counting. With the Yellowbrick 0.3.3 release, support has been added for two text visualizers, which I think I will test out at scale using this corpus.


In [5]:
from sklearn.pipeline import Pipeline 
from sklearn.feature_extraction.text import CountVectorizer 
from yellowbrick.text import FreqDistVisualizer

visualizer = Pipeline([
    ('norm', TextNormalizer()),
    ('count', CountVectorizer(tokenizer=lambda x: x, preprocessor=None, lowercase=False)),
    ('viz', FreqDistVisualizer())
])

visualizer.fit_transform(documents(), labels())
visualizer.named_steps['viz'].poof()


/usr/local/lib/python3.5/site-packages/sklearn/cross_validation.py:44: DeprecationWarning: This module was deprecated in version 0.18 in favor of the model_selection module into which all the refactored classes and functions are moved. Also note that the interface of the new CV iterators are different from that of this module. This module will be removed in 0.20.
  "This module will be removed in 0.20.", DeprecationWarning)
---------------------------------------------------------------------------
AttributeError                            Traceback (most recent call last)
<ipython-input-5-3514380b0c82> in <module>()
      9 ])
     10 
---> 11 visualizer.fit_transform(documents(), labels())
     12 visualizer.named_steps['viz'].poof()

/usr/local/lib/python3.5/site-packages/sklearn/pipeline.py in fit_transform(self, X, y, **fit_params)
    301         Xt, fit_params = self._fit(X, y, **fit_params)
    302         if hasattr(last_step, 'fit_transform'):
--> 303             return last_step.fit_transform(Xt, y, **fit_params)
    304         elif last_step is None:
    305             return Xt

/usr/local/lib/python3.5/site-packages/sklearn/base.py in fit_transform(self, X, y, **fit_params)
    495         else:
    496             # fit method of arity 2 (supervised transformation)
--> 497             return self.fit(X, y, **fit_params).transform(X)
    498 
    499 

AttributeError: 'NoneType' object has no attribute 'transform'

In [6]:
vect = Pipeline([
    ('norm', TextNormalizer()),
    ('count', CountVectorizer(tokenizer=lambda x: x, preprocessor=None, lowercase=False)),
])

docs = vect.fit_transform(documents(), labels())
viz = FreqDistVisualizer() 
viz.fit(docs, vect.named_steps['count'].get_feature_names())
viz.poof()



In [8]:
from sklearn.pipeline import Pipeline 
from sklearn.feature_extraction.text import TfidfVectorizer 
from yellowbrick.text import TSNEVisualizer

vect = Pipeline([
    ('norm', TextNormalizer()),
    ('tfidf', TfidfVectorizer(tokenizer=lambda x: x, preprocessor=None, lowercase=False)),
])

docs = vect.fit_transform(documents(), labels())

viz = TSNEVisualizer() 
viz.fit(docs, labels())
viz.poof()


Classification

The primary task for this kind of corpus is classification - sentiment analysis, etc.


In [19]:
from sklearn.model_selection import train_test_split as tts 

docs_train, docs_test, labels_train, labels_test = tts(docs, list(labels()), test_size=0.2)

In [21]:
from sklearn.linear_model import LogisticRegression 
from yellowbrick.classifier import ClassBalance, ClassificationReport, ROCAUC

logit = LogisticRegression()
logit.fit(docs_train, labels_train)


Out[21]:
LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
          intercept_scaling=1, max_iter=100, multi_class='ovr', n_jobs=1,
          penalty='l2', random_state=None, solver='liblinear', tol=0.0001,
          verbose=0, warm_start=False)

In [23]:
logit_balance = ClassBalance(logit, classes=set(labels_test))
logit_balance.score(docs_test, labels_test)
logit_balance.poof()



In [27]:
logit_balance = ClassificationReport(logit, classes=set(labels_test))
logit_balance.score(docs_test, labels_test)
logit_balance.poof()


---------------------------------------------------------------------------
IndexError                                Traceback (most recent call last)
<ipython-input-27-7b911cc79c37> in <module>()
      1 logit_balance = ClassificationReport(logit, classes=set(labels_test))
----> 2 logit_balance.score(docs_test, labels_test)
      3 logit_balance.poof()

/Users/benjamin/Repos/tmp/yellowbrick/yellowbrick/classifier.py in score(self, X, y, **kwargs)
    133         self.scores = map(lambda s: dict(zip(self.classes_, s)), self.scores[0:3])
    134         self.scores = dict(zip(keys, self.scores))
--> 135         return self.draw(y, y_pred)
    136 
    137     def draw(self, y, y_pred):

/Users/benjamin/Repos/tmp/yellowbrick/yellowbrick/classifier.py in draw(self, y, y_pred)
    158         for column in range(len(self.matrix)+1):
    159             for row in range(len(self.classes_)):
--> 160                 self.ax.text(column,row,self.matrix[row][column],va='center',ha='center')
    161 
    162         fig = plt.imshow(self.matrix, interpolation='nearest', cmap=self.cmap, vmin=0, vmax=1)

IndexError: list index out of range

In [29]:
logit_balance = ClassificationReport(LogisticRegression())
logit_balance.fit(docs_train, labels_train)
logit_balance.score(docs_test, labels_test)
logit_balance.poof()


---------------------------------------------------------------------------
IndexError                                Traceback (most recent call last)
<ipython-input-29-3e54aae07a7d> in <module>()
      1 logit_balance = ClassificationReport(LogisticRegression())
      2 logit_balance.fit(docs_train, labels_train)
----> 3 logit_balance.score(docs_test, labels_test)
      4 logit_balance.poof()

/Users/benjamin/Repos/tmp/yellowbrick/yellowbrick/classifier.py in score(self, X, y, **kwargs)
    133         self.scores = map(lambda s: dict(zip(self.classes_, s)), self.scores[0:3])
    134         self.scores = dict(zip(keys, self.scores))
--> 135         return self.draw(y, y_pred)
    136 
    137     def draw(self, y, y_pred):

/Users/benjamin/Repos/tmp/yellowbrick/yellowbrick/classifier.py in draw(self, y, y_pred)
    158         for column in range(len(self.matrix)+1):
    159             for row in range(len(self.classes_)):
--> 160                 self.ax.text(column,row,self.matrix[row][column],va='center',ha='center')
    161 
    162         fig = plt.imshow(self.matrix, interpolation='nearest', cmap=self.cmap, vmin=0, vmax=1)

IndexError: list index out of range

In [28]:
logit_balance = ROCAUC(logit)
logit_balance.score(docs_test, labels_test)
logit_balance.poof()


---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
<ipython-input-28-5eccb2a02e03> in <module>()
      1 logit_balance = ROCAUC(logit)
----> 2 logit_balance.score(docs_test, labels_test)
      3 logit_balance.poof()

/Users/benjamin/Repos/tmp/yellowbrick/yellowbrick/classifier.py in score(self, X, y, **kwargs)
    311         """
    312         y_pred = self.predict(X)
--> 313         self.fpr, self.tpr, self.thresholds = roc_curve(y, y_pred)
    314         self.roc_auc = auc(self.fpr, self.tpr)
    315         return self.draw(y, y_pred)

/usr/local/lib/python3.5/site-packages/sklearn/metrics/ranking.py in roc_curve(y_true, y_score, pos_label, sample_weight, drop_intermediate)
    503     """
    504     fps, tps, thresholds = _binary_clf_curve(
--> 505         y_true, y_score, pos_label=pos_label, sample_weight=sample_weight)
    506 
    507     # Attempt to drop thresholds corresponding to points in between and

/usr/local/lib/python3.5/site-packages/sklearn/metrics/ranking.py in _binary_clf_curve(y_true, y_score, pos_label, sample_weight)
    312              array_equal(classes, [-1]) or
    313              array_equal(classes, [1]))):
--> 314         raise ValueError("Data is not binary and pos_label is not specified")
    315     elif pos_label is None:
    316         pos_label = 1.

ValueError: Data is not binary and pos_label is not specified