Topic Classification and Clustering

The two most celebrated areas that machine learning are applied is in natural language processing and machine vision — that is machine learning that is applied to text and images. This is probably because these types of algorithms exhibit the most human-like pattern recognition and are therefore the most useful for short circuiting human-type work.

In this notebook we'll take a look at how to get started with Machine Learning on text, performing both supervised topic identification (classification) and unsupervised topic modeling (clustering).


In [44]:
import os
import time

from bs4 import BeautifulSoup
from sklearn.datasets.base import Bunch
from sklearn import metrics

from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.feature_extraction import DictVectorizer
from sklearn.pipeline import FeatureUnion, Pipeline
from sklearn.base import BaseEstimator, TransformerMixin

from sklearn.decomposition import NMF
from sklearn.decomposition import LatentDirichletAllocation

from sklearn.naive_bayes import MultinomialNB
from sklearn.svm import SVC

The Dataset

In this tutorial we are going to use a collection of RSS feeds that have been ingested using feedparser in a project called Baleen. Baleen uses an OPML file to specifically target blog posts of certain categories and saves them accordingly. The result is a corpus that contains 11 categories with a few thousand HTML documents inside.

The document structure is as follows:

|-rss_corpus
|---README
|---books
|---business
|---cinema
|---cooking
|---data_science
|---design
|---do_it_yourself
|---essays
|---gaming
|---sports
|---tech

As you can see this directory structure contains a lot of information, including the predefined topics or classes for supervised machine learning, and a README for the description of the dataset.

We will load our dataset into a bundle using the following function:


In [3]:
CORPUS_ROOT = "data/rss_corpus/"

def load_data(root=CORPUS_ROOT):
    """
    Loads the text data into memory using the bundle dataset structure.
    Note that on larger corpora, memory safe CorpusReaders should be used.
    """
    
    # Open the README and store
    with open(os.path.join(root, 'README'), 'r') as readme:
        DESCR = readme.read()
    
    # Iterate through all the categories
    # Read the HTML into the data and store the category in target
    data      = []
    target    = []
    filenames = []
    
    for category in os.listdir(root):
        if category == "README": continue # Skip the README
        for doc in os.listdir(os.path.join(root, category)):
            fname = os.path.join(root, category, doc)
            
            # Store information about document
            filenames.append(fname)
            target.append(category)

            # Read data and store in data list
            with open(fname, 'r') as f:
                data.append(f.read())
    
    return Bunch(
        data=data,
        target=target,
        filenames=filenames,
        target_names=frozenset(target),
        DESCR=DESCR,
    )

In [4]:
dataset = load_data()

In [5]:
print dataset.DESCR


Baleen RSS Export
=================

These feeds were exported on Oct 03, 2014 at 17:30

There are 5506 posts in 11 categories in this corpus as follows:
   tech: 979 posts
   do it yourself: 294 posts
   cooking: 279 posts
   cinema: 679 posts
   gaming: 475 posts
   essays: 46 posts
   business: 903 posts
   design: 449 posts
   sports: 653 posts
   books: 253 posts
   data science: 496 posts

Feature Extraction


In [6]:
class HTMLPreprocessor(BaseEstimator, TransformerMixin):
    """
    Preprocesses HTML to extract the title and the paragraph.
    """

    def fit(self, X, y=None):
        return self

    def parse_body_(self, soup):
        """
        Helper function for dealing with the HTML body
        """
        
        if soup.find('p'):
            # Use paragraph extraction
            return "\n\n".join([
                    p.text.strip() 
                    for p in soup.find_all('p') 
                    if p.text.strip()
                ])
        
        else:
            # Use raw text extraction
            return soup.find('body').text.strip()
    
    def parse_html_(self, text):
        """
        Helper function for dealing with an HTML document
        """
        soup  = BeautifulSoup(text, 'lxml')
        title = soup.find('title').text
        body  = self.parse_body_(soup)

        # Get rid of the soup
        soup.decompose()
        del soup

        return {
            'title': title,
            'body': body
        }
            
    def transform(self, texts):
        """
        Extracts the text from all the paragraph tags 
        """
        return [
            self.parse_html_(text)
            for text in texts
        ]

    
class ValueByKey(BaseEstimator, TransformerMixin):
    """
    Extracts a value from a dictionary by key.
    """
    
    def __init__(self, key):
        self.key = key
    
    def fit(self, X, y=None):
        return self

    def transform(self, dicts):
        """
        Returns a list of values by key.
        """
        return [
            d[self.key] for d in dicts
        ]

In [7]:
## Construct the Feature Extraction Pipeline
features = Pipeline([
    
    # Preprocess html to extract the text.
    ('preprocess', HTMLPreprocessor()),
        
    # Use FeatureUnion to combine title and body features 
    ('html_union', FeatureUnion(
    
        # Create union of Title and Body
        transformer_list=[
            
            # Pipeline for Title  Extraction
            ('title', Pipeline([
                ('title_extract', ValueByKey('title')),
                ('title_tf', CountVectorizer(
                    max_features=4000, stop_words='english'
                )),
                # TODO: Add advanced TF parameters for better features
            ])),
                    
            # Pipeline for Task Extraction
            ('body', Pipeline([
                ('body_extract', ValueByKey('body')),
                ('body_tfidf', TfidfVectorizer(stop_words='english')),
                # TODO: Add advanced TF-IDF parameters for better features
            ])),
                      
        ],

        # weight components in FeatureUnion
        transformer_weights={
            'title': 0.45,
            'body':  0.55,
        },
                
    ))
        
])

In [8]:
start = time.time()
data  = features.fit_transform(dataset.data)

print (
    "Processed {} documents with {} features in {:0.3f} seconds"
    .format(data.shape[0], data.shape[1], time.time()-start)
)


Processed 5506 documents with 53852 features in 12.564 seconds

In [28]:
feature_names = features.steps[1][1].transformer_list[0][1].steps[1][1].get_feature_names()
feature_names.extend(
    features.steps[1][1].transformer_list[1][1].steps[1][1].get_feature_names()
)


Out[28]:
53852

Topic Modeling (Clustering)

In this section we will use our features extracted from the HTML documents to do some topic clustering using NMF and LDA.


In [40]:
N_TOPICS    = 10
N_TOP_WORDS = 20

def model_topics(model, data, **kwargs):
    """
    Automatic topic modeling and elucidation of topic classes.
    """
    
    start = time.time()
    clust = model(**kwargs).fit(data)
    
    print "Fit {} model in {:0.3f} seconds\n".format(clust.__class__.__name__, time.time()-start)
    for idx, topic in enumerate(clust.components_):
        print "  Topic {}:".format(idx)
        for tdx in topic.argsort()[:-N_TOP_WORDS - 1:-1]:
            print "    - {}".format(feature_names[tdx])
        print

In [41]:
model_topics(NMF, data, n_components=N_TOPICS, random_state=1, alpha=.1, l1_ratio=.5)


Fit NMF model in 1.316 seconds

  Topic 0:
    - new
    - york
    - trailer
    - new
    - watch
    - interstellar
    - product
    - iphones
    - read
    - home
    - oculus
    - prototype
    - times
    - mobile
    - launches
    - ad
    - gets
    - online
    - set
    - trailer

  Topic 1:
    - data
    - big
    - science
    - read
    - data
    - analytics
    - scientist
    - comments
    - sep
    - mining
    - upcoming
    - business
    - analytics
    - science
    - big
    - kdnuggets
    - using
    - webcasts
    - learning
    - free

  Topic 2:
    - apple
    - apple
    - ios
    - watch
    - report
    - beats
    - music
    - says
    - aapl
    - update
    - music
    - ios
    - beats
    - tim
    - cook
    - event
    - 16
    - ipad
    - october
    - fix

  Topic 3:
    - video
    - tv
    - night
    - best
    - trailer
    - video
    - late
    - watch
    - games
    - music
    - football
    - live
    - drone
    - lip
    - says
    - movies
    - year
    - et
    - wednesday
    - says

  Topic 4:
    - iphone
    - plus
    - iphone
    - plus
    - apple
    - aapl
    - review
    - new
    - phones
    - phones
    - best
    - record
    - vs
    - sales
    - iphones
    - buy
    - big
    - camera
    - phone
    - watching

  Topic 5:
    - 2014
    - september
    - week
    - october
    - links
    - short
    - fantastic
    - fest
    - design
    - bestsellers
    - npr
    - critical
    - linking
    - day
    - tiff
    - london
    - festival
    - highlights
    - 15
    - film

  Topic 6:
    - 3dthursday
    - 3dprinting
    - 3d
    - 3d
    - electronichalloween
    - printing
    - projects
    - printed
    - 3dprinted
    - 3dxhalloween
    - thursday
    - printed
    - halloween
    - design
    - printer
    - adafruit
    - electronics
    - project
    - sculptures
    - collection

  Topic 7:
    - pi
    - raspberrypi
    - piday
    - raspberry_pi
    - raspberry
    - pi
    - raspberry
    - tutorials
    - adafruit
    - using
    - camera
    - model
    - build
    - station
    - piday
    - accessories
    - selection
    - code
    - kit
    - point

  Topic 8:
    - 10
    - know
    - things
    - need
    - windows
    - million
    - today
    - ways
    - great
    - 10
    - make
    - tech
    - windows
    - weekend
    - opening
    - want
    - goog
    - available
    - advertising
    - sells

  Topic 9:
    - game
    - game
    - vs
    - creed
    - assassin
    - play
    - thrones
    - 15
    - season
    - games
    - steam
    - china
    - imitation
    - unity
    - pass
    - set
    - free
    - games
    - 20
    - destiny


In [43]:
model_topics(LatentDirichletAllocation, data, n_topics=N_TOPICS, max_iter=5,
                                learning_method='online', learning_offset=50.,
                                random_state=0)


Fit LatentDirichletAllocation model in 9.603 seconds

  Topic 0:
    - turkey
    - vermilionville
    - 168located
    - barbecued
    - smoker
    - andouille
    - 1830s
    - gumbo
    - smoked
    - embodies
    - lafayette
    - hardcover
    - big
    - 168prejean
    - luxurious
    - gumbo
    - destiny
    - idea
    - andouille
    - making

  Topic 1:
    - chevron
    - 1washington
    - uncommonly
    - garlicky
    - pastas
    - redskins
    - standing
    - anchovy
    - sauce
    - 2014
    - salsa
    - september
    - fiction
    - week
    - desk
    - china
    - stand
    - redskins
    - comics
    - paperback

  Topic 2:
    - shark
    - literary
    - tank
    - entire
    - corcoran
    - barbara
    - gave
    - adult
    - killings
    - marlon
    - bourne
    - banksy
    - mhm
    - bookends
    - bourne
    - marlon
    - bemelmans
    - lda
    - corcoran
    - gensim

  Topic 3:
    - giuliani
    - rudy
    - lawsuit
    - hong
    - kong
    - duty
    - activision
    - noriega
    - noriega
    - walken
    - manuel
    - bitpay
    - counsel
    - giuliani
    - handing
    - cosby
    - hook
    - activision
    - artsbeat
    - dictator

  Topic 4:
    - hardcover
    - worlds
    - girls
    - kernel
    - bookends
    - oitnb
    - mashub
    - pca
    - westerfeld
    - mishaps
    - proofreading
    - pasternak
    - stonor
    - dugard
    - pandemonium
    - mandel
    - harbinger
    - nordberg
    - bacha
    - freedoms

  Topic 5:
    - unknown
    - margaret
    - soda
    - atwood
    - mattress
    - pudding
    - atwood
    - adolescence
    - psychologist
    - ovaltine
    - puddings
    - pichet
    - ong
    - pimco
    - malty
    - troubled
    - therapeutic
    - traumas
    - dreadfulness
    - child

  Topic 6:
    - new
    - apple
    - iphone
    - data
    - 2014
    - read
    - new
    - facebook
    - change
    - video
    - people
    - used
    - lets
    - check
    - uses
    - game
    - stolen
    - experiments
    - guidelines
    - big

  Topic 7:
    - wasn
    - chevron
    - deleon
    - chevron
    - sordid
    - ecuador
    - year
    - mistakes
    - climate
    - poker
    - coyotes
    - krissy
    - case
    - averted
    - athena
    - ebola
    - just
    - online
    - new
    - change

  Topic 8:
    - excellent
    - memoir
    - arches
    - swansea
    - nabokov
    - zoumbas
    - whereismouse
    - _mathildes_artbook_
    - heythererosetta
    - isariasir
    - lasaventurasdechaps
    - designbeginswithl
    - stencild
    - bigfootstudiosart
    - mister_vi
    - jgbrasier
    - shalem_bencivenga
    - meerkatsu
    - starshrouded
    - chhutisultana

  Topic 9:
    - toast
    - religious
    - russell
    - eimear
    - mcbride
    - chronicles
    - awesommme
    - mannn
    - vote
    - slathered
    - saffron
    - yearnings
    - couillard
    - separatist
    - perilous
    - savery
    - paysage
    - roeland
    - avec
    - animaux

Topic Identification (Classification)

In this section we will use our features extracted from the HTML documents to do some topic identification using SVM, Maximum Entropy, and Naive Bayes. (Maybe also a Random Forest).


In [ ]:
def classify_topics(model, data, **kwargs):
    start = time.time()
    clf = model(**kwargs).fit(data, dataset.target)
    
    print "Fit {} model in {:0.3f} seconds\n".format(clf.__class__.__name__, time.time()-start)
    classification_report(clf)