In [1]:

    
%%html
<style>
table {float:left}
</style>

TacticToolkit Introduction

TacticToolkit is a codebase to assist with machine learning and natural language processing. We build on top of sklearn, tensorflow, keras, nltk, spaCy and other popular libraries. The TacticToolkit will help throughout; from data acquisition to preprocessing to training to inference.

Modules	Description
corpus	Load and work with text corpora
data	Data generation and common data functions
plotting	Predefined and customizable plots
preprocessing	Transform and clean data in preparation for training
sandbox	Newer experimental features and references
text	Text manipulation and processing



In [2]:

    
# until we can install, add parent dir to path so ttk is found
import sys
sys.path.insert(0, '..')



In [3]:

    
# basic imports
import pandas as pd
import numpy as np
import re

import matplotlib
%matplotlib inline
matplotlib.rcParams['figure.figsize'] = (10.0, 8.0)

import matplotlib.pyplot as plt

Let's start with some text

The ttk.text module includes classes and functions to make working with text easier. These are meant to supplement existing nltk and spaCy text processing, and often work in conjunction with these libraries. Below is an overview of some of the major components. We'll explore these objects with some simple text now.

Class	Purpose
Normalizer	Normalizes text by formatting, stemming and substitution
Tokenizer	High level tokenizer, provides word, sentence and paragraph tokenizers



In [4]:

    
# simple text normalization
# apply individually
# apply to sentences



In [5]:

    
# simple text tokenization
# harder text tokenization
# sentence tokenization
# paragraph tokenization

Corpii? Corpuses? Corpora!

The ttk.corpus module builds on the nltk.corpus model, adding new corpus readers and corpus processing objects. It also includes loading functions for the corpora included with ttk, which will download the content from github as needed.

We'll use the Dated Headline corpus included with ttk. This corpus was created using ttk, and is maintained in a complimentary github project, TacticCorpora (https://github.com/tacticsiege/TacticCorpora).

First, a quick look at the corpus module's major classes and functions.

Class	Purpose
CategorizedDatedCorpusReader	Extends nltk's CategorizedPlainTextCorpusReader to include a second category, Date
CategorizedDatedCorpusReporter	Summarizes corpora. Filterable, and output can be str, list or DataFrame

Function	Purpose
load_headline_corpus(with_date=True)	Loads Categorized or CategorizedDated CorpusReader from headline data



In [6]:

    
from ttk.corpus import load_headline_corpus

# load the dated corpus. 
# This will attempt to download the corpus from github if it is not present locally.
corpus = load_headline_corpus(verbose=True)









    



Loading corpus from: S:\git\tacticsiege\tactictoolkit\ttk\..\env\corpus\dated\2017_08_22\corpus
Corpus loaded.



In [7]:

    
# inspect categories
print (len(corpus.categories()), 'categories')
for cat in corpus.categories():
    print (cat)









    



26 categories
CBS News
The Atlantic
Slate
Associated Press
The Guardian
Economist
Fox News
Russia Today
Reuters
Weekly Standard
The Independent
Al Jazeera
NPR
BBC
Boston Globe
CNBC
Business Insider
Wall Street Journal
Washington Post
Huffington Post
ABC News
Evening Standard
CNN
Breitbart
New York Times
The New Yorker



In [8]:

    
# all main corpus methods allow lists of categories and dates filters
d = '2017-08-22'
print (len(corpus.categories(dates=[d])), 'categories')
for cat in corpus.categories(dates=[d]):
    print (cat)









    



24 categories
CBS News
The Atlantic
Associated Press
The Guardian
Fox News
Russia Today
Reuters
Weekly Standard
The Independent
Al Jazeera
NPR
BBC
Boston Globe
CNBC
Business Insider
Wall Street Journal
Washington Post
Huffington Post
ABC News
Evening Standard
CNN
Breitbart
New York Times
The New Yorker



In [10]:

    
# use the Corpus Reporters to get summary reports
from ttk.corpus import CategorizedDatedCorpusReporter
reporter = CategorizedDatedCorpusReporter()

# summarize categories
print (reporter.category_summary(corpus))









    



CBS News              88 dates  10412 sentences  113282 words  12288 unique words,  88 files
The Atlantic          86 dates   2610 sentences   24953 words   5077 unique words,  86 files
Slate                 79 dates   1062 sentences    7899 words   2428 unique words,  79 files
Associated Press      88 dates   7711 sentences   83257 words   9989 unique words,  88 files
The Guardian          87 dates  15278 sentences  205447 words  19845 unique words,  87 files
Economist             45 dates    252 sentences    2304 words   1007 unique words,  45 files
Fox News              86 dates  11213 sentences  145350 words  15793 unique words,  86 files
Russia Today          88 dates   6463 sentences   92975 words  12692 unique words,  88 files
Reuters               88 dates   9462 sentences  117009 words  10973 unique words,  88 files
Weekly Standard       80 dates   1161 sentences   10188 words   3077 unique words,  80 files
The Independent       87 dates   4584 sentences   72318 words  10048 unique words,  87 files
Al Jazeera            87 dates   3164 sentences   29095 words   5565 unique words,  87 files
NPR                   88 dates   6626 sentences   79370 words  11101 unique words,  88 files
BBC                   88 dates  12198 sentences  121363 words  15447 unique words,  88 files
Boston Globe          86 dates    607 sentences    7062 words   2275 unique words,  86 files
CNBC                  88 dates  12726 sentences  186294 words  14217 unique words,  88 files
Business Insider      86 dates   7949 sentences  122707 words  13074 unique words,  86 files
Wall Street Journal   88 dates   3573 sentences   37259 words   6257 unique words,  88 files
Washington Post       88 dates  16230 sentences  209588 words  15682 unique words,  88 files
Huffington Post       88 dates   3361 sentences   43254 words   6568 unique words,  88 files
ABC News              88 dates  15891 sentences  181688 words  14233 unique words,  88 files
Evening Standard      84 dates   5522 sentences   87879 words  10763 unique words,  84 files
CNN                   88 dates  18383 sentences  176168 words  14749 unique words,  88 files
Breitbart             88 dates   6750 sentences   99341 words  11599 unique words,  88 files
New York Times        88 dates   6304 sentences   83061 words  10201 unique words,  88 files
The New Yorker        80 dates    955 sentences    9504 words   2881 unique words,  80 files



In [13]:

    
# reporters can return str, list or dataframe
for s in reporter.date_summary(corpus,
                               dates=['2017-08-17', '2017-08-18', '2017-08-19',], 
                               output='list'):
    print (s)









    



{'date': '2017-08-17', 'categories': 25, 'sentences': 2843, 'words': 36125, 'uniq_words': 7255, 'files': 25}
{'date': '2017-08-18', 'categories': 25, 'sentences': 2773, 'words': 34342, 'uniq_words': 6891, 'files': 25}
{'date': '2017-08-19', 'categories': 24, 'sentences': 1445, 'words': 17718, 'uniq_words': 4435, 'files': 24}



In [14]:

    
cat_frame = reporter.category_summary(corpus,
                                      categories=['BBC', 'CNBC', 'CNN', 'NPR',],
                                      output='dataframe')
cat_frame.head()



In [ ]:

	category	dates	files	sentences	uniq_words	words
0	CNN	88	88	18383	14749	176168
1	NPR	88	88	6626	11101	79370
2	CNBC	88	88	12726	14217	186294
3	BBC	88	88	12198	15447	121363