In [1]:
%%html
<style>
table {float:left}
</style>
TacticToolkit is a codebase to assist with machine learning and natural language processing. We build on top of sklearn, tensorflow, keras, nltk, spaCy and other popular libraries. The TacticToolkit will help throughout; from data acquisition to preprocessing to training to inference.
| Modules | Description |
|---|---|
| corpus | Load and work with text corpora |
| data | Data generation and common data functions |
| plotting | Predefined and customizable plots |
| preprocessing | Transform and clean data in preparation for training |
| sandbox | Newer experimental features and references |
| text | Text manipulation and processing |
In [2]:
# until we can install, add parent dir to path so ttk is found
import sys
sys.path.insert(0, '..')
In [3]:
# basic imports
import pandas as pd
import numpy as np
import re
import matplotlib
%matplotlib inline
matplotlib.rcParams['figure.figsize'] = (10.0, 8.0)
import matplotlib.pyplot as plt
The ttk.text module includes classes and functions to make working with text easier. These are meant to supplement existing nltk and spaCy text processing, and often work in conjunction with these libraries. Below is an overview of some of the major components. We'll explore these objects with some simple text now.
| Class | Purpose |
|---|---|
| Normalizer | Normalizes text by formatting, stemming and substitution |
| Tokenizer | High level tokenizer, provides word, sentence and paragraph tokenizers |
In [4]:
# simple text normalization
# apply individually
# apply to sentences
In [5]:
# simple text tokenization
# harder text tokenization
# sentence tokenization
# paragraph tokenization
The ttk.corpus module builds on the nltk.corpus model, adding new corpus readers and corpus processing objects. It also includes loading functions for the corpora included with ttk, which will download the content from github as needed.
We'll use the Dated Headline corpus included with ttk. This corpus was created using ttk, and is maintained in a complimentary github project, TacticCorpora (https://github.com/tacticsiege/TacticCorpora).
First, a quick look at the corpus module's major classes and functions.
| Class | Purpose |
|---|---|
| CategorizedDatedCorpusReader | Extends nltk's CategorizedPlainTextCorpusReader to include a second category, Date |
| CategorizedDatedCorpusReporter | Summarizes corpora. Filterable, and output can be str, list or DataFrame |
| Function | Purpose |
|---|---|
| load_headline_corpus(with_date=True) | Loads Categorized or CategorizedDated CorpusReader from headline data |
In [6]:
from ttk.corpus import load_headline_corpus
# load the dated corpus.
# This will attempt to download the corpus from github if it is not present locally.
corpus = load_headline_corpus(verbose=True)
In [7]:
# inspect categories
print (len(corpus.categories()), 'categories')
for cat in corpus.categories():
print (cat)
In [8]:
# all main corpus methods allow lists of categories and dates filters
d = '2017-08-22'
print (len(corpus.categories(dates=[d])), 'categories')
for cat in corpus.categories(dates=[d]):
print (cat)
In [10]:
# use the Corpus Reporters to get summary reports
from ttk.corpus import CategorizedDatedCorpusReporter
reporter = CategorizedDatedCorpusReporter()
# summarize categories
print (reporter.category_summary(corpus))
In [13]:
# reporters can return str, list or dataframe
for s in reporter.date_summary(corpus,
dates=['2017-08-17', '2017-08-18', '2017-08-19',],
output='list'):
print (s)
In [14]:
cat_frame = reporter.category_summary(corpus,
categories=['BBC', 'CNBC', 'CNN', 'NPR',],
output='dataframe')
cat_frame.head()
Out[14]:
In [ ]: