In [1]:
%matplotlib inline
In [2]:
import os
import sys
# Modify the path
sys.path.append("..")
import yellowbrick as yb
import matplotlib.pyplot as plt
Yellowbrick has provided a text corpus wrangled from the Baleen RSS Corpus to present the following examples. If you haven't downloaded the data, you can do so by running:
$ python download.py
In the same directory as the text notebook. Note that this will create a directory called data
that contains subdirectories with the provided datasets.
In [3]:
from download import download_all
from sklearn.datasets.base import Bunch
## The path to the test data sets
FIXTURES = os.path.join(os.getcwd(), "data")
## Dataset loading mechanisms
datasets = {
"hobbies": os.path.join(FIXTURES, "hobbies")
}
def load_data(name, download=True):
"""
Loads and wrangles the passed in text corpus by name.
If download is specified, this method will download any missing files.
"""
# Get the path from the datasets
path = datasets[name]
# Check if the data exists, otherwise download or raise
if not os.path.exists(path):
if download:
download_all()
else:
raise ValueError((
"'{}' dataset has not been downloaded, "
"use the download.py module to fetch datasets"
).format(name))
# Read the directories in the directory as the categories.
categories = [
cat for cat in os.listdir(path)
if os.path.isdir(os.path.join(path, cat))
]
files = [] # holds the file names relative to the root
data = [] # holds the text read from the file
target = [] # holds the string of the category
# Load the data from the files in the corpus
for cat in categories:
for name in os.listdir(os.path.join(path, cat)):
files.append(os.path.join(path, cat, name))
target.append(cat)
with open(os.path.join(path, cat, name), 'r') as f:
data.append(f.read())
# Return the data bunch for use similar to the newsgroups example
return Bunch(
categories=categories,
files=files,
data=data,
target=target,
)
In [4]:
corpus = load_data('hobbies')
A method for visualizing the frequency of tokens within and across corpora is frequency distribution. A frequency distribution tells us the frequency of each vocabulary item in the text. In general, it could count any kind of observable event. It is a distribution because it tells us how the total number of word tokens in the text are distributed across the vocabulary items.
In [7]:
from sklearn.feature_extraction.text import CountVectorizer
from yellowbrick.text.freqdist import FreqDistVisualizer
Note that the FreqDistVisualizer
does not perform any normalization or vectorization, and it expects text that has already be count vectorized.
We first instantiate a FreqDistVisualizer
object, and then call fit()
on that object with the count vectorized documents and the features (i.e. the words from the corpus), which computes the frequency distribution. The visualizer then plots a bar chart of the top 50 most frequent terms in the corpus, with the terms listed along the x-axis and frequency counts depicted at y-axis values. As with other Yellowbrick visualizers, when the user invokes show()
, the finalized visualization is shown.
In [10]:
vectorizer = CountVectorizer()
docs = vectorizer.fit_transform(corpus.data)
features = vectorizer.get_feature_names()
visualizer = FreqDistVisualizer()
visualizer.fit(docs, features)
visualizer.show()
In [15]:
vectorizer = CountVectorizer(stopwords='english')
docs = vectorizer.fit_transform(corpus.data)
features = vectorizer.get_feature_names()
visualizer = FreqDistVisualizer()
visualizer.fit(docs, features)
visualizer.show()
It is also interesting to explore the differences in tokens across a corpus. The hobbies corpus that comes with Yellowbrick has already been categorized (try corpus['categories']
), so let's visually compare the differences in the frequency distributions for two of the categories: "cooking" and "gaming"
In [18]:
hobby_types = {}
for category in corpus['categories']:
texts = []
for idx in range(len(corpus['data'])):
if corpus['target'][idx] == category:
texts.append(corpus['data'][idx])
hobby_types[category] = texts
In [19]:
vectorizer = CountVectorizer(stop_words='english')
docs = vectorizer.fit_transform(text for text in hobby_types['cooking'])
features = vectorizer.get_feature_names()
visualizer = FreqDistVisualizer()
visualizer.fit(docs, features)
visualizer.show()
In [20]:
vectorizer = CountVectorizer(stop_words='english')
docs = vectorizer.fit_transform(text for text in hobby_types['gaming'])
features = vectorizer.get_feature_names()
visualizer = FreqDistVisualizer()
visualizer.fit(docs, features)
visualizer.show()
In [ ]: