_adapted from https://github.com/StarCYing/open_data_day_dc_
In this workshop we show you an example of a workflow in data science from initial data ingestion, cleaning, modeling, and ultimately clustering. In this example we scrape the news feed of the National Institute of Standards and Technology (NIST). For those not in the know, NIST is comprised of multiple research centers which include:
This makes it an easy target for topic modeling, a way of identifying patterns in a corpus that uses clustering.
In [1]:
from lxml import html
import requests
from __future__ import print_function
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.cluster import KMeans, MiniBatchKMeans
from time import time
We request NIST news based on the following URL, 'http://www.nist.gov/allnews.cfm?s=01-01-2014&e=12-31-2014'. For this workshop, we look at only 2014 news articles posted on the NIST website.
We then pass that retrieved content to our HTML parser and search for a specific div class, "select_portal_module_wrapper" which is assigned to every headline and every description (headlines receive a strong tag and descriptions receive a p tag).
We then merge both the headline and description into one entry in the list because we don't need to differentiate between title and description.
In [2]:
print("Retrieving data from NIST...")
# Retrieve the data from the web page.
page = requests.get('http://www.nist.gov/allnews.cfm?s=01-01-2014&e=12-31-2014')
# Use html module to parse it out and store in tree.
tree = html.fromstring(page.content)
# Create list of news headlines and descriptions.
# This required obtaining the xpath of the elements by examining the web page.
list_of_headlines = tree.xpath('//div[@class="select_portal_module_wrapper"]/a/strong/text()')
list_of_descriptions = tree.xpath('//div[@class="select_portal_module_wrapper"]/p/text()')
#Combine each headline and description into one value in a list
news=[]
for each_headline in list_of_headlines:
for each_description in list_of_descriptions:
news.append(each_headline+each_description)
print("Last item in list retrieved: %s" % news[-1])
We now call a TF-IDF vectorizer to create a sparse matrix with term frequency-inverse document frequency weights:
In [3]:
print("Extracting features from the training dataset using a sparse vectorizer")
t0 = time()
# Create a sparse word occurrence frequency matrix of the most frequent words
# with the following parameters:
# Maximum document frequency = half the total documents
# Minimum document frequency = two documents
# Toss out common English stop words.
vectorizer = TfidfVectorizer(input=news, max_df=0.5, min_df=2, stop_words='english')
# This calculates the counts
X = vectorizer.fit_transform(news)
print("done in %fs" % (time() - t0))
print("n_samples: %d, n_features: %d" % X.shape)
print()
I happen to know there are 15 subject areas at NIST:
- Bioscience & Health
- Building and Fire Research
- Chemistry
- Electronics & Telecommunications
- Energy
- Environment/Climate
- Information Technology
- Manufacturing
- Materials Science
- Math
- Nanotechnology
- Physics
- Public Safety & Security
- Quality
- Transportation
So, why don't we cheat and set the number of clusters to 15?
Then we call the KMeans clustering model from sklearn and set an upper bound to the number of iterations for fitting the data to the model.
Finally we list out each centroid and the top 10 terms associated with each centroid.
In [4]:
# Set the number of clusters to 15
k = 15
# Initialize the kMeans cluster model.
km = KMeans(n_clusters=k, init='k-means++', max_iter=100)
print("Clustering sparse data with %s" % km)
t0 = time()
# Pass the model our sparse matrix with the TF-IDF counts.
km.fit(X)
print("done in %0.3fs" % (time() - t0))
print()
print("Top terms per cluster:")
order_centroids = km.cluster_centers_.argsort()[:, ::-1]
terms = vectorizer.get_feature_names()
for i in range(k):
print("Cluster %d:" % (i+1), end='')
for ind in order_centroids[i, :10]:
print(' %s' % terms[ind], end='')
print()
How do the results compare to NIST's listed subject areas?
How would you operationalize this model?