Clustering NIST headlines and descriptions

adapted from https://github.com/star-is-here/open_data_day_dc

Introduction:

In this workshop we show you an example of a workflow in data science from initial data ingestion, cleaning, modeling, and ultimately clustering. In this example we scrape the news feed of the National Institute of Standards and Technology (NIST). For those not in the know, NIST is comprised of multiple research centers which include:

Center for Nanoscale Science and Technology (CNST)
Engineering Laboratory (EL)
Information Technology Laboratory (ITL)
NIST Center for Neutron Research (NCNR)
Material Measurement Laboratory (MML)
Physical Measurement Laboratory (PML)

This makes it an easy target for topic modeling, a way of identifying patterns in a corpus that uses clustering.



In [1]:

    
from lxml import html
import requests
from __future__ import print_function
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.cluster import KMeans, MiniBatchKMeans
from time import time

Get the Data

Building the list of headlines and descriptions

We request NIST news based on the following URL, 'http://www.nist.gov/allnews.cfm?s=01-01-2014&e=12-31-2014'. For this workshop, we look at only 2014 news articles posted on the NIST website.

We then pass that retrieved content to our HTML parser and search for a specific div class, "select_portal_module_wrapper" which is assigned to every headline and every description (headlines receive a strong tag and descriptions receive a p tag).

We then merge both the headline and description into one entry in the list because we don't need to differentiate between title and description.



In [24]:

    
print("Retrieving data from NIST...")

# Retrieve the data from the web page.
page = requests.get('https://www.nist.gov/news-events/news/search?combine=&field_campus_tid=All&term_node_tid_depth_1=All&date_filter%5Bmin%5D%5Bdate%5D=January+01%2C+2016&date_filter%5Bmax%5D%5Bdate%5D=June+30%2C+2016&items_per_page=200') 

# Use html module to parse it out and store in tree.
tree = html.fromstring(page.content)

# Create list of news headlines and descriptions. 
# This required obtaining the xpath of the elements by examining the web page.
list_of_headlines = tree.xpath('//h3[@class="nist-teaser__title"]/a/text()')
list_of_descriptions = tree.xpath('//div[@class="field-body field--body nist-body nist-teaser__content"]/text()')

#Combine each headline and description into one value in a list
news=[]
for each_headline in list_of_headlines:
    for each_description in list_of_descriptions:
        news.append(each_headline+each_description)

print("Last item in list retrieved: %s" % news[-1])









    



Retrieving data from NIST...
Last item in list retrieved: Evaluating Investments in Community Resilience: New Guide Explains How
    Communities weighing choices for capital improvement projects intended to improve their resilience to severe weather,...

Term Frequency-Inverse Document Frequency

The weight of a term that occurs in a document is proportional to the term frequency.
Term frequency is the number of times a term occurs in a document.
Inverse document frequency diminishes the weight of terms that occur very frequently in the document set and increases the weight of terms that occur rarely.

Convert collection of documents to TF-IDF matrix

We now call a TF-IDF vectorizer to create a sparse matrix with term frequency-inverse document frequency weights:



In [25]:

    
print("Extracting features from the training dataset using a sparse vectorizer")
t0 = time()

# Create a sparse word occurrence frequency matrix of the most frequent words
# with the following parameters:
# Maximum document frequency = half the total documents
# Minimum document frequency = two documents
# Toss out common English stop words.
vectorizer = TfidfVectorizer(input=news, max_df=0.5, min_df=2, stop_words='english')

# This calculates the counts 
X = vectorizer.fit_transform(news) 

print("done in %fs" % (time() - t0))
print("n_samples: %d, n_features: %d" % X.shape)
print()









    



Extracting features from the training dataset using a sparse vectorizer
done in 0.393537s
n_samples: 17161, n_features: 1243

Let's do some clustering!

I happen to know there are 15 subject areas at NIST:

- Bioscience & Health
- Building and Fire Research
- Chemistry
- Electronics & Telecommunications
- Energy
- Environment/Climate
- Information Technology
- Manufacturing
- Materials Science
- Math
- Nanotechnology
- Physics
- Public Safety & Security
- Quality
- Transportation

So, why don't we cheat and set the number of clusters to 15?

Then we call the KMeans clustering model from sklearn and set an upper bound to the number of iterations for fitting the data to the model.

Finally we list out each centroid and the top 10 terms associated with each centroid.



In [26]:

    
# Set the number of clusters to 15
k = 15

# Initialize the kMeans cluster model.
km = KMeans(n_clusters=k, init='k-means++', max_iter=100)

print("Clustering sparse data with %s" % km)
t0 = time()

# Pass the model our sparse matrix with the TF-IDF counts.
km.fit(X)
print("done in %0.3fs" % (time() - t0))
print()

print("Top terms per cluster:")
order_centroids = km.cluster_centers_.argsort()[:, ::-1]

terms = vectorizer.get_feature_names()
for i in range(k):
    print("Cluster %d:" % (i+1), end='')
    for ind in order_centroids[i, :10]:
        print(' %s' % terms[ind], end='')
    print()









    



Clustering sparse data with KMeans(copy_x=True, init='k-means++', max_iter=100, n_clusters=15, n_init=10,
    n_jobs=1, precompute_distances='auto', random_state=None, tol=0.0001,
    verbose=0)
done in 8.552s

Top terms per cluster:
Cluster 1: save alarmingly count millions microbes lives arsenal antimicrobial present butler
Cluster 2: researchers institute developed communities working technology new national standards county
Cluster 3: form essential quantum frequency communication converting awaited arrives finally incorporate
Cluster 4: boulder labs march closed storm conferences snow continuing 23 thursday
Cluster 5: commerce technology standards department secretary institute today md national announced
Cluster 6: new measurement devices physical time kind world cybersecurity researchers study
Cluster 7: baldrige excellence program executive using performance health framework malcolm organizations
Cluster 8: centers mep states manufacturing hollings extension refresh multiyear volunteer common
Cluster 9: energy hosted zero net just manufacturing 2016 plastic intensive branch
Cluster 10: standards institute national technology new version designed developing document million
Cluster 11: large laser jila combing complex identify faire washington maker method
Cluster 12: blizzard continues access cleanup gaithersburg limited weekend manufacturers closed robot
Cluster 13: scientists quantum new gases employees particles greenhouse physics experiment telework
Cluster 14: institutes opportunity release funding nnmi announces manufacturing innovation science center
Cluster 15: currently recipients official million annual learn flu surprised including workhorse

Questions

How do the results compare to NIST's listed subject areas?
How would you operationalize this model?