Clustering NIST headlines and descriptions

_adapted from https://github.com/StarCYing/open_data_day_dc_

Introduction:

In this workshop we show you an example of a workflow in data science from initial data ingestion, cleaning, modeling, and ultimately clustering. In this example we scrape the news feed of the National Institute of Standards and Technology (NIST). For those not in the know, NIST is comprised of multiple research centers which include:

  • Center for Nanoscale Science and Technology (CNST)
  • Engineering Laboratory (EL)
  • Information Technology Laboratory (ITL)
  • NIST Center for Neutron Research (NCNR)
  • Material Measurement Laboratory (MML)
  • Physical Measurement Laboratory (PML)

This makes it an easy target for topic modeling, a way of identifying patterns in a corpus that uses clustering.


In [1]:
from lxml import html
import requests
from __future__ import print_function
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.cluster import KMeans, MiniBatchKMeans
from time import time

Get the Data

Building the list of headlines and descriptions

We request NIST news based on the following URL, 'http://www.nist.gov/allnews.cfm?s=01-01-2014&e=12-31-2014'. For this workshop, we look at only 2014 news articles posted on the NIST website.

We then pass that retrieved content to our HTML parser and search for a specific div class, "select_portal_module_wrapper" which is assigned to every headline and every description (headlines receive a strong tag and descriptions receive a p tag).

We then merge both the headline and description into one entry in the list because we don't need to differentiate between title and description.


In [2]:
print("Retrieving data from NIST...")

# Retrieve the data from the web page.
page = requests.get('http://www.nist.gov/allnews.cfm?s=01-01-2014&e=12-31-2014') 

# Use html module to parse it out and store in tree.
tree = html.fromstring(page.content)

# Create list of news headlines and descriptions. 
# This required obtaining the xpath of the elements by examining the web page.
list_of_headlines = tree.xpath('//div[@class="select_portal_module_wrapper"]/a/strong/text()')
list_of_descriptions = tree.xpath('//div[@class="select_portal_module_wrapper"]/p/text()')

#Combine each headline and description into one value in a list
news=[]
for each_headline in list_of_headlines:
    for each_description in list_of_descriptions:
        news.append(each_headline+each_description)

print("Last item in list retrieved: %s" % news[-1])


Retrieving data from NIST...
Last item in list retrieved: A New NIST Online Database: The NIST Polycyclic Aromatic Hydrocarbon Structure Index Recently, a new website containing a wealth of information on polycyclic aromatic hydrocarbons (PAHs) was made publicly available by NIST. PAHs are compounds that are produced during the … 

Term Frequency-Inverse Document Frequency

  • The weight of a term that occurs in a document is proportional to the term frequency.
  • Term frequency is the number of times a term occurs in a document.
  • Inverse document frequency diminishes the weight of terms that occur very frequently in the document set and increases the weight of terms that occur rarely.

Convert collection of documents to TF-IDF matrix

We now call a TF-IDF vectorizer to create a sparse matrix with term frequency-inverse document frequency weights:


In [3]:
print("Extracting features from the training dataset using a sparse vectorizer")
t0 = time()

# Create a sparse word occurrence frequency matrix of the most frequent words
# with the following parameters:
# Maximum document frequency = half the total documents
# Minimum document frequency = two documents
# Toss out common English stop words.
vectorizer = TfidfVectorizer(input=news, max_df=0.5, min_df=2, stop_words='english')

# This calculates the counts 
X = vectorizer.fit_transform(news) 

print("done in %fs" % (time() - t0))
print("n_samples: %d, n_features: %d" % X.shape)
print()


Extracting features from the training dataset using a sparse vectorizer
done in 4.089050s
n_samples: 110224, n_features: 12197

Let's do some clustering!

I happen to know there are 15 subject areas at NIST:

- Bioscience & Health
- Building and Fire Research
- Chemistry
- Electronics & Telecommunications
- Energy
- Environment/Climate
- Information Technology
- Manufacturing
- Materials Science
- Math
- Nanotechnology
- Physics
- Public Safety & Security
- Quality
- Transportation

So, why don't we cheat and set the number of clusters to 15?

Then we call the KMeans clustering model from sklearn and set an upper bound to the number of iterations for fitting the data to the model.

Finally we list out each centroid and the top 10 terms associated with each centroid.


In [4]:
# Set the number of clusters to 15
k = 15

# Initialize the kMeans cluster model.
km = KMeans(n_clusters=k, init='k-means++', max_iter=100)

print("Clustering sparse data with %s" % km)
t0 = time()

# Pass the model our sparse matrix with the TF-IDF counts.
km.fit(X)
print("done in %0.3fs" % (time() - t0))
print()

print("Top terms per cluster:")
order_centroids = km.cluster_centers_.argsort()[:, ::-1]

terms = vectorizer.get_feature_names()
for i in range(k):
    print("Cluster %d:" % (i+1), end='')
    for ind in order_centroids[i, :10]:
        print(' %s' % terms[ind], end='')
    print()


Clustering sparse data with KMeans(copy_x=True, init='k-means++', max_iter=100, n_clusters=15, n_init=10,
    n_jobs=1, precompute_distances='auto', random_state=None, tol=0.0001,
    verbose=0)
done in 119.277s

Top terms per cluster:
Cluster 1: attack terrorist information work prepare cyber phone hazard united states
Cluster 2: baldrige excellence performance program award malcolm quality organizations penny pritzker
Cluster 3: based technology university institute standards national researchers demonstrated new maryland
Cluster 4: forensic science standards national technology institute new research committees osac
Cluster 5: committee technology visiting agency primary sector national vcat advanced group
Cluster 6: induced silicon pressure resolution force atomic chip photothermal ptir lateral
Cluster 7: cnst make center edition nanoscale science spring recent xps news
Cluster 8: test image left john rpki 15 showing standard click materials
Cluster 9: optical fiber long dark ago laser nistar idea years recent
Cluster 10: draft public comment review standards federal institute technology national issued
Cluster 11: extension hollings partnership manufacturing mep new standards institute technology national
Cluster 12: dimensional metrology pml semiconductor division electron scanning researchers ultra energy
Cluster 13: technology standards institute national released scientists new research 2014 computer
Cluster 14: new used medical time national director 2014 technology center data
Cluster 15: 00 dna sequencer et participate analysts wednesday webinar invited free

Questions

  1. How do the results compare to NIST's listed subject areas?

  2. How would you operationalize this model?