In [ ]:
import numpy as np
import pandas as pd
%matplotlib inline
import matplotlib.pyplot as plt
import seaborn as sb
sb.set_style('whitegrid')
import requests
import json
import re
from bs4 import BeautifulSoup
import string
import nltk
import networkx as nx
Load the data from disk into memory.
In [ ]:
with open('potus_wiki_bios_cleaned.json','r') as f:
bios = json.load(f)
Confirm there are 44 presidents (shaking fist at Grover Cleveland) in the dictionary.
In [ ]:
print("There are {0} biographies of presidents.".format(len(bios)))
What's an example of a single biography? We access the dictionary by passing the key (President's name), which returns the value (the text of the biography).
In [ ]:
example = bios['Grover Cleveland']
print(example)
Get some metadata about the U.S. Presidents.
In [ ]:
presidents_df = pd.DataFrame(requests.get('https://raw.githubusercontent.com/hitch17/sample-data/master/presidents.json').json())
presidents_df = presidents_df.set_index('president')
presidents_df['wikibio words'] = pd.Series({bio_name:len(bio_text) for bio_name,bio_text in bios.items()})
presidents_df.head()
A really basic exploratory scatterplot for the number of words in each President's biography compared to their POTUS index.
In [ ]:
presidents_df.plot.scatter(x='number',y='wikibio words')
We can create a document-term matrix where the rows are our 44 presidential biographies, the columns are the terms (words), and the values in the cells are the word counts: the number of times that document contains that word. This is the "term frequency" (TF) part of TF-IDF.
The IDF part of TF-IDF is the "inverse document frequency". The intuition is that words that occur frequency within a single document but are infrequent across the corpus of documents should receiving a higher weighting: these words have greater relative meaning. Conversely, words that are frequently used across documents are down-weighted.
The image below has documents as columns and terms as rows.
In [ ]:
# Import the libraries from scikit-learn
from sklearn.feature_extraction.text import CountVectorizer,TfidfTransformer
count_vect = CountVectorizer()
# Compute the word counts -- it expects a big string, so join our cleaned words back together
bio_counts = count_vect.fit_transform([' '.join(bio) for bio in bios.values()])
# Compute the TF-IDF for the word counts from each biography
bio_tfidf = TfidfTransformer().fit_transform(bio_counts)
# Convert from sparse to dense array representation
bio_tfidf_dense = bio_tfidf.todense()
Once we have the TFIDF scores for every word in each president's biography, we can make a text similarity network. Multiplying the document by term matrix by its transpose should return the cosine similarities between documents. We can also import cosine_similarity from scikit-learn if you don't believe me (I didn't believe me either). Cosine similarity values closer to 1 indicate these documents' words have more similar TFIDF scores and values closer to 0 indicate these documents' words are more dissimilar.
The goal here is to create a network where nodes are presidents and edges are weighted similarity scores. All text documents will have some minimal similarity, so we can threshold the similarity scores to only those similarities in the top 10% for each president.
In [ ]:
# Compute cosine similarity
pres_pres_df = pd.DataFrame(bio_tfidf_dense*bio_tfidf_dense.T)
# If you don't believe me that cosine similiarty is the document-term matrix times its transpose
from sklearn.metrics.pairwise import cosine_similarity
pres_pres_df = pd.DataFrame(cosine_similarity(bio_tfidf_dense))
# Filter for edges in the 90th percentile or greater
pres_pres_filtered_df = pres_pres_df[pres_pres_df >= pres_pres_df.quantile(.9)]
# Reshape and filter data
edgelist_df = pres_pres_filtered_df.stack().reset_index()
edgelist_df = edgelist_df[(edgelist_df[0] != 0) & (edgelist_df['level_0'] != edgelist_df['level_1'])]
# Rename and replace data
edgelist_df.rename(columns={'level_0':'from','level_1':'to',0:'weight'},inplace=True)
edgelist_df.replace(dict(enumerate(bios.keys())),inplace=True)
# Inspect
edgelist_df.head()
We read this pandas edgelist into networkx using from_pandas_edgelist, report out some basic descriptives about the network, and write the graph object to file in case we want to visualize it in a dedicated network visualization package like Gephi.
In [ ]:
# Convert from edgelist to a graph object
g = nx.from_pandas_edgelist(edgelist_df,source='from',target='to',edge_attr=['weight'])
# Report out basic descriptives
print("There are {0:,} nodes and {1:,} edges in the network.".format(g.number_of_nodes(),g.number_of_edges()))
# Write graph object to disk for visualization
nx.write_gexf(g,'bio_similarity.gexf')
Since this is a small and sparse network, we can try to use Matplotlib to visualize it instead. I would only use the nx.draw functionality for small networks like this one.
In [ ]:
# Plot the nodes as a spring layout
#g_pos = nx.layout.fruchterman_reingold_layout(g, k = 5, iterations=10000)
g_pos = nx.layout.kamada_kawai_layout(g)
# Draw the graph
f,ax = plt.subplots(1,1,figsize=(10,10))
nx.draw(G = g,
ax = ax,
pos = g_pos,
with_labels = True,
node_size = [dc*(len(g) - 1)*100 for dc in nx.degree_centrality(g).values()],
font_size = 10,
font_weight = 'bold',
width = [d['weight']*10 for i,j,d in g.edges(data=True)],
node_color = 'tomato',
edge_color = 'grey'
)
In [ ]:
# Load the data
with open('sp500_wiki_articles.json','r') as f:
sp500_articles = json.load(f)
# Bring in the text_preprocessor we wrote from Day 4, Lecture 1
def text_preprocessor(text):
"""Takes a large string (document) and returns a list of cleaned tokens"""
tokens = nltk.wordpunct_tokenize(text)
clean_tokens = []
for t in tokens:
if t.lower() not in all_stopwords and len(t) > 2:
clean_tokens.append(lemmatizer(t.lower()))
return clean_tokens
# Clean each article
cleaned_sp500 = {}
for name,text in sp500_articles.items():
cleaned_sp500[name] = text_preprocessor(text)
# Save to disk
with open('sp500_wiki_articles_cleaned.json','w') as f:
json.dump(cleaned_sp500,f)
Step 2: Compute the TFIDF matrix for the S&P 500 companies.
In [ ]:
# Compute the word counts
sp500_counts =
# Compute the TF-IDF for the word counts from each biography
sp500_tfidf =
# Convert from sparse to dense array representation
sp500_tfidf_dense =
Step 3: Compute the cosine similarities.
In [ ]:
# Compute cosine similarity
company_company_df =
# Filter for edges in the 90th percentile or greater
company_company_filtered_df =
# Reshape and filter data
sp500_edgelist_df =
sp500_edgelist_df =
# Rename and replace data
sp500_edgelist_df.rename(columns={'level_0':'from','level_1':'to',0:'weight'},inplace=True)
sp500_edgelist_df.replace(dict(enumerate(sp500_articles.keys())),inplace=True)
# Inspect
sp500_edgelist_df.head()
Step 4: Visualize the resulting network.
In [ ]:
We used TF-IDF vectors of documents and cosine similarities between these document vectors as a way of representing the similarity in the networks above. However, TF-IDF score are simply (normalized) word frequencies: they do not capture semantic information. A vector space model like the popular Word2Vec represents each token (word) in a high-dimensional (here we'll use 100-dimensions) space that is trained from some (ideally) large corpus of documents. Ideally, tokens that are used in similar contexts are placed into similar locations in this high-dimensional space. Once we have vectorized words into this space, we're able to efficiently compute do a variety of other operations such as compute similarities between words or do transformations that can find analogies.
I lack the expertise and we lack the time to get into the math behind these methods, but here are some helpful tutorials I've found:
We'll use the 44 Presidential biographies as a small and specific corpus. We start by training a bios_model from the list of biographies using hyperparamaters for the number of dimensions (size), the number of surrounding words to use as training (window), and the minimum number of times a word has to occur to be included in the model (min_count).
In [ ]:
from gensim.models import Word2Vec
bios_model = Word2Vec(bios.values(),size=100,window=10,min_count=8)
Each word in the vocabulary exists as a N-dimensional vector, where N is the "size" hyper-parameter set in the model. The "congress" token in located at this position in the 100-dimensional space we trained in bios_model.
In [ ]:
bios_model.wv['congress']
In [ ]:
bios_model.wv.most_similar('congress')
In [ ]:
bios_model.wv.most_similar('court')
In [ ]:
bios_model.wv.most_similar('war')
In [ ]:
bios_model.wv.most_similar('election')
There's a doesnt_match method that predicts which word in a list doesn't match the other word senses in the list. Sometime the results are predictable/trivial.
In [ ]:
bios_model.wv.doesnt_match(['democrat','republican','whig','panama'])
Other times the results are unexpected/interesting.
In [ ]:
bios_model.wv.doesnt_match(['canada','mexico','cuba','japan','france'])
One of the most powerful implications of having these vectorized embeddings of word meanings is the ability to do operations similar arithmetic that recover or reveal interesting semantic meanings. The classic example is Man:Woman::King:Queen:
What are some examples of these vector similarities from our trained model?
republican - slavery = democrat - X
-(republican - slavery) + democrat = X
slavery + democrat - republican = X
In [ ]:
bios_model.wv.most_similar(positive=['democrat','slavery'],negative=['republican'])
In [ ]:
bios_model.wv.most_similar(positive=['republican','labor'],negative=['democrat'])
Finally, you can use the similarity method to return the similarity between two terms. In our trained model, "britain" and "france" are more similar to each other than "mexico" and "canada".
In [ ]:
bios_model.wv.similarity('republican','democrat')
In [ ]:
bios_model.wv.similarity('mexico','canada')
In [ ]:
bios_model.wv.similarity('britain','france')
Step 1: Open the "sp500_wiki_articles_cleaned.json" you previous saved of the cleaned S&P500 company article content or use a text preprocessor on "sp500_wiki_articles.json" to generate a dictionary of cleaned article content. Train a sp500_model using the Word2Vec model on the values of the cleaned company article content. You can use default hyperparameters for size, window, and min_count, or experiment with alternative values.
In [ ]:
Step 2: Using the most_similar method, explore some similarities this model has learned for salient tokens about companies (e.g., "board", "controversy", "executive", "investigation"). Use the positive and negative options to explore different analogies. Using the doesnt_match method, experiment with word combinations to discover predictable and unexpected exceptions. Using the similarity method, identify interesting similarity scores.
In [ ]:
Material from this segment is adapted from Jake Vanderplas's "Python Data Science Handbook" notebooks and Kevyn Collins-Thompson's "Applied Machine Learning in Python" module on Coursera.
In the TF-IDF, we have over 17,000 dimensions (corresponding to the unique tokens) for each of the 44 presidential biographies. This data is sparse and large, which makes it hard to visualize. Ideally we'd only have two dimensions of data for a task like visualization.
Dimensionality reduction encompasses a set of methods like principal component analysis, multidimensional scaling, and more advanced "manifold learning" that reduces high-dimensional data down to fewer dimensions. For the purposes of visualization, we typically want 2 dimensions. These methods use a variety of different assumptions and modeling approaches. If you want to understand the differences between them, you'll likely need to find a graduate-level machine learning course.
Let's compare what each of these do on our presidential TF-IDF: the goal here is to understand there are different methods for dimensionality reduction and each generates different new components and/or clusters that you'll need to interpret.
In [ ]:
print(bio_tfidf_dense.shape)
bio_tfidf_dense
Principal component analysis (PCA) is probably one of the most widely-used and efficient methods for dimensionality reduction.
In [ ]:
# Step 1: Choose a class of models
from sklearn.decomposition import PCA
# Step 2: Instantiate the model
pca = PCA(n_components=2)
# Step 3: Arrange the data into features matrices
# Already done
# Step 4: Fit the model to the data
pca.fit(bio_tfidf_dense)
# Step 5: Evaluate the model
X_pca = pca.transform(bio_tfidf_dense)
# Visualize
f,ax = plt.subplots(1,1,figsize=(10,10))
ax.scatter(X_pca[:,0],X_pca[:,1])
ax.set_title('PCA')
for i,txt in enumerate(bios.keys()):
if txt == 'Barack Obama':
ax.annotate(txt,(X_pca[i,0],X_pca[i,1]),color='blue',fontweight='bold')
elif txt == 'Donald Trump':
ax.annotate(txt,(X_pca[i,0],X_pca[i,1]),color='red',fontweight='bold')
else:
ax.annotate(txt,(X_pca[i,0],X_pca[i,1]))
Multi-dimensional scaling is another common technique in the social sciences.
In [ ]:
# Step 1: Choose your model class(es)
from sklearn.manifold import MDS
# Step 2: Instantiate your model class(es)
mds = MDS(n_components=2,metric=False,n_jobs=-1)
# Step 3: Arrange data into features matrices
# Done!
# Step 4: Fit the data and transform
X_mds = mds.fit_transform(bio_tfidf_dense)
# Plot the data
f,ax = plt.subplots(1,1,figsize=(10,10))
ax.scatter(X_mds[:,0],X_mds[:,1])
ax.set_title('Multi-Dimensional Scaling')
for i,txt in enumerate(bios.keys()):
if txt == 'Barack Obama':
ax.annotate(txt,(X_mds[i,0],X_mds[i,1]),color='blue',fontweight='bold')
elif txt == 'Donald Trump':
ax.annotate(txt,(X_mds[i,0],X_mds[i,1]),color='red',fontweight='bold')
else:
ax.annotate(txt,(X_mds[i,0],X_mds[i,1]))
Isomap is an extension of MDS.
In [ ]:
# Step 1: Choose your model class(es)
from sklearn.manifold import Isomap
# Step 2: Instantiate your model class(es)
iso = Isomap(n_neighbors = 5, n_components = 2)
# Step 3: Arrange data into features matrices
# Done!
# Step 4: Fit the data and transform
X_iso = iso.fit_transform(bio_tfidf_dense)
# Plot the data
f,ax = plt.subplots(1,1,figsize=(10,10))
ax.scatter(X_iso[:,0],X_iso[:,1])
ax.set_title('IsoMap')
for i,txt in enumerate(bios.keys()):
if txt == 'Barack Obama':
ax.annotate(txt,(X_iso[i,0],X_iso[i,1]),color='blue',fontweight='bold')
elif txt == 'Donald Trump':
ax.annotate(txt,(X_iso[i,0],X_iso[i,1]),color='red',fontweight='bold')
else:
ax.annotate(txt,(X_iso[i,0],X_iso[i,1]))
Spectral embedding does interesting things to the eigenvectors of a similarity matrix.
In [ ]:
# Step 1: Choose your model class(es)
from sklearn.manifold import SpectralEmbedding
# Step 2: Instantiate your model class(es)
se = SpectralEmbedding(n_components = 2)
# Step 3: Arrange data into features matrices
# Done!
# Step 4: Fit the data and transform
X_se = se.fit_transform(bio_tfidf_dense)
# Plot the data
f,ax = plt.subplots(1,1,figsize=(9,6))
ax.scatter(X_se[:,0],X_se[:,1])
ax.set_title('Spectral Embedding')
for i,txt in enumerate(bios.keys()):
if txt == 'Barack Obama':
ax.annotate(txt,(X_se[i,0],X_se[i,1]),color='blue',fontweight='bold')
elif txt == 'Donald Trump':
ax.annotate(txt,(X_se[i,0],X_se[i,1]),color='red',fontweight='bold')
else:
ax.annotate(txt,(X_se[i,0],X_se[i,1]))
Locally Linear Embedding is yet another dimensionality reduction method, but not my favorite to date given performance (meaningful clusters as output) and cost (expensive to compute).
In [ ]:
# Step 1: Choose your model class(es)
from sklearn.manifold import LocallyLinearEmbedding
# Step 2: Instantiate your model class(es)
lle = LocallyLinearEmbedding(n_components = 2,n_jobs=-1)
# Step 3: Arrange data into features matrices
# Done!
# Step 4: Fit the data and transform
X_lle = lle.fit_transform(bio_tfidf_dense)
# Plot the data
f,ax = plt.subplots(1,1,figsize=(9,6))
ax.scatter(X_lle[:,0],X_lle[:,1])
ax.set_title('Locally Linear Embedding')
for i,txt in enumerate(bios.keys()):
if txt == 'Barack Obama':
ax.annotate(txt,(X_lle[i,0],X_lle[i,1]),color='blue',fontweight='bold')
elif txt == 'Donald Trump':
ax.annotate(txt,(X_lle[i,0],X_lle[i,1]),color='red',fontweight='bold')
else:
ax.annotate(txt,(X_lle[i,0],X_lle[i,1]))
t-Distributed Stochastic Neighbor Embedding (t-SNE) is ubiquitous for visualizing word or document embeddings. It can be expensive to run, but it does a great job recovering clusters. There are some hyper-parameters, particularly "perplexity" that you'll need to tune to get things to look interesting.
Wattenberg, Viégas, and Johnson have an outstanding interactive tool visualizing how t-SNE's different parameters influence the layout as well as good advice on how to make the best of it.
In [ ]:
# Step 1: Choose your model class(es)
from sklearn.manifold import TSNE
# Step 2: Instantiate your model class(es)
tsne = TSNE(n_components = 2, init='pca', random_state=42, perplexity=11)
# Step 3: Arrange data into features matrices
# Done!
# Step 4: Fit the data and transform
X_tsne = tsne.fit_transform(bio_tfidf_dense)
# Plot the data
f,ax = plt.subplots(1,1,figsize=(10,10))
ax.scatter(X_tsne[:,0],X_tsne[:,1])
ax.set_title('t-SNE')
for i,txt in enumerate(bios.keys()):
if txt == 'Barack Obama':
ax.annotate(txt,(X_tsne[i,0],X_tsne[i,1]),color='blue',fontweight='bold')
elif txt == 'Donald Trump':
ax.annotate(txt,(X_tsne[i,0],X_tsne[i,1]),color='red',fontweight='bold')
else:
ax.annotate(txt,(X_tsne[i,0],X_tsne[i,1]))
Uniform Maniford Approximation and Projection (UMAP) is a new and particularly fast dimensionality reduction method with some comparatively great documentation. Unfortunately, UMAP is so new that it hasn't been translated into scikit-learn yet, so you'll need to install it separately from the terminal:
conda install -c conda-forge umap-learn
In [ ]:
# Step 1: Choose your model class(es)
from umap import UMAP
# Step 2: Instantiate your model class(es)
umap_ = UMAP(n_components=2, n_neighbors=10, random_state=42)
# Step 3: Arrange data into features matrices
# Done!
# Step 4: Fit the data and transform
X_umap = umap_.fit_transform(bio_tfidf_dense)
# Plot the data
f,ax = plt.subplots(1,1,figsize=(10,10))
ax.scatter(X_umap[:,0],X_umap[:,1])
ax.set_title('UMAP')
for i,txt in enumerate(bios.keys()):
if txt == 'Barack Obama':
ax.annotate(txt,(X_umap[i,0],X_umap[i,1]),color='blue',fontweight='bold')
elif txt == 'Donald Trump':
ax.annotate(txt,(X_umap[i,0],X_umap[i,1]),color='red',fontweight='bold')
else:
ax.annotate(txt,(X_umap[i,0],X_umap[i,1]))
Step 1: Using the sp500_tfidf_dense array/DataFrame, experiment with different dimensionality reduction tools we covered above. Visualize and inspect the distribution of S&P500 companies for interesting dimensions (do X and Y dimensions in this reduced data capture anything meaningful?) or clusters (do companies clusters together as we'd expect?).
In [ ]:
In [ ]:
top_words = pd.DataFrame(bio_counts.todense().sum(0).T,
index=count_vect.get_feature_names())[0]
top_words = top_words.sort_values(0,ascending=False).head(1000).index.tolist()
For each word in top_words, we get its vector from bios_model and add it to the top_word_vectors list and cast this list back to a numpy array.
In [ ]:
top_word_vectors = []
for word in top_words:
try:
vector = bios_model.wv[word]
top_word_vectors.append(vector)
except KeyError:
pass
top_word_vectors = np.array(top_word_vectors)
We can then use the dimensionality tools we just covered in the previous section to visualize the word similarities. PCA is fast but rarely does a great job with this extremely high-dimensional and sparse data: it's a cloud of points with no discernable structure.
In [ ]:
# Step 1: Choose your model class(es)
# from sklearn.decomposition import PCA
# Step 2: Instantiate the model
pca = PCA(n_components=2)
# Step 3: Arrange data into features matrices
X_w2v = top_word_vectors
# Step 4: Fit the data and transform
X_w2v_pca = pca.fit_transform(X_w2v)
# Plot the data
f,ax = plt.subplots(1,1,figsize=(10,10))
ax.scatter(X_w2v_pca[:,0],X_w2v_pca[:,1],s=3)
ax.set_title('PCA')
for i,txt in enumerate(top_words):
if i%10 == 0:
ax.annotate(txt,(X_w2v_pca[i,0],X_w2v_pca[i,1]))
f.savefig('term_pca.pdf')
t-SNE was more-or-less engineered for precisely the task of visualizing word embeddings. It likely takes on the order of a minute or more for t-SNE to reduce the top_words embeddings to only two dimensions. Assuming our perplexity and other t-SNE hyperparameters are well-behaved, there should be relatively easy-to-discern clusters of words with similar meanings. You can also open the "term_sne.pdf" file and zoome to inspect.
In [ ]:
# Step 1: Choose your model class(es)
from sklearn.manifold import TSNE
# Step 2: Instantiate your model class(es)
tsne = TSNE(n_components = 2, init='pca', random_state=42, perplexity=25)
# Step 3: Arrange data into features matrices
X_w2v = top_word_vectors
# Step 4: Fit the data and transform
X_w2v_tsne = tsne.fit_transform(X_w2v)
# Plot the data
f,ax = plt.subplots(1,1,figsize=(10,10))
ax.scatter(X_w2v_tsne[:,0],X_w2v_tsne[:,1],s=3)
ax.set_title('t-SNE')
for i,txt in enumerate(top_words):
if i%10 == 0:
ax.annotate(txt,(X_w2v_tsne[i,0],X_w2v_tsne[i,1]))
f.savefig('term_tsne.pdf')
UMAP is faster and I think better, but you'll need to make sure this is installed on your system since it doesn't come with scikit-learn or Anaconda by default. Words like "nominee" and "campaign" or the names of the months cluster clearly together apart from the rest.
In [ ]:
# Step 1: Choose your model class(es)
from umap import UMAP
# Step 2: Instantiate your model class(es)
umap_ = UMAP(n_components=2, n_neighbors=5, random_state=42)
# Step 3: Arrange data into features matrices
# Done!
# Step 4: Fit the data and transform
X_w2v_umap = umap_.fit_transform(X_w2v)
# Plot the data
f,ax = plt.subplots(1,1,figsize=(10,10))
ax.scatter(X_w2v_umap[:,0],X_w2v_umap[:,1],s=3)
ax.set_title('UMAP')
for i,txt in enumerate(top_words):
if i%10 == 0:
ax.annotate(txt,(X_w2v_umap[i,0],X_w2v_umap[i,1]))
f.savefig('term_umap.pdf')
In [ ]:
Step 2: Reduce the dimensionality of these top word vectors using PCA, t-SNE, or (if you've installed it) UMAP and visualize the results. What meaningful or surprising clusters do you discover?
In [ ]: