This notebook Analyzes Pycon talk data scraped from the conference website using a few natural langauage processing techniques to determine latent topics and build a list of similar talks for each pycon talk.
The rest of the project can be found on github: https://github.com/mikecunha/pycon_reco
TOC
Load two years worth of talks and concatenate a few fields together for BOW.
In [1]:
import pandas as pd
from collections import defaultdict
from datetime import datetime
In [2]:
talks1 = pd.read_csv( 'data/pycon_talks_2015.csv', sep="\t" )
talks1['start_dt'] = pd.to_datetime( talks1.start_dt )
talks2 = pd.read_csv( 'data/pycon_talks_2014.csv', sep='\t' )
talks2['start_dt'] = pd.to_datetime( talks2.start_dt )
talks = talks1.append(talks2, ignore_index=True)
talks.info()
In [3]:
talks.tail()
Out[3]:
In [4]:
# Code below depends on talks being sorted with current year talks first
cur_year_max_ID = talks[ talks.start_dt >= datetime(2015,1,1) ].index[-1] + 1
cur_year_max_ID
Out[4]:
Build a bag of words for each document
In [5]:
documents = []
for ind, talk in talks.fillna('').iterrows():
documents.append( ' '.join([talk['title'],
talk['desc'],
talk['abstract']
]) )
print ("Read %d corpus of documents" % len(documents))
TOC
Instead of examining the top documents and words in each topic by eye every time a model parameter is adjusted define a quick, repeatable measure. The metric should be meaningful in terms of how we'll be using the topic model output, i.e. can we make good recommendations?
Note* these groups depend on talks being in the same order (same index in dataframe) as when I created them.
In [6]:
# Hand-make some clusters of talks with similar topics
hand_made_groups = [
('Databases', set([50,60,80,83,84,24])),
('Django', set([5,40,43,59,101])),
('Docker', set([74,91])),
('Deployment', set([13,31,33,53,56,71,82,110,112,])),
('Systems', set([9,10,53,71,118,120,])),
('Web Apps', set([51,56,89,129,])),
('modules, packages', set([27,20,33,109])),
('API', set([43,56,124,130])),
('Machine Learning, modeling', set([50,52,72,78,97,104,116,125,])),
('Testing', set([21,39,59,64,65])),
('Open Source', set([70,99,122])),
('Python Language, Learning', set([2,7,11,16,19,29,47,67,69,79,90,98,100,126,134,135,137])),
('Data Science', set([8,22,24,25,26,30,52,57,68,72,76,78,81,88,97,102,104,116,125,132,136])),
('Math', set([37,26,57,132,])),
('Graphs', set([8,48,76])),
('Games', set([23,38,])),
]
# put it in context of each doc
labeled_docs = defaultdict(set)
for topic, group in hand_made_groups:
for doc in group:
[ labeled_docs[doc].add(d) for d in group if d != doc ]
In [7]:
def score( make_preds ):
num_docs = len( labeled_docs )
true_pos = 0
false_pos = 0
true_neg = 0
false_neg = 0
for doc in labeled_docs.keys():
# get predictions for eval_doc
preds = make_preds(doc)
tp = len(set(preds).intersection( labeled_docs[doc] ))
true_pos += tp
fp = len(set(preds).difference( labeled_docs[doc] ))
false_pos += fp
fn = len(labeled_docs[doc].difference( set(preds) ))
false_neg += fn
true_neg += (num_docs - (tp + fp + fn))
accuracy = (true_pos + true_neg) / (true_pos + true_neg + false_pos + false_neg)
precision = true_pos / (true_pos + false_pos)
recall = true_pos / (true_pos + false_neg)
f_one = 2 * ((precision * recall) / (precision + recall))
print ("accuracy: %0.2f \nprecision: %0.2f \nrecall: %0.2f \nF1: %0.2f" % (accuracy, precision, recall, f_one))
return accuracy, precision, recall, f_one
TOC
Test the scoring function and create a baseline score by guessing the next doc in the index is related
In [8]:
acc, prec, rec, f_one = score( lambda x: [x+1])
In [9]:
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import cosine_similarity
from sklearn import decomposition
from sklearn.feature_extraction.stop_words import ENGLISH_STOP_WORDS
In [10]:
combined_stops = list(ENGLISH_STOP_WORDS) + ['tutorial', 'also', 'get', "we'll", 'll', 'code', '&' ]
ENGLISH_STOP_WORDS = frozenset( combined_stops )
In [11]:
tfidf = TfidfVectorizer(stop_words=ENGLISH_STOP_WORDS,
lowercase=True,
strip_accents="unicode",
use_idf=True,
norm="l2",
min_df = 6, # appears in >= X docs
max_df = 0.5, # appears in <= Y% of docs
ngram_range=(1,2), # use bigrams
)
A = tfidf.fit_transform(documents)
print ("Created document-term matrix of size %d x %d" % (A.shape[0],A.shape[1]) )
Save mapping of indicies to the words they represent
In [12]:
num_terms = len(tfidf.vocabulary_)
terms = [""] * num_terms
for term in tfidf.vocabulary_.keys():
terms[ tfidf.vocabulary_[term] ] = term
Do the factorization and produce the factors
In [13]:
num_tops = 22
model = decomposition.NMF(init="nndsvd",
n_components=num_tops,
max_iter=400,
tol=0.0001,
)
W = model.fit_transform(A)
H = model.components_
print ("Generated factor W of size %s and factor H of size %s" % ( str(W.shape), str(H.shape) ) )
For each talk in 2015 (the first 138 docs), calculate the similarity between it and all the other 2015 documents based on the topics each is associated with.
In [14]:
doc_sims = cosine_similarity( W[:cur_year_max_ID,:], W[:cur_year_max_ID,:] )
doc_sims.shape
Out[14]:
Build an index of recommended talks for each talk. Incorporate a threshold of similarity and a max number of related talks.
Experimenting with different combinations of similarity threshold and top-n related docs gives pretty good control over precision vs. recall and can be tailored to the recommender application.
In [15]:
NMF_lookup = {}
sim_thresh = 0.74 # min similarity docs need to be to recommend (0.0-1.0)
n_docs = 15 # how many docs above threshold to keep
for doc_key in range(doc_sims.shape[0]):
# slice column out of similarity matrix for this doc
similarities = doc_sims[doc_key,:]
# sort related docs by score, descending
related = sorted(enumerate(similarities), key=lambda tup: tup[1], reverse=True)
# remove related score that compares to itself and any docs under the threshold of relevance
related = [ doc_id for doc_id, sim in related if doc_id != doc_key and sim >= sim_thresh ]
# only use top-N recommendations
related = related[:n_docs]
NMF_lookup[doc_key] = related
# Score it
def make_NMF_pred( doc_id ):
return [ doc_num for doc_num in NMF_lookup[doc_id] ]
a, p, r, f = score( make_NMF_pred )
In [16]:
import numpy as np
In [17]:
for topic_index in range( H.shape[0] ):
top_indices = np.argsort( H[topic_index,:] )[::-1][0:10]
term_ranking = [terms[i] for i in top_indices]
print ("Topic %d: %s" % ( topic_index, ", ".join( term_ranking ) ) )
In [18]:
topics_to_show = 2
for topic_index in range( min(W.shape[1], topics_to_show) ):
top_indices = np.argsort( W[:cur_year_max_ID,topic_index] )[::-1][0:15]
term_ranking = [(talks.ix[i].title, W[i,topic_index]) for i in top_indices]
print ("Topic %d:" % ( topic_index ))
for t in term_ranking:
print (t)
print ()
TOC
Save index to a file that a webservice can load
e.g. the 2nd line is a comma-separated list of the docs related to the 2nd talk
In [19]:
lines = []
for k in range( len(NMF_lookup.keys())):
line = []
for doc_num in NMF_lookup[k]:
line.append( str(doc_num) )
lines.append(','.join(line) )
with open('app/rel_talks.txt', mode='wt', encoding='utf-8') as myfile:
myfile.write( '\n'.join(lines) )
In [20]:
!head app/rel_talks.txt
TOC
Tokenize, make bigrams, remove stop words
In [21]:
import re
import itertools
from gensim import corpora, models, similarities
from nltk import bigrams
from nltk.corpus import stopwords
from nltk import PorterStemmer
stemmer = PorterStemmer()
# Replace dashes with spaces
def remove_dashes(text):
rx = re.compile(u'([\u2014\-\,\u2019,\n]|\.|\(|\)|\:|;|/|\[|\])', flags=re.UNICODE)
new_text = rx.sub(" ", text )
return new_text
Make it as painless as possible to try different ways of preprocessing the talks
In [22]:
def prep_docs(documents, stem_words=True, cur_yr_only=True, use_bigrams=False, min_word_freq=5, tfidf=False ):
"""Clean documents, remove stop words etc. Returns a list
of docs where each doc is a list of n-grams"""
stops = stopwords.words('english') + ['python', 'use', 'learn', 'talk',
'discuss', 'program', 'tutorial',
'also', 'get', "we'll", 'll',
'http', 'code', '&' ]
if cur_yr_only:
end_doc = cur_year_max_ID
else:
end_doc = len(documents)
# Split and clean concatentated text
if stem_words:
print("Splitting %d docs into words and stemming" % len(documents[:end_doc]))
stops = [ stemmer.stem(word) for word in stops ]
texts_ = [ [stemmer.stem(word) for word in remove_dashes(document.lower()).split()]
for document in documents[:end_doc] ]
else:
print("Splitting %d docs into words" % len(documents[:end_doc]))
texts_ = [ [word for word in remove_dashes(document.lower()).split()]
for document in documents[:end_doc] ]
# Filter out stop words
print("Filtering out stop words")
texts_ = [ [word for word in document if word not in stops]
for document in texts_ ]
# Bigrams
if use_bigrams:
print("Adding bigrams")
bigram_texts = []
for doc in texts_:
bigram_texts.append( doc + [' '.join(x) for x in bigrams(doc)] )
texts_ = bigram_texts
# Remove rare words
if min_word_freq:
print("Removing rare words")
all_tokens = list(itertools.chain(*texts_))
rare_tokens = set(word for word in set(all_tokens) if all_tokens.count(word) < min_word_freq)
texts_ = [[ngram for ngram in btext if ngram not in rare_tokens]
for btext in texts_]
dictionary = corpora.Dictionary(texts_)
corpus = [dictionary.doc2bow(text) for text in texts_]
# average length of words per doc (useful for tuning model params)
tot = 0
for t in texts_:
tot += len(t)
print ( "%3.0f average n-gramas per doc" % (tot / float(len(texts_))) )
# TF-IDF weighting
if tfidf:
print("Converting to TF-IDF vector space")
tfidf = models.TfidfModel(corpus)
corpus = tfidf[corpus]
return corpus, dictionary
In [23]:
def save_BOW(corpus, dictionary, c_path="data/pycon_corpus.mm", d_path="data/pycon.dict"):
corpora.MmCorpus.serialize(c_path, corpus)
dictionary.save(d_path)
return
In [24]:
def load_BOW(c_path="data/pycon_corpus.mm", d_path="data/pycon.dict"):
corpus = corpora.MmCorpus(c_path)
dictionary = corpora.Dictionary.load(d_path)
return corpus, dictionary
In [25]:
corpus, dictionary = prep_docs(documents,
stem_words=False,
cur_yr_only=True,
use_bigrams=True,
min_word_freq=False,
tfidf=False )
define Latent Semantic Index of Documents
In [26]:
nf = 22 # number of topics / features
lsi = models.LsiModel(corpus, id2word=dictionary, num_topics=nf, extra_samples=150,)
index = similarities.MatrixSimilarity(lsi[corpus], num_features=nf)
#index.save('data/pycon_2yr_01.index')
#index = similarities.MatrixSimilarity.load('data/pycon_2yr_01.index')
Make a lookup table of related docs and score it
In [27]:
# make a lookup table
lsi_lookup = {}
for doc_key, scores in enumerate( index ):
# sort related docs by score, descending
related = sorted(list(enumerate(scores)), key=lambda tup: tup[1], reverse=True)
# remove related score that compares to itself and any docs under a threshold of relevance
related = [ doc for doc in related if doc[0] != doc_key and doc[1] >= 0.74 ]
# a max of x related, so lists aren't too long
related = related[:15]
lsi_lookup[doc_key] = related
def make_LSI_pred( doc_id ):
return [ doc_num for doc_num, score in lsi_lookup[doc_id] ]
a, p, r, f = score( make_LSI_pred )
In [27]:
In [28]:
corpus, dictionary = prep_docs(documents,
stem_words=True,
cur_yr_only=True,
use_bigrams=True,
min_word_freq=5,
tfidf=False )
In [29]:
nf = 22
lda = models.ldamodel.LdaModel(corpus,
id2word=dictionary,
num_topics=nf,
passes= 10, # > 1 = batch mode; multiple passes over small corpus
iterations=105, # set to approx. number of words per doc
eta=1.5, # symmetric prior over all words
alpha='auto', # optimize alpha to asymmetric vals, using eta as start point
eval_every=0, # set to zero when corp. is small enough
chunksize=10, # approx. 10% of corpus size
gamma_threshold=0.00001, # lower than default 0.001 means more training per doc
)
index2 = similarities.MatrixSimilarity(lda[corpus], num_features=nf)
#index2.save("pycon_2yr-LDA_bigram_sims_v4.index")
In [30]:
# make a lookup table
lda_bigram_lookup = {}
for doc_key, scores in enumerate( index2 ):
# sort related docs by score, descending
related = sorted(list(enumerate(scores)), key=lambda tup: tup[1], reverse=True)
# remove related score that compares to itself and any docs under a threshold of relevance
related = [ doc for doc in related if doc[0] != doc_key and doc[1] >= 0.75 ]
# a max of 10 related, so lists aren't too long
related = related[:15]
lda_bigram_lookup[doc_key] = related
# Score
def make_LDA_bigram_pred( doc_id ):
return [ doc_num for doc_num, score in lda_bigram_lookup[doc_id] ]
a, p, r, f = score( make_LDA_bigram_pred )
In [30]:
In [31]:
%matplotlib inline
import matplotlib.pyplot as plt
from mpld3 import plugins, enable_notebook
import numpy as np
from tsne3 import tsne
In [32]:
enable_notebook()
Get most likely topic labels from NMF factors for each doc (have to pick a single topic to color the talk with in the plot)
In [33]:
nmf_labels = []
num_topics = W.shape[1]
for talk_index in range(W.shape[0]):
nmf_labels.append( np.argsort( W[talk_index,:] )[::-1][0] )
Get labels for each talk according to "category" listed on pycon website for comparison
In [34]:
mapping = {}
i = 0
cats = talks.category.unique()
for category in cats:
mapping[category] = float(i)
i += 1
print("%d unique categories" % len(cats))
cat_labels = [ mapping[c] for c in talks.category.values ]
Run tSNE Algo on 2015 Documents
In [39]:
Y = tsne(W[:cur_year_max_ID,:],
2, # Final Output Dimensions desired
20, # Dimensions for PCA to pre-process raw data to, same = no PCA
5) # Perplexity, clumped ball = too high
Now plot it
In [40]:
topic_colors = nmf_labels
#topic_colors = cat_labels
fig, ax = plt.subplots()
fig.set_size_inches(12.5,7.5)
points = ax.scatter(Y[:cur_year_max_ID,0], Y[:cur_year_max_ID,1],
c=topic_colors[:cur_year_max_ID],
s = 190,
alpha=0.4,
cmap=plt.cm.jet)
ax.set_title("Pycon Talks by NMF Topic", size=20)
# tooltip = talk title
labels = list(talks.ix[:cur_year_max_ID].title.values)
tooltip = plugins.PointLabelTooltip(points, labels)
plugins.connect(fig, tooltip)
In [ ]: