Analysis

This notebook Analyzes Pycon talk data scraped from the conference website using a few natural langauage processing techniques to determine latent topics and build a list of similar talks for each pycon talk.

The rest of the project can be found on github: https://github.com/mikecunha/pycon_reco

Table of Contents

Load Talk Data

TOC
Load two years worth of talks and concatenate a few fields together for BOW.


In [1]:
import pandas as pd
from collections import defaultdict
from datetime import datetime

In [2]:
talks1 = pd.read_csv( 'data/pycon_talks_2015.csv', sep="\t" )
talks1['start_dt'] = pd.to_datetime( talks1.start_dt )
talks2 = pd.read_csv( 'data/pycon_talks_2014.csv', sep='\t' )
talks2['start_dt'] = pd.to_datetime( talks2.start_dt )

talks = talks1.append(talks2, ignore_index=True)
talks.info()


<class 'pandas.core.frame.DataFrame'>
Int64Index: 269 entries, 0 to 268
Data columns (total 9 columns):
abstract    262 non-null object
author      269 non-null object
category    269 non-null object
desc        269 non-null object
level       262 non-null object
title       269 non-null object
weekday     269 non-null object
start_dt    269 non-null datetime64[ns]
end_dt      269 non-null object
dtypes: datetime64[ns](1), object(8)
memory usage: 21.0 KB

In [3]:
talks.tail()


Out[3]:
abstract author category desc level title weekday start_dt end_dt
264 A recommendation engine is a software system t... Diego Maniloff,Christian Fricke,Zach Howard Science In this tutorial we'll set ourselves the goal ... Intermediate Hands-on with Pydata: how to build a minimal r... Thursday 2014-04-10 09:00:00 2014-04-10 12:20:00
265 Working with developers on schema migrations i... Selena Deckelmann Databases Working with developers on schema migrations i... Intermediate Sane schema migrations with Alembic and SQLAlc... Saturday 2014-04-12 14:35:00 2014-04-12 15:05:00
266 Designing APIs is one of the hardest tasks in ... Erik Rose Best Practices & Patterns The language you speak determines the thoughts... Intermediate Designing Poetic APIs Saturday 2014-04-12 12:10:00 2014-04-12 12:55:00
267 Beginning programmers: welcome to PyCon! Jumps... Jessica McKellar Python Core (language, stdlib, etc.) Beginning programmers: welcome to PyCon! Jumps... Novice A hands-on introduction to Python for beginnin... Wednesday 2014-04-09 09:00:00 2014-04-09 12:20:00
268 I. Why Async? A. First, There Was Multithreadi... A. Jesse Jiryu Davis Web Frameworks Python’s asynchronous frameworks, like Tulip, ... Intermediate What Is Async, How Does It Work, And When Shou... Friday 2014-04-11 15:15:00 2014-04-11 16:00:00

In [4]:
# Code below depends on talks being sorted with current year talks first
cur_year_max_ID = talks[ talks.start_dt >= datetime(2015,1,1) ].index[-1] + 1
cur_year_max_ID


Out[4]:
138

Build a bag of words for each document


In [5]:
documents = []

for ind, talk in talks.fillna('').iterrows():
    documents.append( ' '.join([talk['title'],
                                talk['desc'],
                                talk['abstract'] 
                                ]) )
    
print ("Read %d corpus of documents" % len(documents))


Read 269 corpus of documents

Build a Labelled Test-Set and Define a Scoring Function

TOC
Instead of examining the top documents and words in each topic by eye every time a model parameter is adjusted define a quick, repeatable measure. The metric should be meaningful in terms of how we'll be using the topic model output, i.e. can we make good recommendations?

Note* these groups depend on talks being in the same order (same index in dataframe) as when I created them.


In [6]:
# Hand-make some clusters of talks with similar topics
hand_made_groups = [
('Databases', set([50,60,80,83,84,24])),
('Django', set([5,40,43,59,101])),
('Docker', set([74,91])),
('Deployment', set([13,31,33,53,56,71,82,110,112,])),
('Systems', set([9,10,53,71,118,120,])),
('Web Apps', set([51,56,89,129,])),
('modules, packages', set([27,20,33,109])),
('API', set([43,56,124,130])),
('Machine Learning, modeling', set([50,52,72,78,97,104,116,125,])),
('Testing', set([21,39,59,64,65])),
('Open Source', set([70,99,122])),
('Python Language, Learning', set([2,7,11,16,19,29,47,67,69,79,90,98,100,126,134,135,137])),
('Data Science', set([8,22,24,25,26,30,52,57,68,72,76,78,81,88,97,102,104,116,125,132,136])),
('Math', set([37,26,57,132,])),
('Graphs', set([8,48,76])),
('Games', set([23,38,])),
]

# put it in context of each doc
labeled_docs = defaultdict(set)
    
for topic, group in hand_made_groups:
    for doc in group:
        [ labeled_docs[doc].add(d) for d in group if d != doc ]

In [7]:
def score( make_preds ):
    num_docs = len( labeled_docs )
    true_pos = 0
    false_pos = 0
    true_neg = 0
    false_neg = 0

    for doc in labeled_docs.keys():

        # get predictions for eval_doc
        preds = make_preds(doc)

        tp = len(set(preds).intersection( labeled_docs[doc] ))
        true_pos += tp

        fp = len(set(preds).difference( labeled_docs[doc] ))
        false_pos += fp

        fn = len(labeled_docs[doc].difference( set(preds) ))
        false_neg += fn

        true_neg += (num_docs - (tp + fp + fn))
        
    accuracy = (true_pos + true_neg) / (true_pos + true_neg + false_pos + false_neg)
    precision = true_pos / (true_pos + false_pos)
    recall = true_pos / (true_pos + false_neg)
    f_one = 2 * ((precision * recall) / (precision + recall))

    print ("accuracy:  %0.2f \nprecision: %0.2f \nrecall:    %0.2f \nF1:        %0.2f" % (accuracy, precision, recall, f_one))
    
    return accuracy, precision, recall, f_one

Baseline

TOC
Test the scoring function and create a baseline score by guessing the next doc in the index is related


In [8]:
acc, prec, rec, f_one = score( lambda x: [x+1])


accuracy:  0.85 
precision: 0.07 
recall:    0.01 
F1:        0.01

Non-Negative Matrix Factorization

TOC
Most of the NMF code below is taken directly from Derek Green's (@derekgreene) excellent article and notebook on the topic


In [9]:
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import cosine_similarity
from sklearn import decomposition
from sklearn.feature_extraction.stop_words import ENGLISH_STOP_WORDS

In [10]:
combined_stops =  list(ENGLISH_STOP_WORDS) + ['tutorial', 'also', 'get', "we'll", 'll', 'code', '&' ]
ENGLISH_STOP_WORDS = frozenset( combined_stops )

In [11]:
tfidf = TfidfVectorizer(stop_words=ENGLISH_STOP_WORDS, 
                        lowercase=True, 
                        strip_accents="unicode", 
                        use_idf=True, 
                        norm="l2", 
                        min_df = 6,  # appears in >= X docs
                        max_df = 0.5,  # appears in <= Y% of docs
                        ngram_range=(1,2), # use bigrams
                        ) 

A = tfidf.fit_transform(documents)

print ("Created document-term matrix of size %d x %d" % (A.shape[0],A.shape[1]) )


Created document-term matrix of size 269 x 1023

Save mapping of indicies to the words they represent


In [12]:
num_terms = len(tfidf.vocabulary_)
terms = [""] * num_terms
for term in tfidf.vocabulary_.keys():
    terms[ tfidf.vocabulary_[term] ] = term

Do the factorization and produce the factors


In [13]:
num_tops = 22
model = decomposition.NMF(init="nndsvd", 
                          n_components=num_tops, 
                          max_iter=400, 
                          tol=0.0001,
                          )
W = model.fit_transform(A)
H = model.components_
print ("Generated factor W of size %s and factor H of size %s" % ( str(W.shape), str(H.shape) ) )


Generated factor W of size (269, 22) and factor H of size (22, 1023)

For each talk in 2015 (the first 138 docs), calculate the similarity between it and all the other 2015 documents based on the topics each is associated with.


In [14]:
doc_sims = cosine_similarity( W[:cur_year_max_ID,:], W[:cur_year_max_ID,:] )
doc_sims.shape


Out[14]:
(138, 138)

Build an index of recommended talks for each talk. Incorporate a threshold of similarity and a max number of related talks.
Experimenting with different combinations of similarity threshold and top-n related docs gives pretty good control over precision vs. recall and can be tailored to the recommender application.


In [15]:
NMF_lookup = {}
sim_thresh = 0.74  # min similarity docs need to be to recommend (0.0-1.0)
n_docs = 15  # how many docs above threshold to keep

for doc_key in range(doc_sims.shape[0]):
    
    # slice column out of similarity matrix for this doc
    similarities = doc_sims[doc_key,:]
    
    # sort related docs by score, descending
    related = sorted(enumerate(similarities), key=lambda tup: tup[1], reverse=True)
    
    # remove related score that compares to itself and any docs under the threshold of relevance
    related = [ doc_id for doc_id, sim in related if doc_id != doc_key and sim >= sim_thresh ]
    
    # only use top-N recommendations
    related = related[:n_docs]
    
    NMF_lookup[doc_key] = related

# Score it
def make_NMF_pred( doc_id ):
    return [ doc_num for doc_num in NMF_lookup[doc_id] ]

a, p, r, f = score( make_NMF_pred )


accuracy:  0.86 
precision: 0.49 
recall:    0.20 
F1:        0.28

Examine words in each topic

TOC


In [16]:
import numpy as np

In [17]:
for topic_index in range( H.shape[0] ):
    top_indices = np.argsort( H[topic_index,:] )[::-1][0:10]
    term_ranking = [terms[i] for i in top_indices]
    print ("Topic %d: %s" % ( topic_index, ", ".join( term_ranking ) ) )


Topic 0: source, open source, open, project, contribute, community, projects, source python, want, help
Topic 1: machine learning, machine, learning, scikit, scikit learn, learn, data, model, algorithms, science
Topic 2: data, analysis, scraping, using, data analysis, time, graph, pandas, matplotlib, libraries
Topic 3: django, templates, request, views, response, web, interface, apps, components, new
Topic 4: game, games, video, simple, developed, walk, platforms, development, attendees, using
Topic 5: tests, testing, test, unit, py, requests, unit tests, automated, functional, write
Topic 6: programming, programs, pycon, python programs, interactive, write, programmers, beginning, practice, python programming
Topic 7: hands, material, new python, intermediate, bring laptop, laptop python, decorators, laptop, bring, fast paced
Topic 8: ansible, configuration, management, configuration management, deployment, modules, systems, remote, execution, written python
Topic 9: twisted, event, asynchronous, loop, event loop, network, connection, protocol, server, frameworks
Topic 10: performance, slow, techniques, library, optimization, standard library, standard, used, cycle, applications
Topic 11: security, app, web app, secure, attacks, experience, site, web, issues, cross
Topic 12: function, class, definition, ways, languages, features, understand, like, functions, don
Topic 13: problems, people, help, students, real, example, simple, engineers, world, started
Topic 14: flask, services, web, framework, build, session, extensions, way, web framework, authentication
Topic 15: software, free, years, module, design, modules, developers, happen, behavior, learned
Topic 16: javascript, pypy, interpreter, browser, js, cpython, python interpreter, client, minutes, python javascript
Topic 17: memory, distributed, process, low, reference, module, high, cpython, systems, objects
Topic 18: ipython, engine, notebook, interactive, minimal, computing, process, numpy, shell, building
Topic 19: web, application, web application, python web, scraping, tools, development, web development, stack, end
Topic 20: database, sqlalchemy, orm, sql, special, query, schema, developer, core, complex
Topic 21: api, apis, rest, models, ve, graphics, build, design, patterns, practices

Examine top talks in each topic

TOC


In [18]:
topics_to_show = 2
for topic_index in range( min(W.shape[1], topics_to_show) ):
    top_indices = np.argsort( W[:cur_year_max_ID,topic_index] )[::-1][0:15]
    term_ranking = [(talks.ix[i].title, W[i,topic_index]) for i in top_indices]
    print ("Topic %d:" % ( topic_index  ))
    for t in term_ranking:
        print (t)
    print ()


Topic 0:
('Open Source for Newcomers and the People Who Want to Welcome Them', 0.2826173899896588)
('Opening Statements', 0.14402782707582332)
('Robots Robots Ra Ra Ra!!!', 0.13648104563829647)
('PSF Chair', 0.1075813864177774)
("Django's Co-creator", 0.099293899702828689)
('Demystifying Docker', 0.09500002237769585)
('Docker 101: Introduction to Docker', 0.091417558154530981)
('Python Performance Profiling: The Guts And The Glory', 0.075179964413609593)
('Avoiding Burnout, and other essentials of Open Source Self-Care', 0.073562713452527059)
('How to make your code Python 2/3 compatible', 0.071725509051593653)
('I18N: World Domination the Easy Way', 0.069337481808145099)
('Hadoop with Python', 0.067601006595841312)
('streamparse: real-time streams with Python and Apache Storm', 0.062445942287859081)
('The Ethical Consequences Of Our Collective Activities', 0.057936841287175432)
('3D Print Anything with the Blender API', 0.051230040111769286)

Topic 1:
('Machine Learning with Scikit-Learn (I)', 0.69768370316642114)
('Machine Learning 101', 0.59800464405341514)
('Winning Machine Learning Competitions With Scikit-Learn', 0.58126458884981669)
('Machine Learning with Scikit-Learn (II)', 0.48176504643539941)
('Grids, Streets and Pipelines: Building a linguistic street map with scikit-learn', 0.2228875396081787)
('Hands-on Data Analysis with Python', 0.20734296846544256)
('Data Science in Advertising: Or a future when we love ads', 0.20509401221311555)
('Bytes in the Machine: Inside the CPython interpreter', 0.098007775974864125)
('Exploring Minecraft and Python: Learning to Code Through Play', 0.070529006473038697)
("Learning from other's mistakes: Data-driven analysis of Python code", 0.051900286675114676)
('What to do when you need crypto', 0.037300891553877238)
('Losing your Loops: Fast Numerical Computing with NumPy', 0.036183244158200324)
('"Words, words, words": Reading Shakespeare with Python', 0.035711659509561594)
('Build and test wheel packages on Linux, OSX & Windows', 0.033361721570761824)
('Slithering Into Elasticsearch', 0.031589324903414442)

Export Lookup Table for Web-Service to Use

TOC
Save index to a file that a webservice can load
e.g. the 2nd line is a comma-separated list of the docs related to the 2nd talk


In [19]:
lines = []
for k in range( len(NMF_lookup.keys())):
    line = []
    for doc_num in NMF_lookup[k]:
        line.append( str(doc_num) )
        
    lines.append(','.join(line) )

with open('app/rel_talks.txt', mode='wt', encoding='utf-8') as myfile:
    myfile.write( '\n'.join(lines) )

In [20]:
!head app/rel_talks.txt


24
11,105,84,132,35,7,92,70,23,17,104
120,79,134,28,100
25,84

40,73,101
20,61,63,122
132,105,11,57,35,1,23,70,75,104,41,92
25,26,76,128,14,48,34,81,24
107,98

Process Documents for Gensim

TOC
Tokenize, make bigrams, remove stop words


In [21]:
import re
import itertools
from gensim import corpora, models, similarities
from nltk import bigrams
from nltk.corpus import stopwords 
from nltk import PorterStemmer

stemmer = PorterStemmer()

# Replace dashes with spaces
def remove_dashes(text):
    rx = re.compile(u'([\u2014\-\,\u2019,\n]|\.|\(|\)|\:|;|/|\[|\])', flags=re.UNICODE)
    new_text = rx.sub(" ", text )
    return new_text

Make it as painless as possible to try different ways of preprocessing the talks


In [22]:
def prep_docs(documents, stem_words=True, cur_yr_only=True, use_bigrams=False, min_word_freq=5, tfidf=False ):
    """Clean documents, remove stop words etc. Returns a list 
    of docs where each doc is a list of n-grams"""
    
    stops = stopwords.words('english') + ['python', 'use', 'learn', 'talk', 
                                          'discuss', 'program', 'tutorial',
                                          'also', 'get', "we'll", 'll', 
                                          'http', 'code', '&' ]
    if cur_yr_only:
        end_doc = cur_year_max_ID
    else:
        end_doc = len(documents)
    
    # Split and clean concatentated text
    if stem_words:
        print("Splitting %d docs into words and stemming" % len(documents[:end_doc]))
        stops = [ stemmer.stem(word) for word in stops ]
        
        texts_ = [ [stemmer.stem(word) for word in remove_dashes(document.lower()).split()] 
                  for document in documents[:end_doc] ]
    else:
        print("Splitting %d docs into words" % len(documents[:end_doc]))
        texts_ = [ [word for word in remove_dashes(document.lower()).split()] 
                  for document in documents[:end_doc] ]

    # Filter out stop words
    print("Filtering out stop words")
    texts_ = [ [word for word in document if word not in stops]
               for document in texts_ ]
    
    # Bigrams
    if use_bigrams:
        print("Adding bigrams")
        bigram_texts = []
        for doc in texts_:
            bigram_texts.append( doc + [' '.join(x) for x in bigrams(doc)] )
        texts_ = bigram_texts

    # Remove rare words
    if min_word_freq:
        print("Removing rare words")
        all_tokens = list(itertools.chain(*texts_))
        rare_tokens = set(word for word in set(all_tokens) if all_tokens.count(word) < min_word_freq)
        texts_ = [[ngram for ngram in btext if ngram not in rare_tokens]
                  for btext in texts_]

    dictionary = corpora.Dictionary(texts_)
    corpus = [dictionary.doc2bow(text) for text in texts_]

    # average length of words per doc (useful for tuning model params)
    tot = 0
    for t in texts_:
        tot += len(t)
    print ( "%3.0f average n-gramas per doc" % (tot / float(len(texts_))) )

    # TF-IDF weighting
    if tfidf:
        print("Converting to TF-IDF vector space")
        tfidf = models.TfidfModel(corpus) 
        corpus = tfidf[corpus]
        
    return corpus, dictionary

In [23]:
def save_BOW(corpus, dictionary, c_path="data/pycon_corpus.mm", d_path="data/pycon.dict"):
    corpora.MmCorpus.serialize(c_path, corpus)
    dictionary.save(d_path)
    return

In [24]:
def load_BOW(c_path="data/pycon_corpus.mm", d_path="data/pycon.dict"):
    corpus = corpora.MmCorpus(c_path)
    dictionary = corpora.Dictionary.load(d_path)
    return corpus, dictionary

Latent-Semantic-Indexing

TOC


In [25]:
corpus, dictionary = prep_docs(documents, 
                               stem_words=False, 
                               cur_yr_only=True, 
                               use_bigrams=True, 
                               min_word_freq=False, 
                               tfidf=False )


Splitting 138 docs into words
Filtering out stop words
Adding bigrams
267 average n-gramas per doc

define Latent Semantic Index of Documents


In [26]:
nf = 22 # number of topics / features
lsi = models.LsiModel(corpus, id2word=dictionary, num_topics=nf, extra_samples=150,)
index = similarities.MatrixSimilarity(lsi[corpus], num_features=nf) 

#index.save('data/pycon_2yr_01.index')
#index = similarities.MatrixSimilarity.load('data/pycon_2yr_01.index')

Make a lookup table of related docs and score it


In [27]:
# make a lookup table
lsi_lookup = {}

for doc_key, scores in enumerate( index ):
    
    # sort related docs by score, descending
    related = sorted(list(enumerate(scores)), key=lambda tup: tup[1], reverse=True)
    
    # remove related score that compares to itself and any docs under a threshold of relevance
    related = [ doc for doc in related if doc[0] != doc_key and doc[1] >= 0.74 ]
    
    # a max of x related, so lists aren't too long
    related = related[:15]
    
    lsi_lookup[doc_key] = related
    
def make_LSI_pred( doc_id ):
    
    return [ doc_num for doc_num, score in lsi_lookup[doc_id] ]

a, p, r, f = score( make_LSI_pred )


accuracy:  0.83 
precision: 0.29 
recall:    0.17 
F1:        0.21

In [27]:

Latent Dirichlet Allocation

TOC


In [28]:
corpus, dictionary = prep_docs(documents, 
                               stem_words=True, 
                               cur_yr_only=True, 
                               use_bigrams=True, 
                               min_word_freq=5, 
                               tfidf=False )


Splitting 138 docs into words and stemming
Filtering out stop words
Adding bigrams
Removing rare words
104 average n-gramas per doc

In [29]:
nf = 22
lda = models.ldamodel.LdaModel(corpus, 
                               id2word=dictionary, 
                               num_topics=nf, 
                               passes= 10,                 # > 1 = batch mode; multiple passes over small corpus
                               iterations=105,             # set to approx. number of words per doc
                               eta=1.5,                    # symmetric prior over all words
                               alpha='auto',               # optimize alpha to asymmetric vals, using eta as start point
                               eval_every=0,               # set to zero when corp. is small enough 
                               chunksize=10,               # approx. 10% of corpus size 
                               gamma_threshold=0.00001,    # lower than default 0.001 means more training per doc
                               )   

index2 = similarities.MatrixSimilarity(lda[corpus], num_features=nf)
#index2.save("pycon_2yr-LDA_bigram_sims_v4.index")

In [30]:
# make a lookup table
lda_bigram_lookup = {}

for doc_key, scores in enumerate( index2 ):
    
    # sort related docs by score, descending
    related = sorted(list(enumerate(scores)), key=lambda tup: tup[1], reverse=True)
    
    # remove related score that compares to itself and any docs under a threshold of relevance
    related = [ doc for doc in related if doc[0] != doc_key and doc[1] >= 0.75 ]
    
    # a max of 10 related, so lists aren't too long
    related = related[:15]
    
    lda_bigram_lookup[doc_key] = related
    
# Score
def make_LDA_bigram_pred( doc_id ):
    
    return [ doc_num for doc_num, score in lda_bigram_lookup[doc_id] ]

a, p, r, f = score( make_LDA_bigram_pred )


accuracy:  0.81 
precision: 0.34 
recall:    0.40 
F1:        0.37

In [30]:

Visualize the Talks Grouped by Topic Using t-SNE

TOC
Using the python implementation (adapted to python3) of t-Distributed Stochastic Neighbor Embedding by Laurens van der Maaten found here

and mpld3 to make an interactive d3 scatterplot in which you can zoom in on clusters of talks and hover for a tooltip with the talks's title.


In [31]:
%matplotlib inline
import matplotlib.pyplot as plt
from mpld3 import plugins, enable_notebook
import numpy as np
from tsne3 import tsne

In [32]:
enable_notebook()

Get most likely topic labels from NMF factors for each doc (have to pick a single topic to color the talk with in the plot)


In [33]:
nmf_labels = []
num_topics = W.shape[1]

for talk_index in range(W.shape[0]):
    nmf_labels.append( np.argsort( W[talk_index,:] )[::-1][0] )

Get labels for each talk according to "category" listed on pycon website for comparison


In [34]:
mapping = {}
i = 0
cats = talks.category.unique()
for category in cats:
    mapping[category] = float(i)
    i += 1
print("%d unique categories" % len(cats))
cat_labels = [ mapping[c] for c in talks.category.values ]


17 unique categories

Run tSNE Algo on 2015 Documents


In [39]:
Y = tsne(W[:cur_year_max_ID,:], 
         2,   # Final Output Dimensions desired
         20,  # Dimensions for PCA to pre-process raw data to, same = no PCA
         5)   # Perplexity, clumped ball = too high


Preprocessing the data using PCA...
Computing pairwise distances...
Computing P-values for point  0  of  138 ...
Mean value of sigma:  0.0859821117558
Iteration  100 : error is  15.4334850425
Iteration  200 : error is  0.643691013238
Iteration  300 : error is  0.514638524071
Iteration  400 : error is  0.506172800415
Iteration  500 : error is  0.502392153897
Iteration  600 : error is  0.499322270326
Iteration  700 : error is  0.49632358316
Iteration  800 : error is  0.493053094674
Iteration  900 : error is  0.491477705305
Iteration  1000 : error is  0.490333328072

Now plot it


In [40]:
topic_colors = nmf_labels
#topic_colors = cat_labels
fig, ax = plt.subplots()
fig.set_size_inches(12.5,7.5)
points = ax.scatter(Y[:cur_year_max_ID,0], Y[:cur_year_max_ID,1],
                     c=topic_colors[:cur_year_max_ID],
                     s = 190,
                     alpha=0.4,
                     cmap=plt.cm.jet)

ax.set_title("Pycon Talks by NMF Topic", size=20)

# tooltip = talk title
labels = list(talks.ix[:cur_year_max_ID].title.values)
tooltip = plugins.PointLabelTooltip(points, labels)

plugins.connect(fig, tooltip)


TOC


In [ ]: