Popular question and answer (qna) site - stackoverflow (& their sister sites - http://stackexchange.com/sites) witnesses tremendous activity daily where netizens post questions, recieve answers, comments and engage in an active discussion.
We want to analyze this data with the intention to identify related questions. In general, the analysis presented in this notebook can be used in a variety of scenarios. Some illustrative examples include
Power users use multiple browser tabs simultaneously. Using the techniques outlined below, we can identify and group similar tabs together. With the appropriate visual cues, similar tabs can be identified at a glance.
Many products provide a twitter handle to offer support. User questions about the product or service related queries (all within the twitter 140 char limit) can be grouped based on similarity.
The Stack Exchange network uploads monthly dumps of their data which we have used for analysis.
"The anonymized dump of all user-contributed content on the Stack Exchange network. Each site is formatted as a separate archive consisting of XML files zipped via 7-zip using bzip2 compression. Each site archive includes Posts, Users, Votes, Comments, PostHistory and PostLinks. For complete schema information, see the included readme.txt.
All user content contributed to the Stack Exchange network is cc-by-sa 3.0 licensed, intended to be shared and remixed. We even provide all our data as a convenient data dump."
The schema for their data is located @ https://ia800500.us.archive.org/22/items/stackexchange/readme.txt.
However,
The first step is to then write a converter which does this. Our Streaming XML2CSV Converter expects input data to be present at input/english for english.stackexchange.com, input/aviation for aviation.stackexchange.com etc. Similarly, the converted data is dumped at output/english and output/aviation respectively.
Also, note the converter outputs the csv file in a zipped format as pandas (the lib we use to read data) understands zip formats natively.
Please download the data from https://archive.org/download/stackexchange/english.stackexchange.com.7z and https://archive.org/download/stackexchange/aviation.stackexchange.com.7z into the input folders to run the dashboard. Alternatively, a sample run output has been stored in the respective output folders.
In [8]:
#imports
import warnings
warnings.filterwarnings('ignore')
import os
import os.path
import pandas as pd
import numpy as np
import math
import re
import gensim
from gensim import corpora, models,similarities
from gensim.models import Doc2Vec, TfidfModel
import nltk
from nltk.corpus import stopwords
from nltk.stem.porter import PorterStemmer
from nltk.stem.snowball import SnowballStemmer
from nltk.stem.lancaster import LancasterStemmer
from nltk.stem import WordNetLemmatizer
from pprint import pprint # pretty-printer
%matplotlib inline
import matplotlib
import matplotlib.pyplot as plt
from matplotlib.pyplot import figure
import pyLDAvis
import pyLDAvis.gensim
pyLDAvis.enable_notebook()
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.cluster import KMeans
from sklearn.manifold import TSNE
from sklearn.decomposition import PCA
from sklearn.decomposition import TruncatedSVD
from wordcloud import WordCloud
SAMPLE_SIZE = 20000
RECOMPUTE_MODEL = True
SITE1 = 'biology' #.stackexchange.com
SITE2 = 'bicycles' #.stackexchange.com
SITES = [SITE1, SITE2]
samples = {}
samples[SITE1] = [
'Why are the human knees and elbows bent in an opposite direction',
'How would one determine whether a chemical will upregulate a certain class of proteins?'
]
samples[SITE2] = [
'Picking a bike for a cyclist new to riding on the road?',
'max weight that the shimano Rs400 can carry ?'
]
stopw = {}
stopw[SITE1] = ['biology', 'human', 'cell', 'dna', 'protein','doe']
stopw[SITE2] = ['bike','bicycle','bikes','bicycles']
In [9]:
# Global variables
Posts = {}
NormalizedPosts = {}
Dictionaries = {}
Corpora = {}
TFIDF = {}
NUM_TOPICS = {}
STOPW = {}
In [10]:
#%%bash
#set XXX = 'chemistry'
#curl https://archive.org/download/stackexchange/$XXX.stackexchange.com.7z
In [11]:
from convertso2csv import process
# To get a new network, do the following
# wget https://archive.org/download/stackexchange/XXX.stackexchange.com.7z [XXX=chemistry, physics etc]
# Extract the zip into input/XXX folder and then run this cell.
process(os.getcwd() + '/input')
For the purposes of this analysis, we use only post title and body. However, we will show the other data also made available to us from stackexchange. This can be used to further enrich the analysis. But this is not done in the current notebook.
The original data is large. So to play around with the data, we have provide some knobs to tune within the notebooks (versus putting them in the parser)
Feel free to change this value to experiment or get better results.
In [12]:
path = 'output/'+SITE1+'/'+'posts.csv.gz'
Posts[SITE1] = pd.read_csv(path,compression='gzip',nrows=SAMPLE_SIZE).dropna(subset=['Body','Title'])
Posts[SITE1]['Tags'] = Posts[SITE1]['Tags'].apply(lambda x : x.replace('<',' ').replace('>',' '))
Posts[SITE1][['Body','Title', 'Tags']].head(2)
Out[12]:
In [13]:
path = 'output/'+SITE2+'/'+'posts.csv.gz'
Posts[SITE2] = pd.read_csv(path,compression='gzip',nrows=SAMPLE_SIZE).dropna(subset=['Body','Title'])
Posts[SITE2]['Tags'] = Posts[SITE2]['Tags'].apply(lambda x : x.replace('<',' ').replace('>',' '))
Posts[SITE2][['Body','Title', 'Tags']].head(2)
Out[13]:
Related data can also be loaded. However, for the purposes of this demonstration, we are not exploiting the data from the tables below.
In [14]:
path = 'output/'+SITE2+'/'+'comments.csv.gz'
comments = pd.read_csv(path,compression='gzip',nrows=SAMPLE_SIZE).dropna()
comments[['Score','Text']].head(5)
Out[14]:
In [15]:
path='output/'+SITE1+'/'+'posthistory.csv.gz'
posthistory = pd.read_csv(path,compression='gzip',nrows=SAMPLE_SIZE).dropna(subset=['Text'])
posthistory[['Text']].head(5)
Out[15]:
In [16]:
path='output/' + SITE1 + '/' + 'users.csv.gz'
users = pd.read_csv(path,compression='gzip',nrows=SAMPLE_SIZE).dropna(subset=['AboutMe','Location'])
users[['Location','AboutMe']].head(5)
Out[16]:
The relationship between all these tables and the detailed meaning of all the attributes and values can be found @ https://ia800500.us.archive.org/22/items/stackexchange/readme.txt
In [17]:
class StopWords():
lemma = WordNetLemmatizer()
#stemmer = SnowballStemmer("english")
#stemmer = LancasterStemmer()
def __init__(self,add=[]):
self.stop_words = stopwords.words('english')
self.stop_words.append('use')
for word in add:
self.stop_words.append(word)
def remove(self, sentence):
line = [StopWords.lemma.lemmatize(tok) for tok in sentence.split()]
return [tok for tok in line if not tok in self.stop_words and len(tok) > 1]
In [18]:
class SentenceTokens():
def __init__(self,df,field,sw):
self.field = field
self.df = df
self.sw = sw
def posts(self):
for index, row in self.df.iterrows():
raw_sentence = row[self.field]
tokens = [tok for tok in self.sw.remove(raw_sentence)]
allsame = [tok for tok in tokens if tok != tokens[0]]
if len(tokens) >=2 and len(allsame) > 0 : #at least 2 words, huh?
yield raw_sentence
def __iter__(self):
for index, row in self.df.iterrows():
raw_sentence = row[self.field]
tokens = [tok for tok in self.sw.remove(raw_sentence)]
allsame = [tok for tok in tokens if tok != tokens[0]]
if len(tokens) >=2 and len(allsame) > 0 : #at least 2 words, huh?
yield tokens
In [19]:
def get_dictionary(site, posts, force=False):
dictFile = 'models/' + site + '_posts.dict'
# Check if trained model file exists
if ( os.path.isfile(dictFile) and not force):
dictionary = corpora.Dictionary.load(dictFile)
else:
#How frequently each term occurs within each document? We construct a document-term matrix.
dictionary = corpora.Dictionary(posts)
#dictionary.filter_extremes(no_below=5, no_above=10)
# store the dictionary, for future reference
dictionary.save(dictFile)
return dictionary
# Bag of words
# corpus is a list of vectors equal to the number of documents.
# In each document vector is a series of tuples.
def get_corpus(site, posts, force=False):
corpusFile = 'models/' + site + '_posts.mm'
# Check if corpus file is found
if ( os.path.isfile(corpusFile) and not force ):
corpus = corpora.MmCorpus(corpusFile)
else:
# Create corpus
corpus = [Dictionaries[site].doc2bow(post) for post in posts]
# Store corpus to file
corpora.MmCorpus.serialize(corpusFile, corpus) #Save the bow corpus
return corpus
In [20]:
for site in SITES:
STOPW[site] = StopWords(stopw.get(site,[]))
NormalizedPosts[site] = SentenceTokens(Posts[site],'Title',STOPW[site])
Dictionaries[site] = get_dictionary(site, NormalizedPosts[site], force=RECOMPUTE_MODEL)
Corpora[site] = get_corpus(site, NormalizedPosts[site], force=RECOMPUTE_MODEL)
TFIDF[site] = TfidfModel(Corpora[site], normalize=RECOMPUTE_MODEL)
In [21]:
# Has to be better way to do this
def getWordOccurencesInCorpus(site):
wordfreq = {}
for row in Corpora[site]:
for words in row:
w = Dictionaries[site].get(words[0])
if w in wordfreq:
wordfreq[w] += words[1]
else:
wordfreq[w] = 1
freqs = []
for k in wordfreq:
freqs.append((k,(wordfreq[k])))
return freqs
In [22]:
fig = plt.figure(figsize=(20*3, 30))
plotId=1
for site in [SITE1]:
wfreq = getWordOccurencesInCorpus(site)
plt.subplot(120+plotId)
plt.imshow(WordCloud().fit_words(wfreq))
plt.axis("off")
plt.title(site + `plotId`)
plotId +=1
plt.show()
In [23]:
fig = plt.figure(figsize=(30*3, 30))
plotId=1
for site in [SITE2]:
wfreq = getWordOccurencesInCorpus(site)
plt.subplot(120+plotId)
plt.imshow(WordCloud().fit_words(wfreq))
plt.axis("off")
plt.title(site + `plotId`)
plotId +=1
plt.show()
===========================================
We were not aware of the number of topic, so we decided to reduce the number of features to two dimensions, then clustering the points for different values of K (number of clusters) to find an optimum value.
One such transform is the Latent Semantic Indexing (LSI) transform, which we use to project the original data to 2D.
In [24]:
def get_lsi_model(site, dictionary, corpus, force = False):
lsimodelFile = 'models/' + site + '_lsi.model'
# Check if corpus file is found
if ( os.path.isfile(lsimodelFile) and not force):
lsimodel = models.LsiModel.load(lsimodelFile)
else:
lsimodel = models.LsiModel(TFIDF[site][Corpora[site]], id2word=dictionary, num_topics=NUM_TOPICS[site])
lsimodel.save(lsimodelFile)
return lsimodel
In [25]:
# Get LSI Model
LsiModels = {}
for site in SITES:
NUM_TOPICS[site] = 2
LsiModels[site] = get_lsi_model(site, Dictionaries[site], Corpora[site], force=RECOMPUTE_MODEL)
In [26]:
def getWordCloud(line):
scores = [float(x.split("*")[0]) for x in line.split(" + ")]
words = [x.split("*")[1] for x in line.split(" + ")]
freqs = []
for word, score in zip(words, scores):
if word.startswith('"') and word.endswith('"'):
word = word[1:-1]
freqs.append((word, score))
return WordCloud().fit_words(freqs)
def plotWordCloud(site, model, num_topics = None, num_words=6):
plotId = 1
figureId = 1
if num_topics is None:
num_topics = NUM_TOPICS[site]
for line in model.print_topics(num_topics,num_words):
if plotId % 2 != 0:
plotId = 1
plt.figure(figureId,figsize=(5*3, 5))
plt.subplot(120+plotId)
plt.imshow(getWordCloud(line[1]))
plt.axis("off")
plt.title(site + `figureId`)
plotId += 1
figureId +=1
plt.show()
In [27]:
for site in SITES:
plotWordCloud(site, LsiModels[site], 3)
We clustered the points in the this reduced space using KMeans, varying the number of clusters (K) from 1 to 10. We used the Inertia of the cluster provided by Scikit-Learn KMeans algorithm.
In [28]:
#pprint(Corpora[SITE1])
#pprint(LsiModels[SITE1][Corpora[SITE1]])
#for row in LsiModels[site][Corpora[SITE1]]:
# print(row)
In [29]:
X = {}
for site in SITES:
D = ''
for vector in LsiModels[site][Corpora[site]]:
if 2 == len(vector):
#'1 2; 3 4' => 2 X 2
D = D + `round(vector[0][1],4)` + ' ' + `round(vector[1][1],4)` + ';'
else:
print(vector)
X[site] = np.matrix(D[:-1])
MAX_K = 10
ks = range(1, MAX_K + 1)
inertias = np.zeros(MAX_K)
diff = np.zeros(MAX_K)
diff2 = np.zeros(MAX_K)
diff3 = np.zeros(MAX_K)
for k in ks:
kmeans = KMeans(k).fit(X[site])
inertias[k - 1] = kmeans.inertia_
# first difference
if k > 1:
diff[k - 1] = inertias[k - 1] - inertias[k - 2]
# second difference
if k > 2:
diff2[k - 1] = diff[k - 1] - diff[k - 2]
# third difference
if k > 3:
diff3[k - 1] = diff2[k - 1] - diff2[k - 2]
elbow = np.argmin(diff3[3:]) + 3
plt.plot(ks, inertias, "b*-")
plt.plot(ks[elbow], inertias[elbow], marker='o', markersize=12,
markeredgewidth=2, markeredgecolor='r', markerfacecolor=None)
plt.ylabel("Inertia")
plt.xlabel("K")
plt.title(site)
plt.show()
In [30]:
print(Corpora[SITE1][49])
print(X[SITE1].shape[0])
print(len(LsiModels[SITE1][Corpora[SITE1]]))
In [31]:
NUM_TOPICS = {}
NUM_TOPICS[SITE1] = 5
NUM_TOPICS[SITE2] = 5
In [32]:
fig = plt.figure(figsize=(8, 8))
kmeans = KMeans(NUM_TOPICS[SITE1]).fit(X[SITE1])
labels = kmeans.labels_
colors = ["b", "g", "r", "m", "c", "p", "y"] #Not expecting more than 5 for the data size
for i in range(X[SITE1].shape[0]):
plt.scatter(X[SITE1][i,0] ,X[SITE1][i,1], c=colors[labels[i]], s=50)
plt.title(SITE1)
plt.show()
In [33]:
Buckets = {}
for k in range(NUM_TOPICS[SITE1]):
Buckets[k] = []
for index, sentence in enumerate(NormalizedPosts[SITE1].posts()):
Buckets[labels[index]].append(sentence)
for key in Buckets:
print('LABEL : ' + `key`)
print('-------------------')
print("\n".join(Buckets[key][:5]))
In [34]:
fig = plt.figure(figsize=(10, 10))
kmeans = KMeans(NUM_TOPICS[SITE2]).fit(X[SITE2])
y = kmeans.labels_
colors = ["b", "g", "r", "m", "c", "p", "y"] #Not expecting more than 5 for the data size
for i in range(X[SITE2].shape[0]):
plt.scatter(X[SITE2][i,0] ,X[SITE2][i,1], c=colors[y[i]], s=50)
plt.title(SITE2)
plt.show()
In [35]:
Buckets = Buckets or {}
for k in range(NUM_TOPICS[SITE2]):
Buckets[k] = []
for index, sentence in enumerate(NormalizedPosts[SITE2].posts()):
Buckets[labels[index]].append(sentence)
for key in Buckets:
print('LABEL : ' + `key`)
print('-------------------')
print("\n".join(Buckets[key][:5]))
In [36]:
fig = plt.figure(figsize=(10, 10))
D = ''
for site in SITES:
for vector in LsiModels[site][Corpora[site]]:
if 2 == len(vector):
#'1 2; 3 4' => 2 X 2
D = D + `round(vector[0][1],4)` + ' ' + `round(vector[1][1],4)` + ';'
X['ALL'] = np.matrix(D[:-1])
kmeans = KMeans(2).fit(X['ALL'])
y = kmeans.labels_
colors = ["r", "y",]
#plt.subplot(120+plotId)
for i in range(X['ALL'].shape[0]):
plt.scatter(X['ALL'][i,0] ,X['ALL'][i,1], c=colors[y[i]], s=50)
plt.title('ALL')
plt.show()
In [37]:
def QSim(site, document): # Query function for document
# Apply stop words
doc_processed = STOPW[site].remove(document)
# create vector
vec_bow = Dictionaries[site].doc2bow(doc_processed)
# convert the query (sample vector) to LSI space
vec_lsi = LsiModels[site][vec_bow]
corpus = Corpora[site]
# find indexes of similar sims
index = similarities.MatrixSimilarity(LsiModels[site][corpus])
# perform a similarity query against the corpus
sims = index[vec_lsi]
return sims
In [38]:
site = SITE2
target = samples[site][1]
# Search for similar documents
sims = QSim(site,target)
# Sort in descending order - highest matching percentage on top
sorted_sims = sorted(enumerate(sims), key=lambda item: -item[1])
sims_list = list(enumerate(sorted_sims))
print("Search results for [{}]:".format(target))
# Show top 10 matches only
for i in range(0, 10):
docid = sims_list[i][1][0]
matchPercentage = sims_list[i][1][1]
print("{:10.3f}% : {}".format(matchPercentage * 100, Posts[site].iloc[docid]['Title']))
In [39]:
def get_lda_model(site, dictionary, corpus, force=False):
modelFile = 'models/' + site + '_lda.model'
# Check if trained model file exists
if ( os.path.isfile(modelFile) and not force):
# load trained model from file
ldamodel = models.LdaModel.load(modelFile)
else:
# Create model
#num_topics: required. An LDA model requires the user to determine how many topics should be generated.
#id2word: required. The LdaModel class requires our previous dictionary to map ids to strings.
#passes: optional. The number of laps the model will take through corpus.
ldamodel = gensim.models.ldamodel.LdaModel(corpus, num_topics=NUM_TOPICS[site], id2word = dictionary, passes=30)
# save model to disk (no need to use pickle module)
ldamodel.save(modelFile)
return ldamodel
In [40]:
# Get LDA Models
LdaModels = {}
for site in SITES:
LdaModels[site] = get_lda_model(site, Dictionaries[site], Corpora[site], force=RECOMPUTE_MODEL)
In [41]:
from __future__ import print_function
from ipywidgets import interact, interactive, fixed
import ipywidgets as widgets
In [42]:
@interact(num_topics=NUM_TOPICS[SITE1] + 5, num_words=3)
def understand(num_topics, num_words):
plotWordCloud(SITE1,LdaModels[SITE1],num_topics,num_words)
return LdaModels[SITE1].print_topics(num_topics, num_words)
In [43]:
@interact(num_topics=NUM_TOPICS[SITE2] + 5, num_words=3)
def understand(num_topics, num_words):
plotWordCloud(SITE2,LdaModels[SITE2],num_topics,num_words)
return LdaModels[SITE2].print_topics(num_topics, num_words)
LDAvis is a web-based interactive visualization of topics estimated using Latent Dirichlet Allocation. The visualization provides a global view of the topics (and how they differ from each other), while at the same time allowing for a deep inspection of the terms most highly associated with each individual topic.
Ref : 2014 ACL Workshop on Interactive Language Learning, Visualization, and Interfaces http://nlp.stanford.edu/events/illvi2014/papers/sievert-illvi2014.pdf
In [44]:
#for site in SITES:
site = SITE1
pyLDAvis.gensim.prepare(LdaModels[site],Corpora[site],Dictionaries[site])
Out[44]:
In [61]:
class LabeledLineSentence(object):
def __init__(self,sw,df,field,tag):
self.df = df
self.field = field
self.tag = tag
self.sw = sw
def __iter__(self):
for index, row in self.df.iterrows():
tokens = self.sw.remove(row[self.field])
yield Doc2Vec.TaggedDocument(words=tokens,tags=[row[self.tag]])
In [62]:
def get_doc2vec(site, force=False):
doc2vecFile = 'doc2vec.model'
# Check if trained model file exists
if ( os.path.isfile(doc2vecFile) and not force):
# load trained model from file
docmodel = Doc2Vec.load(doc2vecFile)
else:
lablines = LabeledLineSentence(STOPW[site],Posts[site],'Title','Id')
# Create model
docmodel = Doc2Vec(alpha=0.025, min_alpha=0.025)
docmodel.build_vocab(lablines)
for epoch in range(10):
docmodel.train(lablines)
docmodel.alpha -= 0.002 # decrease the learning rate
docmodel.min_alpha = docmodel.alpha # fix the learning rate, no decay
# Save model
docmodel.save(doc2vecFile)
return docmodel
In [ ]:
site = SITE1
docmodel = get_doc2vec(site, RECOMPUTE_MODEL)
In [ ]:
class MatchingPost(object):
matchingPercentage = 0
title = ""
def __init__(self, matchingPercentage, title):
self.matchingPercentage = matchingPercentage
self.title = title
In [ ]:
def showsimilar(question):
if (type(question) is not 'str'):
question = str(question)
norm_input = stop_words.remove(question) # question.split()
q_vector = docmodel.infer_vector(norm_input)
similar_vecs = docmodel.docvecs.most_similar(positive=[q_vector])
similarTitles = []
for vec in similar_vecs:
post = posts[posts['Id']==vec[0]]
if(len(post) == 0): continue
title = posts[posts['Id']==vec[0]]['Title']
similarPostInfo = MatchingPost(vec[1], title.iloc[0])
similarTitles.append(similarPostInfo)
print("Search results for [{}]:".format(question))
# Show top 10 matches only
for title in similarTitles:
post = title.title
matchPercentage = title.matchingPercentage
print("{:10.2f}% : {}".format(matchPercentage * 100, post))
return similarTitles
In [ ]:
# Find similar questions
similarTitles = showsimilar(samples[0])
In [ ]:
from IPython.display import display
from ipywidgets import widgets
import warnings
warnings.filterwarnings("ignore", category=DeprecationWarning)
def handler(sender):
showsimilar(text.value)
text = widgets.Text()
display(text)
text.on_submit(handler)
In [ ]:
tf = TfidfVectorizer(analyzer='word', ngram_range=(1,3), min_df = 0, stop_words = 'english')
tfidf = {}
featurenames = {}
for site in SITES:
tfidf[site] = tf.fit_transform([p['Title'] for i, p in Posts[site].iterrows()])
featurenames[site] = tf.get_feature_names()
#dense = tfidf_matrix.todense()
#len(feature_names)
#feature_names[50:70]
In [ ]:
tsne = {}
for site in SITES:
X_reduced = TruncatedSVD(n_components=50, random_state=0).fit_transform(tfidf[site])
tsne[site] = TSNE(learning_rate=10,n_components=2, perplexity=40, verbose=2).fit_transform(X_reduced)
In [ ]:
'''
X_pca = PCA().fit_transform(dense)
from matplotlib.pyplot import figure
figure(figsize=(10, 5))
plt.subplot(121)
plt.scatter(X_tsne[:, 0], X_tsne[:, 1])
plt.title('T-SNE')
plt.subplot(122)
plt.scatter(X_pca[:, 0], X_pca[:, 1])
plt.title('PCA')
'''
In [ ]:
plt.figure(1,figsize=(5*3, 5))
plotId = 1
for site in SITES:
plt.subplot(120+plotId)
plt.scatter(tsne[site][:, 0], tsne[site][:, 1])
plt.title('T-SNE (' + site + ')')
plotId = plotId + 1
These are the kinds of questions we would like to pursue in the future:
Predict the next question a user may ask based on this current search
The raw data has user-generated tags for all the questions asked. Use supervised learning algorithms against this data set.
Build a browser plugin for the use case discussed in the introduction
We are always live @ https://github.com/dhruvaray/soml
In [ ]: