In [1]:
%load_ext autoreload
%autoreload 2
Example user:
Paul Groth
In [2]:
username = 'pgroth'
Publication titles found through Google Scholar
In [102]:
import json
publications_titles = json.load(open('paul_pubs.json'))
In [103]:
from gensim import similarities, models, corpora, utils
from nltk.corpus import stopwords
stoplist = stopwords.words('english')
In [104]:
import re
re_split = re.compile('\W+')
Read in all publications titles and tokenize them, also removing all stopwords and words that only occur once.
In [114]:
texts = [ [word for word in re.split(re_split,pub.lower()) if word not in stoplist and word != ''] for pub in publication_titles]
all_tokens = sum(texts, [])
tokens_once = set(word for word in set(all_tokens) if all_tokens.count(word) == 1)
texts = [[word for word in text if word not in tokens_once] for text in texts]
Create a dictionary based on these texts, a 'bag-of-words', where every token in the dictionary gets an id. Also an corpus based on this dictionary.
In [115]:
dictionary = corpora.Dictionary(texts)
dictionary.save('/tmp/paul.dict')
corpus = [dictionary.doc2bow(text) for text in texts]
corpora.MmCorpus.serialize('/tmp/deerwester.mm', corpus)
Create an Latent Similarity Indexing Model
In [123]:
lsi = models.LsiModel(corpus, id2word = dictionary, num_topics=10)
From this model we can create an similarity measure on new sentences, to which topic they belong. I hope we can also find a way to compare an entire twitter feed to this model, and say how similar it is to this corpus.
In [124]:
index = similarities.MatrixSimilarity(lsi[corpus])
index.save('/tmp/paul.index')
In [125]:
vec_bow = dictionary.doc2bow('New sentence containing the words Provenance Query experiments chemistry data'.lower().split())
vec_lsi = lsi[vec_bow]
sims = index[vec_lsi]
sims = enumerate(sims)
sims = sorted(sims, key=lambda item: -item[1])
print sims
In [ ]: