Clustering the twitter samples corpus

corpushash is a simple library that aims to make the natural language processing of sensitive documents easier. the library enables performing common NLP tasks on sensitive documents without disclosing their contents. This is done by hashing every token in the corpus along with a salt (to prevent dictionary attacks).

its workflow is as simple as having the sensitive corpora as a python nested list (or generator) whose elements are themselves (nested) lists of strings. after the hashing is done, NLP can be carried out by a third party, and when the results are in they can be decoded by a dictionary that maps hashes to the original strings. so that makes:

import corpushash as ch
hashed_corpus = ch.CorpusHash(mycorpus_as_a_nested_list, '/home/sensitive-corpus')
>>> "42 documents hashed and saved to '/home/sensitive-corpus/public/$(timestamp)'"

NLP is done, and results are in:

for token in results:
    print(token, ">", hashed_corpus.decode_dictionary[token])
>>> "7)JBMGG?sGu+>%Js~dG=%c1Qn1HpAU{jM-~Buu7?" > "gutenberg"

loading libraries


In [10]:
import gensim
import logging, bz2, os
from corpushash import CorpusHash
from nltk.corpus import twitter_samples as tt
import numpy as np
import string
from gensim import corpora, models, similarities

In [11]:
logging.basicConfig(format='%(asctime)s : %(levelname)s : %(message)s', level=logging.INFO)

uncomment this if you don't have the corpus downloaded


In [12]:
#import nltk
#nltk.download('twitter_samples')

specify the directory you'd like to save files to:


In [13]:
path = os.getcwd()

this is needed because gensim's doc2bow has some random behaviour:


In [14]:
np.random.seed(42)

the twitter samples corpus

this is how the corpus looks like:


In [15]:
tt.strings()[:10]


Out[15]:
['hopeless for tmr :(',
 "Everything in the kids section of IKEA is so cute. Shame I'm nearly 19 in 2 months :(",
 '@Hegelbon That heart sliding into the waste basket. :(',
 '“@ketchBurning: I hate Japanese call him "bani" :( :(”\n\nMe too',
 'Dang starting next week I have "work" :(',
 "oh god, my babies' faces :( https://t.co/9fcwGvaki0",
 '@RileyMcDonough make me smile :((',
 '@f0ggstar @stuartthull work neighbour on motors. Asked why and he said hates the updates on search :( http://t.co/XvmTUikWln',
 'why?:("@tahuodyy: sialan:( https://t.co/Hv1i0xcrL2"',
 'Athabasca glacier was there in #1948 :-( #athabasca #glacier #jasper #jaspernationalpark #alberta #explorealberta #… http://t.co/dZZdqmf7Cz']

but we'll be using the pre-tokenized version:


In [16]:
tt.tokenized()[0]


Out[16]:
['hopeless', 'for', 'tmr', ':(']

In [17]:
len(tt.tokenized())


Out[17]:
30000

In [18]:
decoded_twitter = tt.tokenized()

building gensim dictionary

from this document generator gensim will build a dictionary that maps every hashed token to an ID, a mapping which is later used to calculate the tf-idf weights:


In [19]:
id2word = gensim.corpora.Dictionary(decoded_twitter)


2017-05-23 22:39:18,669 : INFO : adding document #0 to Dictionary(0 unique tokens: [])
2017-05-23 22:39:20,438 : INFO : adding document #10000 to Dictionary(24343 unique tokens: ['Americano', 'http://t.co/lFm9Zq4Tj2', 'meditation', 'Danny', '@nabilaAF2013']...)
2017-05-23 22:39:22,948 : INFO : adding document #20000 to Dictionary(35614 unique tokens: ['Americano', 'Roche', 'http://t.co/lFm9Zq4Tj2', 'Republican', 'meditation']...)
2017-05-23 22:39:25,462 : INFO : built Dictionary(42532 unique tokens: ['https://t.co/aueMOZvKeq', 'Americano', '#Wow', 'Roche', 'http://t.co/lFm9Zq4Tj2']...) from 30000 documents (total 580322 corpus positions)

In [20]:
id2word[0]


Out[20]:
'hopeless'

bag-of-words

to build a tf-idf model, the gensim library needs an input that yields this vectorized bag-of-words when iterated over:


In [21]:
mm = [id2word.doc2bow(text) for text in decoded_twitter]
gensim.corpora.MmCorpus.serialize(os.path.join(path, 'twitter_pt_tfidf.mm'), mm)


2017-05-23 22:39:30,548 : INFO : storing corpus in Matrix Market format to /home/bruno/Documents/github/corpushash/notebooks/twitter_pt_tfidf.mm
2017-05-23 22:39:30,552 : INFO : saving sparse matrix to /home/bruno/Documents/github/corpushash/notebooks/twitter_pt_tfidf.mm
2017-05-23 22:39:30,556 : INFO : PROGRESS: saving document #0
2017-05-23 22:39:30,753 : INFO : PROGRESS: saving document #1000
2017-05-23 22:39:30,962 : INFO : PROGRESS: saving document #2000
2017-05-23 22:39:31,170 : INFO : PROGRESS: saving document #3000
2017-05-23 22:39:31,367 : INFO : PROGRESS: saving document #4000
2017-05-23 22:39:31,557 : INFO : PROGRESS: saving document #5000
2017-05-23 22:39:31,753 : INFO : PROGRESS: saving document #6000
2017-05-23 22:39:31,957 : INFO : PROGRESS: saving document #7000
2017-05-23 22:39:32,157 : INFO : PROGRESS: saving document #8000
2017-05-23 22:39:32,369 : INFO : PROGRESS: saving document #9000
2017-05-23 22:39:32,568 : INFO : PROGRESS: saving document #10000
2017-05-23 22:39:32,873 : INFO : PROGRESS: saving document #11000
2017-05-23 22:39:33,181 : INFO : PROGRESS: saving document #12000
2017-05-23 22:39:33,502 : INFO : PROGRESS: saving document #13000
2017-05-23 22:39:33,817 : INFO : PROGRESS: saving document #14000
2017-05-23 22:39:34,123 : INFO : PROGRESS: saving document #15000
2017-05-23 22:39:34,439 : INFO : PROGRESS: saving document #16000
2017-05-23 22:39:34,746 : INFO : PROGRESS: saving document #17000
2017-05-23 22:39:35,049 : INFO : PROGRESS: saving document #18000
2017-05-23 22:39:35,353 : INFO : PROGRESS: saving document #19000
2017-05-23 22:39:35,665 : INFO : PROGRESS: saving document #20000
2017-05-23 22:39:35,974 : INFO : PROGRESS: saving document #21000
2017-05-23 22:39:36,281 : INFO : PROGRESS: saving document #22000
2017-05-23 22:39:36,593 : INFO : PROGRESS: saving document #23000
2017-05-23 22:39:36,891 : INFO : PROGRESS: saving document #24000
2017-05-23 22:39:37,188 : INFO : PROGRESS: saving document #25000
2017-05-23 22:39:37,489 : INFO : PROGRESS: saving document #26000
2017-05-23 22:39:37,797 : INFO : PROGRESS: saving document #27000
2017-05-23 22:39:38,108 : INFO : PROGRESS: saving document #28000
2017-05-23 22:39:38,436 : INFO : PROGRESS: saving document #29000
2017-05-23 22:39:38,741 : INFO : saved 30000x42532 matrix, density=0.042% (538552/1275960000)
2017-05-23 22:39:38,745 : INFO : saving MmCorpus index to /home/bruno/Documents/github/corpushash/notebooks/twitter_pt_tfidf.mm.index

In [22]:
%%time
if os.path.exists(os.path.join(path, 'twitter_tfidf_model')):
    tfidf = models.TfidfModel.load(os.path.join(path, 'twitter_tfidf_model'))
else:
    tfidf = models.TfidfModel(mm)
    tfidf.save('twitter_tfidf_model')


2017-05-23 22:39:38,788 : INFO : collecting document frequencies
2017-05-23 22:39:38,793 : INFO : PROGRESS: processing document #0
2017-05-23 22:39:38,989 : INFO : PROGRESS: processing document #10000
2017-05-23 22:39:39,318 : INFO : PROGRESS: processing document #20000
2017-05-23 22:39:39,652 : INFO : calculating IDF weights for 30000 documents and 42531 features (538552 matrix non-zeros)
2017-05-23 22:39:39,890 : INFO : saving TfidfModel object under twitter_tfidf_model, separately None
2017-05-23 22:39:39,920 : INFO : saved twitter_tfidf_model
CPU times: user 1.1 s, sys: 28 ms, total: 1.13 s
Wall time: 1.14 s

Calculating the LSI model

The next step is to train the LSI model with a tfidf transformed corpus. So we will need yet another generator to yield the transformed corpus.


In [23]:
def tfidf_corpus_stream(corpus):
    for doc in corpus:
        yield tfidf[doc]

In [24]:
tfidf_corpus_s = tfidf_corpus_stream(mm)

In [25]:
if os.path.exists(os.path.join(path, 'twitter_lsi_model')):
    lsi = gensim.models.LsiModel.load(os.path.join(path, 'twitter_lsi_model'))
else:
    lsi = gensim.models.lsimodel.LsiModel(corpus=tfidf_corpus_s, id2word=id2word, num_topics=100)
    lsi.save(os.path.join(path, 'twitter_lsi_model'))


2017-05-23 22:39:40,046 : INFO : using serial LSI version on this node
2017-05-23 22:39:40,050 : INFO : updating model with new documents
2017-05-23 22:39:44,837 : INFO : preparing a new chunk of documents
2017-05-23 22:39:45,831 : INFO : using 100 extra samples and 2 power iterations
2017-05-23 22:39:45,836 : INFO : 1st phase: constructing (42532, 200) action matrix
2017-05-23 22:39:47,061 : INFO : orthonormalizing (42532, 200) action matrix
2017-05-23 22:40:12,259 : INFO : 2nd phase: running dense svd on (200, 20000) matrix
2017-05-23 22:40:19,609 : INFO : computing the final decomposition
2017-05-23 22:40:19,628 : INFO : keeping 100 factors (discarding 22.970% of energy spectrum)
2017-05-23 22:40:23,244 : INFO : processed documents up to #20000
2017-05-23 22:40:23,254 : INFO : topic #0(20.207): 0.246*""" + 0.207*"SNP" + 0.186*"Tories" + 0.174*"Miliband" + 0.171*"in" + 0.162*"is" + 0.160*"to" + 0.154*"of" + 0.146*"Sco" + 0.146*"…"
2017-05-23 22:40:23,263 : INFO : topic #1(17.533): -0.288*""" + 0.200*"SNP" + -0.187*"preoccupied" + -0.187*"@Tommy_Colc" + -0.187*"inequality" + -0.186*"wrote" + -0.179*"claiming" + -0.177*"man" + -0.174*"come" + -0.174*"w"
2017-05-23 22:40:23,272 : INFO : topic #2(14.004): 0.235*"(" + 0.209*"%" + 0.200*":(" + 0.189*"I" + 0.174*"-" + 0.170*"!" + 0.167*"you" + 0.163*"the" + 0.147*")" + 0.145*"a"
2017-05-23 22:40:23,282 : INFO : topic #3(12.137): 0.543*"%" + 0.331*"-" + 0.317*"(" + 0.251*")" + 0.223*"1" + 0.124*"+" + 0.116*"CON" + 0.115*"LAB" + -0.114*"I" + -0.113*"you"
2017-05-23 22:40:23,295 : INFO : topic #4(10.230): -0.424*":(" + -0.378*"(" + 0.216*"%" + -0.204*"!" + 0.182*"'" + -0.160*"i" + -0.158*":)" + 0.126*":" + -0.119*"I" + 0.117*"Cameron"
2017-05-23 22:40:25,938 : INFO : preparing a new chunk of documents
2017-05-23 22:40:26,487 : INFO : using 100 extra samples and 2 power iterations
2017-05-23 22:40:26,491 : INFO : 1st phase: constructing (42532, 200) action matrix
2017-05-23 22:40:27,203 : INFO : orthonormalizing (42532, 200) action matrix
2017-05-23 22:40:52,163 : INFO : 2nd phase: running dense svd on (200, 10000) matrix
2017-05-23 22:40:56,878 : INFO : computing the final decomposition
2017-05-23 22:40:56,888 : INFO : keeping 100 factors (discarding 23.851% of energy spectrum)
2017-05-23 22:41:00,371 : INFO : merging projections: (42532, 100) + (42532, 100)
2017-05-23 22:41:08,904 : INFO : keeping 100 factors (discarding 12.573% of energy spectrum)
2017-05-23 22:41:13,063 : INFO : processed documents up to #30000
2017-05-23 22:41:13,073 : INFO : topic #0(26.147): 0.313*""" + 0.181*"Tories" + 0.171*"preoccupied" + 0.171*"inequality" + 0.171*"@Tommy_Colc" + 0.171*"wrote" + 0.167*"Miliband" + 0.167*"claiming" + 0.166*"w" + 0.164*"man"
2017-05-23 22:41:13,084 : INFO : topic #1(22.136): 0.247*"SNP" + -0.210*""" + 0.178*"Sco" + 0.177*"to" + 0.176*"protect" + 0.176*"lots" + 0.175*"definitely" + 0.172*"@NicolaSturgeon" + 0.172*"rather" + 0.170*"let"
2017-05-23 22:41:13,095 : INFO : topic #2(17.285): 0.194*"the" + 0.175*"." + -0.169*"protect" + -0.169*"lots" + -0.168*"definitely" + -0.168*"Sco" + 0.159*"a" + 0.159*"%" + 0.148*"I" + -0.147*"MPs"
2017-05-23 22:41:13,107 : INFO : topic #3(14.181): -0.646*"%" + -0.319*"-" + -0.266*"(" + -0.220*")" + -0.195*"1" + -0.131*"CON" + -0.130*"LAB" + -0.127*"poll" + -0.126*"8" + -0.125*"34"
2017-05-23 22:41:13,119 : INFO : topic #4(11.743): -0.236*"thus" + -0.236*"ahem" + -0.236*"@thomasmessenger" + -0.236*"http://t.co/DkLwCwzhDA" + -0.235*"financial" + -0.234*"caused" + -0.233*"global" + -0.233*"crisis" + -0.225*"For" + -0.225*"overspent"
2017-05-23 22:41:13,179 : INFO : saving Projection object under /home/bruno/Documents/github/corpushash/notebooks/twitter_lsi_model.projection, separately None
2017-05-23 22:41:14,211 : INFO : saved /home/bruno/Documents/github/corpushash/notebooks/twitter_lsi_model.projection
2017-05-23 22:41:14,215 : INFO : saving LsiModel object under /home/bruno/Documents/github/corpushash/notebooks/twitter_lsi_model, separately None
2017-05-23 22:41:14,219 : INFO : not storing attribute projection
2017-05-23 22:41:14,224 : INFO : not storing attribute dispatcher
2017-05-23 22:41:14,339 : INFO : saved /home/bruno/Documents/github/corpushash/notebooks/twitter_lsi_model

In [26]:
for n in range(17):
    print("====================")
    print("Topic {}:".format(n))
    print("Coef.\t Token")
    print("--------------------")
    for tok,coef in lsi.show_topic(n):
        print("{:.3}\t{}".format(coef,tok))


====================
Topic 0:
Coef.	 Token
--------------------
0.313	"
0.181	Tories
0.171	preoccupied
0.171	inequality
0.171	@Tommy_Colc
0.171	wrote
0.167	Miliband
0.167	claiming
0.166	w
0.164	man
====================
Topic 1:
Coef.	 Token
--------------------
0.247	SNP
-0.21	"
0.178	Sco
0.177	to
0.176	protect
0.176	lots
0.175	definitely
0.172	@NicolaSturgeon
0.172	rather
0.17	let
====================
Topic 2:
Coef.	 Token
--------------------
0.194	the
0.175	.
-0.169	protect
-0.169	lots
-0.168	definitely
-0.168	Sco
0.159	a
0.159	%
0.148	I
-0.147	MPs
====================
Topic 3:
Coef.	 Token
--------------------
-0.646	%
-0.319	-
-0.266	(
-0.22	)
-0.195	1
-0.131	CON
-0.13	LAB
-0.127	poll
-0.126	8
-0.125	34
====================
Topic 4:
Coef.	 Token
--------------------
-0.236	thus
-0.236	ahem
-0.236	@thomasmessenger
-0.236	http://t.co/DkLwCwzhDA
-0.235	financial
-0.234	caused
-0.233	global
-0.233	crisis
-0.225	For
-0.225	overspent
====================
Topic 5:
Coef.	 Token
--------------------
0.341	FT
0.321	(
-0.232	%
0.2	)
0.19	:(
0.177	Jonathan
0.177	Ford
0.177	writer
0.176	Boris
-0.162	'
====================
Topic 6:
Coef.	 Token
--------------------
0.414	'
0.182	deal
-0.171	Cameron
-0.162	David
0.152	Tomorrow
0.147	myself
0.147	@mrmarksteel
0.145	case
-0.141	on
0.136	tell
====================
Topic 7:
Coef.	 Token
--------------------
0.282	!
0.279	:(
-0.232	FT
0.187	:)
0.182	you
0.171	I
-0.155	leader
-0.127	'
-0.123	Jonathan
-0.123	Ford
====================
Topic 8:
Coef.	 Token
--------------------
-0.312	'
0.24	-
0.21	(
0.197	Labour
-0.173	,
-0.162	%
0.146	"
0.133	SNP
0.13	with
-0.13	FT
====================
Topic 9:
Coef.	 Token
--------------------
0.365	(
0.264	David
0.248	Cameron
0.202	:(
-0.181	%
0.151	...
-0.144	#AskNigelFarage
0.137	-
0.13	)
0.129	*
====================
Topic 10:
Coef.	 Token
--------------------
-0.356	!
-0.224	you
-0.198	he'd
-0.185	than
0.182	"
-0.182	rather
-0.171	:)
0.153	no
-0.139	let
0.134	(
====================
Topic 11:
Coef.	 Token
--------------------
-0.527	"
-0.367	!
0.218	:(
-0.134	:)
0.129	I
0.126	.
-0.121	-
0.0924	(
0.0896	Tories
-0.0893	'
====================
Topic 12:
Coef.	 Token
--------------------
0.394	'
-0.262	retweet
-0.198	not
-0.194	this
-0.165	do
0.163	(
-0.156	@LabourEoin
-0.138	would
-0.133	http://t.co/5D2pKCstr3
-0.131	repeat
====================
Topic 13:
Coef.	 Token
--------------------
0.527	"
-0.278	!
-0.201	'
-0.156	-
0.15	not
-0.143	http
-0.126	*
0.115	says
0.106	do
0.103	than
====================
Topic 14:
Coef.	 Token
--------------------
-0.186	hungrier
-0.186	reliant
0.184	Ed
-0.182	five
-0.178	banks
-0.176	food
-0.176	@Markfergusonuk
0.174	*
-0.17	ago
-0.164	years
====================
Topic 15:
Coef.	 Token
--------------------
-0.26	#AskNigelFarage
-0.206	Farage
0.198	"
-0.18	(
-0.18	retweet
-0.162	Nigel
-0.16	@Nigel_Farage
-0.16	@UKIP
0.157	:)
-0.155	#UKIP
====================
Topic 16:
Coef.	 Token
--------------------
0.237	:(
0.162	Ed
-0.161	-
-0.16	)
0.144	%
-0.14	will
-0.137	#bbcqt
-0.134	#AskNigelFarage
-0.131	child
-0.13	*

LSI on the hashed corpus

Now all of the original documents have been hashed, and we can run the same analysis we ran with the plain corpus.


In [27]:
np.random.seed(42)

processing using the corpushash library

instatiating CorpusHash class, which hashes the provided corpus to the corpus_path:


In [28]:
%%time
hashed = CorpusHash(decoded_twitter, 'twitter')


2017-05-23 22:41:43,678 - corpushash.hashers - INFO - 30000 documents hashed and saved to twitter/public/2017-05-23_22-41-14-659510.
2017-05-23 22:41:43,678 : INFO : 30000 documents hashed and saved to twitter/public/2017-05-23_22-41-14-659510.
CPU times: user 24.9 s, sys: 3.85 s, total: 28.7 s
Wall time: 29 s

that is it. corpushash's work is done.

building dictionary for the hashed corpus


In [29]:
id2word = gensim.corpora.Dictionary(hashed.read_hashed_corpus())


2017-05-23 22:41:43,707 : INFO : adding document #0 to Dictionary(0 unique tokens: [])
2017-05-23 22:41:48,381 : INFO : adding document #10000 to Dictionary(24343 unique tokens: ['yXWupV7(%Jh!0%xFvLM3KCn!Rr$+BsbDU$W$N&d_', '6KU!3F6?h!OF9NsnPE^;r9Rj8e)AqhO!PYKbM3I1', 'C2QO*QC}-Y2aYL9fLx%Vs?PB&9OrQD$tI2DTzc{b', '?l|xct~8Qo?l~`s8&#O_OzN6<GN!%MmH4a#J9U;M', '<K>XG@1gDWqVR831Pu(2>Drd^vJ@X6&v64cA&bGY']...)
2017-05-23 22:41:53,801 : INFO : adding document #20000 to Dictionary(35614 unique tokens: ['yXWupV7(%Jh!0%xFvLM3KCn!Rr$+BsbDU$W$N&d_', '?&<)40p-obP!%ppvV+NMcy5N&+J+4vkkDrw=`r)7', '6KU!3F6?h!OF9NsnPE^;r9Rj8e)AqhO!PYKbM3I1', 'C2QO*QC}-Y2aYL9fLx%Vs?PB&9OrQD$tI2DTzc{b', '?l|xct~8Qo?l~`s8&#O_OzN6<GN!%MmH4a#J9U;M']...)
2017-05-23 22:41:59,155 : INFO : built Dictionary(42532 unique tokens: ['yXWupV7(%Jh!0%xFvLM3KCn!Rr$+BsbDU$W$N&d_', 'HAUZdq60Hh-U@NAP|^&3i%rnpTXX1QD3zOt<I9?X', '?&<)40p-obP!%ppvV+NMcy5N&+J+4vkkDrw=`r)7', '6KU!3F6?h!OF9NsnPE^;r9Rj8e)AqhO!PYKbM3I1', 'C2QO*QC}-Y2aYL9fLx%Vs?PB&9OrQD$tI2DTzc{b']...) from 30000 documents (total 580322 corpus positions)

In [30]:
id2word[0]


Out[30]:
'>)>{1_a)^MuyM>C0m&TpEi&OwdI=p=a>{46-ep!@'

In [31]:
mm = [id2word.doc2bow(text) for text in hashed.read_hashed_corpus()]
gensim.corpora.MmCorpus.serialize(os.path.join(path, 'hashed_twitter_pt_tfidf.mm'), mm)


2017-05-23 22:42:12,175 : INFO : storing corpus in Matrix Market format to /home/bruno/Documents/github/corpushash/notebooks/hashed_twitter_pt_tfidf.mm
2017-05-23 22:42:12,179 : INFO : saving sparse matrix to /home/bruno/Documents/github/corpushash/notebooks/hashed_twitter_pt_tfidf.mm
2017-05-23 22:42:12,183 : INFO : PROGRESS: saving document #0
2017-05-23 22:42:12,379 : INFO : PROGRESS: saving document #1000
2017-05-23 22:42:12,574 : INFO : PROGRESS: saving document #2000
2017-05-23 22:42:12,768 : INFO : PROGRESS: saving document #3000
2017-05-23 22:42:12,962 : INFO : PROGRESS: saving document #4000
2017-05-23 22:42:13,149 : INFO : PROGRESS: saving document #5000
2017-05-23 22:42:13,349 : INFO : PROGRESS: saving document #6000
2017-05-23 22:42:13,557 : INFO : PROGRESS: saving document #7000
2017-05-23 22:42:13,758 : INFO : PROGRESS: saving document #8000
2017-05-23 22:42:13,964 : INFO : PROGRESS: saving document #9000
2017-05-23 22:42:14,163 : INFO : PROGRESS: saving document #10000
2017-05-23 22:42:14,463 : INFO : PROGRESS: saving document #11000
2017-05-23 22:42:14,773 : INFO : PROGRESS: saving document #12000
2017-05-23 22:42:15,089 : INFO : PROGRESS: saving document #13000
2017-05-23 22:42:15,400 : INFO : PROGRESS: saving document #14000
2017-05-23 22:42:15,715 : INFO : PROGRESS: saving document #15000
2017-05-23 22:42:16,014 : INFO : PROGRESS: saving document #16000
2017-05-23 22:42:16,321 : INFO : PROGRESS: saving document #17000
2017-05-23 22:42:16,621 : INFO : PROGRESS: saving document #18000
2017-05-23 22:42:16,915 : INFO : PROGRESS: saving document #19000
2017-05-23 22:42:17,211 : INFO : PROGRESS: saving document #20000
2017-05-23 22:42:17,507 : INFO : PROGRESS: saving document #21000
2017-05-23 22:42:17,800 : INFO : PROGRESS: saving document #22000
2017-05-23 22:42:18,101 : INFO : PROGRESS: saving document #23000
2017-05-23 22:42:18,394 : INFO : PROGRESS: saving document #24000
2017-05-23 22:42:18,697 : INFO : PROGRESS: saving document #25000
2017-05-23 22:42:19,006 : INFO : PROGRESS: saving document #26000
2017-05-23 22:42:19,312 : INFO : PROGRESS: saving document #27000
2017-05-23 22:42:19,614 : INFO : PROGRESS: saving document #28000
2017-05-23 22:42:19,920 : INFO : PROGRESS: saving document #29000
2017-05-23 22:42:20,221 : INFO : saved 30000x42532 matrix, density=0.042% (538552/1275960000)
2017-05-23 22:42:20,225 : INFO : saving MmCorpus index to /home/bruno/Documents/github/corpushash/notebooks/hashed_twitter_pt_tfidf.mm.index

In [32]:
%%time
if os.path.exists(os.path.join(path, 'hashed_twitter_tfidf_model')):
    tfidf = models.TfidfModel.load(os.path.join(path, 'hashed_twitter_tfidf_model'))
else:
    tfidf = models.TfidfModel(mm)
    tfidf.save(os.path.join(path, 'hashed_twitter_tfidf_model'))


2017-05-23 22:42:20,264 : INFO : collecting document frequencies
2017-05-23 22:42:20,268 : INFO : PROGRESS: processing document #0
2017-05-23 22:42:20,473 : INFO : PROGRESS: processing document #10000
2017-05-23 22:42:20,786 : INFO : PROGRESS: processing document #20000
2017-05-23 22:42:21,118 : INFO : calculating IDF weights for 30000 documents and 42531 features (538552 matrix non-zeros)
2017-05-23 22:42:21,348 : INFO : saving TfidfModel object under /home/bruno/Documents/github/corpushash/notebooks/hashed_twitter_tfidf_model, separately None
2017-05-23 22:42:21,377 : INFO : saved /home/bruno/Documents/github/corpushash/notebooks/hashed_twitter_tfidf_model
CPU times: user 1.12 s, sys: 8 ms, total: 1.12 s
Wall time: 1.12 s

The next step is to train the LSI model with a tfidf transformed corpus. So we will need yet another generator to yield the transformed corpus.


In [33]:
def tfidf_corpus_stream(corpus):
    for doc in corpus:
        yield tfidf[doc]

In [34]:
tfidf_corpus_s = tfidf_corpus_stream(mm)

Calculating the LSI model


In [35]:
if os.path.exists(os.path.join(path, 'hashed_twitter_lsi_model')):
    lsih = gensim.models.LsiModel.load(os.path.join(path, 'hashed_twitter_lsi_model'))
else:
    lsih = gensim.models.lsimodel.LsiModel(corpus=tfidf_corpus_s, id2word=id2word, num_topics=100)
    lsih.save(os.path.join(path, 'hashed_twitter_lsi_model'))


2017-05-23 22:42:21,498 : INFO : using serial LSI version on this node
2017-05-23 22:42:21,503 : INFO : updating model with new documents
2017-05-23 22:42:25,736 : INFO : preparing a new chunk of documents
2017-05-23 22:42:26,726 : INFO : using 100 extra samples and 2 power iterations
2017-05-23 22:42:26,730 : INFO : 1st phase: constructing (42532, 200) action matrix
2017-05-23 22:42:27,956 : INFO : orthonormalizing (42532, 200) action matrix
2017-05-23 22:42:51,259 : INFO : 2nd phase: running dense svd on (200, 20000) matrix
2017-05-23 22:42:58,155 : INFO : computing the final decomposition
2017-05-23 22:42:58,160 : INFO : keeping 100 factors (discarding 22.970% of energy spectrum)
2017-05-23 22:43:01,784 : INFO : processed documents up to #20000
2017-05-23 22:43:01,796 : INFO : topic #0(20.207): 0.246*"+xy0wTTqhE|JZz*%3WH{1@1q9h<G<=uLOD^et9xA" + 0.207*"d>egDXR4GAFvN3>C#)E7D`$~a8Jrm^3Sk6!^KgLD" + 0.186*"kGAJg>G9k#tX-bv0@t`HG>lN^h{c3!)<j`xP7D7)" + 0.174*"UC&2ub*;{Z+u~QBI_5Nvm##j=vQ4Kb{jrqZ0cyQ9" + 0.171*"Cs){F8vfMX0J=VRZfm7hVzq|K)nN5lopFL-JMTRL" + 0.162*"scC~6Ewvk)rQlronLRbS2@`x+`h}}_p``j02te&(" + 0.160*"I~sD~jLSr+)GCK909lzdzZ3G;qmm{27cyfFDq9+6" + 0.154*"VxfA8F~$z1TM2|NDDjwr>&cZBx~ec$;6R_WjuN*8" + 0.146*"T%uM_zM%cSpWHVycNOkwYJ;A@0LKu#kFCR-A$<C-" + 0.146*"1a3qE3$b^|ktB>5M8w5JY53Vc_i#O;YW7R9`M<7%"
2017-05-23 22:43:01,809 : INFO : topic #1(17.533): -0.288*"+xy0wTTqhE|JZz*%3WH{1@1q9h<G<=uLOD^et9xA" + 0.200*"d>egDXR4GAFvN3>C#)E7D`$~a8Jrm^3Sk6!^KgLD" + -0.187*"7y5%8<c1FGHe#<|dSu3vccWGzM9Q=*v4Z~lCTyRF" + -0.187*"fcu&`jIKDa#uPvE5u2(^)Z6n*AYzTwzbe#gP5;R1" + -0.187*"%4N-jt=3f2@)u4VhSd^a*M@(S%#}e2p!OmWo6cGd" + -0.186*"MK{t^7~%aBL2s88ymX=}S3R5ZohAfC370_$lde}H" + -0.179*"!~2FQb#UvY9{_*vVk6$2r=BlJ7&KiuJ=m&?+39}`" + -0.177*"(=w$026I9$E;O3NWu`+r+EKS@8tfx7Y|rQ?#Qexa" + -0.174*"9`-{*T<vV}7!gZ=CHhVRMjf#Wo=fja1+Sy4V;Zkf" + -0.174*"Wze>-N3e(e%@~h_T9GmAjsj^FE15#}DV6I7z^#l0"
2017-05-23 22:43:01,817 : INFO : topic #2(14.004): 0.235*"|DFN1^-Rm2LBndh2UKPj&yu+vFnE%_e&2C5n~))K" + 0.209*"T*pV0z?{CsLU*C_G*C+s4<`lU`F2sUXL-4ACCo$N" + 0.200*"=P1Ag@Ij!t@IL?%A<%u8vd4M#*=i#bm2bEny=a@5" + 0.189*"bH3^hWim!;vcXsFT-QDx3;e0N77|4{g+t&W^&CCI" + 0.174*"t@eOlY=<ke^P>>>+{uBW@Ef%VR<YqbOWZ3V_Pdvi" + 0.170*"oDXju01V%d!rg9hE57SZym8*E(ck9rvJN+5tB)y%" + 0.167*"@^>@IC?15aQs^hUVBYM1-5fAvs_ordu&=rnNh6PB" + 0.163*"RIi|^?U#n%E1dN|mB5{%#t%U=ZRofW?YORx5e@zS" + 0.147*">4*}YwimZ6-#mVL2ER|n?mnn7qj|QPrw{7B`-N+m" + 0.145*"yd=I30lV+i$zYCPC0kR>%795MIUWZT##<gbXcgwF"
2017-05-23 22:43:01,827 : INFO : topic #3(12.137): -0.543*"T*pV0z?{CsLU*C_G*C+s4<`lU`F2sUXL-4ACCo$N" + -0.331*"t@eOlY=<ke^P>>>+{uBW@Ef%VR<YqbOWZ3V_Pdvi" + -0.317*"|DFN1^-Rm2LBndh2UKPj&yu+vFnE%_e&2C5n~))K" + -0.251*">4*}YwimZ6-#mVL2ER|n?mnn7qj|QPrw{7B`-N+m" + -0.223*"LrA@7L~CKxM)HY?-yvGd0aQ7{>D`XFUEP+lTEz0O" + -0.124*"`*3zeQ@f&=OlKs@DTj@8a*~|AEFVdm2DXJvUCWV%" + -0.116*"zw5l{Ei}J#O|YZD{lHbRPQSgo7iOkBQyq#7rI3En" + -0.115*"Sh|)rr4B!m;irvDrEaT@-7eNxQ{E<O1JkcoOuC_#" + 0.114*"bH3^hWim!;vcXsFT-QDx3;e0N77|4{g+t&W^&CCI" + 0.113*"@^>@IC?15aQs^hUVBYM1-5fAvs_ordu&=rnNh6PB"
2017-05-23 22:43:01,837 : INFO : topic #4(10.230): -0.424*"=P1Ag@Ij!t@IL?%A<%u8vd4M#*=i#bm2bEny=a@5" + -0.378*"|DFN1^-Rm2LBndh2UKPj&yu+vFnE%_e&2C5n~))K" + 0.216*"T*pV0z?{CsLU*C_G*C+s4<`lU`F2sUXL-4ACCo$N" + -0.204*"oDXju01V%d!rg9hE57SZym8*E(ck9rvJN+5tB)y%" + 0.182*"^9b$h%%)@4{(Y|b_bYp`)!Qww+w|Mp6PVxspg<TY" + -0.160*"^=}#sVuCM=X_>IB@gXfim!jq?b57MYloVue2UN|k" + -0.158*"#oN0tUUyM7arz)StmZs827s8GEM}zMITF+*Yoh))" + 0.126*"6<1+nW&?VjhF)H5le0c?_43A421Ea$zrb{fiOr9`" + -0.119*"bH3^hWim!;vcXsFT-QDx3;e0N77|4{g+t&W^&CCI" + 0.117*"m>`y63pFyS2X_RhMjirqjTIY~pM#7()AqvYZ>ha5"
2017-05-23 22:43:04,982 : INFO : preparing a new chunk of documents
2017-05-23 22:43:05,558 : INFO : using 100 extra samples and 2 power iterations
2017-05-23 22:43:05,562 : INFO : 1st phase: constructing (42532, 200) action matrix
2017-05-23 22:43:06,250 : INFO : orthonormalizing (42532, 200) action matrix
2017-05-23 22:43:32,582 : INFO : 2nd phase: running dense svd on (200, 10000) matrix
2017-05-23 22:43:37,064 : INFO : computing the final decomposition
2017-05-23 22:43:37,071 : INFO : keeping 100 factors (discarding 23.851% of energy spectrum)
2017-05-23 22:43:40,586 : INFO : merging projections: (42532, 100) + (42532, 100)
2017-05-23 22:43:49,259 : INFO : keeping 100 factors (discarding 12.573% of energy spectrum)
2017-05-23 22:43:53,250 : INFO : processed documents up to #30000
2017-05-23 22:43:53,262 : INFO : topic #0(26.147): 0.313*"+xy0wTTqhE|JZz*%3WH{1@1q9h<G<=uLOD^et9xA" + 0.181*"kGAJg>G9k#tX-bv0@t`HG>lN^h{c3!)<j`xP7D7)" + 0.171*"7y5%8<c1FGHe#<|dSu3vccWGzM9Q=*v4Z~lCTyRF" + 0.171*"%4N-jt=3f2@)u4VhSd^a*M@(S%#}e2p!OmWo6cGd" + 0.171*"fcu&`jIKDa#uPvE5u2(^)Z6n*AYzTwzbe#gP5;R1" + 0.171*"MK{t^7~%aBL2s88ymX=}S3R5ZohAfC370_$lde}H" + 0.167*"UC&2ub*;{Z+u~QBI_5Nvm##j=vQ4Kb{jrqZ0cyQ9" + 0.167*"!~2FQb#UvY9{_*vVk6$2r=BlJ7&KiuJ=m&?+39}`" + 0.166*"Wze>-N3e(e%@~h_T9GmAjsj^FE15#}DV6I7z^#l0" + 0.164*"(=w$026I9$E;O3NWu`+r+EKS@8tfx7Y|rQ?#Qexa"
2017-05-23 22:43:53,272 : INFO : topic #1(22.136): 0.247*"d>egDXR4GAFvN3>C#)E7D`$~a8Jrm^3Sk6!^KgLD" + -0.210*"+xy0wTTqhE|JZz*%3WH{1@1q9h<G<=uLOD^et9xA" + 0.178*"T%uM_zM%cSpWHVycNOkwYJ;A@0LKu#kFCR-A$<C-" + 0.177*"I~sD~jLSr+)GCK909lzdzZ3G;qmm{27cyfFDq9+6" + 0.176*"Bh2!q*S%}l{R(WEpNx(uuzvgEtw67?ef#IFfGP#M" + 0.176*"W?ub=D4w?+k(C@)>>S^;x;y={V{C;cUIfGFX!@Pr" + 0.175*"MX6KN<VGK3ocNM&IyWAe<=-N>Gj}w=8mKF|{$_Kr" + 0.172*"`@UR?Y)PdDQjaaM55=?%1GscqtN}j8r1|W&$pPN$" + 0.172*"y43B0#ro>^B{97WhYGV&V%~Kv7>Z*kZFL@#H!qyJ" + 0.170*"2fp#fh(ps5=9wpygjp(K+0o*EVg{S&W#rfN(ON|k"
2017-05-23 22:43:53,280 : INFO : topic #2(17.285): 0.194*"RIi|^?U#n%E1dN|mB5{%#t%U=ZRofW?YORx5e@zS" + 0.175*"jmSf~?N^VLfH=V6Xo*4<fEqJgMu;1of2Fg3cyZ@I" + -0.169*"Bh2!q*S%}l{R(WEpNx(uuzvgEtw67?ef#IFfGP#M" + -0.169*"W?ub=D4w?+k(C@)>>S^;x;y={V{C;cUIfGFX!@Pr" + -0.168*"MX6KN<VGK3ocNM&IyWAe<=-N>Gj}w=8mKF|{$_Kr" + -0.168*"T%uM_zM%cSpWHVycNOkwYJ;A@0LKu#kFCR-A$<C-" + 0.159*"yd=I30lV+i$zYCPC0kR>%795MIUWZT##<gbXcgwF" + 0.159*"T*pV0z?{CsLU*C_G*C+s4<`lU`F2sUXL-4ACCo$N" + 0.148*"bH3^hWim!;vcXsFT-QDx3;e0N77|4{g+t&W^&CCI" + -0.147*"?~1<E9+`x`2u^Q5*=4yO;TA<Iko~xtFJlYG2Q$;_"
2017-05-23 22:43:53,287 : INFO : topic #3(14.181): -0.646*"T*pV0z?{CsLU*C_G*C+s4<`lU`F2sUXL-4ACCo$N" + -0.319*"t@eOlY=<ke^P>>>+{uBW@Ef%VR<YqbOWZ3V_Pdvi" + -0.266*"|DFN1^-Rm2LBndh2UKPj&yu+vFnE%_e&2C5n~))K" + -0.220*">4*}YwimZ6-#mVL2ER|n?mnn7qj|QPrw{7B`-N+m" + -0.195*"LrA@7L~CKxM)HY?-yvGd0aQ7{>D`XFUEP+lTEz0O" + -0.131*"zw5l{Ei}J#O|YZD{lHbRPQSgo7iOkBQyq#7rI3En" + -0.130*"Sh|)rr4B!m;irvDrEaT@-7eNxQ{E<O1JkcoOuC_#" + -0.127*"41!mk9|#!#udzP<LI+GPhP3hL`nVYP@=1#FY{t!`" + -0.126*"pdDXUb+DAJjotCGEU%CCxef%Wo>3`I0cVE=KW^#(" + -0.125*"3W!Sw!#HkUTh78EUP~p&j-Uf_nLL`MuhZoYXR-kc"
2017-05-23 22:43:53,294 : INFO : topic #4(11.743): -0.236*"5aXF{gWMJU&6yNvQVr6#s&qNi;d7HwNA@yrT{p9|" + -0.236*"5%)p;QfLv<&WPMiMNYrK1=9Ceu4(iH{vkgs>{3in" + -0.236*"+uMb59>XHuaG9hHfIJrP$JZ@-E3kJ%i9NX#m%tpR" + -0.236*"w$&)+6bqZfUW~0{kpiQ`gUOG?HXyZ?7++gpKL43d" + -0.235*"ha@=a3kmg>VuPo`xVA^m9jbv<m0~f4f+mB~cWd_*" + -0.234*"j84fMp?Ul%;t#VeYz)~0Rgr>Z6zmXbw$<t&5tx~6" + -0.233*"-ah{cFz#^Zvm}gAFA?3Jpi2*Rz68HVFg$sp%G3ur" + -0.233*"0XP2~L;?Yfp!g`%@kYHx)7@t0lftfga^ZZ4O_;Ld" + -0.225*"kG?Jp;H^+XOQ2gStY*D%qgHp|ZYZ_dp_ee?%oXSo" + -0.225*"z-T)F`f(s=Ij%nN#e%$1LJ-;VE)IognonFQ8-BwX"
2017-05-23 22:43:53,349 : INFO : saving Projection object under /home/bruno/Documents/github/corpushash/notebooks/hashed_twitter_lsi_model.projection, separately None
2017-05-23 22:43:54,409 : INFO : saved /home/bruno/Documents/github/corpushash/notebooks/hashed_twitter_lsi_model.projection
2017-05-23 22:43:54,413 : INFO : saving LsiModel object under /home/bruno/Documents/github/corpushash/notebooks/hashed_twitter_lsi_model, separately None
2017-05-23 22:43:54,417 : INFO : not storing attribute projection
2017-05-23 22:43:54,421 : INFO : not storing attribute dispatcher
2017-05-23 22:43:54,527 : INFO : saved /home/bruno/Documents/github/corpushash/notebooks/hashed_twitter_lsi_model

Let now look at the topics generated, decoding the hashed tokens using the decode_dictionary.


In [36]:
for n in range(17):
    print("====================")
    print("Topic {}:".format(n))
    print("Coef.\t Token")
    print("--------------------")
    for tok,coef in lsih.show_topic(n):
        tok = hashed.decode_dictionary[tok.strip()][0]
        print("{:.3}\t{}".format(coef,tok))


====================
Topic 0:
Coef.	 Token
--------------------
0.313	"
0.181	Tories
0.171	preoccupied
0.171	inequality
0.171	@Tommy_Colc
0.171	wrote
0.167	Miliband
0.167	claiming
0.166	w
0.164	man
====================
Topic 1:
Coef.	 Token
--------------------
0.247	SNP
-0.21	"
0.178	Sco
0.177	to
0.176	protect
0.176	lots
0.175	definitely
0.172	@NicolaSturgeon
0.172	rather
0.17	let
====================
Topic 2:
Coef.	 Token
--------------------
0.194	the
0.175	.
-0.169	protect
-0.169	lots
-0.168	definitely
-0.168	Sco
0.159	a
0.159	%
0.148	I
-0.147	MPs
====================
Topic 3:
Coef.	 Token
--------------------
-0.646	%
-0.319	-
-0.266	(
-0.22	)
-0.195	1
-0.131	CON
-0.13	LAB
-0.127	poll
-0.126	8
-0.125	34
====================
Topic 4:
Coef.	 Token
--------------------
-0.236	thus
-0.236	ahem
-0.236	@thomasmessenger
-0.236	http://t.co/DkLwCwzhDA
-0.235	financial
-0.234	caused
-0.233	global
-0.233	crisis
-0.225	For
-0.225	overspent
====================
Topic 5:
Coef.	 Token
--------------------
0.341	FT
0.321	(
-0.232	%
0.2	)
0.19	:(
0.177	Ford
0.177	Jonathan
0.177	writer
0.176	Boris
-0.162	'
====================
Topic 6:
Coef.	 Token
--------------------
-0.414	'
-0.182	deal
0.171	Cameron
0.162	David
-0.152	Tomorrow
-0.147	myself
-0.147	@mrmarksteel
-0.145	case
0.141	on
-0.136	tell
====================
Topic 7:
Coef.	 Token
--------------------
0.282	!
0.279	:(
-0.232	FT
0.187	:)
0.182	you
0.171	I
-0.155	leader
-0.127	'
-0.123	Ford
-0.123	Jonathan
====================
Topic 8:
Coef.	 Token
--------------------
-0.312	'
0.24	-
0.21	(
0.197	Labour
-0.173	,
-0.162	%
0.146	"
0.133	SNP
0.13	with
-0.13	FT
====================
Topic 9:
Coef.	 Token
--------------------
0.365	(
0.264	David
0.248	Cameron
0.202	:(
-0.181	%
0.151	...
-0.144	#AskNigelFarage
0.137	-
0.13	)
0.129	*
====================
Topic 10:
Coef.	 Token
--------------------
0.356	!
0.224	you
0.198	he'd
0.185	than
-0.182	"
0.182	rather
0.171	:)
-0.153	no
0.139	let
-0.134	(
====================
Topic 11:
Coef.	 Token
--------------------
-0.527	"
-0.367	!
0.218	:(
-0.134	:)
0.129	I
0.126	.
-0.121	-
0.0924	(
0.0896	Tories
-0.0893	'
====================
Topic 12:
Coef.	 Token
--------------------
0.394	'
-0.262	retweet
-0.198	not
-0.194	this
-0.165	do
0.163	(
-0.156	@LabourEoin
-0.138	would
-0.133	http://t.co/5D2pKCstr3
-0.131	repeat
====================
Topic 13:
Coef.	 Token
--------------------
0.527	"
-0.278	!
-0.201	'
-0.156	-
0.15	not
-0.143	http
-0.126	*
0.115	says
0.106	do
0.103	than
====================
Topic 14:
Coef.	 Token
--------------------
-0.186	hungrier
-0.186	reliant
0.184	Ed
-0.182	five
-0.178	banks
-0.176	food
-0.176	@Markfergusonuk
0.174	*
-0.17	ago
-0.164	years
====================
Topic 15:
Coef.	 Token
--------------------
-0.26	#AskNigelFarage
-0.206	Farage
0.198	"
-0.18	(
-0.18	retweet
-0.162	Nigel
-0.16	@Nigel_Farage
-0.16	@UKIP
0.157	:)
-0.155	#UKIP
====================
Topic 16:
Coef.	 Token
--------------------
0.237	:(
0.162	Ed
-0.161	-
-0.16	)
0.144	%
-0.14	will
-0.137	#bbcqt
-0.134	#AskNigelFarage
-0.131	child
-0.13	*

comparing the resulting topics we see that the NLP's results are the same regardless of which corpus we use, i.e., we can use hashed corpora to perform NLP tasks in a lossless manner.