corpushash is a simple library that aims to make the natural language processing of sensitive documents easier. the library enables performing common NLP tasks on sensitive documents without disclosing their contents. This is done by hashing every token in the corpus along with a salt (to prevent dictionary attacks).
its workflow is as simple as having the sensitive corpora as a python nested list (or generator) whose elements are themselves (nested) lists of strings. after the hashing is done, NLP can be carried out by a third party, and when the results are in they can be decoded by a dictionary that maps hashes to the original strings. so that makes:
import corpushash as ch
hashed_corpus = ch.CorpusHash(mycorpus_as_a_nested_list, '/home/sensitive-corpus')
>>> "42 documents hashed and saved to '/home/sensitive-corpus/public/$(timestamp)'"
NLP is done, and results
are in:
for token in results:
print(token, ">", hashed_corpus.decode_dictionary[token])
>>> "7)JBMGG?sGu+>%Js~dG=%c1Qn1HpAU{jM-~Buu7?" > "gutenberg"
In [10]:
import gensim
import logging, bz2, os
from corpushash import CorpusHash
from nltk.corpus import twitter_samples as tt
import numpy as np
import string
from gensim import corpora, models, similarities
In [11]:
logging.basicConfig(format='%(asctime)s : %(levelname)s : %(message)s', level=logging.INFO)
In [12]:
#import nltk
#nltk.download('twitter_samples')
specify the directory you'd like to save files to:
In [13]:
path = os.getcwd()
this is needed because gensim
's doc2bow
has some random behaviour:
In [14]:
np.random.seed(42)
In [15]:
tt.strings()[:10]
Out[15]:
but we'll be using the pre-tokenized version:
In [16]:
tt.tokenized()[0]
Out[16]:
In [17]:
len(tt.tokenized())
Out[17]:
In [18]:
decoded_twitter = tt.tokenized()
In [19]:
id2word = gensim.corpora.Dictionary(decoded_twitter)
In [20]:
id2word[0]
Out[20]:
In [21]:
mm = [id2word.doc2bow(text) for text in decoded_twitter]
gensim.corpora.MmCorpus.serialize(os.path.join(path, 'twitter_pt_tfidf.mm'), mm)
In [22]:
%%time
if os.path.exists(os.path.join(path, 'twitter_tfidf_model')):
tfidf = models.TfidfModel.load(os.path.join(path, 'twitter_tfidf_model'))
else:
tfidf = models.TfidfModel(mm)
tfidf.save('twitter_tfidf_model')
The next step is to train the LSI model with a tfidf transformed corpus. So we will need yet another generator to yield the transformed corpus.
In [23]:
def tfidf_corpus_stream(corpus):
for doc in corpus:
yield tfidf[doc]
In [24]:
tfidf_corpus_s = tfidf_corpus_stream(mm)
In [25]:
if os.path.exists(os.path.join(path, 'twitter_lsi_model')):
lsi = gensim.models.LsiModel.load(os.path.join(path, 'twitter_lsi_model'))
else:
lsi = gensim.models.lsimodel.LsiModel(corpus=tfidf_corpus_s, id2word=id2word, num_topics=100)
lsi.save(os.path.join(path, 'twitter_lsi_model'))
In [26]:
for n in range(17):
print("====================")
print("Topic {}:".format(n))
print("Coef.\t Token")
print("--------------------")
for tok,coef in lsi.show_topic(n):
print("{:.3}\t{}".format(coef,tok))
Now all of the original documents have been hashed, and we can run the same analysis we ran with the plain corpus.
In [27]:
np.random.seed(42)
In [28]:
%%time
hashed = CorpusHash(decoded_twitter, 'twitter')
that is it. corpushash
's work is done.
In [29]:
id2word = gensim.corpora.Dictionary(hashed.read_hashed_corpus())
In [30]:
id2word[0]
Out[30]:
In [31]:
mm = [id2word.doc2bow(text) for text in hashed.read_hashed_corpus()]
gensim.corpora.MmCorpus.serialize(os.path.join(path, 'hashed_twitter_pt_tfidf.mm'), mm)
In [32]:
%%time
if os.path.exists(os.path.join(path, 'hashed_twitter_tfidf_model')):
tfidf = models.TfidfModel.load(os.path.join(path, 'hashed_twitter_tfidf_model'))
else:
tfidf = models.TfidfModel(mm)
tfidf.save(os.path.join(path, 'hashed_twitter_tfidf_model'))
The next step is to train the LSI model with a tfidf transformed corpus. So we will need yet another generator to yield the transformed corpus.
In [33]:
def tfidf_corpus_stream(corpus):
for doc in corpus:
yield tfidf[doc]
In [34]:
tfidf_corpus_s = tfidf_corpus_stream(mm)
In [35]:
if os.path.exists(os.path.join(path, 'hashed_twitter_lsi_model')):
lsih = gensim.models.LsiModel.load(os.path.join(path, 'hashed_twitter_lsi_model'))
else:
lsih = gensim.models.lsimodel.LsiModel(corpus=tfidf_corpus_s, id2word=id2word, num_topics=100)
lsih.save(os.path.join(path, 'hashed_twitter_lsi_model'))
Let now look at the topics generated, decoding the hashed tokens using the decode_dictionary
.
In [36]:
for n in range(17):
print("====================")
print("Topic {}:".format(n))
print("Coef.\t Token")
print("--------------------")
for tok,coef in lsih.show_topic(n):
tok = hashed.decode_dictionary[tok.strip()][0]
print("{:.3}\t{}".format(coef,tok))