In [1]:
%%HTML
<style> code {background-color : lightgrey !important;} </style>
corpushash is a simple library that aims to make the natural language processing of sensitive documents easier. the library enables performing common NLP tasks on sensitive documents without disclosing their contents. This is done by hashing every token in the corpus along with a salt (to prevent dictionary attacks).
its workflow is as simple as having the sensitive corpora as a python nested list (or generator) whose elements are themselves (nested) lists of strings. after the hashing is done, NLP can be carried out by a third party, and when the results are in they can be decoded by a dictionary that maps hashes to the original strings. so that makes:
import corpushash as ch
hashed_corpus = ch.CorpusHash(mycorpus_as_a_nested_list, '/home/sensitive-corpus')
>>> "42 documents hashed and saved to '/home/sensitive-corpus/public/$(timestamp)'"
NLP is done, and results
are in:
for token in results:
print(token, ">", hashed_corpus.decode_dictionary[token])
>>> "7)JBMGG?sGu+>%Js~dG=%c1Qn1HpAU{jM-~Buu7?" > "gutenberg"
In [2]:
import os
import string
import random
import pickle
import nltk
from nltk.corpus import gutenberg
In [3]:
#nltk.download("gutenberg")
files in test data:
In [4]:
gutenberg.fileids()
Out[4]:
creating test corpus path, where hashed documents will be stored as .json files:
In [5]:
corpus_path = os.path.join(os.getcwd(), 'guten_test')
corpus_path
Out[5]:
the library takes as input a nested list whose elements are the original documents as nested lists. the nested lists represent the document's structure (sections, paragraphs, sentences). this can be an in-memory nested list or some generator that yields a nested list when it is iterated over.
as this is a simple tf-idf transformation, we don't need the nested list format, because document structure is not important: just having the words with no sentence or paragraph structure is sufficient. the library's input is designed to be flexible in this regard.
In [6]:
%%time
decoded_gutencorpus = []
for document_name in gutenberg.fileids():
document = [word.lower() for word in gutenberg.words(document_name) if word not in string.punctuation and not word.isdigit()]
decoded_gutencorpus.append(document)
excerpt:
In [7]:
document = random.choice(decoded_gutencorpus)
print(document[:100])
loading libraries for corpushash
In [8]:
import corpushash as ch
In [9]:
%time hashed_guten = ch.CorpusHash(decoded_gutencorpus, corpus_path)
that is it. corpushash
's work is done.
from now on we are simulating the work of an analyst that receives the encoded documents and performs a common NLP task on them, the calculation of tf-idf weights.
note: we will be using the gensim library for the NLP. if you are not familiar with it, you may want to check the first couple tutorials. (you could use the library of your choice, naturally).
all the analyst has are the files on corpushash/gutenberg_test/public/$(timestamp)
.
In [10]:
encoded_corpus_path = hashed_guten.public_path
encoded_corpus_path
Out[10]:
In [11]:
os.path.exists(encoded_corpus_path)
Out[11]:
loading libraries we need for processing
In [12]:
import json
import gensim
in order to perform the tf-idf calculation, we use the gensim library. the gensim library takes as input any object that yields documents when it is iterated over.
corpushash
has a built-in reader that yields the document index and the hashed document. these are the first tokens in each document:
In [13]:
for i in hashed_guten.read_hashed_corpus():
print(i[0], i[1][:1])
as the analyst will not have access to this convenience function, we will build a new generator for this.
In [14]:
def encoded_gutenberg_yielder(corpus_path):
for ix in range(len(gutenberg.fileids())):
path = os.path.join(corpus_path, '{}.json'.format(ix))
with open(path, 'r') as fp:
document_tokens = json.load(fp)
yield document_tokens
example_doc = encoded_gutenberg_yielder(encoded_corpus_path)
print("\n".join(next(example_doc)[:10]))
In [15]:
encoded_gutencorpus = encoded_gutenberg_yielder(encoded_corpus_path)
In [16]:
encoded_gutendict = gensim.corpora.Dictionary(encoded_gutencorpus)
#encoded_gutendict.save_as_text('enc_dict.txt', sort_by_word=False)
In [26]:
print("the number of unique words in our corpus is {}.".format(len(encoded_gutendict)))
In [27]:
def bow_gutenberg_yielder(corpus, dictionary):
for document_tokens in corpus:
yield dictionary.doc2bow(document_tokens)
we must re-instantiate the generator, else it'll be depleted.
In [29]:
encoded_gutencorpus = encoded_gutenberg_yielder(encoded_corpus_path)
encoded_gutenbow = bow_gutenberg_yielder(encoded_gutencorpus, encoded_gutendict)
In [30]:
print('token', '>>', "(token id, frequency in document)\n")
for i in next(encoded_gutenbow)[:10]:
print(encoded_gutendict.get(i[0]), '>>', i)
In [31]:
encoded_gutencorpus = encoded_gutenberg_yielder(encoded_corpus_path)
encoded_gutenbow = bow_gutenberg_yielder(encoded_gutencorpus, encoded_gutendict)
encoded_tfidf = gensim.models.TfidfModel(encoded_gutenbow)
after calculating the frequencies, its time to transform the given bag-of-words vectors to corresponding tf-idf weights vectors.
In [32]:
encoded_gutencorpus = encoded_gutenberg_yielder(encoded_corpus_path)
encoded_gutenbow = bow_gutenberg_yielder(encoded_gutencorpus, encoded_gutendict)
encoded_guten_tfidf = encoded_tfidf[encoded_gutenbow]
example of token ids and their tf-idf weights:
In [33]:
print('token', '>>', "(token id, tf-idf weight)\n")
for i in encoded_guten_tfidf:
for k in i[:10]:
print(encoded_gutendict.get(k[0]), '>>', k)
break
now we will take the role of the corpus owner when she gets the results from the analyst back.
we will validate the results in two steps:
we will decode the previous result back to the unhashed tokens, maintaining their tf-idf weights;
we will compare the tf-idf results with the same analysis done on the decoded corpus.
In [34]:
decode_dictionary_path = hashed_guten.decode_dictionary_path
In [35]:
os.path.isfile(decode_dictionary_path)
Out[35]:
obtaining decode dictionary:
In [36]:
with open(decode_dictionary_path, 'rb') as f:
decode_dictionary = pickle.load(f)
here we are reinstantiating the generators, applying the tf-idf model, and then iterating over the model's results to replace the hashed tokens with their decoded counterparts. the result is saved to disk. make sure your indexer doesn't change document order (gutenberg.fileids does it), else the names may mismatch with the contents.
In [37]:
encoded_gutencorpus = encoded_gutenberg_yielder(encoded_corpus_path)
encoded_gutenbow = bow_gutenberg_yielder(encoded_gutencorpus, encoded_gutendict)
encoded_guten_tfidf = encoded_tfidf[encoded_gutenbow]
for path, document in zip(gutenberg.fileids(), encoded_guten_tfidf):
decoded_document = []
for tuple_value in document:
hashed_token = encoded_gutendict.get(tuple_value[0]) # 6 - > ed07dbbe94c8ff385a1a00e6720f0ab66ac420...
token, _ = decode_dictionary[hashed_token] # 'ed07dbbe94c8ff385a1a00e... -> 'genesis'
decoded_document.append("{}: {}".format(token, tuple_value[1]))
fname = 'decoded_'+ path
with open(os.path.join(corpus_path, fname), 'w') as f:
f.write("\n".join(decoded_document))
In [49]:
example_id = random.choice(gutenberg.fileids())
with open(os.path.join(corpus_path, 'decoded_' + example_id), 'r') as f:
decoded_doc = f.read().splitlines()
print(example_id, '>>')
print("\n".join(decoded_doc[:10]))
the next step in our validation procedure is to apply the same analysis to the the decoded corpus.
the input files are not in the /public/$(timestamp)
directory, but in the decoded_gutencorpus variable.
the files are, again, decoded, in contrast to the hashed tokens seen above.
In [50]:
example_document = random.choice(decoded_gutencorpus)
print(example_document[:10])
creating the dictionary that maps a token to an ID:
In [51]:
decoded_gutendict = gensim.corpora.Dictionary(decoded_gutencorpus)
#decoded_gutendict.save_as_text('dec_dict.txt', sort_by_word=False)
In [52]:
len(decoded_gutendict)
Out[52]:
creating a generator that yields the bag-of-words model of a document when iterated over:
In [53]:
decoded_gutenbow = bow_gutenberg_yielder(decoded_gutencorpus, decoded_gutendict)
In [54]:
print('token', '>>', "(token id, frequency in document)\n")
for i in next(decoded_gutenbow)[:10]:
print(decoded_gutendict.get(i[0]), '>>', i)
creating the model:
In [56]:
decoded_gutenbow = bow_gutenberg_yielder(decoded_gutencorpus, decoded_gutendict)
In [57]:
decoded_tfidf = gensim.models.TfidfModel(decoded_gutenbow)
applying the model to the documents (not forgetting to reinstantiate the generators):
In [58]:
decoded_gutenbow = bow_gutenberg_yielder(decoded_gutencorpus, decoded_gutendict)
decoded_guten_tfidf = decoded_tfidf[decoded_gutenbow]
In [59]:
print('token', '>>', "(token id, tf-idf weight)\n")
for i in decoded_guten_tfidf:
for k in i[:10]:
print(decoded_gutendict.get(k[0]), '>>', k)
break
reinstatiating the generators...
In [60]:
encoded_gutencorpus = encoded_gutenberg_yielder(encoded_corpus_path)
encoded_gutenbow = bow_gutenberg_yielder(encoded_gutencorpus, encoded_gutendict)
encoded_guten_tfidf = encoded_tfidf[encoded_gutenbow]
In [61]:
decoded_gutenbow = bow_gutenberg_yielder(decoded_gutencorpus, decoded_gutendict)
decoded_guten_tfidf = decoded_tfidf[decoded_gutenbow]
In [62]:
%%time
encoded_tfidf, decoded_tfidf = {}, {}
for encoded_document, decoded_document in zip(encoded_guten_tfidf, decoded_guten_tfidf):
for encoded_item, decoded_item in zip(encoded_document, decoded_document):
hashed_token = encoded_gutendict.get(encoded_item[0])
original_token = decode_dictionary[hashed_token][0] # get hash, ignoring salt
encoded_tfidf[original_token] = round(encoded_item[1], 7) # rounding because python <3.6 seems to represent floats inconsistently
decoded_tfidf[decoded_gutendict.get(decoded_item[0])] = round(decoded_item[1], 7)
print(encoded_tfidf == decoded_tfidf)
In [79]:
random_token = random.choice(list(encoded_tfidf.keys()))
print("example token: tf-idf weight in encoded corpus | in decoded corpus\n{:^35}: {} | {}".format(random_token, encoded_tfidf[random_token], decoded_tfidf[random_token]))