In [1]:
%%HTML
<style> code {background-color : lightgrey !important;} </style>


TF-IDF on gutenberg corpus

corpushash is a simple library that aims to make the natural language processing of sensitive documents easier. the library enables performing common NLP tasks on sensitive documents without disclosing their contents. This is done by hashing every token in the corpus along with a salt (to prevent dictionary attacks).

its workflow is as simple as having the sensitive corpora as a python nested list (or generator) whose elements are themselves (nested) lists of strings. after the hashing is done, NLP can be carried out by a third party, and when the results are in they can be decoded by a dictionary that maps hashes to the original strings. so that makes:

import corpushash as ch
hashed_corpus = ch.CorpusHash(mycorpus_as_a_nested_list, '/home/sensitive-corpus')
>>> "42 documents hashed and saved to '/home/sensitive-corpus/public/$(timestamp)'"

NLP is done, and results are in:

for token in results:
    print(token, ">", hashed_corpus.decode_dictionary[token])
>>> "7)JBMGG?sGu+>%Js~dG=%c1Qn1HpAU{jM-~Buu7?" > "gutenberg"

loading libraries for preprocessing


In [2]:
import os
import string
import random
import pickle
import nltk
from nltk.corpus import gutenberg

downloading nltk gutenberg corpus, if not downloaded already:


In [3]:
#nltk.download("gutenberg")

files in test data:


In [4]:
gutenberg.fileids()


Out[4]:
['austen-emma.txt',
 'austen-persuasion.txt',
 'austen-sense.txt',
 'bible-kjv.txt',
 'blake-poems.txt',
 'bryant-stories.txt',
 'burgess-busterbrown.txt',
 'carroll-alice.txt',
 'chesterton-ball.txt',
 'chesterton-brown.txt',
 'chesterton-thursday.txt',
 'edgeworth-parents.txt',
 'melville-moby_dick.txt',
 'milton-paradise.txt',
 'shakespeare-caesar.txt',
 'shakespeare-hamlet.txt',
 'shakespeare-macbeth.txt',
 'whitman-leaves.txt']

preparing input

creating test corpus path, where hashed documents will be stored as .json files:


In [5]:
corpus_path = os.path.join(os.getcwd(), 'guten_test')
corpus_path


Out[5]:
'/home/bruno/Documents/github/hashed-nlp/guten_test'

the library takes as input a nested list whose elements are the original documents as nested lists. the nested lists represent the document's structure (sections, paragraphs, sentences). this can be an in-memory nested list or some generator that yields a nested list when it is iterated over.

as this is a simple tf-idf transformation, we don't need the nested list format, because document structure is not important: just having the words with no sentence or paragraph structure is sufficient. the library's input is designed to be flexible in this regard.

creating a dictionary whose keys are the fileids and whose values are a list of the words in every document:


In [6]:
%%time
decoded_gutencorpus = []
for document_name in gutenberg.fileids():
    document = [word.lower() for word in gutenberg.words(document_name) if word not in string.punctuation and not word.isdigit()]
    decoded_gutencorpus.append(document)


CPU times: user 23.2 s, sys: 684 ms, total: 23.9 s
Wall time: 23.9 s

excerpt:


In [7]:
document = random.choice(decoded_gutencorpus)
print(document[:100])


['moby', 'dick', 'by', 'herman', 'melville', 'etymology', 'supplied', 'by', 'a', 'late', 'consumptive', 'usher', 'to', 'a', 'grammar', 'school', 'the', 'pale', 'usher', '--', 'threadbare', 'in', 'coat', 'heart', 'body', 'and', 'brain', 'i', 'see', 'him', 'now', 'he', 'was', 'ever', 'dusting', 'his', 'old', 'lexicons', 'and', 'grammars', 'with', 'a', 'queer', 'handkerchief', 'mockingly', 'embellished', 'with', 'all', 'the', 'gay', 'flags', 'of', 'all', 'the', 'known', 'nations', 'of', 'the', 'world', 'he', 'loved', 'to', 'dust', 'his', 'old', 'grammars', 'it', 'somehow', 'mildly', 'reminded', 'him', 'of', 'his', 'mortality', 'while', 'you', 'take', 'in', 'hand', 'to', 'school', 'others', 'and', 'to', 'teach', 'them', 'by', 'what', 'name', 'a', 'whale', 'fish', 'is', 'to', 'be', 'called', 'in', 'our', 'tongue', 'leaving']

processing using corpushash

loading libraries for corpushash


In [8]:
import corpushash as ch

instatiating CorpusHash class, which hashes the provided corpus to the corpus_path:


In [9]:
%time hashed_guten = ch.CorpusHash(decoded_gutencorpus, corpus_path)


2017-05-04 21:47:51,604 - corpushash.hashers - INFO - 18 documents hashed and saved to /home/bruno/Documents/github/hashed-nlp/guten_test/public/2017-05-04_21-47-13.
CPU times: user 37.4 s, sys: 1.08 s, total: 38.5 s
Wall time: 38.6 s

that is it. corpushash's work is done.

NLP: a tf-idf example

from now on we are simulating the work of an analyst that receives the encoded documents and performs a common NLP task on them, the calculation of tf-idf weights.

note: we will be using the gensim library for the NLP. if you are not familiar with it, you may want to check the first couple tutorials. (you could use the library of your choice, naturally).

all the analyst has are the files on corpushash/gutenberg_test/public/$(timestamp).


In [10]:
encoded_corpus_path = hashed_guten.public_path
encoded_corpus_path


Out[10]:
'/home/bruno/Documents/github/hashed-nlp/guten_test/public/2017-05-04_21-47-13'

In [11]:
os.path.exists(encoded_corpus_path)


Out[11]:
True

loading libraries we need for processing


In [12]:
import json
import gensim

defining iterable for gensim:

in order to perform the tf-idf calculation, we use the gensim library. the gensim library takes as input any object that yields documents when it is iterated over.

corpushash has a built-in reader that yields the document index and the hashed document. these are the first tokens in each document:


In [13]:
for i in hashed_guten.read_hashed_corpus():
    print(i[0], i[1][:1])


0 ['jvUAM!v5r3%Qm9p$>Fvt^c*oxVyC0|t^6x^C^W*m']
1 ['NR2suR3NuRNIP*q#h`D(3J)+V#3IBR;3%|Rv>@6a']
2 ['o%Pqc+tEWvBC>f+VTmr1>#YqIx1x}#U-8ssStgNb']
3 ['Mg!>;<8IWflr?tDWG2Ah7P8GSrK+ag!oaD`xgZnF']
4 ['&Ef7M@=~%szmYdt#42k(k)Am6iZspGu9e*kfKEh_']
5 ['%5I7)%aXQ9>yB>a`CX7mcZS=mv^EbO7wTIH1G}vT']
6 ['Mg!>;<8IWflr?tDWG2Ah7P8GSrK+ag!oaD`xgZnF']
7 ['efB55nAlWX5}s*E05`$`DJV{P+?tRbZ@DfZ-J~>V']
8 ['Mg!>;<8IWflr?tDWG2Ah7P8GSrK+ag!oaD`xgZnF']
9 ['Mg!>;<8IWflr?tDWG2Ah7P8GSrK+ag!oaD`xgZnF']
10 ['Mg!>;<8IWflr?tDWG2Ah7P8GSrK+ag!oaD`xgZnF']
11 ['Mg!>;<8IWflr?tDWG2Ah7P8GSrK+ag!oaD`xgZnF']
12 ['#68+}E6K9gxiG!A6<6%zZZ$LNyyMHdv{CU_zaIHQ']
13 ['z)aMcvNBzTkH1+$qrU!Fdgt15F$+Lo&br5=E+QBU']
14 ['Mg!>;<8IWflr?tDWG2Ah7P8GSrK+ag!oaD`xgZnF']
15 ['Mg!>;<8IWflr?tDWG2Ah7P8GSrK+ag!oaD`xgZnF']
16 ['Mg!>;<8IWflr?tDWG2Ah7P8GSrK+ag!oaD`xgZnF']
17 ['|59SolgN4Ad>Y>67LeZ%w6-r-%aBPb8|0A7Jr;vr']

as the analyst will not have access to this convenience function, we will build a new generator for this.


In [14]:
def encoded_gutenberg_yielder(corpus_path):
    for ix in range(len(gutenberg.fileids())):
        path = os.path.join(corpus_path, '{}.json'.format(ix))
        with open(path, 'r') as fp:
            document_tokens = json.load(fp)
        yield document_tokens

example_doc = encoded_gutenberg_yielder(encoded_corpus_path)
print("\n".join(next(example_doc)[:10]))


jvUAM!v5r3%Qm9p$>Fvt^c*oxVyC0|t^6x^C^W*m
HIb_;r&gA>jyB(7uN3ccOc2^ezLBX^I?s8LsBWW6
(9E+nK-5R5&aH)68J0*s`j#I{;|{Qy$m+20Edx_U
hU_mku^J2ns3DavK#1gNfcWM;H?FrrrLs`XuaJA1
-7j6;I86bI1D(lzhW*-hG0^$J|F3`gwp91esOOks
wCz}5S=Sj?a!DwyA7-#|)r?d$;*HpVcRZdTNF*6x
|HEJp)5AGk;5N_I1OM)PS3sVJ-fve*0cT}4K`5AA
wCz}5S=Sj?a!DwyA7-#|)r?d$;*HpVcRZdTNF*6x
jvUAM!v5r3%Qm9p$>Fvt^c*oxVyC0|t^6x^C^W*m
?L6*!ZRoW*AKav}%p%F#4Nos4k!N?~%As6JcDyLS

building gensim dictionary

from this document generator gensim will build a dictionary that maps every hashed token to an ID, a mapping which is later used to calculate the tf-idf weights:


In [15]:
encoded_gutencorpus = encoded_gutenberg_yielder(encoded_corpus_path)

In [16]:
encoded_gutendict = gensim.corpora.Dictionary(encoded_gutencorpus)
#encoded_gutendict.save_as_text('enc_dict.txt', sort_by_word=False)

In [26]:
print("the number of unique words in our corpus is {}.".format(len(encoded_gutendict)))


the number of unique words in our corpus is 42020.

bag-of-words

to build a tf-idf model, the gensim library needs an input that yields this vectorized bag-of-words when iterated over:


In [27]:
def bow_gutenberg_yielder(corpus, dictionary):
    for document_tokens in corpus:
        yield dictionary.doc2bow(document_tokens)

we must re-instantiate the generator, else it'll be depleted.


In [29]:
encoded_gutencorpus = encoded_gutenberg_yielder(encoded_corpus_path)
encoded_gutenbow = bow_gutenberg_yielder(encoded_gutencorpus, encoded_gutendict)

In [30]:
print('token', '>>', "(token id, frequency in document)\n")
for i in next(encoded_gutenbow)[:10]:
    print(encoded_gutendict.get(i[0]), '>>', i)


token >> (token id, frequency in document)

v>@CDuT4Yoss>twzY&Z6^aTR}!CP6m(ht&fRM`eh >> (0, 865)
`;cqzu8WYSbhdPA;EgebuBgVt0sL{kl*69OoN^u& >> (1, 571)
S?VWSP0PVQ*Ze<nSimZQ|Lq4->d6xKrh5Wh+o7tE >> (2, 301)
o9zYE@{v;UprEjBZwc9v0PVlWPZ`e^i$9w29K<0c >> (3, 1)
ed>e_?cy=_9AX6Nk#;<yNp?mo_lxn_n{6{=1k2@i >> (4, 3)
etNN3mT&s;#89i7vY|s;39^H@e7XhhTvGIj>Y!V@ >> (5, 3178)
@e|=)yTLqzKhfIlyTg&UdGw26F`IG&lYaQs!d4=U >> (6, 56)
p%(T2Bw%|_G&T!P#*xLbVSwta9jyeHP@W9(JXY?A >> (7, 313)
?|7@`_*&RzcZM9EQ39<D6tVAOO*n_wK!qM1gPP!U >> (8, 38)
PA3Y-CI+Mi^<nXVdA?y@m<abq->ak5G{>}LlFk}n >> (9, 27)

tf-idf model

now we are ready to deploy the tf-idf model to our corpus.

when applying gensim.models.TfidfModel we calculate the tf-idf weights of every token in the corpus.


In [31]:
encoded_gutencorpus = encoded_gutenberg_yielder(encoded_corpus_path)
encoded_gutenbow = bow_gutenberg_yielder(encoded_gutencorpus, encoded_gutendict)
encoded_tfidf = gensim.models.TfidfModel(encoded_gutenbow)


INFO:gensim.models.tfidfmodel:collecting document frequencies
INFO:gensim.models.tfidfmodel:PROGRESS: processing document #0
INFO:gensim.models.tfidfmodel:calculating IDF weights for 18 documents and 42019 features (121967 matrix non-zeros)

after calculating the frequencies, its time to transform the given bag-of-words vectors to corresponding tf-idf weights vectors.


In [32]:
encoded_gutencorpus = encoded_gutenberg_yielder(encoded_corpus_path)
encoded_gutenbow = bow_gutenberg_yielder(encoded_gutencorpus, encoded_gutendict)
encoded_guten_tfidf = encoded_tfidf[encoded_gutenbow]

example of token ids and their tf-idf weights:


In [33]:
print('token', '>>', "(token id, tf-idf weight)\n")
for i in encoded_guten_tfidf:
    for k in i[:10]:
        print(encoded_gutendict.get(k[0]), '>>', k)
    break


token >> (token id, tf-idf weight)

v>@CDuT4Yoss>twzY&Z6^aTR}!CP6m(ht&fRM`eh >> (0, 0.5042503721243573)
S?VWSP0PVQ*Ze<nSimZQ|Lq4->d6xKrh5Wh+o7tE >> (2, 0.14308755923657526)
o9zYE@{v;UprEjBZwc9v0PVlWPZ`e^i$9w29K<0c >> (3, 0.00047537395095207725)
ed>e_?cy=_9AX6Nk#;<yNp?mo_lxn_n{6{=1k2@i >> (4, 0.00039197866090719016)
@e|=)yTLqzKhfIlyTg&UdGw26F`IG&lYaQs!d4=U >> (6, 0.012048339086347707)
p%(T2Bw%|_G&T!P#*xLbVSwta9jyeHP@W9(JXY?A >> (7, 0.24002347235604862)
?|7@`_*&RzcZM9EQ39<D6tVAOO*n_wK!qM1gPP!U >> (8, 0.004965063038157742)
PA3Y-CI+Mi^<nXVdA?y@m<abq->ak5G{>}LlFk}n >> (9, 0.0042105532568832305)
WlnemVIhYBWej5dRSV&RU!6SIYRXl$dCgyDybv|r >> (11, 0.00021230678602490824)
l&-X&S=o-LiHJSMg!)xpX&+8g=dY?p4v%g^oJzw# >> (14, 0.002935499650433135)

validating tf-idf model on unencoded corpus

now we will take the role of the corpus owner when she gets the results from the analyst back.

we will validate the results in two steps:

  1. we will decode the previous result back to the unhashed tokens, maintaining their tf-idf weights;

  2. we will compare the tf-idf results with the same analysis done on the decoded corpus.

1. decoding previous result

we will turn the previous result from a tuple as in (6, 0.0017337194574225342) to something like genesis: 0.0017337194574225342 and store it in disk.


In [34]:
decode_dictionary_path = hashed_guten.decode_dictionary_path

In [35]:
os.path.isfile(decode_dictionary_path)


Out[35]:
True

obtaining decode dictionary:


In [36]:
with open(decode_dictionary_path, 'rb') as f:
    decode_dictionary = pickle.load(f)

here we are reinstantiating the generators, applying the tf-idf model, and then iterating over the model's results to replace the hashed tokens with their decoded counterparts. the result is saved to disk. make sure your indexer doesn't change document order (gutenberg.fileids does it), else the names may mismatch with the contents.


In [37]:
encoded_gutencorpus = encoded_gutenberg_yielder(encoded_corpus_path)
encoded_gutenbow = bow_gutenberg_yielder(encoded_gutencorpus, encoded_gutendict)
encoded_guten_tfidf = encoded_tfidf[encoded_gutenbow]
for path, document in zip(gutenberg.fileids(), encoded_guten_tfidf):
    decoded_document = []
    for tuple_value in document:
        hashed_token = encoded_gutendict.get(tuple_value[0]) # 6 - > ed07dbbe94c8ff385a1a00e6720f0ab66ac420...
        token, _ = decode_dictionary[hashed_token]  # 'ed07dbbe94c8ff385a1a00e... -> 'genesis'
        decoded_document.append("{}: {}".format(token, tuple_value[1]))
    fname = 'decoded_'+ path
    with open(os.path.join(corpus_path, fname), 'w') as f:
        f.write("\n".join(decoded_document))

In [49]:
example_id = random.choice(gutenberg.fileids())
with open(os.path.join(corpus_path, 'decoded_' + example_id), 'r') as f:
        decoded_doc = f.read().splitlines()
print(example_id, '>>')
print("\n".join(decoded_doc[:10]))


edgeworth-parents.txt >>
chapter: 0.004383086322422836
handsome: 0.0047913130193044
clever: 0.00413009101386359
rich: 0.0008959312959607944
comfortable: 0.0007035645947490613
disposition: 0.002859293778828639
seemed: 0.002956349779127873
existence: 0.0003176993087587376
lived: 0.005025921344935186
nearly: 0.003166040676370776

2. tf-idf on decoded corpus

creating unencoded corpus dictionary

the next step in our validation procedure is to apply the same analysis to the the decoded corpus.

the input files are not in the /public/$(timestamp) directory, but in the decoded_gutencorpus variable.

the files are, again, decoded, in contrast to the hashed tokens seen above.


In [50]:
example_document = random.choice(decoded_gutencorpus)
print(example_document[:10])


['emma', 'by', 'jane', 'austen', 'volume', 'i', 'chapter', 'i', 'emma', 'woodhouse']

creating the dictionary that maps a token to an ID:


In [51]:
decoded_gutendict = gensim.corpora.Dictionary(decoded_gutencorpus)
#decoded_gutendict.save_as_text('dec_dict.txt', sort_by_word=False)


INFO:gensim.corpora.dictionary:adding document #0 to Dictionary(0 unique tokens: [])
INFO:gensim.corpora.dictionary:built Dictionary(42020 unique tokens: ['emma', 'by', 'jane', 'austen', 'volume']...) from 18 documents (total 2161659 corpus positions)

In [52]:
len(decoded_gutendict)


Out[52]:
42020

creating a generator that yields the bag-of-words model of a document when iterated over:


In [53]:
decoded_gutenbow = bow_gutenberg_yielder(decoded_gutencorpus, decoded_gutendict)

In [54]:
print('token', '>>', "(token id, frequency in document)\n")
for i in next(decoded_gutenbow)[:10]:
    print(decoded_gutendict.get(i[0]), '>>', i)


token >> (token id, frequency in document)

emma >> (0, 865)
by >> (1, 571)
jane >> (2, 301)
austen >> (3, 1)
volume >> (4, 3)
i >> (5, 3178)
chapter >> (6, 56)
woodhouse >> (7, 313)
handsome >> (8, 38)
clever >> (9, 27)

tf-idf model

creating the model:


In [56]:
decoded_gutenbow = bow_gutenberg_yielder(decoded_gutencorpus, decoded_gutendict)

In [57]:
decoded_tfidf = gensim.models.TfidfModel(decoded_gutenbow)


INFO:gensim.models.tfidfmodel:collecting document frequencies
INFO:gensim.models.tfidfmodel:PROGRESS: processing document #0
INFO:gensim.models.tfidfmodel:calculating IDF weights for 18 documents and 42019 features (121967 matrix non-zeros)

applying the model to the documents (not forgetting to reinstantiate the generators):


In [58]:
decoded_gutenbow = bow_gutenberg_yielder(decoded_gutencorpus, decoded_gutendict)
decoded_guten_tfidf = decoded_tfidf[decoded_gutenbow]

In [59]:
print('token', '>>', "(token id, tf-idf weight)\n")
for i in decoded_guten_tfidf:
    for k in i[:10]:
        print(decoded_gutendict.get(k[0]), '>>', k)
    break


token >> (token id, tf-idf weight)

emma >> (0, 0.5042503721243573)
jane >> (2, 0.14308755923657526)
austen >> (3, 0.00047537395095207725)
volume >> (4, 0.00039197866090719016)
chapter >> (6, 0.012048339086347707)
woodhouse >> (7, 0.24002347235604862)
handsome >> (8, 0.004965063038157742)
clever >> (9, 0.0042105532568832305)
rich >> (11, 0.00021230678602490824)
comfortable >> (14, 0.002935499650433135)

comparing results of encoded and decoded corpus

reinstatiating the generators...


In [60]:
encoded_gutencorpus = encoded_gutenberg_yielder(encoded_corpus_path)
encoded_gutenbow = bow_gutenberg_yielder(encoded_gutencorpus, encoded_gutendict)
encoded_guten_tfidf = encoded_tfidf[encoded_gutenbow]

In [61]:
decoded_gutenbow = bow_gutenberg_yielder(decoded_gutencorpus, decoded_gutendict)
decoded_guten_tfidf = decoded_tfidf[decoded_gutenbow]

In [62]:
%%time
encoded_tfidf, decoded_tfidf = {}, {}
for encoded_document, decoded_document in zip(encoded_guten_tfidf, decoded_guten_tfidf):
    for encoded_item, decoded_item in zip(encoded_document, decoded_document):
        hashed_token = encoded_gutendict.get(encoded_item[0])
        original_token = decode_dictionary[hashed_token][0] # get hash, ignoring salt
        encoded_tfidf[original_token] = round(encoded_item[1], 7) # rounding because python <3.6 seems to represent floats inconsistently
        decoded_tfidf[decoded_gutendict.get(decoded_item[0])] = round(decoded_item[1], 7)
print(encoded_tfidf == decoded_tfidf)


True
CPU times: user 2.05 s, sys: 30 ms, total: 2.08 s
Wall time: 2.08 s

In [79]:
random_token = random.choice(list(encoded_tfidf.keys()))
print("example token: tf-idf weight in encoded corpus | in decoded corpus\n{:^35}: {} | {}".format(random_token, encoded_tfidf[random_token], decoded_tfidf[random_token]))


example token: tf-idf weight in encoded corpus | in decoded corpus
            nourisheth             : 0.0002203 | 0.0002203

thus we see that the NLP's results are the same regardless of which corpus we use, i.e., we can use hashed corpora to perform NLP tasks in a lossless manner.