In [26]:
%%HTML
<style> code {background-color : lightgrey !important;} </style>


corpushash tutorial

corpushash is a simple library that aims to make the natural language processing of sensitive documents easier. the library enables performing common NLP tasks on sensitive documents without disclosing their contents. This is done by hashing every token in the corpus along with a salt (to prevent dictionary attacks).

its workflow is as simple as having the sensitive corpora as a python nested list (or generator) whose elements are themselves (nested) lists of strings. after the hashing is done, NLP can be carried out by a third party, and when the results are in they can be decoded by a dictionary that maps hashes to the original strings. so that makes:

import corpushash as ch
hashed_corpus = ch.CorpusHash(mycorpus_as_a_nested_list, '/home/sensitive-corpus')
>>> "42 documents hashed and saved to '/home/sensitive-corpus/public/$(timestamp)'"

NLP is done, and results are in:

for token in results:
    print(token, ">", hashed_corpus.decode_dictionary[token])
>>> "7)JBMGG?sGu+>%Js~dG=%c1Qn1HpAU{jM-~Buu7?" > "gutenberg"

The library requires as input:

  • a tokenized corpus as a nested list, whose elements are themselves nested lists of the tokens of each document in the corpus

    each list corresponds to a document structure: its chapters, paragraphs, sentences. you decide how the nested list is to be created or structured, as long as the input is a nested list with strings as their bottom-most elements.

  • corpus_path, a path to a directory where the output files are to be stored

The output includes:

  • a .json file for every item in the dictionary provided, named sequencially as positive integers, e.g., the first document being 0.json, stored in corpus_path/public/$(timestamp-of-hash)/

  • two .json dictionaries stored in corpus_path/private. they are used to decode the .json files or the NLP results

Preparing the input data

loading libraries...


In [27]:
import os
import json
from nltk.corpus import gutenberg
import corpushash as ch
import base64
import hashlib
import random

we'll use the gutenberg corpus as test data, which is available through the nltk library.

downloading test data (if needed):


In [28]:
import nltk
#nltk.download('gutenberg')  # comment (uncomment) if you have (don't have) the data

files in test data:


In [29]:
gutenberg.fileids()


Out[29]:
['austen-emma.txt',
 'austen-persuasion.txt',
 'austen-sense.txt',
 'bible-kjv.txt',
 'blake-poems.txt',
 'bryant-stories.txt',
 'burgess-busterbrown.txt',
 'carroll-alice.txt',
 'chesterton-ball.txt',
 'chesterton-brown.txt',
 'chesterton-thursday.txt',
 'edgeworth-parents.txt',
 'melville-moby_dick.txt',
 'milton-paradise.txt',
 'shakespeare-caesar.txt',
 'shakespeare-hamlet.txt',
 'shakespeare-macbeth.txt',
 'whitman-leaves.txt']

creating test corpus path, where hashed documents will be stored as .json files:


In [30]:
base_path = os.getcwd()
base_path


Out[30]:
'/home/guest/Documents/git/corpushash/notebooks'

In [31]:
corpus_path = os.path.join(base_path, 'guten_test')
corpus_path


Out[31]:
'/home/guest/Documents/git/corpushash/notebooks/guten_test'

function to split text into nested list:


In [32]:
excerpt = gutenberg.raw('austen-emma.txt')[50:478]
print(excerpt)


Emma Woodhouse, handsome, clever, and rich, with a comfortable home
and happy disposition, seemed to unite some of the best blessings
of existence; and had lived nearly twenty-one years in the world
with very little to distress or vex her.

She was the youngest of the two daughters of a most affectionate,
indulgent father; and had, in consequence of her sister's marriage,
been mistress of his house from a very early period. 

every paragraph and sentence is its own list:


In [33]:
print(ch.text_split(excerpt))


[[['Emma', 'Woodhouse', 'handsome', 'clever', 'and', 'rich', 'with', 'a', 'comfortable', 'home']], [['and', 'happy', 'disposition', 'seemed', 'to', 'unite', 'some', 'of', 'the', 'best', 'blessings']], [['of', 'existence', 'and', 'had', 'lived', 'nearly', 'twenty-one', 'years', 'in', 'the', 'world']], [['with', 'very', 'little', 'to', 'distress', 'or', 'vex', 'her']], [['She', 'was', 'the', 'youngest', 'of', 'the', 'two', 'daughters', 'of', 'a', 'most', 'affectionate']], [['indulgent', 'father', 'and', 'had', 'in', 'consequence', 'of', 'her', "sister's", 'marriage']], [['been', 'mistress', 'of', 'his', 'house', 'from', 'a', 'very', 'early', 'period']]]

Input

the library takes as input a nested list whose elements are the original documents as nested lists. this can be an in-nested list or some generator that yields a nested list when it is iterated over.

creating nested list made from the raw texts in the gutenberg corpus:


In [34]:
%%time
guten_list = []
for document_name in gutenberg.fileids():
    document = gutenberg.raw(document_name)
    split_document = ch.text_split(document)
    guten_list.append(split_document)


CPU times: user 1.29 s, sys: 42 ms, total: 1.33 s
Wall time: 1.34 s

excerpt:


In [35]:
document = random.choice(guten_list)
print(document[:10])


[[['Persuasion', 'by', 'Jane', 'Austen', '1818']], [['Chapter', '1']], [['Sir', 'Walter', 'Elliot', 'of', 'Kellynch', 'Hall', 'in', 'Somersetshire', 'was', 'a', 'man', 'who']], [['for', 'his', 'own', 'amusement', 'never', 'took', 'up', 'any', 'book', 'but', 'the', 'Baronetage']], [['there', 'he', 'found', 'occupation', 'for', 'an', 'idle', 'hour', 'and', 'consolation', 'in', 'a']], [['distressed', 'one', 'there', 'his', 'faculties', 'were', 'roused', 'into', 'admiration', 'and']], [['respect', 'by', 'contemplating', 'the', 'limited', 'remnant', 'of', 'the', 'earliest', 'patents']], [['there', 'any', 'unwelcome', 'sensations', 'arising', 'from', 'domestic', 'affairs']], [['changed', 'naturally', 'into', 'pity', 'and', 'contempt', 'as', 'he', 'turned', 'over']], [['the', 'almost', 'endless', 'creations', 'of', 'the', 'last', 'century', 'and', 'there']]]

processing using corpushash

instantiating CorpusHash class, which hashes the provided corpus to the corpus_path:


In [36]:
%time hashed_guten = ch.CorpusHash(guten_list, corpus_path)


2017-05-23 19:36:58,004 - corpushash.hashers - INFO - dictionaries from previous hashing found. loading them.
2017-05-23 19:37:04,490 - corpushash.hashers - INFO - 18 documents hashed and saved to /home/guest/Documents/git/corpushash/notebooks/guten_test/public/2017-05-23_19-36-58-004565.
CPU times: user 6.46 s, sys: 88 ms, total: 6.55 s
Wall time: 6.55 s

Output

Encode dictionary

The encode dictionary is used to encode values to hashes, so that the same strings are guaranteed to be hashed to the same value, including the random salt.


In [37]:
entries = random.sample(list(hashed_guten.encode_dictionary.keys()), k=5)
for entry in entries:
    print("token >> {:^20} | hashed_token >> '{}'".format(entry, hashed_guten.encode_dictionary[entry]))


token >>      apartment       | hashed_token >> 'a}&qib+Q4Fgt8x<s0&h?kgF}2Z35<l+5xtFy;FHw'
token >>        parcel        | hashed_token >> '<GNXWu~|S}0CFuIsk0qD;pS&(48()%ELYA`#X1I|'
token >>       nigger's       | hashed_token >> '*a^-xD<Tl{$_m(Y1x9i4ti1~2KOWzMN@FjSXC4*Q'
token >>    deep-throated     | hashed_token >> '1-V5a1gRpXFi+009z8Wl+Y>qj8};2BSfSVT%zNZ#'
token >>      blastments      | hashed_token >> '_y{I3(0u$wyxyov4aZn)#o-dq(wVPn%wK|hFk<tV'

Decode dictionary

The decode dictionary is used to decode hashes to their original strings, so that one can make sense of the results of any posterior NLP analysis. It also lists the salt that was used to obtain the given hash. This dictionary must be kept secret with the owner of the data.


In [38]:
entries = random.sample(list(hashed_guten.decode_dictionary.keys()), k=5)
for entry in entries:
    print("hashed_token >> '{}' | (token >> '{}', salt >> '{}'".format(entry, hashed_guten.decode_dictionary[entry][0], hashed_guten.decode_dictionary[entry][1][:4]))  # cutting off some bytes for aesthetic reasons


hashed_token >> '$P%GRo*)j<5v@h_WoGU}4~SZqLAVuFe0z9T_f1RX' | (token >> 'washerwoman's?', salt >> ';u|x'
hashed_token >> '#FED2$9Adv7d@+x8|=T}gxBLf(L{TxXPyC`e|Ah+' | (token >> 'Ashes', salt >> 'BK;d'
hashed_token >> '>ER4JShc2BZ7r;9Zl%Z3SG%*j<6PyY<R=CXPSx!M' | (token >> 'AFRICA', salt >> 'dz6h'
hashed_token >> 'st(Oe56bMSeo02)lLqM2zoVLQjc}Y47AB)R_`d+$' | (token >> 'Tremisen', salt >> 'j*Ez'
hashed_token >> 'l_3Sw0Y>gO*6aHP9ogvdk3>KxK>pJc8w{F>zAi)O' | (token >> 'Ieering?', salt >> '1z;7'

hashed .json files

the walk_nested_list function yields items in a nested list in order, regardless of their depth:


In [39]:
print(excerpt)


Emma Woodhouse, handsome, clever, and rich, with a comfortable home
and happy disposition, seemed to unite some of the best blessings
of existence; and had lived nearly twenty-one years in the world
with very little to distress or vex her.

She was the youngest of the two daughters of a most affectionate,
indulgent father; and had, in consequence of her sister's marriage,
been mistress of his house from a very early period. 

In [40]:
for element in ch.walk_nested_list(ch.text_split(excerpt)):
    print(element)


Emma
Woodhouse
handsome
clever
and
rich
with
a
comfortable
home
and
happy
disposition
seemed
to
unite
some
of
the
best
blessings
of
existence
and
had
lived
nearly
twenty-one
years
in
the
world
with
very
little
to
distress
or
vex
her
She
was
the
youngest
of
the
two
daughters
of
a
most
affectionate
indulgent
father
and
had
in
consequence
of
her
sister's
marriage
been
mistress
of
his
house
from
a
very
early
period

we can use this function to see what hashcodecs has done to the corpus.

adjusting parameters:


In [41]:
limit = 10  # showing first ten entries
document = random.randint(0, len(gutenberg.fileids()))
print('document {} corresponds to {}.'.format(document, gutenberg.fileids()[document]))


document 6 corresponds to burgess-busterbrown.txt.

note: take care to have your corpus yield the documents always in the same order, else you'll have a harder time knowing which hashed documents correspond to which documents.

you can read output .json directly:


In [42]:
document_path = os.path.join(hashed_guten.public_path, "{}.json".format(document))
with open(document_path, mode="rt") as fp:
    encoded_document = json.load(fp)

In [43]:
print("original token >> encoded token")
for ix, tokens in enumerate(zip(ch.walk_nested_list(encoded_document), ch.walk_nested_list(encoded_document))):
    print("'{}' >> '{}'".format(tokens[0], tokens[1]))
    if ix > limit:
        break


original token >> encoded token
'R=g*sYbtCCcVi&Y_p5N^?o5eSpz!&+L_V#DI6igc' >> 'R=g*sYbtCCcVi&Y_p5N^?o5eSpz!&+L_V#DI6igc'
'OYY5FGKuiV^a=QOnBS01I+^hyol?2xdA-T!GJk?a' >> 'OYY5FGKuiV^a=QOnBS01I+^hyol?2xdA-T!GJk?a'
'Ei-Hp(t}e3^ufaY>t{$s7y!WWMOfz8q!#^_f=v%r' >> 'Ei-Hp(t}e3^ufaY>t{$s7y!WWMOfz8q!#^_f=v%r'
'_U>d6-<x%GI7#FA!@1{cJua8^r5dr50$7f5y==qB' >> '_U>d6-<x%GI7#FA!@1{cJua8^r5dr50$7f5y==qB'
'uwHGGy0m-APOJn;I($j?1!(5E!Ev>#>vMHzU043k' >> 'uwHGGy0m-APOJn;I($j?1!(5E!Ev>#>vMHzU043k'
'IE%wyW#6pZRU2at4KdE#t!@f^9*M{1m@|lzxt9y_' >> 'IE%wyW#6pZRU2at4KdE#t!@f^9*M{1m@|lzxt9y_'
'vtlN74-vMdn=k5E9+IEI3FqdY+gGpC+9JzMGeF1d' >> 'vtlN74-vMdn=k5E9+IEI3FqdY+gGpC+9JzMGeF1d'
'XEmiU;A3lz#RoF}`DaYW{qqeBEJ;I(+IDgSPCQDA' >> 'XEmiU;A3lz#RoF}`DaYW{qqeBEJ;I(+IDgSPCQDA'
'iu^+c2Sp>L|G6uw7owEcxQ7SW%-CL^-tcL0H`K(3' >> 'iu^+c2Sp>L|G6uw7owEcxQ7SW%-CL^-tcL0H`K(3'
'ok4PU%pLEctS|MV&%!XgRyGRa4EVCp^DNG;YS<UK' >> 'ok4PU%pLEctS|MV&%!XgRyGRa4EVCp^DNG;YS<UK'
'as%U0&V-nDtNK{0E50ZA4C0dvk(V1yXidc6&;%p9' >> 'as%U0&V-nDtNK{0E50ZA4C0dvk(V1yXidc6&;%p9'
't^_RbJn2S;A1|_}UW-9Wf47S-rovkmM8F9^r91~B' >> 't^_RbJn2S;A1|_}UW-9Wf47S-rovkmM8F9^r91~B'

or using corpushash's read_hashed_corpus method, which yields the corpus' documents in order:


In [44]:
for document in hashed_guten.read_hashed_corpus():
    print(document[0])
    break


[['hiU9#l*$9>Sz)3~G67yj9MNmW8dw;QLuiw>r5pDN', 'IE%wyW#6pZRU2at4KdE#t!@f^9*M{1m@|lzxt9y_', 's3s)>`MqvdQM9BRX;S=fC#ED?z}kXVIR$fz>O+j8', 'FzVBj0XgxuZB$DsIvT3EN+HT@J=Vt>Kr6l0;~cm&', 'K1ivUg0Ub|nhkR%dU6=y!KbP{q;@bmZ7?&hl}oXR']]

alternatively, one can check the corpus_path directory and read the output files using one's favorite text editor.

the path is:


In [45]:
hashed_guten.public_path


Out[45]:
'/home/guest/Documents/git/corpushash/notebooks/guten_test/public/2017-05-23_19-36-58-004565'

Advanced

how this library works

this is the basic algorithm for the corpushash library. for more details, check the source code, it is readable.

  • Create an empty corpus structure (nested list) to hold the hashed tokens;
  • Create a decoding dictionary: a list of key, value pairs where the key is an encoded token (hash) and the values are the unhashed token and its salt.
  • Create an encoding dictionary: a list of key, value pairs where the key is a plain token and the value is its cryptographic hash.
  • Iterate over the unhashed tokens
    • Check if the word is in the encoding dictionary;
    • If so, add its hash value to the hashed tokens list
    • If not, hash it with the addition of a random salt, and add them to the encoding and decoding dictionaries.
  • Return the hashed corpus and the dictionaries

optional arguments

hash_function: default is sha-256, but can be any hash function offered by the hashlib library that does not need additional parameters (as does scrypt, for instance)

salt_length: determines salt length in bytes, default is 32 bytes.

one_salt: determines if tokens will be hashed with the same salt or one for each token. if True, os.urandom generates a salt to be used in all hashings. (note: in this case, choose a greater salt length). if False, os.urandom will generate a salt for each token.

encoding: determines the encoding of the outputted .json files. default is utf-8, and you probably want to keep it that way.

indent_json: if None, won't indent output .json files. if positive integer, will indent using this number of spaces. zero will add, but no indentation. if you don't have nested lists, the default argument None is probably the best option, for with large corpora indentation and can take up a lot of space.

hashing documents from the same corpus at different times

hashing documents from the same corpus at different times is supported in case you produce documents continuously.

by specifying a corpus_path from a previous instance of CorpusHash to this new instance, it'll automatically search for and employ the same dictionaries used in the last hashing, which means the same tokens in the old and new documents will map to the same hash.

the files will be saved to a new directory in the public/ directory, named after the timestamp of this instance of CorpusHash.

note: be careful when specifying a previously used corpus_path if the above is not what you want to do. if you want new dictionaries, just specify a new corpus_path.

note: when specifying a previously used corpus_path, take care with the optional arguments of CorpusHash.

  • specifying a different hash_function will cause the hashes of the same words in each instance to differ.

  • if you set one_salt to True, the library will assume this was also True for the previous instances of CorpusHash, and will take an arbitrary salt from the decode_dictionary as the salt to be used in this instance -- they should all be the same, after all. if one_salt was not True for the previous instances, this will produce unexpected results.

    • additionally, if you pass one_salt=True in this situation with any value to salt_length, this value will be ignored.
    • if you pass one_salt=True in this situation with a different value than the last to hash_function, any token not hashed in the previous instances will be hashed with a different hash function, which may mean a different hash length, for example. the tokens that were hashed in the previous instances will take the same value as before, because they are looked up in the encode_dictionary and not re-hashed.