In [26]:
%%HTML
<style> code {background-color : lightgrey !important;} </style>
corpushash is a simple library that aims to make the natural language processing of sensitive documents easier. the library enables performing common NLP tasks on sensitive documents without disclosing their contents. This is done by hashing every token in the corpus along with a salt (to prevent dictionary attacks).
its workflow is as simple as having the sensitive corpora as a python nested list (or generator) whose elements are themselves (nested) lists of strings. after the hashing is done, NLP can be carried out by a third party, and when the results are in they can be decoded by a dictionary that maps hashes to the original strings. so that makes:
import corpushash as ch
hashed_corpus = ch.CorpusHash(mycorpus_as_a_nested_list, '/home/sensitive-corpus')
>>> "42 documents hashed and saved to '/home/sensitive-corpus/public/$(timestamp)'"
NLP is done, and results
are in:
for token in results:
print(token, ">", hashed_corpus.decode_dictionary[token])
>>> "7)JBMGG?sGu+>%Js~dG=%c1Qn1HpAU{jM-~Buu7?" > "gutenberg"
The library requires as input:
a tokenized corpus as a nested list, whose elements are themselves nested lists of the tokens of each document in the corpus
each list corresponds to a document structure: its chapters, paragraphs, sentences. you decide how the nested list is to be created or structured, as long as the input is a nested list with strings as their bottom-most elements.
corpus_path
, a path to a directory where the output files are to be stored
The output includes:
a .json file for every item in the dictionary provided, named sequencially as positive integers, e.g., the first document being 0.json
, stored in corpus_path/public/$(timestamp-of-hash)/
two .json
dictionaries stored in corpus_path/private
. they are used to decode the .json files or the NLP results
loading libraries...
In [27]:
import os
import json
from nltk.corpus import gutenberg
import corpushash as ch
import base64
import hashlib
import random
we'll use the gutenberg corpus as test data, which is available through the nltk library.
downloading test data (if needed):
In [28]:
import nltk
#nltk.download('gutenberg') # comment (uncomment) if you have (don't have) the data
files in test data:
In [29]:
gutenberg.fileids()
Out[29]:
creating test corpus path, where hashed documents will be stored as .json files:
In [30]:
base_path = os.getcwd()
base_path
Out[30]:
In [31]:
corpus_path = os.path.join(base_path, 'guten_test')
corpus_path
Out[31]:
In [32]:
excerpt = gutenberg.raw('austen-emma.txt')[50:478]
print(excerpt)
every paragraph and sentence is its own list:
In [33]:
print(ch.text_split(excerpt))
In [34]:
%%time
guten_list = []
for document_name in gutenberg.fileids():
document = gutenberg.raw(document_name)
split_document = ch.text_split(document)
guten_list.append(split_document)
excerpt:
In [35]:
document = random.choice(guten_list)
print(document[:10])
In [36]:
%time hashed_guten = ch.CorpusHash(guten_list, corpus_path)
In [37]:
entries = random.sample(list(hashed_guten.encode_dictionary.keys()), k=5)
for entry in entries:
print("token >> {:^20} | hashed_token >> '{}'".format(entry, hashed_guten.encode_dictionary[entry]))
In [38]:
entries = random.sample(list(hashed_guten.decode_dictionary.keys()), k=5)
for entry in entries:
print("hashed_token >> '{}' | (token >> '{}', salt >> '{}'".format(entry, hashed_guten.decode_dictionary[entry][0], hashed_guten.decode_dictionary[entry][1][:4])) # cutting off some bytes for aesthetic reasons
the walk_nested_list
function yields items in a nested list in order, regardless of their depth:
In [39]:
print(excerpt)
In [40]:
for element in ch.walk_nested_list(ch.text_split(excerpt)):
print(element)
In [41]:
limit = 10 # showing first ten entries
document = random.randint(0, len(gutenberg.fileids()))
print('document {} corresponds to {}.'.format(document, gutenberg.fileids()[document]))
note: take care to have your corpus yield the documents always in the same order, else you'll have a harder time knowing which hashed documents correspond to which documents.
you can read output .json
directly:
In [42]:
document_path = os.path.join(hashed_guten.public_path, "{}.json".format(document))
with open(document_path, mode="rt") as fp:
encoded_document = json.load(fp)
In [43]:
print("original token >> encoded token")
for ix, tokens in enumerate(zip(ch.walk_nested_list(encoded_document), ch.walk_nested_list(encoded_document))):
print("'{}' >> '{}'".format(tokens[0], tokens[1]))
if ix > limit:
break
or using corpushash
's read_hashed_corpus
method, which yields the corpus' documents in order:
In [44]:
for document in hashed_guten.read_hashed_corpus():
print(document[0])
break
alternatively, one can check the corpus_path
directory and read the output files using one's favorite text editor.
the path is:
In [45]:
hashed_guten.public_path
Out[45]:
this is the basic algorithm for the corpushash
library. for more details, check the source code, it is readable.
hash_function
: default is sha-256, but can be any hash function offered by the hashlib library that does not need additional parameters (as does scrypt, for instance)
salt_length
: determines salt length in bytes, default is 32 bytes.
one_salt
: determines if tokens will be hashed with the same salt or one for each token. if True
, os.urandom generates a salt to be used in all hashings. (note: in this case, choose a greater salt length). if False
, os.urandom will generate a salt for each token.
encoding
: determines the encoding of the outputted .json
files. default is utf-8, and you probably want to keep it that way.
indent_json
: if None
, won't indent output .json
files. if positive integer, will indent using this number of spaces. zero will add, but no indentation. if you don't have nested lists, the default argument None
is probably the best option, for with large corpora indentation and can take up a lot of space.
hashing documents from the same corpus at different times is supported in case you produce documents continuously.
by specifying a corpus_path
from a previous instance of CorpusHash
to this new instance, it'll automatically search for and employ the same dictionaries used in the last hashing, which means the same tokens in the old and new documents will map to the same hash.
the files will be saved to a new directory in the public/
directory, named after the timestamp of this instance of CorpusHash
.
note: be careful when specifying a previously used corpus_path
if the above is not what you want to do. if you want new dictionaries, just specify a new corpus_path
.
note: when specifying a previously used corpus_path
, take care with the optional arguments of CorpusHash
.
specifying a different hash_function
will cause the hashes of the same words in each instance to differ.
if you set one_salt
to True
, the library will assume this was also True
for the previous instances of CorpusHash
, and will take an arbitrary salt from the decode_dictionary
as the salt to be used in this instance -- they should all be the same, after all. if one_salt
was not True
for the previous instances, this will produce unexpected results.
one_salt=True
in this situation with any value to salt_length
, this value will be ignored.one_salt=True
in this situation with a different value than the last to hash_function
, any token not hashed in the previous instances will be hashed with a different hash function, which may mean a different hash length, for example. the tokens that were hashed in the previous instances will take the same value as before, because they are looked up in the encode_dictionary
and not re-hashed.