corpushash is a simple library that aims to make the natural language processing of sensitive documents easier. the library enables performing common NLP tasks on sensitive documents without disclosing their contents. This is done by hashing every token in the corpus along with a salt (to prevent dictionary attacks).
its workflow is as simple as having the sensitive corpora as a python nested list (or generator) whose elements are themselves (nested) lists of strings. after the hashing is done, NLP can be carried out by a third party, and when the results are in they can be decoded by a dictionary that maps hashes to the original strings. so that makes:
import corpushash as ch
hashed_corpus = ch.CorpusHash(mycorpus_as_a_nested_list, '/home/sensitive-corpus')
>>> "42 documents hashed and saved to '/home/sensitive-corpus/public/$(timestamp)'"
NLP is done, and results
are in:
for token in results:
print(token, ">", hashed_corpus.decode_dictionary[token])
>>> "7)JBMGG?sGu+>%Js~dG=%c1Qn1HpAU{jM-~Buu7?" > "gutenberg"
The library requires as input:
a tokenized corpus as a nested list, whose elements are themselves nested lists of the tokens of each document in the corpus
each list corresponds to a document structure: its chapters, paragraphs, sentences. you decide how the nested list is to be created or structured, as long as the input is a nested list with strings as their bottom-most elements.
corpus_path
, a path to a directory where the output files are to be stored
The output includes:
a .json file for every item in the dictionary provided, named sequencially as positive integers, e.g., the first document being 0.json
, stored in corpus_path/public/$(timestamp-of-hash)/
two pickled dictionaries stored in corpus_path/private
. they are used to decode the .json files or the NLP results
In [1]:
import os
import sys
import shutil
import copy
import numpy as np
import matplotlib.pyplot as plt
from timeit import default_timer as timer
import nltk
from nltk.corpus import words
from nltk.corpus import reuters
import corpushash as ch
disabling logging information so that we are not swamped by it when hashing several corpora in a row:
In [26]:
import logging
logging.getLogger('corpushash.hashers').setLevel(logging.WARNING)
In [27]:
corpus_path = 'bench_test'
In [28]:
plt.rcParams['figure.figsize'] = (30, 12)
benchmark corpus is a wordlist, so that it represents a worst-case scenario where all tokens have to be hashed instead of some them being repeated and being looked-up.
we'll test the wordlist against the abc corpus, which is 3 times bigger, but which has only 31885 unique words.
In [9]:
# if corpus is missing, comment here and nltk import (above)
nltk.download('words')
nltk.download('abc')
nltk.download('reuters')
Out[9]:
In [10]:
print('|wordlist corpus|\ntotal nr of tokens & nr of unique tokens:', len(words.words()))
In [11]:
print('|abc corpus|\ntotal nr of tokens:', len(nltk.corpus.abc.words()), '\nunique tokens:', len(set(nltk.corpus.abc.words())))
In [28]:
corpus = [words.words()]
In [29]:
%time wordlistcorpus = ch.CorpusHash(corpus, corpus_path)
In [30]:
print('decode dictionary size (bytes)\n', os.path.getsize(os.path.join(corpus_path, 'private', 'decode_dictionary.json')))
In [31]:
print('hashed corpus size (bytes)\n', os.path.getsize(os.path.join(wordlistcorpus.public_path, '{}.json'.format(wordlistcorpus.corpus_size-1))))
removing folder to have clean slate for next corpus hash:
In [32]:
shutil.rmtree(corpus_path)
In [27]:
corpus = [list(nltk.corpus.abc.words())]
len(corpus[0])
Out[27]:
In [24]:
%time abccorpus = ch.CorpusHash(corpus, corpus_path)
In [25]:
print('decode dictionary size (bytes)\n', os.path.getsize(os.path.join(corpus_path, 'private', 'decode_dictionary.json')))
In [26]:
print('hashed corpus size (bytes)\n', os.path.getsize(os.path.join(abccorpus.public_path, '{}.json'.format(wordlistcorpus.corpus_size-1))))
In [16]:
shutil.rmtree(corpus_path)
two main variables should determine the time taken to hash a corpus:
the documents' sizes;
and the documents' nesting level
a corpus of zero-depth is a list of tokens, while a higher-depth corpus is a nested list of tokens, where the nesting represents document structure, such as sentences and paragraphs.
In [17]:
corpus_size = 10
step = 100
iterations = 2330
max_corpus_size = corpus_size + step*iterations
loops = 1
iterations_time = np.zeros((iterations))
In [18]:
iterations_time.shape
Out[18]:
In [19]:
max_corpus_size
Out[19]:
In [33]:
%%time
i = 0
for size in range(corpus_size, max_corpus_size, step):
corpus = [words.words()[:size]]
startt = timer()
ch.CorpusHash(corpus, corpus_path)
endt = timer()
iterations_time[i] = endt - startt
shutil.rmtree(corpus_path)
i += 1
print(i)
In [ ]:
file_name = '0_d_tokens.npy'
if os.path.isfile(file_name):
iterations_time = np.load(file_name)
else:
np.save(file_name, iterations_time)
iterations_time is a vector with each element being the time it took to hash the corpus of a given size.
In [59]:
x = np.arange(corpus_size, max_corpus_size, step)
y = iterations_time / iterations_time[0]
plt.plot(x, y)
plt.ylabel('time (normalized)')
plt.xlabel('corpus size (in nr of tokens)')
plt.title('time to hash list of tokens')
plt.show()
In [60]:
z = np.polyfit(x, y, 1)
p = np.poly1d(z)
print(p)
In [61]:
xp = np.linspace(corpus_size, max_corpus_size, 10000) # creating evenly-spaced points to evaluate polynomial
plt.plot(x, y, '.', xp, p(xp), '-')
plt.ylabel('time (normalized)')
plt.xlabel('corpus size (in nr of tokens)')
plt.title('time to hash list of tokens')
plt.show()
using reuters corpus and varying corpus size and nr of docs independently:
In [63]:
corpus_size = 10
step = 5
iterations = 260
max_corpus_size = corpus_size + step*iterations
iterations_time = np.zeros((iterations*iterations))
In [64]:
iterations_time.shape
Out[64]:
In [65]:
max_corpus_size
Out[65]:
In [66]:
document = list(nltk.corpus.reuters.words())
In [67]:
corpus_length = len(document)
corpus_length
Out[67]:
In [68]:
%%time
i = 0
k = 0
for size in range(corpus_size, max_corpus_size, step):
for nrdocs in range(iterations):
doc_length = size
corpus = []
for doc in range(nrdocs+1):
begin = doc*doc_length
end = (1 + doc)*doc_length
corpus.append(document[begin:end])
startt = timer()
a = ch.CorpusHash(corpus, corpus_path)
endt = timer()
iterations_time[i] = endt - startt
shutil.rmtree(corpus_path)
i += 1
print(k)
k += 1
In [ ]:
file_name = '0_d_tokens_documents.npy'
if os.path.isfile(file_name):
iterations_time = np.load(file_name)
else:
np.save(file_name, iterations_time)
In [9]:
from mpl_toolkits.mplot3d import Axes3D
fig = plt.figure()
ax = fig.gca(projection='3d')
In [13]:
xs = np.array(list(range(iterations))*iterations) # nr of docs
In [14]:
y = []
for i in range(corpus_size, max_corpus_size, step):
y += [i]*iterations
ys = np.array(y) # nr of tokens in each document
In [39]:
iterations_time[0]
Out[39]:
In [42]:
z = np.average(iterations_time, axis=0)
zs = z / z[0]
In [51]:
ax.scatter(xs, ys, zs)
ax.set_xlabel('nr of documents')
ax.set_ylabel('nr of tokens per document')
ax.set_zlabel('time to hash (normalized)')
plt.show()
using reuters and varying corpus size and nr of docs independently
In [32]:
corpus_size = 10
step = 5
iterations = 260
max_corpus_size = corpus_size + step*iterations
iterations_time = np.zeros((iterations*iterations))
In [33]:
iterations_time.shape
Out[33]:
In [34]:
max_corpus_size
Out[34]:
normal nesting:
In [38]:
normal_corpus = ch.text_split(reuters.raw())
In [41]:
%time ch.CorpusHash([normal_corpus], 'split_test')
Out[41]:
In [42]:
document = []
for ix, word in enumerate(reuters.words()):
if ix % 2:
document.append([word])
else:
document.append(word)
document[:10]
Out[42]:
In [43]:
shutil.rmtree('split_test')
In [44]:
%time ch.CorpusHash([document], 'split_test')
Out[44]:
In [45]:
corpus_length = len(document)
corpus_length
Out[45]:
In [47]:
%%time
i = 0
k = 0
for size in range(corpus_size, max_corpus_size, step):
for nrdocs in range(iterations):
doc_length = size
corpus = []
for doc in range(nrdocs+1):
begin = doc*doc_length
end = (1 + doc)*doc_length
corpus.append(document[begin:end])
startt = timer()
a = ch.CorpusHash(corpus, corpus_path)
endt = timer()
iterations_time[i] = endt - startt
shutil.rmtree(corpus_path)
i += 1
print(k)
k += 1
In [ ]:
file_name = '1_d_tokens_documents.npy'
if os.path.isfile(file_name):
iterations_time = np.load(file_name)
else:
np.save(file_name, iterations_time)
In [23]:
from mpl_toolkits.mplot3d import Axes3D
fig = plt.figure()
ax = fig.gca(projection='3d')
In [24]:
xs = np.array(list(range(iterations))*iterations) # nr of docs
In [25]:
y = []
for i in range(corpus_size, max_corpus_size, step):
y += [i]*iterations
ys = np.array(y) # nr of tokens in each document
In [44]:
iterations_time[0]
Out[44]:
In [27]:
zs = iterations_time / iterations_time[0]
In [28]:
ax.scatter(xs, ys, zs)
ax.set_xlabel('nr of documents')
ax.set_ylabel('nr of tokens per document')
ax.set_zlabel('time to hash (normalized)')
plt.show()
In [48]:
corpus_size = 10
step = 5
iterations = 260
max_corpus_size = corpus_size + step*iterations
iterations_time = np.zeros((iterations*iterations))
In [49]:
iterations_time.shape
Out[49]:
In [50]:
max_corpus_size
Out[50]:
In [51]:
document = []
for ix, word in enumerate(nltk.corpus.reuters.words()):
if ix % 3 == 0:
document.append(word)
elif ix % 3 == 2:
document.append([word])
else:
document.append([[word]])
document[:10]
Out[51]:
In [52]:
corpus_length = len(document)
corpus_length
Out[52]:
In [54]:
%%time
i = 0
k = 0
for size in range(corpus_size, max_corpus_size, step):
for nrdocs in range(iterations):
doc_length = size
corpus = []
for doc in range(nrdocs+1):
begin = doc*doc_length
end = (1 + doc)*doc_length
corpus.append(document[begin:end])
startt = timer()
a = ch.CorpusHash(corpus, corpus_path)
endt = timer()
iterations_time[i] = endt - startt
shutil.rmtree(corpus_path)
i += 1
print(k)
k += 1
file_name = '2_d_tokens_documents.npy'
if os.path.isfile(file_name):
iterations_time = np.load(file_name)
else:
np.save(file_name, iterations_time)
In [33]:
from mpl_toolkits.mplot3d import Axes3D
fig = plt.figure()
ax = fig.gca(projection='3d')
In [34]:
xs = np.array(list(range(iterations))*iterations) # nr of docs
In [35]:
y = []
for i in range(corpus_size, max_corpus_size, step):
y += [i]*iterations
ys = np.array(y) # nr of tokens in each document
In [40]:
iterations_time[0]
Out[40]:
In [37]:
zs = iterations_time / iterations_time[0]
In [38]:
ax.scatter(xs, ys, zs)
ax.set_xlabel('nr of documents')
ax.set_ylabel('nr of tokens per document')
ax.set_zlabel('time to hash (normalized)')
plt.show()
In [10]:
file_name = '0_d_tokens_documents.npy'
if os.path.isfile(file_name):
zero = np.load(file_name)
In [9]:
file_name = '1_d_tokens_documents.npy'
if os.path.isfile(file_name):
um = np.load(file_name)
In [ ]:
file_name = '2_d_tokens_documents.npy'
if os.path.isfile(file_name):
dois = np.load(file_name)
In [31]:
from mpl_toolkits.mplot3d import Axes3D
In [32]:
xs = np.array(list(range(iterations))*iterations) # nr of docs
In [33]:
y = []
for i in range(corpus_size, max_corpus_size, step):
y += [i]*iterations
ys = np.array(y) # nr of tokens in each document
In [ ]:
z0 = zero / zero[0]
In [34]:
z1 = um / zero[0]
In [35]:
z2 = dois / zero[0]
In [38]:
fig = plt.figure(figsize=plt.figaspect(3))
ax = fig.add_subplot(3, 1, 1, projection='3d')
ax.scatter(xs, ys, z0, c='r')
plt.title('nesting level 0')
ax = fig.add_subplot(3, 1, 2, projection='3d')
ax.scatter(xs, ys, z1, c='g')
plt.title('nesting level 1')
ax = fig.add_subplot(3, 1, 3, projection='3d')
ax.scatter(xs, ys, z2, c='b')
plt.title('nesting level 2')
ax.set_xlabel('nr of documents')
ax.set_ylabel('nr of tokens per document')
ax.set_zlabel('time to hash (normalized)')
plt.show()
in this benchmark we'll be using the Brazilian Portuguese and English dictionaries, for a corpus of moderate size.
In [5]:
with open('brazilian.txt', 'r') as f:
portuguese = f.read().split()
english = words.words()
In [6]:
document = english + portuguese
corpus_length = len(document)
corpus_length
Out[6]:
In [7]:
corpus_size = 10
step = 100
iterations = 74
max_corpus_size = corpus_size + step*iterations
loops = 3
iterations_time = np.zeros((loops, iterations))
In [ ]:
%%time
i = 0
for size in range(corpus_size, max_corpus_size, step):
corpus = []
doc_length = size
for doc in range(size):
begin = doc*doc_length
end = (1 + doc)*doc_length
corpus.append(document[begin:end])
for loop in range(loops):
startt = timer()
a = ch.CorpusHash(corpus, corpus_path)
endt = timer()
iterations_time[loop, i] = endt - startt
shutil.rmtree(corpus_path)
print(i)
i += 1
saving data to disk:
In [8]:
file_name = 'paper-iterations-time.npy'
if os.path.isfile(file_name):
iterations_time = np.load(file_name)
else:
np.save(file_name, iterations_time)
plotting:
In [9]:
from mpl_toolkits.mplot3d import Axes3D
fig = plt.figure()
ax = fig.gca(projection='3d')
In [10]:
xs = np.arange(1, iterations+1)
ys = np.arange(corpus_size, max_corpus_size, step)
zs = np.average(iterations_time, axis=0)
normalized_zs = np.divide(zs, zs[0])
In [11]:
ax.scatter(xs, ys, zs)
ax.set_xlabel('documents')
ax.set_ylabel('tokens per document')
ax.set_zlabel('time to hash (normalized)')
plt.show()
In [12]:
z = np.polyfit(ys*xs, normalized_zs, 1)
p = np.poly1d(z)
print(p)
In [13]:
from PIL import Image
from io import BytesIO
saving plot as .tiff, for the paper:
In [15]:
fig = plt.figure(figsize=(20, 12), dpi=300)
xp = np.linspace(0, 600000, iterations, ) # creating evenly-spaced points to evaluate polynomial
plt.plot(xs*ys, normalized_zs, '.', xp, p(xp), 'r-')
plt.ylabel('time', fontsize=22)
plt.xlabel('corpus size (documents $\cdot$ words)', fontsize=22)
plt.text(300000, 3000, r'$y(x) = 0.01821 x + 172.5$', fontsize=20, color='r')
#plt.title('time to hash corpus')
plt.tick_params(axis='both', which='major', labelsize=20)
png1 = BytesIO()
fig.savefig(png1, format='png')
# (2) load this image into PIL
png2 = Image.open(png1)
# (3) save as TIFF
png2.save('complexity.tiff')
png1.close()
In [19]:
document2 = []
for ix, word in enumerate(document):
if ix % 2:
document2.append([word])
else:
document2.append(word)
In [20]:
document3 = []
for ix, word in enumerate(document):
if ix % 3 == 0:
document3.append(word)
elif ix % 3 == 2:
document3.append([word])
else:
document3.append([[word]])
In [21]:
docs = [document, document2, document3]
In [35]:
corpus_length = len(document)
corpus_size = 10
step = 10000
iterations = 54
max_corpus_size = corpus_size + step*iterations
iterations_time = np.zeros((3, iterations))
In [36]:
max_corpus_size
Out[36]:
In [ ]:
%%time
i = 0
for size in range(corpus_size, max_corpus_size, step):
for ix, doc in enumerate(docs):
corpus = [doc[:size]]
startt = timer()
a = ch.CorpusHash(corpus, corpus_path)
endt = timer()
iterations_time[ix, i] = endt - startt
shutil.rmtree(corpus_path)
print(i)
i += 1
In [7]:
file_name = 'nest-iterations-time.npy'
if os.path.isfile(file_name):
iterations_time = np.load(file_name)
else:
np.save(file_name, iterations_time)
In [8]:
iterations_time
Out[8]:
In [10]:
to_2_1 = np.divide(iterations_time[1], iterations_time[0])
In [15]:
print(to_2_1, np.mean(to_2_1), np.std(to_2_1))
In [17]:
to_3_1 = np.divide(iterations_time[2], iterations_time[0])
In [18]:
print(to_3_1, np.mean(to_3_1), np.std(to_3_1))
In [20]:
to_3_2 = np.divide(iterations_time[2], iterations_time[1])
In [21]:
print(to_3_2, np.mean(to_3_2), np.std(to_3_2))