In [1]:
from hashlib import sha256, md5
m = sha256()
m.update('hello'.encode('utf-8'))
m.hexdigest()
Out[1]:
In [3]:
m2 = sha256()
m2.update('hello'.encode('utf-8'))
m2.hexdigest()
Out[3]:
Similar-looking but different strings will yield different hashes.
In [4]:
m3 = sha256()
m3.update('héllo'.encode('utf-8'))
m3.hexdigest()
Out[4]:
Using a different hashing algorithm will yield a different hash.
In [5]:
n = md5()
n.update('hello'.encode('utf-8'))
n.hexdigest()
Out[5]:
Hashing functions don't work on all objects.
In [6]:
try:
o = sha256()
o.update(3)
except TypeError:
print('Numbers cannot be hashed')
In [7]:
try:
o = sha256()
o.update('Hello world!')
except TypeError:
print('Strings cannot be hashed without encoding.')
In [8]:
try:
o = sha256()
o.update('Hello world!'.encode('utf-8'))
print(o.hexdigest())
except TypeError:
print('Strings must be encoded first.')
Multiple approaches possible:
A good pragmatic balance is to check every row against a hash of that row; storing the hash of the row may help us pinpoint which row may have been tampered.
Inside datafuncs.py, write a utility function with the signature hash_data(handle), that does the following:
pandas to open the data file as specified by the handle as the variable df.hashes.hashes called concat, which is each column of data from df converted to strings and concatenated into a contiguous string.hashes called hash, which is the computed the hash of each row of the contiguous strings.concat column from hashes.hashes.csv.It is possible to check the hash of a file. Let's add an existing implementation found online to our toolkit, datafuncs.py.
(All credit to StackOverflow community: http://stackoverflow.com/questions/3431825/generating-an-md5-checksum-of-a-file)
In [33]:
def hash_file(fname):
filehash = sha256()
with open(fname, "rb") as f:
for chunk in iter(lambda: f.read(4096), b""):
filehash.update(chunk)
return filehash.hexdigest()
In [34]:
hash_file('data/Divvy_Stations_2013.csv')
Out[34]:
In [35]:
hash_file('data/Divvy_Stations_2013_corrupt.csv')
Out[35]:
Inside a new script, record_file_hash.py, write code that records the hash of a CSV file inside a database, say, tinydb, or a CSV file. The steps I think you might want to follow are outlined below:
pandas.DataFrame() (or create a tinydb database) to store the MD5 hash of the Divvy_Stations_2013.csv file. Place the database (or CSV file) in the directory called data_integrity/. Be sure to record, at the minimum, the following:tinydb database, then check the API docs here for more information on how to query for a particular record.
In [ ]: