File Integrity

With file integrity, the basic question we are answering is: "Has the file changed since the last time you used it?"

Hash (Browns)

File integrity can be checked by checking the "hash" of a file.

The layman definition of a hash: A fixed-length, scrambled string that uniquely identifies "a thing".

The layman definition of a hashing function: A function that transforms "a thing" into a hash.

hashlib

hashlib is part of the Python standard library, and it provides a library of hashing functions for hashing objects, strings, etc.


In [1]:
from hashlib import sha256, md5

m = sha256()
m.update('hello'.encode('utf-8'))
m.hexdigest()


Out[1]:
'2cf24dba5fb0a30e26e83b2ac5b9e29e1b161e5c1fa7425e73043362938b9824'

Properties of hashes & hashlib functions

The first property of hashes is that of the same "thing" should yield the same hash value.


In [3]:
m2 = sha256()
m2.update('hello'.encode('utf-8'))
m2.hexdigest()


Out[3]:
'2cf24dba5fb0a30e26e83b2ac5b9e29e1b161e5c1fa7425e73043362938b9824'

Similar-looking but different strings will yield different hashes.


In [4]:
m3 = sha256()
m3.update('héllo'.encode('utf-8'))
m3.hexdigest()


Out[4]:
'3c48591d8d098a4538f5e013dfcf406e948eac4d3277b10bf614e295d6068179'

Using a different hashing algorithm will yield a different hash.


In [5]:
n = md5()
n.update('hello'.encode('utf-8'))
n.hexdigest()


Out[5]:
'5d41402abc4b2a76b9719d911017c592'

Hashing functions don't work on all objects.


In [6]:
try:
    o = sha256()
    o.update(3)
except TypeError:
    print('Numbers cannot be hashed')


Numbers cannot be hashed

In [7]:
try:
    o = sha256()
    o.update('Hello world!')
except TypeError:
    print('Strings cannot be hashed without encoding.')


Strings cannot be hashed without encoding.

In [8]:
try:
    o = sha256()
    o.update('Hello world!'.encode('utf-8'))
    print(o.hexdigest())
except TypeError:
    print('Strings must be encoded first.')


c0535e4be2b79ffd93291305436bf889314e4a3faec05ecffcbb7df31ad9e51a

Checking for changes in data file

Multiple approaches possible:

  • Check every cell against a "master" copy, assuming you have one. (inefficient, but good for pinpointing tampered cells)
  • Check every row against a hash of that row. (somewhat inefficient, but good for practice, and good for pinpointing tampered rows)
  • Check hash of a file. (most efficient way)

A good pragmatic balance is to check every row against a hash of that row; storing the hash of the row may help us pinpoint which row may have been tampered.

Exercise Part 1

  • Write a convenience function that hashes strings and returns the digest, and add it to datafuncs.py. It should wrap the SHA256 algorithm.

Exercise Part 2

Inside datafuncs.py, write a utility function with the signature hash_data(handle), that does the following:

  • Use pandas to open the data file as specified by the handle as the variable df.
  • Create a new DataFrame called hashes.
  • Create a new column in hashes called concat, which is each column of data from df converted to strings and concatenated into a contiguous string.
  • Create a new column in hashes called hash, which is the computed the hash of each row of the contiguous strings.
  • Delete the concat column from hashes.
  • Save the hashes to disk as the file hashes.csv.

Exercise Part 3

  • Now, write a function test_divvy_corrupt(), that lets us compare the two CSV files and automatically finds out which row has corrupted data. You will need to import the functions previously written.

Hash of a file

It is possible to check the hash of a file. Let's add an existing implementation found online to our toolkit, datafuncs.py.

(All credit to StackOverflow community: http://stackoverflow.com/questions/3431825/generating-an-md5-checksum-of-a-file)


In [33]:
def hash_file(fname):
    filehash = sha256()
    with open(fname, "rb") as f:
        for chunk in iter(lambda: f.read(4096), b""):
            filehash.update(chunk)
    return filehash.hexdigest()

In [34]:
hash_file('data/Divvy_Stations_2013.csv')


Out[34]:
'c861005089beb7f09e26a5b7afa09843a0ac1ca98fe9c36ac0510a58b21da40d'

In [35]:
hash_file('data/Divvy_Stations_2013_corrupt.csv')


Out[35]:
'880ba1ef2e38e4c35df4b2cd745529797f08fb24048dea0600e8174518a99869'

Exercise

Inside a new script, record_file_hash.py, write code that records the hash of a CSV file inside a database, say, tinydb, or a CSV file. The steps I think you might want to follow are outlined below:

  • Create a CSV file from a pandas.DataFrame() (or create a tinydb database) to store the MD5 hash of the Divvy_Stations_2013.csv file. Place the database (or CSV file) in the directory called data_integrity/. Be sure to record, at the minimum, the following:
    • File name.
    • Hash.
    • Date and time on which hash was computed.

Exercise

  • Write a test that checks that the current file hash is the value that was most recently recorded.
  • If you used a tinydb database, then check the API docs here for more information on how to query for a particular record.

In [ ]: