File Integrity

With file integrity, the basic question we are answering is: "Has the file changed since the last time you used it?"

Hash (Browns)

File integrity can be checked by checking the "hash" of a file.

The layman definition of a hash: A fixed-length, scrambled string that uniquely identifies "a thing".

The layman definition of a hashing function: A function that transforms "a thing" into a hash.

`hashlib`

hashlib is part of the Python standard library, and it provides a library of hashing functions for hashing objects, strings, etc.



In [1]:

    
from hashlib import sha256, md5

m = sha256()
m.update('hello'.encode('utf-8'))
m.hexdigest()









    Out[1]:





'2cf24dba5fb0a30e26e83b2ac5b9e29e1b161e5c1fa7425e73043362938b9824'

Properties of hashes & `hashlib` functions

The first property of hashes is that of the same "thing" should yield the same hash value.



In [3]:

    
m2 = sha256()
m2.update('hello'.encode('utf-8'))
m2.hexdigest()









    Out[3]:





'2cf24dba5fb0a30e26e83b2ac5b9e29e1b161e5c1fa7425e73043362938b9824'

Similar-looking but different strings will yield different hashes.



In [4]:

    
m3 = sha256()
m3.update('héllo'.encode('utf-8'))
m3.hexdigest()









    Out[4]:





'3c48591d8d098a4538f5e013dfcf406e948eac4d3277b10bf614e295d6068179'

Using a different hashing algorithm will yield a different hash.



In [5]:

    
n = md5()
n.update('hello'.encode('utf-8'))
n.hexdigest()









    Out[5]:





'5d41402abc4b2a76b9719d911017c592'

Hashing functions don't work on all objects.



In [6]:

    
try:
    o = sha256()
    o.update(3)
except TypeError:
    print('Numbers cannot be hashed')









    



Numbers cannot be hashed



In [7]:

    
try:
    o = sha256()
    o.update('Hello world!')
except TypeError:
    print('Strings cannot be hashed without encoding.')









    



Strings cannot be hashed without encoding.



In [8]:

    
try:
    o = sha256()
    o.update('Hello world!'.encode('utf-8'))
    print(o.hexdigest())
except TypeError:
    print('Strings must be encoded first.')









    



c0535e4be2b79ffd93291305436bf889314e4a3faec05ecffcbb7df31ad9e51a

Checking for changes in data file

Multiple approaches possible:

Check every cell against a "master" copy, assuming you have one. (inefficient, but good for pinpointing tampered cells)
Check every row against a hash of that row. (somewhat inefficient, but good for practice, and good for pinpointing tampered rows)
Check hash of a file. (most efficient way)

A good pragmatic balance is to check every row against a hash of that row; storing the hash of the row may help us pinpoint which row may have been tampered.

Exercise Part 1

Write a convenience function that hashes strings and returns the digest, and add it to datafuncs.py. It should wrap the SHA256 algorithm.

Exercise Part 2

Inside datafuncs.py, write a utility function with the signature hash_data(handle), that does the following:

Use pandas to open the data file as specified by the handle as the variable df.
Create a new DataFrame called hashes.
Create a new column in hashes called concat, which is each column of data from df converted to strings and concatenated into a contiguous string.
Create a new column in hashes called hash, which is the computed the hash of each row of the contiguous strings.
Delete the concat column from hashes.
Save the hashes to disk as the file hashes.csv.

Exercise Part 3

Now, write a function test_divvy_corrupt(), that lets us compare the two CSV files and automatically finds out which row has corrupted data. You will need to import the functions previously written.

Hash of a file

It is possible to check the hash of a file. Let's add an existing implementation found online to our toolkit, datafuncs.py.

(All credit to StackOverflow community: http://stackoverflow.com/questions/3431825/generating-an-md5-checksum-of-a-file)



In [33]:

    
def hash_file(fname):
    filehash = sha256()
    with open(fname, "rb") as f:
        for chunk in iter(lambda: f.read(4096), b""):
            filehash.update(chunk)
    return filehash.hexdigest()



In [34]:

    
hash_file('data/Divvy_Stations_2013.csv')









    Out[34]:





'c861005089beb7f09e26a5b7afa09843a0ac1ca98fe9c36ac0510a58b21da40d'



In [35]:

    
hash_file('data/Divvy_Stations_2013_corrupt.csv')









    Out[35]:





'880ba1ef2e38e4c35df4b2cd745529797f08fb24048dea0600e8174518a99869'

Exercise

Inside a new script, record_file_hash.py, write code that records the hash of a CSV file inside a database, say, tinydb, or a CSV file. The steps I think you might want to follow are outlined below:

Create a CSV file from a pandas.DataFrame() (or create a tinydb database) to store the MD5 hash of the Divvy_Stations_2013.csv file. Place the database (or CSV file) in the directory called data_integrity/. Be sure to record, at the minimum, the following:
- File name.
- Hash.
- Date and time on which hash was computed.

Exercise

Write a test that checks that the current file hash is the value that was most recently recorded.
If you used a tinydb database, then check the API docs here for more information on how to query for a particular record.



In [ ]:

File Integrity

Hash (Browns)

hashlib

Properties of hashes & hashlib functions

Checking for changes in data file

Exercise Part 1

Exercise Part 2

Exercise Part 3

Hash of a file

Exercise

Exercise

`hashlib`

Properties of hashes & `hashlib` functions