DinoData


PYTHON FOR DATA SCIENCE

HASHING THE DINOS

This Notebook is admittedly a little bit weird in terms of the topics it mixes. We bring in a large number of dinosuar names, in the sense of species, as discovered from the fossil record, and perhaps from other records. However this set of strings only serves to fuel the purely mathematical process of performing hashlib.sha256 magic on each one.

Think of dino names as passwords. We may consider these insecure but lets not assume the game is that serious. For the purposes of today's exercise, they're secure enough.

However, just because you've picked a password does not mean a DBA needs to keep it in her database, where it might get stolen. Rather, a hash of your password serves as a deterministic fingerprint. Just save the fingerprint. No one with a wrong password will get through the gate.


In [ ]:
import pandas as pd
import matplotlib.pyplot as plt

dinos = pd.read_json("dino_hash.json")

You'll notice the hashing algorithm has already been applied by the time we import this JSON data. I'll be showing you the Python source code for scripts that aren't Jupyter Notebook scripts, for that part of the pipeline.


In [ ]:
dinos.head()

Remember how the .loc attribute uses enhanced slice notation ("enhanced" in the sense core Python does not support it).


In [ ]:
dinos.loc["Mo":"N"]

In [ ]:
dinos.dtypes

In [ ]:
dinos.index.is_unique

In [ ]:
code = dinos.loc['Mtapaiasaurus'][0]

In [ ]:
code

In [ ]:
len(code)

In [ ]:
dinos.info()

In [ ]:
int(code, base=16)

In [ ]:
0xafe4c2b017ed3996bf5f4f3b937f0ae22e649df2f620787e136ed6bd3ea32e2d

LAB

Add a column to dinos that contains the decimal equivalent of the sha256 hash. Hint.


In [ ]:

LAB

Sort dinos by the column sha256 -- this will be an alphabetical sort.


In [ ]:

How about in descending order?


In [ ]: