HARVESTING_DATA_PYT_DS_SAISOFT


PYT-DS: Harvesting Data

We're used to reading JSON and CSV files over the internet, using Pandas. However, if you have control of a server, there's no reason you can't make scripts such as the one below fetch everything for you under the hood, using URL requests.

By the time the data surfaces in the Notebook, it's already an up-to-date Dataframe, sorted and massaged. I'm exposing the pipeline here, but it's easy to imagine the Notebook actually starting around the last cell, having already done the job behind the scenes, of harvesting data.

As a data scientist, your role may be as much about storing data for convenient access, in a usable form, as it is about end user analysis of said data. Your role may be part DBA (database administrator) at the end of the day. That's not a bad thing.


In [ ]:
import pandas as pd
import numpy as np

In [ ]:
"""
Created on Thu Jun 29 10:17:02 2017
Rewritten for get_data package on Oct 25, 2017

@author: Kirby Urner

Decorated generator used IN PLACE OF:
    
class Url:
    
    def __init__(self, the_url):
        self.url = the_url
        
    def __enter__(self):
        self.rq = urlopen(self.url)
        return self.rq
    
    def __exit__(self, *oops):
        if oops[0]:
            print("Failed to connect")
            return False
        self.rq.close()
        return True
"""

from urllib.request import urlopen
import json
from contextlib import contextmanager

PREFIX = "http://thekirbster.pythonanywhere.com/"

@contextmanager
def url(target):
    try:
        yield urlopen(target)
    except:
        print("Failed to connect")
        raise
           
def get_chems():
    """
    Get the element data from the web using API
    
    Typical record:
    [1, "H", "Hydrogen", 1.008, "diatomic nonmetal", 1498013115, "KTU"]
    """
    global chems
    with url(PREFIX + "api/elements?elem=all") as httpreq:
        data = json.loads(httpreq.read())  # getting JSON data
        chems = pd.DataFrame(data)

get_chems()

In [ ]:
chems.head()

This is not quite how we want to see the data. We need to flip it, or swap axes. Transpose will do. Then lets change the column names. Finally, we'll sort.


In [ ]:
chems = chems.T
chems.columns=["Protons", "Symbol", "Name", "Mass", "Category", "Changed", "Initials"]

In [ ]:
chems.sort_values("Protons")
chems.head()

In [ ]:
chems = chems.sort_index()  # lets get this in index order

In [ ]:
chems.head()

LAB:

Now that you have all the periodic table, why not try some of your tricks? How many categories are there and how many elements in each category. Remember GroupBy? What other tricks (like magic spells) do you remember?