We're used to reading JSON and CSV files over the internet, using Pandas. However, if you have control of a server, there's no reason you can't make scripts such as the one below fetch everything for you under the hood, using URL requests.
By the time the data surfaces in the Notebook, it's already an up-to-date Dataframe, sorted and massaged. I'm exposing the pipeline here, but it's easy to imagine the Notebook actually starting around the last cell, having already done the job behind the scenes, of harvesting data.
As a data scientist, your role may be as much about storing data for convenient access, in a usable form, as it is about end user analysis of said data. Your role may be part DBA (database administrator) at the end of the day. That's not a bad thing.
In [ ]:
import pandas as pd
import numpy as np
In [ ]:
"""
Created on Thu Jun 29 10:17:02 2017
Rewritten for get_data package on Oct 25, 2017
@author: Kirby Urner
Decorated generator used IN PLACE OF:
class Url:
def __init__(self, the_url):
self.url = the_url
def __enter__(self):
self.rq = urlopen(self.url)
return self.rq
def __exit__(self, *oops):
if oops[0]:
print("Failed to connect")
return False
self.rq.close()
return True
"""
from urllib.request import urlopen
import json
from contextlib import contextmanager
PREFIX = "http://thekirbster.pythonanywhere.com/"
@contextmanager
def url(target):
try:
yield urlopen(target)
except:
print("Failed to connect")
raise
def get_chems():
"""
Get the element data from the web using API
Typical record:
[1, "H", "Hydrogen", 1.008, "diatomic nonmetal", 1498013115, "KTU"]
"""
global chems
with url(PREFIX + "api/elements?elem=all") as httpreq:
data = json.loads(httpreq.read()) # getting JSON data
chems = pd.DataFrame(data)
get_chems()
In [ ]:
chems.head()
This is not quite how we want to see the data. We need to flip it, or swap axes. Transpose will do. Then lets change the column names. Finally, we'll sort.
In [ ]:
chems = chems.T
chems.columns=["Protons", "Symbol", "Name", "Mass", "Category", "Changed", "Initials"]
In [ ]:
chems.sort_values("Protons")
chems.head()
In [ ]:
chems = chems.sort_index() # lets get this in index order
In [ ]:
chems.head()