We're used to reading JSON and CSV files over the internet, using Pandas. However, if you have control of a server, there's no reason you can't make scripts such as the one below fetch everything for you under the hood, using URL requests.
By the time the data surfaces in the Notebook, it's already an up-to-date Dataframe, sorted and massaged. I'm exposing the pipeline here, but it's easy to imagine the Notebook actually starting around the last cell, having already done the job behind the scenes, of harvesting data.
As a data scientist, your role may be as much about storing data for convenient access, in a usable form, as it is about end user analysis of said data. Your role may be part DBA (database administrator) at the end of the day. That's not a bad thing.
In [ ]:import pandas as pd import numpy as np
In [ ]:""" Created on Thu Jun 29 10:17:02 2017 Rewritten for get_data package on Oct 25, 2017 @author: Kirby Urner Decorated generator used IN PLACE OF: class Url: def __init__(self, the_url): self.url = the_url def __enter__(self): self.rq = urlopen(self.url) return self.rq def __exit__(self, *oops): if oops: print("Failed to connect") return False self.rq.close() return True """ from urllib.request import urlopen import json from contextlib import contextmanager PREFIX = "http://thekirbster.pythonanywhere.com/" @contextmanager def url(target): try: yield urlopen(target) except: print("Failed to connect") raise def get_chems(): """ Get the element data from the web using API Typical record: [1, "H", "Hydrogen", 1.008, "diatomic nonmetal", 1498013115, "KTU"] """ global chems with url(PREFIX + "api/elements?elem=all") as httpreq: data = json.loads(httpreq.read()) # getting JSON data chems = pd.DataFrame(data) get_chems()
In [ ]:chems.head()
This is not quite how we want to see the data. We need to flip it, or swap axes. Transpose will do. Then lets change the column names. Finally, we'll sort.
In [ ]:chems = chems.T chems.columns=["Protons", "Symbol", "Name", "Mass", "Category", "Changed", "Initials"]
In [ ]:chems.sort_values("Protons") chems.head()
In [ ]:chems = chems.sort_index() # lets get this in index order
In [ ]:chems.head()