NTDS'17 demo 6: web APIs & data analysis with pandas

1 Web API

We already used the Twitter web API, albeit through a nice Python wrapper library. This time, we'll talk to an API directly.

1.1 API doc

The first step is always to look at the API documentation. Two questions to answer: i) how to construct the uniform resource locator (URL), and ii) how to interpret the returned data. For this tutorial and the third assignment: https://freemusicarchive.org/api. (Some of you might have referred to the Twitter API doc for the first assignment.)

Start simple. Send an HTTP GET request with your web browser to the following uniform resource locator (URL): https://freemusicarchive.org/recent.json.

1.2 API calls

A well-designed HTTP library for Python is request.



In [ ]:

    
import requests

Send an HTTP GET request, as our browser did above, and receive a response from the web server.



In [ ]:

    
URL = 'https://freemusicarchive.org/recent.json'
response = requests.get(URL)

If the GET request worked, the server answers with a "200 OK", the standard response for successful HTTP requests. The request may fail and e.g. return the infamous "404 Not Found" error.



In [ ]:

    
print(response.status_code)
print(requests.get('https://www.epfl.ch/do_not_exist').status_code)

1.3 Exercise

First, as often, we need an API key for certain operations. Add the following to your credentials.ini file.

[freemusicarchive]
api_key = MY-KEY



In [ ]:

    
# Read the confidential api key.
import configparser
import os
credentials = configparser.ConfigParser()
credentials.read(os.path.join('..', 'credentials.ini'))
api_key = credentials.get('freemusicarchive', 'api_key')

Find the name of the artist which has an ID of 58.



In [ ]:

    
ARTIST_ID = 58

BASE_URL = 'https://freemusicarchive.org/api/get/artists.json'
url = '{}?artist_id={}&api_key={}'.format(BASE_URL, ARTIST_ID, api_key)
print(url)
requests.get(url).content

2 JSON

The goal of an HTTP GET request is to get data. The returned data might be HTML (as you see when you browse the web), XML, JSON, etc. Most web APIs nowadays return data formated as JSON. As JSON data objects consist of key-value pairs and lists, the format is well modeled by Python dictionaries and lists.



In [ ]:

    
data = response.json()

The above call to json() interprets the returned data as being JSON and constructs Python dictionary and list objects out of it. In this case the top-level object is a dictionary, with some keys.



In [ ]:

    
print(type(data))
print(data.keys())

Let's look at the value of the "title" key.



In [ ]:

    
data['title']

Exploring the returned data is a good way to learn about the API. Let's get to what we were looking for, a list of recently added tracks.



In [ ]:

    
print(type(data['aTracks']))
print(data['aTracks'][0].keys())



In [ ]:

    
for track in data['aTracks'][:5]:
    print(track['track_title'])

2.1 Exercise

Construct a list of the names of the 16 top-level genres. No need to call the API again, everything is in the above collected JSON data.



In [ ]:

    
genres = [genre['genre_title'] for genre in data['nav_genres']]

assert type(genres) is list
print(genres)

3 Pandas: data analysis in Python

While it might be sufficient to keep the data as lists and dictionaries, we often prefer to see data in a tabular format for analysis. A tabular format allows to make operations on the rows and columns, e.g. by taking a sum over prices. At a large scale, tabular data is stored in a database (think the list of clients of a compagny). At a small scale, you probably used it in the form of an Excel spreadsheet. In Python, pandas is the most used data analysis tool. You can think of it as a programmable spreadsheet.

Let's first create a simple table, called a DataFrame in pandas' language. We can initialize the table with e.g. a Python list or a NumPy array. As our running example, let's say we want to do some accounting for our family and define the following schema: each row is a member of the family, the first column represents the revenue and the second the expenses. Sure enough, we can create a NumPy array.



In [ ]:

    
import numpy as np

accounts = np.array([[10, 20], [30, 30], [40, 20]])
print(accounts)

But that's not very user friendly. Who's the second line already? Enter pandas.



In [ ]:

    
import pandas as pd

accounts = pd.DataFrame(accounts)
accounts

But this is not much more useful than our NumPy array. Let's e.g. name the rows and columns.



In [ ]:

    
accounts.columns = ['revenues', 'expenditures']
accounts.index = ['John', 'Mary', 'Alison']
accounts.index.name = 'given name'
accounts

Now if I want to know how much Alison spent this month, I don't have to remember that Alison is the third row and that the expenditures are the second column. I can query:



In [ ]:

    
accounts.at['Alison', 'expenditures']

We may want to compute the revenue of the entire familiy (note the similarity with the way you would do it in a spreadsheet):



In [ ]:

    
accounts['revenues'].sum()

Or the balance of each member:



In [ ]:

    
accounts['balance'] = accounts['revenues'] - accounts['expenditures']
accounts

Another quite useful feature is selection:



In [ ]:

    
accounts[accounts['balance'] < 0]

Or sorting:



In [ ]:

    
accounts.sort_values('expenditures')

Now it's time to save our data for archival, or to open it up in another tool.



In [ ]:

    
accounts.to_csv(os.path.join('..', 'data', 'family_accounts.csv'))



In [ ]:

    
!cat ../data/family_accounts.csv
# Windows: !type ..\data\family_accounts.csv

These are very basic operations to give you an idea of what pandas is. More info in the docs. That library will certainly be useful for your projects.

3.1 Exercise

Using pandas and the above data (i.e. data['aTracks']), find how many tracks each artist published.



In [ ]:

    
tracks = pd.DataFrame(data['aTracks'])
assert type(tracks) is pd.DataFrame
tracks.head()



In [ ]:

    
tracks['artist_name'].value_counts()