.. _api_tutorial:

Using the OpenAQ API

The openaq api is an easy-to-use wrapper built around the OpenAQ Api. Complete API documentation can be found on their website.

There are no keys or rate limits (as of March 2017), so working with the API is straight forward. If building a website or app, you may want to just use the python wrapper and interact with the data in json format. However, the rest of this tutorial will assume you are interested in analyzing the data. To get more out of it, I recommend installing seaborn for manipulating the asthetics of plots, and working with data as DataFrames using pandas. For more information on these, check out the installation section of this documentation.

From this point forward, I assume you have at least a basic knowledge of python and matplotlib. This documentation was built using the following versions of all packages:


In [ ]:
import pandas as pd
import seaborn as sns
import matplotlib as mpl
import matplotlib.pyplot as plt
import openaq
import warnings

warnings.simplefilter('ignore')

%matplotlib inline

# Set major seaborn asthetics
sns.set("notebook", style='ticks', font_scale=1.0)

# Increase the quality of inline plots
mpl.rcParams['figure.dpi']= 500

print ("pandas v{}".format(pd.__version__))
print ("matplotlib v{}".format(mpl.__version__))
print ("seaborn v{}".format(sns.__version__))
print ("openaq v{}".format(openaq.__version__))

OpenAQ API

The OpenAQ APi has only eight endpoints that we are interested in:

  • cities: provides a simple listing of cities within the platforms
  • countries: provides a simple listing of countries within the platform
  • fetches: providing data about individual fetch operations that are used to populate data in the platform
  • latest: provides the latest value of each available parameter for every location in the system
  • locations: provides a list of measurement locations and their meta data
  • measurements: provides data about individual measurements
  • parameters: provides a simple listing of parameters within the platform
  • sources: provides a list of data sources

For detailed documentation about each one in the context of this API wrapper, please check out the API documentation.

Your First Request

Real quick, let's go ahead and initiate an instance of the openaq.OpenAQ class so we can begin looking at data:


In [ ]:
api = openaq.OpenAQ()

Cities

The cities API endpoint lists the cities available within the platform. Results can be subselected by country and paginated to retrieve all results in the database. Let's start by performing a basic query with an increased limit (so we can get all of them) and return it as a DataFrame:


In [ ]:
resp = api.cities(df=True, limit=10000)

# display the first 10 rows
resp.info()

So we retrieved 1400+ entries from the database. We can then take a look at them:


In [ ]:
print (resp.head(10))

Let's try to find out which ones are in India:


In [ ]:
print (resp.query("country == 'IN'"))

Great! For the rest of the tutorial, we are going to focus on Delhi, India. Why? Well..because there are over 500,000 data points and my personal research is primarily in India. We will also take a look at some $SO_2$ data from Hawai'i later on (another great research locale).

Countries

Similar to the cities endpoint, the countries endpoint lists the countries available. The only parameters we have to play with are the limit and page number. If we want to grab them all, we can just up the limit to the maximum (10000).


In [ ]:
res = api.countries(limit=10000, df=True)

print (res.head())

Fetches

If you are interested in getting information pertaining to the individual data fetch operations, go ahead and use this endpoint. Most people won't need to use this. This API method does not allow the df parameter; if you would like it to be added, drop me a message.

Otherwise, here is how you can access the json-formatted data:


In [ ]:
status, resp = api.fetches(limit=1)

# Print out the meta info
resp['meta']

Parameters

The parameters endpoint will provide a listing off all the parameters available:


In [ ]:
res = api.parameters(df=True)

print (res)

Sources

The sources endpoint will provide a list of the sources where the raw data came from.


In [ ]:
res = api.sources(df=True)

# Print out the first one
res.ix[0]

Locations

The locations endpoint will return the list of measurement locations and their meta data. We can do quite a bit of querying with this one:

Let's see what the data looks like:


In [ ]:
res = api.locations(df=True)

res.info()

In [ ]:
# print out the first one
res.ix[0]

What if we just want to grab the locations in Delhi?


In [ ]:
res = api.locations(city='Delhi', df=True)


res.ix[0]

What about just figuring out which locations in Delhi have $PM_{2.5}$ data?


In [ ]:
res = api.locations(city='Delhi', parameter='pm25', df=True)

res.ix[0]

Latest

Grab the latest data from a location or locations.

What was the most recent $PM_{2.5}$ data in Delhi?


In [ ]:
res = api.latest(city='Delhi', parameter='pm25', df=True)

res.head()

What about the most recent $SO_2$ data in Hawii?


In [ ]:
res = api.latest(city='Hilo', parameter='so2', df=True)

res

Measurements

Finally, the endpoint we've all been waiting for! Measurements allows you to grab all of the dataz! You can query on a whole bunhc of parameters listed in the API documentation. Let's dive in:

Let's grab the past 10000 data points for $PM_{2.5}$ in Delhi:


In [ ]:
res = api.measurements(city='Delhi', parameter='pm25', limit=10000, df=True)

# Print out the statistics on a per-location basiss
res.groupby(['location'])['value'].describe()

Clearly, we should be doing some serious data cleaning ;) Why don't we go ahead and plot all of these locations on a figure.


In [ ]:
fig, ax = plt.subplots(1, figsize=(10, 6))

for group, df in res.groupby('location'):
    # Query the data to only get positive values and resample to hourly
    _df = df.query("value >= 0.0").resample('1h').mean()
    
    _df.value.plot(ax=ax, label=group)
    
ax.legend(loc='best')
ax.set_ylabel("$PM_{2.5}$  [$\mu g m^{-3}$]", fontsize=20)
ax.set_xlabel("")
sns.despine(offset=5)

plt.legend(bbox_to_anchor=(1.05, 1), loc=2, borderaxespad=0.)

plt.show()

Don't worry too much about how ugly and uninteresting the plot above is...we'll take care of that in the next tutorial! Let's go ahead and look at the distribution of $PM_{2.5}$ values seen in Delhi by various sensors. This is the same data as above, but viewed in a different way.


In [ ]:
fig, ax = plt.subplots(1, figsize=(14,7))

ax = sns.boxplot(
    x='location', 
    y='value', 
    data=res.query("value >= 0.0"), 
    fliersize=0, 
    palette='deep',
    ax=ax)

ax.set_ylim([0, 750])
ax.set_ylabel("$PM_{2.5}\;[\mu gm^{-3}]$", fontsize=18)
ax.set_xlabel("")

sns.despine(offset=10)

plt.xticks(rotation=90)
plt.show()

If we remember from above, there was at least one location where many parameters were measured. Let's go ahead and look at that location and see if there is any correlation among parameters!


In [ ]:
res = api.measurements(city='Delhi', location='Anand Vihar', limit=1000, df=True)

# Which params do we have?
res.parameter.unique()

In [ ]:
df = pd.DataFrame()

for u in res.parameter.unique():
    _df = res[res['parameter'] == u][['value']]
    _df.columns = [u]
    
    # Merge the dataframes together
    df = pd.merge(df, _df, left_index=True, right_index=True, how='outer')

# Get rid of rows where not all exist
df.dropna(how='any', inplace=True)

g = sns.PairGrid(df, diag_sharey=False)

g.map_lower(sns.kdeplot, cmap='Blues_d')
g.map_upper(plt.scatter)
g.map_diag(sns.kdeplot, lw=3)

plt.show()

For kicks, let's go ahead and look at a timeseries of $SO_2$ data in Hawai'i. Quiz: What do you expect? Did you know that Hawai'i has a huge $SO_2$ problem?


In [ ]:
res = api.measurements(city='Hilo', parameter='so2', limit=10000, df=True)

# Print out the statistics on a per-location basiss
res.groupby(['location'])['value'].describe()

In [ ]:
fig, ax = plt.subplots(1, figsize=(10, 5))

for group, df in res.groupby('location'):
    # Query the data to only get positive values and resample to hourly
    _df = df.query("value >= 0.0").resample('6h').mean()
    
    # Convert from ppm to ppb
    _df['value'] *= 1e3
    
    # Multiply the value by 1000 to get from ppm to ppb
    _df.value.plot(ax=ax, label=group)
    
ax.legend(loc='best')
ax.set_ylabel("$SO_2 \; [ppb]$", fontsize=18)
ax.set_xlabel("")

sns.despine(offset=5)

plt.show()

NOTE: These values are for 6h means. The local readings can actually get much, much higher (>5 ppm!) when looking at 1min data.


In [ ]: