The openaq
api is an easy-to-use wrapper built around the OpenAQ Api. Complete API documentation can be found on their website.
There are no keys or rate limits (as of March 2017), so working with the API is straight forward. If building a website or app, you may want to just use the python wrapper and interact with the data in json format. However, the rest of this tutorial will assume you are interested in analyzing the data. To get more out of it, I recommend installing seaborn
for manipulating the asthetics of plots, and working with data as DataFrames using pandas
. For more information on these, check out the installation section of this documentation.
From this point forward, I assume you have at least a basic knowledge of python and matplotlib. This documentation was built using the following versions of all packages:
In [32]:
import pandas as pd
import seaborn as sns
import matplotlib as mpl
import matplotlib.pyplot as plt
import openaq
import warnings
warnings.simplefilter('ignore')
%matplotlib inline
# Set major seaborn asthetics
sns.set("notebook", style='ticks', font_scale=1.0)
# Increase the quality of inline plots
mpl.rcParams['figure.dpi']= 500
print ("pandas v{}".format(pd.__version__))
print ("matplotlib v{}".format(mpl.__version__))
print ("seaborn v{}".format(sns.__version__))
print ("openaq v{}".format(openaq.__version__))
The OpenAQ APi has only eight endpoints that we are interested in:
For detailed documentation about each one in the context of this API wrapper, please check out the API documentation.
Real quick, let's go ahead and initiate an instance of the openaq.OpenAQ
class so we can begin looking at data:
In [2]:
api = openaq.OpenAQ()
In [3]:
resp = api.cities(df=True, limit=10000)
# display the first 10 rows
resp.info()
So we retrieved 1400+ entries from the database. We can then take a look at them:
In [4]:
print (resp.head(10))
Let's try to find out which ones are in India:
In [5]:
print (resp.query("country == 'IN'"))
Great! For the rest of the tutorial, we are going to focus on Delhi, India. Why? Well..because there are over 500,000 data points and my personal research is primarily in India. We will also take a look at some $SO_2$ data from Hawai'i later on (another great research locale).
In [7]:
res = api.countries(limit=10000, df=True)
print (res.head())
If you are interested in getting information pertaining to the individual data fetch operations, go ahead and use this endpoint. Most people won't need to use this. This API method does not allow the df
parameter; if you would like it to be added, drop me a message.
Otherwise, here is how you can access the json-formatted data:
In [9]:
status, resp = api.fetches(limit=1)
# Print out the meta info
resp['meta']
Out[9]:
In [10]:
res = api.parameters(df=True)
print (res)
In [11]:
res = api.sources(df=True)
# Print out the first one
res.ix[0]
Out[11]:
In [12]:
res = api.locations(df=True)
res.info()
In [13]:
# print out the first one
res.ix[0]
Out[13]:
What if we just want to grab the locations in Delhi?
In [14]:
res = api.locations(city='Delhi', df=True)
res.ix[0]
Out[14]:
What about just figuring out which locations in Delhi have $PM_{2.5}$ data?
In [15]:
res = api.locations(city='Delhi', parameter='pm25', df=True)
res.ix[0]
Out[15]:
In [17]:
res = api.latest(city='Delhi', parameter='pm25', df=True)
res.head()
Out[17]:
What about the most recent $SO_2$ data in Hawii?
In [18]:
res = api.latest(city='Hilo', parameter='so2', df=True)
res
Out[18]:
In [19]:
res = api.measurements(city='Delhi', parameter='pm25', limit=10000, df=True)
# Print out the statistics on a per-location basiss
res.groupby(['location'])['value'].describe()
Out[19]:
Clearly, we should be doing some serious data cleaning ;) Why don't we go ahead and plot all of these locations on a figure.
In [38]:
fig, ax = plt.subplots(1, figsize=(10, 6))
for group, df in res.groupby('location'):
# Query the data to only get positive values and resample to hourly
_df = df.query("value >= 0.0").resample('1h').mean()
_df.value.plot(ax=ax, label=group)
ax.legend(loc='best')
ax.set_ylabel("$PM_{2.5}$ [$\mu g m^{-3}$]", fontsize=20)
ax.set_xlabel("")
sns.despine(offset=5)
plt.legend(bbox_to_anchor=(1.05, 1), loc=2, borderaxespad=0.)
plt.show()
Don't worry too much about how ugly and uninteresting the plot above is...we'll take care of that in the next tutorial! Let's go ahead and look at the distribution of $PM_{2.5}$ values seen in Delhi by various sensors. This is the same data as above, but viewed in a different way.
In [37]:
fig, ax = plt.subplots(1, figsize=(14,7))
ax = sns.boxplot(
x='location',
y='value',
data=res.query("value >= 0.0"),
fliersize=0,
palette='deep',
ax=ax)
ax.set_ylim([0, 750])
ax.set_ylabel("$PM_{2.5}\;[\mu gm^{-3}]$", fontsize=18)
ax.set_xlabel("")
sns.despine(offset=10)
plt.xticks(rotation=90)
plt.show()
If we remember from above, there was at least one location where many parameters were measured. Let's go ahead and look at that location and see if there is any correlation among parameters!
In [39]:
res = api.measurements(city='Delhi', location='Anand Vihar', limit=1000, df=True)
# Which params do we have?
res.parameter.unique()
Out[39]:
In [40]:
df = pd.DataFrame()
for u in res.parameter.unique():
_df = res[res['parameter'] == u][['value']]
_df.columns = [u]
# Merge the dataframes together
df = pd.merge(df, _df, left_index=True, right_index=True, how='outer')
# Get rid of rows where not all exist
df.dropna(how='any', inplace=True)
g = sns.PairGrid(df, diag_sharey=False)
g.map_lower(sns.kdeplot, cmap='Blues_d')
g.map_upper(plt.scatter)
g.map_diag(sns.kdeplot, lw=3)
plt.show()
For kicks, let's go ahead and look at a timeseries of $SO_2$ data in Hawai'i. Quiz: What do you expect? Did you know that Hawai'i has a huge $SO_2$ problem?
In [43]:
res = api.measurements(city='Hilo', parameter='so2', limit=10000, df=True)
# Print out the statistics on a per-location basiss
res.groupby(['location'])['value'].describe()
Out[43]:
In [45]:
fig, ax = plt.subplots(1, figsize=(10, 5))
for group, df in res.groupby('location'):
# Query the data to only get positive values and resample to hourly
_df = df.query("value >= 0.0").resample('6h').mean()
# Convert from ppm to ppb
_df['value'] *= 1e3
# Multiply the value by 1000 to get from ppm to ppb
_df.value.plot(ax=ax, label=group)
ax.legend(loc='best')
ax.set_ylabel("$SO_2 \; [ppb]$", fontsize=18)
ax.set_xlabel("")
sns.despine(offset=5)
plt.show()
NOTE: These values are for 6h means. The local readings can actually get much, much higher (>5 ppm!) when looking at 1min data.
In [ ]: