Case study: air quality data of European monitoring stations (AirBase)
© 2016, Joris Van den Bossche and Stijn Van Hoey (mailto:jorisvandenbossche@gmail.com, mailto:stijnvanhoey@gmail.com). Licensed under CC BY 4.0 Creative Commons
AirBase is the European air quality database maintained by the European Environment Agency (EEA). It contains air quality monitoring data and information submitted by participating countries throughout Europe. The air quality database consists of a multi-annual time series of air quality measurement data and statistics for a number of air pollutants.
In [ ]:
from IPython.display import HTML
HTML('<iframe src=http://www.eea.europa.eu/data-and-maps/data/airbase-the-european-air-quality-database-8#tab-data-by-country width=900 height=350></iframe>')
Some of the data files that are available from AirBase were included in the data folder: the hourly concentrations of nitrogen dioxide (NO2) for 4 different measurement stations:
In [ ]:
%matplotlib inline
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
pd.options.display.max_rows = 8
In [ ]:
with open("data/BETR8010000800100hour.1-1-1990.31-12-2012") as f:
print(f.readline())
So we will need to do some manual processing.
Just reading the tab-delimited data:
In [ ]:
data = pd.read_csv("data/BETR8010000800100hour.1-1-1990.31-12-2012", sep='\t')#, header=None)
In [ ]:
data.head()
The above data is clearly not ready to be used! Each row contains the 24 measurements for each hour of the day, and also contains a flag (0/1) indicating the quality of the data. Furthermore, there is no header row with column names.
In [ ]:
# Column names: list consisting of 'date' and then intertwined the hour of the day and 'flag'
hours = ["{:02d}".format(i) for i in range(24)]
column_names = ['date'] + [item for pair in zip(hours, ['flag']*24) for item in pair]
In [ ]:
# %load snippets/07 - Case study - air quality data7.py
In [ ]:
# %load snippets/07 - Case study - air quality data8.py
For the sake of this tutorial, we will disregard the 'flag' columns (indicating the quality of the data).
EXERCISE:
Drop all 'flag' columns ('flag1', 'flag2', ...)
In [ ]:
flag_columns = [col for col in data.columns if 'flag' in col]
# we can now use this list to drop these columns
In [ ]:
# %load snippets/07 - Case study - air quality data10.py
In [ ]:
data.head()
Now, we want to reshape it: our goal is to have the different hours as row indices, merged with the date into a datetime-index. Here we have a wide and long dataframe, and want to make this a long, narrow timeseries.
BETR801 | |
---|---|
1990-01-02 09:00:00 | 48.0 |
1990-01-02 12:00:00 | 48.0 |
1990-01-02 13:00:00 | 50.0 |
1990-01-02 14:00:00 | 55.0 |
... | ... |
2012-12-31 20:00:00 | 16.5 |
2012-12-31 21:00:00 | 14.5 |
2012-12-31 22:00:00 | 16.5 |
2012-12-31 23:00:00 | 15.0 |
170794 rows × 1 columns
NOTE: This is an advanced exercise. Do not spend too much time on it and don't hesitate to look at the solutions.
</div>
In [ ]:
# %load snippets/07 - Case study - air quality data12.py
In [ ]:
# %load snippets/07 - Case study - air quality data13.py
In [ ]:
# %load snippets/07 - Case study - air quality data14.py
In [ ]:
# %load snippets/07 - Case study - air quality data15.py
In [ ]:
# %load snippets/07 - Case study - air quality data16.py
In [ ]:
data_stacked.head()
Our final data is now a time series. In pandas, this means that the index is a DatetimeIndex
:
In [ ]:
data_stacked.index
In [ ]:
data_stacked.plot()
We now have seen the code steps to process one of the files. We have however multiple files for the different stations with the same structure. Therefore, to not have to repeat the actual code, let's make a function from the steps we have seen above.
In [ ]:
def read_airbase_file(filename, station):
"""
Read hourly AirBase data files.
Parameters
----------
filename : string
Path to the data file.
station : string
Name of the station.
Returns
-------
DataFrame
Processed dataframe.
"""
...
return ...
In [ ]:
# %load snippets/07 - Case study - air quality data21.py
Test the function on the data file from above:
In [ ]:
filename = "data/BETR8010000800100hour.1-1-1990.31-12-2012"
station = filename.split("/")[-1][:7]
In [ ]:
station
In [ ]:
test = read_airbase_file(filename, station)
test.head()
We now want to use this function to read in all the different data files from AirBase, and combine them into one Dataframe.
In [ ]:
import glob
In [ ]:
# %load snippets/07 - Case study - air quality data33.py
In [ ]:
# %load snippets/07 - Case study - air quality data34.py
In [ ]:
# %load snippets/07 - Case study - air quality data35.py
In [ ]:
combined_data.head()
Finally, we don't want to have to repeat this each time we use the data. Therefore, let's save the processed data to a csv file.
In [ ]:
combined_data.to_csv("airbase_data.csv")
We processed the individual data files above, and saved it to a csv file airbase_data.csv
. Let's import the file here (if you didn't finish the above exercises, a version of the dataset is also available in data/airbase_data.csv
):
In [ ]:
alldata = pd.read_csv('airbase_data.csv', index_col=0, parse_dates=True)
We only use the data from 1999 onwards:
In [ ]:
data = alldata['1999':].copy()
Som first exploration with the typical functions:
In [ ]:
data.head() # tail()
In [ ]:
data.info()
In [ ]:
data.describe(percentiles=[0.1, 0.5, 0.9])
Quickly visualizing the data
In [ ]:
data.plot(kind='box', ylim=[0,250])
In [ ]:
data['BETR801'].plot(kind='hist', bins=50)
In [ ]:
data.plot(figsize=(12,6))
This does not say too much ..
We can select part of the data (eg the latest 500 data points):
In [ ]:
data[-500:].plot(figsize=(12,6))
In [ ]:
# %load snippets/07 - Case study - air quality data50.py
In [ ]:
# %load snippets/07 - Case study - air quality data51.py
In [ ]:
# %load snippets/07 - Case study - air quality data52.py
In [ ]:
# %load snippets/07 - Case study - air quality data53.py
In [ ]:
# %load snippets/07 - Case study - air quality data54.py
In [ ]:
# %load snippets/07 - Case study - air quality data55.py
resample
can actually be seen as a specific kind of groupby
. E.g. taking annual means with data.resample('A', 'mean')
is equivalent to data.groupby(data.index.year).mean()
(only the result of resample
still has a DatetimeIndex).
In [ ]:
data.groupby(data.index.year).mean().plot()
But, groupby
is more flexible and can also do resamples that do not result in a new continuous time series, e.g. by grouping by the hour of the day to get the diurnal cycle.
1. add a column to the dataframe that indicates the month (integer value of 1 to 12):
In [ ]:
# %load snippets/07 - Case study - air quality data57.py
2. Now, we can calculate the mean of each month over the different years:
In [ ]:
# %load snippets/07 - Case study - air quality data58.py
3. plot the typical monthly profile of the different stations:
In [ ]:
# %load snippets/07 - Case study - air quality data59.py
In [ ]:
data = data.drop('month', axis=1, errors='ignore')
In [ ]:
df2011 = data['2011'].dropna()
In [ ]:
# %load snippets/07 - Case study - air quality data64.py
In [ ]:
# %load snippets/07 - Case study - air quality data65.py
In [ ]:
# %load snippets/07 - Case study - air quality data66.py
In [ ]:
# %load snippets/07 - Case study - air quality data67.py
In [ ]:
# %load snippets/07 - Case study - air quality data68.py
In [ ]:
# %load snippets/07 - Case study - air quality data69.py
In [ ]:
# %load snippets/07 - Case study - air quality data70.py
In [ ]:
# %load snippets/07 - Case study - air quality data72.py
In [ ]:
# %load snippets/07 - Case study - air quality data73.py
In [ ]:
# %load snippets/07 - Case study - air quality data74.py
In [ ]:
data.index.weekday?
In [ ]:
# %load snippets/07 - Case study - air quality data76.py
Add a column indicating week/weekend
In [ ]:
# %load snippets/07 - Case study - air quality data77.py
In [ ]:
# %load snippets/07 - Case study - air quality data78.py
In [ ]:
# %load snippets/07 - Case study - air quality data79.py
In [ ]:
# %load snippets/07 - Case study - air quality data80.py
In [ ]:
# %load snippets/07 - Case study - air quality data81.py
Tip: the boxplot method of a DataFrame expects the data for the different boxes in different columns). For this, you can either use pivot_table
as a combination of groupby
and unstack
In [ ]:
# %load snippets/07 - Case study - air quality data82.py
In [ ]:
# %load snippets/07 - Case study - air quality data83.py
In [ ]:
# %load snippets/07 - Case study - air quality data84.py
An alternative method using groupby
and unstack
:
In [ ]:
# %load snippets/07 - Case study - air quality data85.py
Tip: have a look at the rolling
method to perform moving window operations.
Note: this is not an actual limit for NO2, but a nice exercise to introduce the rolling
method. Other pollutans, such as 03 have actually such kind of limit values.
In [ ]:
# %load snippets/07 - Case study - air quality data86.py
In [ ]:
# %load snippets/07 - Case study - air quality data87.py
In [ ]:
# %load snippets/07 - Case study - air quality data88.py
In [ ]: