Case study: air quality data of European monitoring stations (AirBase)


AirBase (The European Air quality dataBase): hourly measurements of all air quality monitoring stations from Europe.

© 2016, Joris Van den Bossche and Stijn Van Hoey (mailto:jorisvandenbossche@gmail.com, mailto:stijnvanhoey@gmail.com). Licensed under CC BY 4.0 Creative Commons


AirBase is the European air quality database maintained by the European Environment Agency (EEA). It contains air quality monitoring data and information submitted by participating countries throughout Europe. The air quality database consists of a multi-annual time series of air quality measurement data and statistics for a number of air pollutants.


In [ ]:
from IPython.display import HTML
HTML('<iframe src=http://www.eea.europa.eu/data-and-maps/data/airbase-the-european-air-quality-database-8#tab-data-by-country width=900 height=350></iframe>')

Some of the data files that are available from AirBase were included in the data folder: the hourly concentrations of nitrogen dioxide (NO2) for 4 different measurement stations:

  • FR04037 (PARIS 13eme): urban background site at Square de Choisy
  • FR04012 (Paris, Place Victor Basch): urban traffic site at Rue d'Alesia
  • BETR802: urban traffic site in Antwerp, Belgium
  • BETN029: rural background site in Houtem, Belgium

See http://www.eea.europa.eu/themes/air/interactive/no2


In [ ]:
%matplotlib inline
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

pd.options.display.max_rows = 8

Processing a single file

We will start with processing one of the downloaded files (BETR8010000800100hour.1-1-1990.31-12-2012). Looking at the data, you will see it does not look like a nice csv file:


In [ ]:
with open("data/BETR8010000800100hour.1-1-1990.31-12-2012") as f:
    print(f.readline())

So we will need to do some manual processing.

Just reading the tab-delimited data:


In [ ]:
data = pd.read_csv("data/BETR8010000800100hour.1-1-1990.31-12-2012", sep='\t')#, header=None)

In [ ]:
data.head()

The above data is clearly not ready to be used! Each row contains the 24 measurements for each hour of the day, and also contains a flag (0/1) indicating the quality of the data. Furthermore, there is no header row with column names.

EXERCISE:

Clean up this dataframe by using more options of `read_csv` (see its [docstring](http://pandas.pydata.org/pandas-docs/stable/generated/pandas.read_csv.html))
  • specify the correct delimiter
  • specify that the values of -999 and -9999 should be regarded as NaN
  • specify are own column names (for how the column names are made up, see See http://stackoverflow.com/questions/6356041/python-intertwining-two-lists)

In [ ]:
# Column names: list consisting of 'date' and then intertwined the hour of the day and 'flag'
hours = ["{:02d}".format(i) for i in range(24)]
column_names = ['date'] + [item for pair in zip(hours, ['flag']*24) for item in pair]

In [ ]:
# %load snippets/07 - Case study - air quality data7.py

In [ ]:
# %load snippets/07 - Case study - air quality data8.py

For the sake of this tutorial, we will disregard the 'flag' columns (indicating the quality of the data).

EXERCISE:

Drop all 'flag' columns ('flag1', 'flag2', ...)


In [ ]:
flag_columns = [col for col in data.columns if 'flag' in col]
# we can now use this list to drop these columns

In [ ]:
# %load snippets/07 - Case study - air quality data10.py

In [ ]:
data.head()

Now, we want to reshape it: our goal is to have the different hours as row indices, merged with the date into a datetime-index. Here we have a wide and long dataframe, and want to make this a long, narrow timeseries.

REMEMBER:
  • Recap: reshaping your data with [`stack` and `unstack`](./pandas_07_reshaping_data.ipynb)
EXERCISE:

Reshape the dataframe to a timeseries. The end result should look like:

BETR801
1990-01-02 09:00:00 48.0
1990-01-02 12:00:00 48.0
1990-01-02 13:00:00 50.0
1990-01-02 14:00:00 55.0
... ...
2012-12-31 20:00:00 16.5
2012-12-31 21:00:00 14.5
2012-12-31 22:00:00 16.5
2012-12-31 23:00:00 15.0

170794 rows × 1 columns

  • Reshape the dataframe so that each row consists of one observation for one date + hour combination
  • When you have the date and hour values as two columns, combine these columns into a datetime (tip: string columns can be summed to concatenate the strings) and remove the original columns
  • Set the new datetime values as the index, and remove the original columns with date and hour values

NOTE: This is an advanced exercise. Do not spend too much time on it and don't hesitate to look at the solutions.

</div>


In [ ]:
# %load snippets/07 - Case study - air quality data12.py

In [ ]:
# %load snippets/07 - Case study - air quality data13.py

In [ ]:
# %load snippets/07 - Case study - air quality data14.py

In [ ]:
# %load snippets/07 - Case study - air quality data15.py

In [ ]:
# %load snippets/07 - Case study - air quality data16.py

In [ ]:
data_stacked.head()

Our final data is now a time series. In pandas, this means that the index is a DatetimeIndex:


In [ ]:
data_stacked.index

In [ ]:
data_stacked.plot()

Processing a collection of files

We now have seen the code steps to process one of the files. We have however multiple files for the different stations with the same structure. Therefore, to not have to repeat the actual code, let's make a function from the steps we have seen above.

EXERCISE:
  • Write a function `read_airbase_file(filename, station)`, using the above steps the read in and process the data, and that returns a processed timeseries.

In [ ]:
def read_airbase_file(filename, station):
    """
    Read hourly AirBase data files.
    
    Parameters
    ----------
    filename : string
        Path to the data file.
    station : string
        Name of the station.
       
    Returns
    -------
    DataFrame
        Processed dataframe.
    """
    
    ...
    
    return ...

In [ ]:
# %load snippets/07 - Case study - air quality data21.py

Test the function on the data file from above:


In [ ]:
filename = "data/BETR8010000800100hour.1-1-1990.31-12-2012"
station = filename.split("/")[-1][:7]

In [ ]:
station

In [ ]:
test = read_airbase_file(filename, station)
test.head()

We now want to use this function to read in all the different data files from AirBase, and combine them into one Dataframe.

EXERCISE:
  • Use the `glob.glob` function to list all 4 AirBase data files that are included in the 'data' directory, and call the result `data_files`.

In [ ]:
import glob

In [ ]:
# %load snippets/07 - Case study - air quality data33.py
EXERCISE:
  • Loop over the data files, read and process the file using our defined function, and append the dataframe to a list.
  • Combine the the different DataFrames in the list into a single DataFrame where the different columns are the different stations. Call the result `combined_data`.

In [ ]:
# %load snippets/07 - Case study - air quality data34.py

In [ ]:
# %load snippets/07 - Case study - air quality data35.py

In [ ]:
combined_data.head()

Finally, we don't want to have to repeat this each time we use the data. Therefore, let's save the processed data to a csv file.


In [ ]:
combined_data.to_csv("airbase_data.csv")

Working with time series data

We processed the individual data files above, and saved it to a csv file airbase_data.csv. Let's import the file here (if you didn't finish the above exercises, a version of the dataset is also available in data/airbase_data.csv):


In [ ]:
alldata = pd.read_csv('airbase_data.csv', index_col=0, parse_dates=True)

We only use the data from 1999 onwards:


In [ ]:
data = alldata['1999':].copy()

Som first exploration with the typical functions:


In [ ]:
data.head() # tail()

In [ ]:
data.info()

In [ ]:
data.describe(percentiles=[0.1, 0.5, 0.9])

Quickly visualizing the data


In [ ]:
data.plot(kind='box', ylim=[0,250])

In [ ]:
data['BETR801'].plot(kind='hist', bins=50)

In [ ]:
data.plot(figsize=(12,6))

This does not say too much ..

We can select part of the data (eg the latest 500 data points):


In [ ]:
data[-500:].plot(figsize=(12,6))

Exercises

REMINDER:

Take a look at the [Timeseries notebook](05 - Time series data.ipynb) when you require more info about...
  • `resample`
  • string indexing of DateTimeIndex


QUESTION: plot the monthly mean and median concentration of the 'FR04037' station for the years 2009-2012

In [ ]:
# %load snippets/07 - Case study - air quality data50.py

In [ ]:
# %load snippets/07 - Case study - air quality data51.py
QUESTION: plot the monthly mininum and maximum daily concentration of the 'BETR801' station

In [ ]:
# %load snippets/07 - Case study - air quality data52.py

In [ ]:
# %load snippets/07 - Case study - air quality data53.py
QUESTION: make a bar plot of the mean of the stations in year of 2012

In [ ]:
# %load snippets/07 - Case study - air quality data54.py
QUESTION: The evolution of the yearly averages with, and the overall mean of all stations (indicate the overall mean with a thicker black line)?

In [ ]:
# %load snippets/07 - Case study - air quality data55.py

Combination with groupby

resample can actually be seen as a specific kind of groupby. E.g. taking annual means with data.resample('A', 'mean') is equivalent to data.groupby(data.index.year).mean() (only the result of resample still has a DatetimeIndex).


In [ ]:
data.groupby(data.index.year).mean().plot()

But, groupby is more flexible and can also do resamples that do not result in a new continuous time series, e.g. by grouping by the hour of the day to get the diurnal cycle.

QUESTION: how does the *typical monthly profile* look like for the different stations?

1. add a column to the dataframe that indicates the month (integer value of 1 to 12):


In [ ]:
# %load snippets/07 - Case study - air quality data57.py

2. Now, we can calculate the mean of each month over the different years:


In [ ]:
# %load snippets/07 - Case study - air quality data58.py

3. plot the typical monthly profile of the different stations:


In [ ]:
# %load snippets/07 - Case study - air quality data59.py

In [ ]:
data = data.drop('month', axis=1, errors='ignore')
QUESTION: plot the weekly 95% percentiles of the concentration in 'BETR801' and 'BETN029' for 2011

In [ ]:
df2011 = data['2011'].dropna()

In [ ]:
# %load snippets/07 - Case study - air quality data64.py

In [ ]:
# %load snippets/07 - Case study - air quality data65.py
QUESTION: The typical diurnal profile for the different stations?

In [ ]:
# %load snippets/07 - Case study - air quality data66.py
QUESTION: What are the number of exceedances of hourly values above the European limit 200 µg/m3 for each year/station?

In [ ]:
# %load snippets/07 - Case study - air quality data67.py

In [ ]:
# %load snippets/07 - Case study - air quality data68.py

In [ ]:
# %load snippets/07 - Case study - air quality data69.py

In [ ]:
# %load snippets/07 - Case study - air quality data70.py
QUESTION: And are there exceedances of the yearly limit value of 40 µg/m3 since 200 ?

In [ ]:
# %load snippets/07 - Case study - air quality data72.py

In [ ]:
# %load snippets/07 - Case study - air quality data73.py

In [ ]:
# %load snippets/07 - Case study - air quality data74.py
QUESTION: What is the difference in the typical diurnal profile between week and weekend days? (and visualise it)

In [ ]:
data.index.weekday?

In [ ]:
# %load snippets/07 - Case study - air quality data76.py

Add a column indicating week/weekend


In [ ]:
# %load snippets/07 - Case study - air quality data77.py

In [ ]:
# %load snippets/07 - Case study - air quality data78.py

In [ ]:
# %load snippets/07 - Case study - air quality data79.py

In [ ]:
# %load snippets/07 - Case study - air quality data80.py

In [ ]:
# %load snippets/07 - Case study - air quality data81.py
QUESTION: Visualize the typical week profile for the different stations as boxplots (where the values in one boxplot are the daily means for the different weeks for a certain weekday).

Tip: the boxplot method of a DataFrame expects the data for the different boxes in different columns). For this, you can either use pivot_table as a combination of groupby and unstack


In [ ]:
# %load snippets/07 - Case study - air quality data82.py

In [ ]:
# %load snippets/07 - Case study - air quality data83.py

In [ ]:
# %load snippets/07 - Case study - air quality data84.py

An alternative method using groupby and unstack:


In [ ]:
# %load snippets/07 - Case study - air quality data85.py
QUESTION: The maximum daily 8 hour mean should be below 100 µg/m³. What are the number of exceedances of this limit for each year/station?

Tip: have a look at the rolling method to perform moving window operations.

Note: this is not an actual limit for NO2, but a nice exercise to introduce the rolling method. Other pollutans, such as 03 have actually such kind of limit values.


In [ ]:
# %load snippets/07 - Case study - air quality data86.py
QUESTION: Calculate the correlation between the different stations

In [ ]:
# %load snippets/07 - Case study - air quality data87.py

In [ ]:
# %load snippets/07 - Case study - air quality data88.py

In [ ]: