Case study: air quality data of European monitoring stations (AirBase)

AirBase (The European Air quality dataBase): hourly measurements of all air quality monitoring stations from Europe.

DS Data manipulation, analysis and visualisation in Python
December, 2019

© 2016, Joris Van den Bossche and Stijn Van Hoey (mailto:jorisvandenbossche@gmail.com, mailto:stijnvanhoey@gmail.com). Licensed under CC BY 4.0 Creative Commons



In [ ]:

    
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import plotnine as pn

pd.options.display.max_rows = 8

In the previous notebook, we processed some raw data files of the AirBase air quality data. As a reminder, the data contains hourly concentrations of nitrogen dioxide (NO2) for 4 different measurement stations:

FR04037 (PARIS 13eme): urban background site at Square de Choisy
FR04012 (Paris, Place Victor Basch): urban traffic site at Rue d'Alesia
BETR802: urban traffic site in Antwerp, Belgium
BETN029: rural background site in Houtem, Belgium

See http://www.eea.europa.eu/themes/air/interactive/no2

Importing and quick exploration

We processed the individual data files in the previous notebook, and saved it to a csv file ../data/airbase_data_processed.csv. Let's import the file here (if you didn't finish the previous notebook, a version of the dataset is also available in ../data/airbase_data.csv):



In [ ]:

    
alldata = pd.read_csv('../data/airbase_data.csv', index_col=0, parse_dates=True)

We only use the data from 1999 onwards:



In [ ]:

    
data = alldata['1999':].copy()

Some first exploration with the typical functions:



In [ ]:

    
data.head() # tail()



In [ ]:

    
data.info()



In [ ]:

    
data.describe(percentiles=[0.1, 0.5, 0.9])



In [ ]:

    
data.plot(figsize=(12,6))

ATTENTION!:

When just using `.plot()` without further notice (selection, aggregation,...)

Risk of running into troubles by overloading your computer processing (certainly with looooong time series)
Not always the most informative/interpretable visualisation

Plot only a subset

Why not just using the head/tail possibilities?



In [ ]:

    
data.tail(500).plot(figsize=(12,6))

Summary figures

Use summary statistics...



In [ ]:

    
data.plot(kind='box', ylim=[0,250])

Also with seaborn plots function, just start with some subsets as first impression...

As we already have seen previously, the plotting library seaborn provides some high-level plotting functions on top of matplotlib (check the docs!). One of those functions is pairplot, which we can use here to quickly visualize the concentrations at the different stations and their relation:



In [ ]:

    
import seaborn as sns



In [ ]:

    
sns.pairplot(data.tail(5000).dropna())

Is this a tidy dataset ?



In [ ]:

    
data.head()

In principle this is not a tidy dataset. The variable that was measured is the NO2 concentration, and is divided in 4 columns. Of course those measurements were made at different stations, so one could interpet it as separate variables. But in any case, such format typically does not work well with plotnine which expects a pure tidy format.

Reason to not use a tidy dataset here:

bigger memory use
timeseries functionality like resample works better
pandas plotting already does what we want when having different columns for some types of plots (eg line plots of the timeseries)

EXERCISE:

Create a tidy version of this dataset data_tidy, ensuring the result has new columns 'station' and 'no2'.
Check how many missing values are contained in the 'no2' column.
Drop the rows with missing values in that column.



In [ ]:

    
# %load _solutions/case4_air_quality_analysis1.py



In [ ]:

    
# %load _solutions/case4_air_quality_analysis2.py



In [ ]:

    
# %load _solutions/case4_air_quality_analysis3.py

In the following exercises we will mostly do our analysis on dataand often use pandas (or seaborn) plotting, but once we produced some kind of summary dataframe as the result of an analysis, then it becomes more interesting to convert that result to a tidy format to be able to use the more advanced plotting functionality of plotnine.

Exercises

REMINDER:

Take a look at the [Timeseries notebook](pandas_04_time_series_data.ipynb) when you require more info about:

resample
string indexing of DateTimeIndex

Take a look at the [matplotlib](visualization_01_matplotlib.ipynb) and [plotnine](visualization_02_plotnine.ipynb) notebooks when you require more info about the plot requirements.

EXERCISE:

Plot the monthly mean and median concentration of the 'FR04037' station for the years 2009 - 2013 in a single figure/ax



In [ ]:

    
# %load _solutions/case4_air_quality_analysis4.py



In [ ]:

    
# %load _solutions/case4_air_quality_analysis5.py

EXERCISE

Make a violin plot for January 2011 until August 2011 (check out the documentation to improve the plotting settings)
Change the y-label to 'NO$_2$ concentration (µg/m³)'

NOTE: When having the data not in a long format but when having different columns for which you want to make violin plots, you can use [seaborn](http://seaborn.pydata.org/examples/index.html). When using the tidy data, we can use `plotnine`.



In [ ]:

    
# %load _solutions/case4_air_quality_analysis6.py



In [ ]:

    
# %load _solutions/case4_air_quality_analysis7.py

EXERCISE

Make a bar plot with pandas of the mean of each of the stations in the year 2012 (check the documentation of Pandas plot to adapt the rotation of the labels) and make sure all bars have the same color.
Using the matplotlib objects, change the y-label to 'NO$_2$ concentration (µg/m³)
Add a 'darkorange' horizontal line on the ax for the y-value 40 µg/m³ (command for horizontal line from matplotlib: axhline).
Place the text 'Yearly limit is 40 µg/m³' just above the 'darkorange' line.



In [ ]:

    
# %load _solutions/case4_air_quality_analysis8.py

EXERCISE: Did the air quality improve over time?

For the data from 1999 till the end, plot the yearly averages
For the same period, add the overall mean (all stations together) as an additional line to the graph, use a thicker black line (linewidth=4 and linestyle='--')
[OPTIONAL] Add a legend above the ax for all lines



In [ ]:

    
# %load _solutions/case4_air_quality_analysis9.py

REMEMBER:

`resample` is a special version of a`groupby` operation. For example, taking annual means with `data.resample('A').mean()` is equivalent to `data.groupby(data.index.year).mean()` (but the result of `resample` still has a DatetimeIndex).

Checking the index of the resulting DataFrame when using **groupby** instead of resample: You'll notice that the Index lost the DateTime capabilities:


> data.groupby(data.index.year).mean().index

Results in:


Int64Index([1990, 1991, 1992, 1993, 1994, 1995, 1996, 1997, 1998, 1999, 2000,
            2001, 2002, 2003, 2004, 2005, 2006, 2007, 2008, 2009, 2010, 2011,
            2012],
           dtype='int64')$

When using **resample**, we keep the DateTime capabilities:


> data.resample('A').mean().index

Results in:


DatetimeIndex(['1999-12-31', '2000-12-31', '2001-12-31', '2002-12-31',
               '2003-12-31', '2004-12-31', '2005-12-31', '2006-12-31',
               '2007-12-31', '2008-12-31', '2009-12-31', '2010-12-31',
               '2011-12-31', '2012-12-31'],
              dtype='datetime64[ns]', freq='A-DEC')

But, `groupby` is more flexible and can also do resamples that do not result in a new continuous time series, e.g. by grouping by the hour of the day to get the diurnal cycle.

EXERCISE

How does the typical yearly profile (typical averages for the different months over the years) look like for the different stations? (add a 'month' column as a first step)



In [ ]:

    
# %load _solutions/case4_air_quality_analysis10.py



In [ ]:

    
data = data.drop("month", axis=1)

Note: Technically, we could reshape the result of the groupby operation to a tidy format (we no longer have a real time series), but since we already have the things we want to plot as lines in different columns, doing .plot already does what we want.

EXERCISE

Plot the weekly 95% percentiles of the concentration in 'BETR801' and 'BETN029' for 2011



In [ ]:

    
# %load _solutions/case4_air_quality_analysis11.py



In [ ]:

    
# %load _solutions/case4_air_quality_analysis12.py

EXERCISE

Plot the typical diurnal profile (typical hourly averages) for the different stations taking into account the whole time period.



In [ ]:

    
# %load _solutions/case4_air_quality_analysis13.py

EXERCISE

What is the difference in the typical diurnal profile between week and weekend days? (and visualise it)

Start with only visualizing the different in diurnal profile for the BETR801 station. In a next step, make the same plot for each station.

**Hints:**

Add a column 'weekend' defining if a value of the index is in the weekend (i.e. weekdays 5 and 6) or not
Add a column 'hour' with the hour of the day for each row.
You can groupby on multiple items at the same time.



In [ ]:

    
# %load _solutions/case4_air_quality_analysis14.py



In [ ]:

    
# %load _solutions/case4_air_quality_analysis15.py



In [ ]:

    
# %load _solutions/case4_air_quality_analysis16.py



In [ ]:

    
# %load _solutions/case4_air_quality_analysis17.py



In [ ]:

    
# %load _solutions/case4_air_quality_analysis18.py



In [ ]:

    
# %load _solutions/case4_air_quality_analysis19.py



In [ ]:

    
data = data.drop(['hour', 'weekend'], axis=1)

EXERCISE:

Calculate the correlation between the different stations (check in the documentation, google "pandas correlation" or use the magic function %psearch)



In [ ]:

    
# %load _solutions/case4_air_quality_analysis20.py

EXERCISE:

Count the number of exceedances of hourly values above the European limit 200 µg/m3 for each year and station after 2005. Make a barplot of the counts. Add an horizontal line indicating the maximum number of exceedances (which is 18) allowed per year?

**Hints:**

Create a new DataFrame, called exceedances, (with boolean values) indicating if the threshold is exceeded or not
Remember that the sum of True values can be used to count elements
Adding a horizontal line can be done with the matplotlib function ax.axhline



In [ ]:

    
# %load _solutions/case4_air_quality_analysis21.py



In [ ]:

    
# %load _solutions/case4_air_quality_analysis22.py



In [ ]:

    
# %load _solutions/case4_air_quality_analysis23.py

More advanced exercises...



In [ ]:

    
data = alldata['1999':].copy()

EXERCISE: Perform the following actions for the station `'FR04012'` only:

Remove the rows containing NaN or zero values
Sort the values of the rows according to the air quality values (low to high values)
Rescale the values to the range [0-1] and store result as FR_scaled (Hint: check wikipedia)
Use pandas to plot these values sorted, not taking into account the dates
Add the station name 'FR04012' as y-label
[OPTIONAL] Add a vertical line to the plot where the line (hence, the values of variable FR_scaled) reach the value 0.3. You will need the documentation of np.searchsorted and matplotlib's axvline



In [ ]:

    
# %load _solutions/case4_air_quality_analysis24.py



In [ ]:

    
# %load _solutions/case4_air_quality_analysis25.py



In [ ]:

    
# %load _solutions/case4_air_quality_analysis26.py

EXERCISE:

Create a Figure with two subplots (axes), for which both axis are shared
In the left subplot, plot the histogram (30 bins) of station 'BETN029', only for the year 2009
In the right subplot, plot the histogram (30 bins) of station 'BETR801', only for the year 2009
Add the title representing the station name on each of the subplots, you do not want to have a legend



In [ ]:

    
# %load _solutions/case4_air_quality_analysis27.py



In [ ]:

    
# %load _solutions/case4_air_quality_analysis28.py

EXERCISE

Make a selection of the original dataset of the data in January 2009, call the resulting variable subset
Add a new column, called 'weekday', to the variable subset which defines for each data point the day of the week
From the subset DataFrame, select only Monday (= day 0) and Sunday (=day 6) and remove the others (so, keep this as variable subset)
Change the values of the weekday column in subset according to the following mapping: {0:"Monday", 6:"Sunday"}
With plotnine, make a scatter plot of the measurements at 'BETN029' vs 'FR04037', with the color variation based on the weekday. Add a linear regression to this plot.

**Note**: If you run into the **SettingWithCopyWarning** and do not know what to do, recheck [pandas_03b_indexing](pandas_03b_indexing.ipynb)



In [ ]:

    
# %load _solutions/case4_air_quality_analysis29.py



In [ ]:

    
# %load _solutions/case4_air_quality_analysis30.py



In [ ]:

    
# %load _solutions/case4_air_quality_analysis31.py

EXERCISE:

The maximum daily, 8 hour mean, should be below 100 µg/m³. What is the number of exceedances of this limit for each year/station?

**Tip:**
Have a look at the `rolling` method to perform moving window operations.

**Note:**
This is not an actual limit for NO$_2$, but a nice exercise to introduce the `rolling` method. Other pollutans, such as 0$_3$ have actually such kind of limit values based on 8-hour means.



In [ ]:

    
# %load _solutions/case4_air_quality_analysis32.py



In [ ]:

    
# %load _solutions/case4_air_quality_analysis33.py

EXERCISE:

Visualize the typical week profile for station 'BETR801' as boxplots (where the values in one boxplot are the daily means for the different weeks for a certain weekday).

**Tip:**
The boxplot method of a DataFrame expects the data for the different boxes in different columns. For this, you can either use `pivot_table` or a combination of `groupby` and `unstack`

Calculating daily means and add weekday information:



In [ ]:

    
# %load _solutions/case4_air_quality_analysis34.py



In [ ]:

    
# %load _solutions/case4_air_quality_analysis35.py

Plotting with plotnine:



In [ ]:

    
# %load _solutions/case4_air_quality_analysis36.py

Reshaping and plotting with pandas:



In [ ]:

    
# %load _solutions/case4_air_quality_analysis37.py



In [ ]:

    
# %load _solutions/case4_air_quality_analysis38.py