Case study: air quality data of European monitoring stations (AirBase)
AirBase (The European Air quality dataBase): hourly measurements of all air quality monitoring stations from Europe.
DS Data manipulation, analysis and visualisation in Python
December, 2019© 2016, Joris Van den Bossche and Stijn Van Hoey (mailto:jorisvandenbossche@gmail.com, mailto:stijnvanhoey@gmail.com). Licensed under CC BY 4.0 Creative Commons
In [ ]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import plotnine as pn
pd.options.display.max_rows = 8
In the previous notebook, we processed some raw data files of the AirBase air quality data. As a reminder, the data contains hourly concentrations of nitrogen dioxide (NO2) for 4 different measurement stations:
We processed the individual data files in the previous notebook, and saved it to a csv file ../data/airbase_data_processed.csv
. Let's import the file here (if you didn't finish the previous notebook, a version of the dataset is also available in ../data/airbase_data.csv
):
In [ ]:
alldata = pd.read_csv('../data/airbase_data.csv', index_col=0, parse_dates=True)
We only use the data from 1999 onwards:
In [ ]:
data = alldata['1999':].copy()
Some first exploration with the typical functions:
In [ ]:
data.head() # tail()
In [ ]:
data.info()
In [ ]:
data.describe(percentiles=[0.1, 0.5, 0.9])
In [ ]:
data.plot(figsize=(12,6))
Plot only a subset
Why not just using the head
/tail
possibilities?
In [ ]:
data.tail(500).plot(figsize=(12,6))
Summary figures
Use summary statistics...
In [ ]:
data.plot(kind='box', ylim=[0,250])
Also with seaborn plots function, just start with some subsets as first impression...
As we already have seen previously, the plotting library seaborn provides some high-level plotting functions on top of matplotlib (check the docs!). One of those functions is pairplot
, which we can use here to quickly visualize the concentrations at the different stations and their relation:
In [ ]:
import seaborn as sns
In [ ]:
sns.pairplot(data.tail(5000).dropna())
In [ ]:
data.head()
In principle this is not a tidy dataset. The variable that was measured is the NO2 concentration, and is divided in 4 columns. Of course those measurements were made at different stations, so one could interpet it as separate variables. But in any case, such format typically does not work well with plotnine
which expects a pure tidy format.
Reason to not use a tidy dataset here:
data_tidy
, ensuring the result has new columns 'station' and 'no2'.
In [ ]:
# %load _solutions/case4_air_quality_analysis1.py
In [ ]:
# %load _solutions/case4_air_quality_analysis2.py
In [ ]:
# %load _solutions/case4_air_quality_analysis3.py
In the following exercises we will mostly do our analysis on data
and often use pandas (or seaborn) plotting, but once we produced some kind of summary dataframe as the result of an analysis, then it becomes more interesting to convert that result to a tidy format to be able to use the more advanced plotting functionality of plotnine
.
resample
In [ ]:
# %load _solutions/case4_air_quality_analysis4.py
In [ ]:
# %load _solutions/case4_air_quality_analysis5.py
In [ ]:
# %load _solutions/case4_air_quality_analysis6.py
In [ ]:
# %load _solutions/case4_air_quality_analysis7.py
axhline
).
In [ ]:
# %load _solutions/case4_air_quality_analysis8.py
linewidth=4
and linestyle='--'
)
In [ ]:
# %load _solutions/case4_air_quality_analysis9.py
> data.groupby(data.index.year).mean().index
Results in:
Int64Index([1990, 1991, 1992, 1993, 1994, 1995, 1996, 1997, 1998, 1999, 2000,
2001, 2002, 2003, 2004, 2005, 2006, 2007, 2008, 2009, 2010, 2011,
2012],
dtype='int64')$
> data.resample('A').mean().index
Results in:
DatetimeIndex(['1999-12-31', '2000-12-31', '2001-12-31', '2002-12-31',
'2003-12-31', '2004-12-31', '2005-12-31', '2006-12-31',
'2007-12-31', '2008-12-31', '2009-12-31', '2010-12-31',
'2011-12-31', '2012-12-31'],
dtype='datetime64[ns]', freq='A-DEC')
In [ ]:
# %load _solutions/case4_air_quality_analysis10.py
In [ ]:
data = data.drop("month", axis=1)
Note: Technically, we could reshape the result of the groupby operation to a tidy format (we no longer have a real time series), but since we already have the things we want to plot as lines in different columns, doing .plot
already does what we want.
In [ ]:
# %load _solutions/case4_air_quality_analysis11.py
In [ ]:
# %load _solutions/case4_air_quality_analysis12.py
In [ ]:
# %load _solutions/case4_air_quality_analysis13.py
In [ ]:
# %load _solutions/case4_air_quality_analysis14.py
In [ ]:
# %load _solutions/case4_air_quality_analysis15.py
In [ ]:
# %load _solutions/case4_air_quality_analysis16.py
In [ ]:
# %load _solutions/case4_air_quality_analysis17.py
In [ ]:
# %load _solutions/case4_air_quality_analysis18.py
In [ ]:
# %load _solutions/case4_air_quality_analysis19.py
In [ ]:
data = data.drop(['hour', 'weekend'], axis=1)
%psearch
)
In [ ]:
# %load _solutions/case4_air_quality_analysis20.py
exceedances
, (with boolean values) indicating if the threshold is exceeded or notax.axhline
In [ ]:
# %load _solutions/case4_air_quality_analysis21.py
In [ ]:
# %load _solutions/case4_air_quality_analysis22.py
In [ ]:
# %load _solutions/case4_air_quality_analysis23.py
In [ ]:
data = alldata['1999':].copy()
NaN
or zero valuesFR_scaled
(Hint: check wikipedia)0.3
. You will need the documentation of np.searchsorted
and matplotlib's axvline
In [ ]:
# %load _solutions/case4_air_quality_analysis24.py
In [ ]:
# %load _solutions/case4_air_quality_analysis25.py
In [ ]:
# %load _solutions/case4_air_quality_analysis26.py
In [ ]:
# %load _solutions/case4_air_quality_analysis27.py
In [ ]:
# %load _solutions/case4_air_quality_analysis28.py
subset
subset
which defines for each data point the day of the weeksubset
DataFrame, select only Monday (= day 0) and Sunday (=day 6) and remove the others (so, keep this as variable subset
)subset
according to the following mapping: {0:"Monday", 6:"Sunday"}
In [ ]:
# %load _solutions/case4_air_quality_analysis29.py
In [ ]:
# %load _solutions/case4_air_quality_analysis30.py
In [ ]:
# %load _solutions/case4_air_quality_analysis31.py
In [ ]:
# %load _solutions/case4_air_quality_analysis32.py
In [ ]:
# %load _solutions/case4_air_quality_analysis33.py
Calculating daily means and add weekday information:
In [ ]:
# %load _solutions/case4_air_quality_analysis34.py
In [ ]:
# %load _solutions/case4_air_quality_analysis35.py
Plotting with plotnine:
In [ ]:
# %load _solutions/case4_air_quality_analysis36.py
Reshaping and plotting with pandas:
In [ ]:
# %load _solutions/case4_air_quality_analysis37.py
In [ ]:
# %load _solutions/case4_air_quality_analysis38.py