Case study: air quality data of European monitoring stations (AirBase)
AirBase (The European Air quality dataBase): hourly measurements of all air quality monitoring stations from Europe.
DS Data manipulation, analysis and visualisation in Python
December, 2019© 2016, Joris Van den Bossche and Stijn Van Hoey (mailto:jorisvandenbossche@gmail.com, mailto:stijnvanhoey@gmail.com). Licensed under CC BY 4.0 Creative Commons
AirBase is the European air quality database maintained by the European Environment Agency (EEA). It contains air quality monitoring data and information submitted by participating countries throughout Europe. The air quality database consists of a multi-annual time series of air quality measurement data and statistics for a number of air pollutants.
Some of the data files that are available from AirBase were included in the data folder: the hourly concentrations of nitrogen dioxide (NO2) for 4 different measurement stations:
In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
pd.options.display.max_rows = 8
plt.style.use("seaborn-whitegrid")
In [2]:
with open("../data/BETR8010000800100hour.1-1-1990.31-12-2012") as f:
print(f.readline())
So we will need to do some manual processing.
Just reading the tab-delimited data:
In [3]:
data = pd.read_csv("../data/BETR8010000800100hour.1-1-1990.31-12-2012", sep='\t')#, header=None)
In [4]:
data.head()
Out[4]:
The above data is clearly not ready to be used! Each row contains the 24 measurements for each hour of the day, and also contains a flag (0/1) indicating the quality of the data. Furthermore, there is no header row with column names.
In [5]:
# Column names: list consisting of 'date' and then intertwined the hour of the day and 'flag'
hours = ["{:02d}".format(i) for i in range(24)]
column_names = ['date'] + [item for pair in zip(hours, ['flag' + str(i) for i in range(24)]) for item in pair]
In [6]:
data = pd.read_csv("../data/BETR8010000800100hour.1-1-1990.31-12-2012",
sep='\t', header=None, names=column_names, na_values=[-999, -9999])
In [7]:
data.head()
Out[7]:
For the sake of this tutorial, we will disregard the 'flag' columns (indicating the quality of the data).
EXERCISE:
Drop all 'flag' columns ('flag1', 'flag2', ...)
In [8]:
flag_columns = [col for col in data.columns if 'flag' in col]
# we can now use this list to drop these columns
In [9]:
data = data.drop(flag_columns, axis=1)
In [10]:
data.head()
Out[10]:
Now, we want to reshape it: our goal is to have the different hours as row indices, merged with the date into a datetime-index. Here we have a wide and long dataframe, and want to make this a long, narrow timeseries.
BETR801 | |
---|---|
1990-01-02 09:00:00 | 48.0 |
1990-01-02 12:00:00 | 48.0 |
1990-01-02 13:00:00 | 50.0 |
1990-01-02 14:00:00 | 55.0 |
... | ... |
2012-12-31 20:00:00 | 16.5 |
2012-12-31 21:00:00 | 14.5 |
2012-12-31 22:00:00 | 16.5 |
2012-12-31 23:00:00 | 15.0 |
170794 rows × 1 columns
NOTE: This is an advanced exercise. Do not spend too much time on it and don't hesitate to look at the solutions.
</div>
Reshaping using melt
:
In [11]:
data_stacked = pd.melt(data, id_vars=['date'], var_name='hour')
data_stacked.head()
Out[11]:
Reshaping using stack
:
In [12]:
# we use stack to reshape the data to move the hours (the column labels) into a column.
# But we don't want to move the 'date' column label, therefore we first set this as the index.
# You can check the difference with "data.stack()"
data_stacked = data.set_index('date').stack()
data_stacked.head()
Out[12]:
In [13]:
# We reset the index to have the date and hours available as columns
data_stacked = data_stacked.reset_index()
data_stacked = data_stacked.rename(columns={'level_1': 'hour'})
data_stacked.head()
Out[13]:
Combine date and hour:
In [14]:
# Now we combine the dates and the hours into a datetime, and set this as the index
data_stacked.index = pd.to_datetime(data_stacked['date'] + data_stacked['hour'], format="%Y-%m-%d%H")
In [15]:
# Drop the origal date and hour columns
data_stacked = data_stacked.drop(['date', 'hour'], axis=1)
data_stacked.head()
Out[15]:
In [16]:
# rename the remaining column to the name of the measurement station
# (this is 0 or 'value' depending on which method was used)
data_stacked = data_stacked.rename(columns={0: 'BETR801'})
In [17]:
data_stacked.head()
Out[17]:
Our final data is now a time series. In pandas, this means that the index is a DatetimeIndex
:
In [18]:
data_stacked.index
Out[18]:
In [19]:
data_stacked.plot()
Out[19]:
We now have seen the code steps to process one of the files. We have however multiple files for the different stations with the same structure. Therefore, to not have to repeat the actual code, let's make a function from the steps we have seen above.
read_airbase_file(filename, station)
, using the above steps the read in and process the data, and that returns a processed timeseries.
In [20]:
def read_airbase_file(filename, station):
"""
Read hourly AirBase data files.
Parameters
----------
filename : string
Path to the data file.
station : string
Name of the station.
Returns
-------
DataFrame
Processed dataframe.
"""
...
return ...
In [21]:
def read_airbase_file(filename, station):
"""
Read hourly AirBase data files.
Parameters
----------
filename : string
Path to the data file.
station : string
Name of the station.
Returns
-------
DataFrame
Processed dataframe.
"""
# construct the column names
hours = ["{:02d}".format(i) for i in range(24)]
flags = ['flag' + str(i) for i in range(24)]
colnames = ['date'] + [item for pair in zip(hours, flags) for item in pair]
# read the actual data
data = pd.read_csv(filename, sep='\t', header=None, na_values=[-999, -9999], names=colnames)
# drop the 'flag' columns
data = data.drop([col for col in data.columns if 'flag' in col], axis=1)
# reshape
data = data.set_index('date')
data_stacked = data.stack()
data_stacked = data_stacked.reset_index()
# parse to datetime and remove redundant columns
data_stacked.index = pd.to_datetime(data_stacked['date'] + data_stacked['level_1'], format="%Y-%m-%d%H")
data_stacked = data_stacked.drop(['date', 'level_1'], axis=1)
data_stacked = data_stacked.rename(columns={0: station})
return data_stacked
Test the function on the data file from above:
In [22]:
import os
In [23]:
filename = "../data/BETR8010000800100hour.1-1-1990.31-12-2012"
station = os.path.split(filename)[-1][:7]
In [24]:
station
Out[24]:
In [25]:
test = read_airbase_file(filename, station)
test.head()
Out[25]:
We now want to use this function to read in all the different data files from AirBase, and combine them into one Dataframe.
glob.glob
function to list all 4 AirBase data files that are included in the 'data' directory, and call the result data_files
.
In [26]:
import glob
In [27]:
data_files = glob.glob("../data/*0008001*")
data_files
Out[27]:
combined_data
.
In [28]:
dfs = []
for filename in data_files:
station = filename.split("/")[-1][:7]
df = read_airbase_file(filename, station)
dfs.append(df)
In [29]:
combined_data = pd.concat(dfs, axis=1)
In [30]:
combined_data.head()
Out[30]:
Finally, we don't want to have to repeat this each time we use the data. Therefore, let's save the processed data to a csv file.
In [31]:
# let's first give the index a descriptive name
combined_data.index.name = 'datetime'
In [32]:
combined_data.to_csv("../data/airbase_data_processed.csv")