Case study: air quality data of European monitoring stations (AirBase)

AirBase (The European Air quality dataBase): hourly measurements of all air quality monitoring stations from Europe.

DS Data manipulation, analysis and visualisation in Python
December, 2019

© 2016, Joris Van den Bossche and Stijn Van Hoey (mailto:jorisvandenbossche@gmail.com, mailto:stijnvanhoey@gmail.com). Licensed under CC BY 4.0 Creative Commons

AirBase is the European air quality database maintained by the European Environment Agency (EEA). It contains air quality monitoring data and information submitted by participating countries throughout Europe. The air quality database consists of a multi-annual time series of air quality measurement data and statistics for a number of air pollutants.

Some of the data files that are available from AirBase were included in the data folder: the hourly concentrations of nitrogen dioxide (NO2) for 4 different measurement stations:

FR04037 (PARIS 13eme): urban background site at Square de Choisy
FR04012 (Paris, Place Victor Basch): urban traffic site at Rue d'Alesia
BETR802: urban traffic site in Antwerp, Belgium
BETN029: rural background site in Houtem, Belgium

See http://www.eea.europa.eu/themes/air/interactive/no2



In [1]:

    
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

pd.options.display.max_rows = 8
plt.style.use("seaborn-whitegrid")

Processing a single file

We will start with processing one of the downloaded files (BETR8010000800100hour.1-1-1990.31-12-2012). Looking at the data, you will see it does not look like a nice csv file:



In [2]:

    
with open("../data/BETR8010000800100hour.1-1-1990.31-12-2012") as f:
    print(f.readline())









    



1990-01-01	-999.000	0	-999.000	0	-999.000	0	-999.000	0	-999.000	0	-999.000	0	-999.000	0	-999.000	0	-999.000	0	-999.000	0	-999.000	0	-999.000	0	-999.000	0	-999.000	0	-999.000	0	-999.000	0	-999.000	0	-999.000	0	-999.000	0	-999.000	0	-999.000	0	-999.000	0	-999.000	0	-999.000	0

So we will need to do some manual processing.

Just reading the tab-delimited data:



In [3]:

    
data = pd.read_csv("../data/BETR8010000800100hour.1-1-1990.31-12-2012", sep='\t')#, header=None)



In [4]:

    
data.head()









    Out[4]:







  
    
      
      1990-01-01
      -999.000
      0
      -999.000.1
      0.1
      -999.000.2
      0.2
      -999.000.3
      0.3
      -999.000.4
      ...
      -999.000.19
      0.19
      -999.000.20
      0.20
      -999.000.21
      0.21
      -999.000.22
      0.22
      -999.000.23
      0.23
    
  
  
    
      0
      1990-01-02
      -999.0
      0
      -999.0
      0
      -999.0
      0
      -999.0
      0
      -999.0
      ...
      57.0
      1
      58.0
      1
      54.0
      1
      49.0
      1
      48.0
      1
    
    
      1
      1990-01-03
      51.0
      1
      50.0
      1
      47.0
      1
      48.0
      1
      51.0
      ...
      84.0
      1
      75.0
      1
      -999.0
      0
      -999.0
      0
      -999.0
      0
    
    
      2
      1990-01-04
      -999.0
      0
      -999.0
      0
      -999.0
      0
      -999.0
      0
      -999.0
      ...
      69.0
      1
      65.0
      1
      64.0
      1
      60.0
      1
      59.0
      1
    
    
      3
      1990-01-05
      51.0
      1
      51.0
      1
      48.0
      1
      50.0
      1
      51.0
      ...
      -999.0
      0
      -999.0
      0
      -999.0
      0
      -999.0
      0
      -999.0
      0
    
    
      4
      1990-01-06
      -999.0
      0
      -999.0
      0
      -999.0
      0
      -999.0
      0
      -999.0
      ...
      -999.0
      0
      -999.0
      0
      -999.0
      0
      -999.0
      0
      -999.0
      0
    
  

5 rows × 49 columns

The above data is clearly not ready to be used! Each row contains the 24 measurements for each hour of the day, and also contains a flag (0/1) indicating the quality of the data. Furthermore, there is no header row with column names.

EXERCISE:

Clean up this dataframe by using more options of `read_csv` (see its [docstring](http://pandas.pydata.org/pandas-docs/stable/generated/pandas.read_csv.html))

specify the correct delimiter
specify that the values of -999 and -9999 should be regarded as NaN
specify are own column names (for how the column names are made up, see See http://stackoverflow.com/questions/6356041/python-intertwining-two-lists)



In [5]:

    
# Column names: list consisting of 'date' and then intertwined the hour of the day and 'flag'
hours = ["{:02d}".format(i) for i in range(24)]
column_names = ['date'] + [item for pair in zip(hours, ['flag' + str(i) for i in range(24)]) for item in pair]



In [6]:

    
data = pd.read_csv("../data/BETR8010000800100hour.1-1-1990.31-12-2012",
                   sep='\t', header=None, names=column_names, na_values=[-999, -9999])



In [7]:

    
data.head()









    Out[7]:







  
    
      
      date
      00
      flag0
      01
      flag1
      02
      flag2
      03
      flag3
      04
      ...
      19
      flag19
      20
      flag20
      21
      flag21
      22
      flag22
      23
      flag23
    
  
  
    
      0
      1990-01-01
      NaN
      0
      NaN
      0
      NaN
      0
      NaN
      0
      NaN
      ...
      NaN
      0
      NaN
      0
      NaN
      0
      NaN
      0
      NaN
      0
    
    
      1
      1990-01-02
      NaN
      0
      NaN
      0
      NaN
      0
      NaN
      0
      NaN
      ...
      57.0
      1
      58.0
      1
      54.0
      1
      49.0
      1
      48.0
      1
    
    
      2
      1990-01-03
      51.0
      1
      50.0
      1
      47.0
      1
      48.0
      1
      51.0
      ...
      84.0
      1
      75.0
      1
      NaN
      0
      NaN
      0
      NaN
      0
    
    
      3
      1990-01-04
      NaN
      0
      NaN
      0
      NaN
      0
      NaN
      0
      NaN
      ...
      69.0
      1
      65.0
      1
      64.0
      1
      60.0
      1
      59.0
      1
    
    
      4
      1990-01-05
      51.0
      1
      51.0
      1
      48.0
      1
      50.0
      1
      51.0
      ...
      NaN
      0
      NaN
      0
      NaN
      0
      NaN
      0
      NaN
      0
    
  

5 rows × 49 columns

For the sake of this tutorial, we will disregard the 'flag' columns (indicating the quality of the data).

EXERCISE:

Drop all 'flag' columns ('flag1', 'flag2', ...)



In [8]:

    
flag_columns = [col for col in data.columns if 'flag' in col]
# we can now use this list to drop these columns



In [9]:

    
data = data.drop(flag_columns, axis=1)



In [10]:

    
data.head()









    Out[10]:







  
    
      
      date
      00
      01
      02
      03
      04
      05
      06
      07
      08
      ...
      14
      15
      16
      17
      18
      19
      20
      21
      22
      23
    
  
  
    
      0
      1990-01-01
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      ...
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
    
    
      1
      1990-01-02
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      ...
      55.0
      59.0
      58.0
      59.0
      58.0
      57.0
      58.0
      54.0
      49.0
      48.0
    
    
      2
      1990-01-03
      51.0
      50.0
      47.0
      48.0
      51.0
      52.0
      58.0
      57.0
      NaN
      ...
      69.0
      74.0
      NaN
      NaN
      103.0
      84.0
      75.0
      NaN
      NaN
      NaN
    
    
      3
      1990-01-04
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      ...
      NaN
      71.0
      74.0
      70.0
      70.0
      69.0
      65.0
      64.0
      60.0
      59.0
    
    
      4
      1990-01-05
      51.0
      51.0
      48.0
      50.0
      51.0
      58.0
      65.0
      66.0
      69.0
      ...
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
    
  

5 rows × 25 columns

Now, we want to reshape it: our goal is to have the different hours as row indices, merged with the date into a datetime-index. Here we have a wide and long dataframe, and want to make this a long, narrow timeseries.

REMEMBER: Recap: reshaping your data with [`stack` / `melt` and `unstack` / `pivot`](./pandas_07_reshaping_data.ipynb)

EXERCISE:

Reshape the dataframe to a timeseries. The end result should look like:

	BETR801
1990-01-02 09:00:00	48.0
1990-01-02 12:00:00	48.0
1990-01-02 13:00:00	50.0
1990-01-02 14:00:00	55.0
...	...
2012-12-31 20:00:00	16.5
2012-12-31 21:00:00	14.5
2012-12-31 22:00:00	16.5
2012-12-31 23:00:00	15.0

170794 rows × 1 columns

Reshape the dataframe so that each row consists of one observation for one date + hour combination
When you have the date and hour values as two columns, combine these columns into a datetime (tip: string columns can be summed to concatenate the strings) and remove the original columns
Set the new datetime values as the index, and remove the original columns with date and hour values

NOTE: This is an advanced exercise. Do not spend too much time on it and don't hesitate to look at the solutions.

</div>

Reshaping using melt:



In [11]:

    
data_stacked = pd.melt(data, id_vars=['date'], var_name='hour')
data_stacked.head()

Reshaping using stack:



In [12]:

    
# we use stack to reshape the data to move the hours (the column labels) into a column.
# But we don't want to move the 'date' column label, therefore we first set this as the index.
# You can check the difference with "data.stack()"
data_stacked = data.set_index('date').stack()
data_stacked.head()









    Out[12]:





date          
1990-01-02  09    48.0
            12    48.0
            13    50.0
            14    55.0
            15    59.0
dtype: float64



In [13]:

    
# We reset the index to have the date and hours available as columns
data_stacked = data_stacked.reset_index()
data_stacked = data_stacked.rename(columns={'level_1': 'hour'})
data_stacked.head()

Combine date and hour:



In [14]:

    
# Now we combine the dates and the hours into a datetime, and set this as the index
data_stacked.index = pd.to_datetime(data_stacked['date'] + data_stacked['hour'], format="%Y-%m-%d%H")



In [15]:

    
# Drop the origal date and hour columns
data_stacked = data_stacked.drop(['date', 'hour'], axis=1)
data_stacked.head()









    Out[15]:







  
    
      
      0
    
  
  
    
      1990-01-02 09:00:00
      48.0
    
    
      1990-01-02 12:00:00
      48.0
    
    
      1990-01-02 13:00:00
      50.0
    
    
      1990-01-02 14:00:00
      55.0
    
    
      1990-01-02 15:00:00
      59.0



In [16]:

    
# rename the remaining column to the name of the measurement station
# (this is 0 or 'value' depending on which method was used)
data_stacked = data_stacked.rename(columns={0: 'BETR801'})



In [17]:

    
data_stacked.head()









    Out[17]:







  
    
      
      BETR801
    
  
  
    
      1990-01-02 09:00:00
      48.0
    
    
      1990-01-02 12:00:00
      48.0
    
    
      1990-01-02 13:00:00
      50.0
    
    
      1990-01-02 14:00:00
      55.0
    
    
      1990-01-02 15:00:00
      59.0

Our final data is now a time series. In pandas, this means that the index is a DatetimeIndex:



In [18]:

    
data_stacked.index









    Out[18]:





DatetimeIndex(['1990-01-02 09:00:00', '1990-01-02 12:00:00',
               '1990-01-02 13:00:00', '1990-01-02 14:00:00',
               '1990-01-02 15:00:00', '1990-01-02 16:00:00',
               '1990-01-02 17:00:00', '1990-01-02 18:00:00',
               '1990-01-02 19:00:00', '1990-01-02 20:00:00',
               ...
               '2012-12-31 12:00:00', '2012-12-31 13:00:00',
               '2012-12-31 16:00:00', '2012-12-31 17:00:00',
               '2012-12-31 18:00:00', '2012-12-31 19:00:00',
               '2012-12-31 20:00:00', '2012-12-31 21:00:00',
               '2012-12-31 22:00:00', '2012-12-31 23:00:00'],
              dtype='datetime64[ns]', length=170794, freq=None)



In [19]:

    
data_stacked.plot()









    Out[19]:





<matplotlib.axes._subplots.AxesSubplot at 0x7f751c08a9e8>

Processing a collection of files

We now have seen the code steps to process one of the files. We have however multiple files for the different stations with the same structure. Therefore, to not have to repeat the actual code, let's make a function from the steps we have seen above.

EXERCISE:

Write a function read_airbase_file(filename, station), using the above steps the read in and process the data, and that returns a processed timeseries.



In [20]:

    
def read_airbase_file(filename, station):
    """
    Read hourly AirBase data files.
    
    Parameters
    ----------
    filename : string
        Path to the data file.
    station : string
        Name of the station.
       
    Returns
    -------
    DataFrame
        Processed dataframe.
    """
    
    ...
    
    return ...



In [21]:

    
def read_airbase_file(filename, station):
    """
    Read hourly AirBase data files.
    
    Parameters
    ----------
    filename : string
        Path to the data file.
    station : string
        Name of the station.
       
    Returns
    -------
    DataFrame
        Processed dataframe.
    """
    
    # construct the column names    
    hours = ["{:02d}".format(i) for i in range(24)]
    flags = ['flag' + str(i) for i in range(24)]
    colnames = ['date'] + [item for pair in zip(hours, flags) for item in pair]
    
    # read the actual data
    data = pd.read_csv(filename, sep='\t', header=None, na_values=[-999, -9999], names=colnames)
    
    # drop the 'flag' columns
    data = data.drop([col for col in data.columns if 'flag' in col], axis=1)

    # reshape
    data = data.set_index('date')
    data_stacked = data.stack()
    data_stacked = data_stacked.reset_index()
    
    # parse to datetime and remove redundant columns 
    data_stacked.index = pd.to_datetime(data_stacked['date'] + data_stacked['level_1'], format="%Y-%m-%d%H")
    data_stacked = data_stacked.drop(['date', 'level_1'], axis=1)
    data_stacked = data_stacked.rename(columns={0: station})
    
    return data_stacked

Test the function on the data file from above:



In [22]:

    
import os



In [23]:

    
filename = "../data/BETR8010000800100hour.1-1-1990.31-12-2012"
station = os.path.split(filename)[-1][:7]



In [24]:

    
station









    Out[24]:





'BETR801'



In [25]:

    
test = read_airbase_file(filename, station)
test.head()









    Out[25]:







  
    
      
      BETR801
    
  
  
    
      1990-01-02 09:00:00
      48.0
    
    
      1990-01-02 12:00:00
      48.0
    
    
      1990-01-02 13:00:00
      50.0
    
    
      1990-01-02 14:00:00
      55.0
    
    
      1990-01-02 15:00:00
      59.0

We now want to use this function to read in all the different data files from AirBase, and combine them into one Dataframe.

EXERCISE:

Use the glob.glob function to list all 4 AirBase data files that are included in the 'data' directory, and call the result data_files.



In [26]:

    
import glob



In [27]:

    
data_files = glob.glob("../data/*0008001*")
data_files









    Out[27]:





['../data/BETN0290000800100hour.1-1-1990.31-12-2012',
 '../data/FR040370000800100hour.1-1-1999.31-12-2012',
 '../data/BETR8010000800100hour.1-1-1990.31-12-2012',
 '../data/FR040120000800100hour.1-1-1999.31-12-2012']

EXERCISE:

Loop over the data files, read and process the file using our defined function, and append the dataframe to a list.
Combine the the different DataFrames in the list into a single DataFrame where the different columns are the different stations. Call the result combined_data.



In [28]:

    
dfs = []

for filename in data_files:
    station = filename.split("/")[-1][:7]
    df = read_airbase_file(filename, station)
    dfs.append(df)



In [29]:

    
combined_data = pd.concat(dfs, axis=1)



In [30]:

    
combined_data.head()









    Out[30]:







  
    
      
      BETN029
      FR04037
      BETR801
      FR04012
    
  
  
    
      1990-01-01 00:00:00
      16.0
      NaN
      NaN
      NaN
    
    
      1990-01-01 01:00:00
      18.0
      NaN
      NaN
      NaN
    
    
      1990-01-01 02:00:00
      21.0
      NaN
      NaN
      NaN
    
    
      1990-01-01 03:00:00
      26.0
      NaN
      NaN
      NaN
    
    
      1990-01-01 04:00:00
      21.0
      NaN
      NaN
      NaN

Finally, we don't want to have to repeat this each time we use the data. Therefore, let's save the processed data to a csv file.



In [31]:

    
# let's first give the index a descriptive name
combined_data.index.name = 'datetime'



In [32]:

    
combined_data.to_csv("../data/airbase_data_processed.csv")

	1990-01-01	-999.000	0	-999.000.1	0.1	-999.000.2	0.2	-999.000.3	0.3	-999.000.4	...	-999.000.19	0.19	-999.000.20	0.20	-999.000.21	0.21	-999.000.22	0.22	-999.000.23	0.23
0	1990-01-02	-999.0	0	-999.0	0	-999.0	0	-999.0	0	-999.0	...	57.0	1	58.0	1	54.0	1	49.0	1	48.0	1
1	1990-01-03	51.0	1	50.0	1	47.0	1	48.0	1	51.0	...	84.0	1	75.0	1	-999.0	0	-999.0	0	-999.0	0
2	1990-01-04	-999.0	0	-999.0	0	-999.0	0	-999.0	0	-999.0	...	69.0	1	65.0	1	64.0	1	60.0	1	59.0	1
3	1990-01-05	51.0	1	51.0	1	48.0	1	50.0	1	51.0	...	-999.0	0	-999.0	0	-999.0	0	-999.0	0	-999.0	0
4	1990-01-06	-999.0	0	-999.0	0	-999.0	0	-999.0	0	-999.0	...	-999.0	0	-999.0	0	-999.0	0	-999.0	0	-999.0	0

	date	00	flag0	01	flag1	02	flag2	03	flag3	04	...	19	flag19	20	flag20	21	flag21	22	flag22	23	flag23
0	1990-01-01	NaN	0	NaN	0	NaN	0	NaN	0	NaN	...	NaN	0	NaN	0	NaN	0	NaN	0	NaN	0
1	1990-01-02	NaN	0	NaN	0	NaN	0	NaN	0	NaN	...	57.0	1	58.0	1	54.0	1	49.0	1	48.0	1
2	1990-01-03	51.0	1	50.0	1	47.0	1	48.0	1	51.0	...	84.0	1	75.0	1	NaN	0	NaN	0	NaN	0
3	1990-01-04	NaN	0	NaN	0	NaN	0	NaN	0	NaN	...	69.0	1	65.0	1	64.0	1	60.0	1	59.0	1
4	1990-01-05	51.0	1	51.0	1	48.0	1	50.0	1	51.0	...	NaN	0	NaN	0	NaN	0	NaN	0	NaN	0

	BETN029	FR04037	BETR801	FR04012
1990-01-01 00:00:00	16.0	NaN	NaN	NaN
1990-01-01 01:00:00	18.0	NaN	NaN	NaN
1990-01-01 02:00:00	21.0	NaN	NaN	NaN
1990-01-01 03:00:00	26.0	NaN	NaN	NaN
1990-01-01 04:00:00	21.0	NaN	NaN	NaN