In [1]:
%matplotlib inline
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
pd.options.display.max_rows = 8
AirBase (The European Air quality dataBase): hourly measurements of all air quality monitoring stations from Europe.
I downloaded and preprocessed some of the data (python-airbase): data/airbase_data.csv
. This file includes the hourly concentrations of NO2 for 4 different measurement stations:
Import the csv file:
In [2]:
!head -5 data/airbase_data.csv
As you can see, the missing values are indicated by -9999
. This can be recognized by read_csv
by passing the na_values
keyword:
In [3]:
data = pd.read_csv('data/airbase_data.csv', index_col=0, parse_dates=True, na_values=[-9999])
In [4]:
data.head(3)
Out[4]:
In [5]:
data.tail()
Out[5]:
In [6]:
data.plot(figsize=(12,6))
Out[6]:
This does not say too much ..
We can select part of the data (eg the latest 500 data points):
In [7]:
data[-500:].plot(figsize=(12,6))
Out[7]:
Or we can use some more advanced time series features -> next section!
When we ensure the DataFrame has a DatetimeIndex
, time-series related functionality becomes available:
In [8]:
data.index
Out[8]:
Indexing a time series works with strings:
In [9]:
data["2010-01-01 09:00": "2010-01-01 12:00"]
Out[9]:
A nice feature is "partial string" indexing, where we can do implicit slicing by providing a partial datetime string.
E.g. all data of 2012:
In [10]:
data['2012']
Out[10]:
Normally you would expect this to access a column named '2012', but as for a DatetimeIndex, pandas also tries to interpret it as a datetime slice.
Or all data of January up to March 2012:
In [11]:
data['2012-01':'2012-03']
Out[11]:
Time and date components can be accessed from the index:
In [12]:
data.index.hour
Out[12]:
In [13]:
data.index.year
Out[13]:
In [ ]:
In [ ]:
In [ ]:
In [17]:
data[(data.index.hour >= 8) & (data.index.hour < 20)]
Out[17]:
In [18]:
data.between_time('08:00', '20:00')
Out[18]:
A very powerful method is resample
: converting the frequency of the time series (e.g. from hourly to daily data).
The time series has a frequency of 1 hour. I want to change this to daily:
In [19]:
data.resample('D').head()
Out[19]:
By default, resample
takes the mean as aggregation function, but other methods can also be specified:
In [20]:
data.resample('D', how='max').head()
Out[20]:
The string to specify the new time frequency: http://pandas.pydata.org/pandas-docs/dev/timeseries.html#offset-aliases
These strings can also be combined with numbers, eg '10D'
.
Further exploring the data:
In [21]:
data.resample('M').plot() # 'A'
Out[21]:
In [22]:
# data['2012'].resample('D').plot()
In [ ]:
In [ ]:
In [ ]:
resample
can actually be seen as a specific kind of groupby
. E.g. taking annual means with data.resample('A', 'mean')
is equivalent to data.groupby(data.index.year).mean()
(only the result of resample
still has a DatetimeIndex
).
In [26]:
data.groupby(data.index.year).mean().plot()
Out[26]:
But, groupby
is more flexible and can also do resamples that do not result in a new continuous time series, e.g. by grouping by the hour of the day to get the diurnal cycle.
In [ ]:
In [ ]:
In [ ]:
In [ ]:
In [ ]:
If you are done, you can give a try to these exercises:
In [ ]:
In [ ]:
Tip: the boxplot method of a DataFrame expects the data for the different boxes in different columns)
In [ ]:
In [ ]:
© 2015, Stijn Van Hoey and Joris Van den Bossche (mailto:stijnvanhoey@gmail.com, mailto:jorisvandenbossche@gmail.com).
© 2015, modified by Bartosz Teleńczuk (original sources available from https://github.com/jorisvandenbossche/2015-EuroScipy-pandas-tutorial)
Licensed under CC BY 4.0 Creative Commons
In [ ]: