04 - Pandas: Working with time series data

DS Data manipulation, analysis and visualisation in Python
December, 2019

© 2016-2019, Joris Van den Bossche and Stijn Van Hoey (mailto:jorisvandenbossche@gmail.com, mailto:stijnvanhoey@gmail.com). Licensed under CC BY 4.0 Creative Commons



In [1]:
# %matplotlib notebook
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

plt.style.use('ggplot')

Introduction: datetime module

Standard Python contains the datetime module to handle date and time data:


In [3]:
import datetime

In [4]:
dt = datetime.datetime(year=2016, month=12, day=19, hour=13, minute=30)
dt


Out[4]:
datetime.datetime(2016, 12, 19, 13, 30)

In [5]:
print(dt) # .day,...


2016-12-19 13:30:00

In [6]:
print(dt.strftime("%d %B %Y"))


19 December 2016

Dates and times in pandas

The Timestamp object

Pandas has its own date and time objects, which are compatible with the standard datetime objects, but provide some more functionality to work with.

The Timestamp object can also be constructed from a string:


In [7]:
ts = pd.Timestamp('2016-12-19')
ts


Out[7]:
Timestamp('2016-12-19 00:00:00')

Like with datetime.datetime objects, there are several useful attributes available on the Timestamp. For example, we can get the month (experiment with tab completion!):


In [8]:
ts.month


Out[8]:
12

There is also a Timedelta type, which can e.g. be used to add intervals of time:


In [9]:
ts + pd.Timedelta('5 days')


Out[9]:
Timestamp('2016-12-24 00:00:00')

Parsing datetime strings

Unfortunately, when working with real world data, you encounter many different datetime formats. Most of the time when you have to deal with them, they come in text format, e.g. from a CSV file. To work with those data in Pandas, we first have to parse the strings to actual Timestamp objects.

REMEMBER:

To convert string formatted dates to Timestamp objects: use the `pandas.to_datetime` function

In [10]:
pd.to_datetime("2016-12-09")


Out[10]:
Timestamp('2016-12-09 00:00:00')

In [11]:
pd.to_datetime("09/12/2016")


Out[11]:
Timestamp('2016-09-12 00:00:00')

In [12]:
pd.to_datetime("09/12/2016", dayfirst=True)


Out[12]:
Timestamp('2016-12-09 00:00:00')

In [13]:
pd.to_datetime("09/12/2016", format="%d/%m/%Y")


Out[13]:
Timestamp('2016-12-09 00:00:00')

A detailed overview of how to specify the format string, see the table in the python documentation: https://docs.python.org/3.5/library/datetime.html#strftime-and-strptime-behavior

Timestamp data in a Series or DataFrame column


In [14]:
s = pd.Series(['2016-12-09 10:00:00', '2016-12-09 11:00:00', '2016-12-09 12:00:00'])

In [15]:
s


Out[15]:
0    2016-12-09 10:00:00
1    2016-12-09 11:00:00
2    2016-12-09 12:00:00
dtype: object

The to_datetime function can also be used to convert a full series of strings:


In [16]:
ts = pd.to_datetime(s)

In [17]:
ts


Out[17]:
0   2016-12-09 10:00:00
1   2016-12-09 11:00:00
2   2016-12-09 12:00:00
dtype: datetime64[ns]

Notice the data type of this series has changed: the datetime64[ns] dtype. This indicates that we have a series of actual datetime values.

The same attributes as on single Timestamps are also available on a Series with datetime data, using the .dt accessor:


In [18]:
ts.dt.hour


Out[18]:
0    10
1    11
2    12
dtype: int64

In [19]:
ts.dt.weekday


Out[19]:
0    4
1    4
2    4
dtype: int64

To quickly construct some regular time series data, the pd.date_range function comes in handy:


In [20]:
pd.Series(pd.date_range(start="2016-01-01", periods=10, freq='3H'))


Out[20]:
0   2016-01-01 00:00:00
1   2016-01-01 03:00:00
2   2016-01-01 06:00:00
3   2016-01-01 09:00:00
            ...        
6   2016-01-01 18:00:00
7   2016-01-01 21:00:00
8   2016-01-02 00:00:00
9   2016-01-02 03:00:00
Length: 10, dtype: datetime64[ns]

Time series data: Timestamp in the index

River discharge example data

For the following demonstration of the time series functionality, we use a sample of discharge data of the Maarkebeek (Flanders) with 3 hour averaged values, derived from the Waterinfo website.


In [21]:
data = pd.read_csv("../data/vmm_flowdata.csv")

In [22]:
data.head()


Out[22]:
Time L06_347 LS06_347 LS06_348
0 2009-01-01 00:00:00 0.137417 0.097500 0.016833
1 2009-01-01 03:00:00 0.131250 0.088833 0.016417
2 2009-01-01 06:00:00 0.113500 0.091250 0.016750
3 2009-01-01 09:00:00 0.135750 0.091500 0.016250
4 2009-01-01 12:00:00 0.140917 0.096167 0.017000

We already know how to parse a date column with Pandas:


In [23]:
data['Time'] = pd.to_datetime(data['Time'])

With set_index('datetime'), we set the column with datetime values as the index, which can be done by both Series and DataFrame.


In [24]:
data = data.set_index("Time")

In [25]:
data


Out[25]:
L06_347 LS06_347 LS06_348
Time
2009-01-01 00:00:00 0.137417 0.097500 0.016833
2009-01-01 03:00:00 0.131250 0.088833 0.016417
2009-01-01 06:00:00 0.113500 0.091250 0.016750
2009-01-01 09:00:00 0.135750 0.091500 0.016250
... ... ... ...
2013-01-01 15:00:00 1.420000 1.420000 0.096333
2013-01-01 18:00:00 1.178583 1.178583 0.083083
2013-01-01 21:00:00 0.898250 0.898250 0.077167
2013-01-02 00:00:00 0.860000 0.860000 0.075000

11697 rows × 3 columns

The steps above are provided as built-in functionality of read_csv:


In [26]:
data = pd.read_csv("../data/vmm_flowdata.csv", index_col=0, parse_dates=True)
REMEMBER:

`pd.read_csv` provides a lot of built-in functionality to support this kind of transactions when reading in a file! Check the help of the read_csv function...

The DatetimeIndex

When we ensure the DataFrame has a DatetimeIndex, time-series related functionality becomes available:


In [27]:
data.index


Out[27]:
DatetimeIndex(['2009-01-01 00:00:00', '2009-01-01 03:00:00',
               '2009-01-01 06:00:00', '2009-01-01 09:00:00',
               '2009-01-01 12:00:00', '2009-01-01 15:00:00',
               '2009-01-01 18:00:00', '2009-01-01 21:00:00',
               '2009-01-02 00:00:00', '2009-01-02 03:00:00',
               ...
               '2012-12-31 21:00:00', '2013-01-01 00:00:00',
               '2013-01-01 03:00:00', '2013-01-01 06:00:00',
               '2013-01-01 09:00:00', '2013-01-01 12:00:00',
               '2013-01-01 15:00:00', '2013-01-01 18:00:00',
               '2013-01-01 21:00:00', '2013-01-02 00:00:00'],
              dtype='datetime64[ns]', name='Time', length=11697, freq=None)

Similar to a Series with datetime data, there are some attributes of the timestamp values available:


In [28]:
data.index.day


Out[28]:
Int64Index([ 1,  1,  1,  1,  1,  1,  1,  1,  2,  2,
            ...
            31,  1,  1,  1,  1,  1,  1,  1,  1,  2],
           dtype='int64', name='Time', length=11697)

In [29]:
data.index.dayofyear


Out[29]:
Int64Index([  1,   1,   1,   1,   1,   1,   1,   1,   2,   2,
            ...
            366,   1,   1,   1,   1,   1,   1,   1,   1,   2],
           dtype='int64', name='Time', length=11697)

In [30]:
data.index.year


Out[30]:
Int64Index([2009, 2009, 2009, 2009, 2009, 2009, 2009, 2009, 2009, 2009,
            ...
            2012, 2013, 2013, 2013, 2013, 2013, 2013, 2013, 2013, 2013],
           dtype='int64', name='Time', length=11697)

The plot method will also adapt its labels (when you zoom in, you can see the different levels of detail of the datetime labels):


In [31]:
# %matplotlib notebook

In [32]:
data.plot()


Out[32]:
<matplotlib.axes._subplots.AxesSubplot at 0x7f34ccc2bbd0>

We have too much data to sensibly plot on one figure. Let's see how we can easily select part of the data or aggregate the data to other time resolutions in the next sections.

Selecting data from a time series

We can use label based indexing on a timeseries as expected:


In [33]:
data[pd.Timestamp("2012-01-01 09:00"):pd.Timestamp("2012-01-01 19:00")]


Out[33]:
L06_347 LS06_347 LS06_348
Time
2012-01-01 09:00:00 0.330750 0.293583 0.029750
2012-01-01 12:00:00 0.295000 0.285167 0.031750
2012-01-01 15:00:00 0.301417 0.287750 0.031417
2012-01-01 18:00:00 0.322083 0.304167 0.038083

But, for convenience, indexing a time series also works with strings:


In [34]:
data["2012-01-01 09:00":"2012-01-01 19:00"]


Out[34]:
L06_347 LS06_347 LS06_348
Time
2012-01-01 09:00:00 0.330750 0.293583 0.029750
2012-01-01 12:00:00 0.295000 0.285167 0.031750
2012-01-01 15:00:00 0.301417 0.287750 0.031417
2012-01-01 18:00:00 0.322083 0.304167 0.038083

A nice feature is "partial string" indexing, where we can do implicit slicing by providing a partial datetime string.

E.g. all data of 2013:


In [35]:
data['2013']


Out[35]:
L06_347 LS06_347 LS06_348
Time
2013-01-01 00:00:00 1.688333 1.688333 0.207333
2013-01-01 03:00:00 2.693333 2.693333 0.201500
2013-01-01 06:00:00 2.220833 2.220833 0.166917
2013-01-01 09:00:00 2.055000 2.055000 0.175667
... ... ... ...
2013-01-01 15:00:00 1.420000 1.420000 0.096333
2013-01-01 18:00:00 1.178583 1.178583 0.083083
2013-01-01 21:00:00 0.898250 0.898250 0.077167
2013-01-02 00:00:00 0.860000 0.860000 0.075000

9 rows × 3 columns

Normally you would expect this to access a column named '2013', but as for a DatetimeIndex, pandas also tries to interprete it as a datetime slice.

Or all data of January up to March 2012:


In [36]:
data['2012-01':'2012-03']


Out[36]:
L06_347 LS06_347 LS06_348
Time
2012-01-01 00:00:00 0.307167 0.273917 0.028000
2012-01-01 03:00:00 0.302917 0.270833 0.030583
2012-01-01 06:00:00 0.331500 0.284750 0.030917
2012-01-01 09:00:00 0.330750 0.293583 0.029750
... ... ... ...
2012-03-31 12:00:00 0.098333 0.124417 0.011833
2012-03-31 15:00:00 0.091917 0.123917 0.011500
2012-03-31 18:00:00 0.085750 0.121417 0.011000
2012-03-31 21:00:00 0.068417 0.119750 0.010417

728 rows × 3 columns

EXERCISE:
  • select all data starting from 2012

In [37]:
data['2012':]


Out[37]:
L06_347 LS06_347 LS06_348
Time
2012-01-01 00:00:00 0.307167 0.273917 0.028000
2012-01-01 03:00:00 0.302917 0.270833 0.030583
2012-01-01 06:00:00 0.331500 0.284750 0.030917
2012-01-01 09:00:00 0.330750 0.293583 0.029750
... ... ... ...
2013-01-01 15:00:00 1.420000 1.420000 0.096333
2013-01-01 18:00:00 1.178583 1.178583 0.083083
2013-01-01 21:00:00 0.898250 0.898250 0.077167
2013-01-02 00:00:00 0.860000 0.860000 0.075000

2937 rows × 3 columns

EXERCISE:
  • select all data in January for all different years

In [38]:
data[data.index.month == 1]


Out[38]:
L06_347 LS06_347 LS06_348
Time
2009-01-01 00:00:00 0.137417 0.097500 0.016833
2009-01-01 03:00:00 0.131250 0.088833 0.016417
2009-01-01 06:00:00 0.113500 0.091250 0.016750
2009-01-01 09:00:00 0.135750 0.091500 0.016250
... ... ... ...
2013-01-01 15:00:00 1.420000 1.420000 0.096333
2013-01-01 18:00:00 1.178583 1.178583 0.083083
2013-01-01 21:00:00 0.898250 0.898250 0.077167
2013-01-02 00:00:00 0.860000 0.860000 0.075000

1001 rows × 3 columns

EXERCISE:
  • select all data in April, May and June for all different years

In [39]:
data[data.index.month.isin([4, 5, 6])]


Out[39]:
L06_347 LS06_347 LS06_348
Time
2009-04-01 00:00:00 0.270417 0.233500 0.031250
2009-04-01 03:00:00 0.249917 0.247083 0.030750
2009-04-01 06:00:00 0.278167 0.260083 0.032583
2009-04-01 09:00:00 0.324833 0.301583 0.034250
... ... ... ...
2012-06-30 12:00:00 0.288000 0.061417 0.006333
2012-06-30 15:00:00 0.287583 0.063083 0.004833
2012-06-30 18:00:00 0.287000 0.080917 0.028167
2012-06-30 21:00:00 0.287000 0.087000 0.010000

2912 rows × 3 columns

EXERCISE:
  • select all 'daytime' data (between 8h and 20h) for all days

In [40]:
data[(data.index.hour > 8) & (data.index.hour < 20)]


Out[40]:
L06_347 LS06_347 LS06_348
Time
2009-01-01 09:00:00 0.135750 0.091500 0.016250
2009-01-01 12:00:00 0.140917 0.096167 0.017000
2009-01-01 15:00:00 0.099167 0.091667 0.017583
2009-01-01 18:00:00 0.132667 0.090167 0.016250
... ... ... ...
2013-01-01 09:00:00 2.055000 2.055000 0.175667
2013-01-01 12:00:00 1.710000 1.710000 0.129583
2013-01-01 15:00:00 1.420000 1.420000 0.096333
2013-01-01 18:00:00 1.178583 1.178583 0.083083

5848 rows × 3 columns

The power of pandas: resample

A very powerfull method is resample: converting the frequency of the time series (e.g. from hourly to daily data).

The time series has a frequency of 1 hour. I want to change this to daily:


In [41]:
data.resample('D').mean().head()


Out[41]:
L06_347 LS06_347 LS06_348
Time
2009-01-01 0.125010 0.092281 0.016635
2009-01-02 0.124146 0.095781 0.016406
2009-01-03 0.113562 0.085542 0.016094
2009-01-04 0.140198 0.102708 0.017323
2009-01-05 0.128812 0.104490 0.018167

Other mathematical methods can also be specified:


In [42]:
data.resample('D').max().head()


Out[42]:
L06_347 LS06_347 LS06_348
Time
2009-01-01 0.140917 0.097500 0.017583
2009-01-02 0.147833 0.101917 0.016833
2009-01-03 0.135833 0.092500 0.016833
2009-01-04 0.160417 0.113750 0.018417
2009-01-05 0.161500 0.115167 0.021583
REMEMBER:

The string to specify the new time frequency: http://pandas.pydata.org/pandas-docs/stable/user_guide/timeseries.html#offset-aliases

These strings can also be combined with numbers, eg `'10D'`...

In [43]:
data.resample('M').mean().plot() # 10D


Out[43]:
<matplotlib.axes._subplots.AxesSubplot at 0x7f34cc671910>
EXERCISE:
  • plot the monthly standard deviation of the columns

In [44]:
data.resample('M').std().plot() # 'A'


Out[44]:
<matplotlib.axes._subplots.AxesSubplot at 0x7f34cc59d9d0>
EXERCISE:
  • plot the monthly mean and median values for the years 2011-2012 for 'L06_347'

**Note** remember the `agg` when using `groupby` to derive multiple statistics at the same time?

In [46]:
subset = data['2011':'2012']['L06_347']
subset.resample('M').agg(['mean', 'median']).plot()


Out[46]:
<matplotlib.axes._subplots.AxesSubplot at 0x7f34cc4d0e50>
EXERCISE:
  • plot the monthly mininum and maximum daily average value of the 'LS06_348' column

In [47]:
daily = data['LS06_348'].resample('D').mean() # daily averages calculated

In [48]:
daily.resample('M').agg(['min', 'max']).plot() # monthly minimum and maximum values of these daily averages


Out[48]:
<matplotlib.axes._subplots.AxesSubplot at 0x7f34cc424250>
EXERCISE:
  • Make a bar plot of the mean of the stations in year of 2013

In [49]:
data['2013'].mean().plot(kind='barh')


Out[49]:
<matplotlib.axes._subplots.AxesSubplot at 0x7f34cc42ae10>