04 - Pandas: Working with time series data
DS Data manipulation, analysis and visualisation in Python
December, 2019© 2016-2019, Joris Van den Bossche and Stijn Van Hoey (mailto:jorisvandenbossche@gmail.com, mailto:stijnvanhoey@gmail.com). Licensed under CC BY 4.0 Creative Commons
In [1]:
# %matplotlib notebook
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
plt.style.use('ggplot')
Standard Python contains the datetime
module to handle date and time data:
In [3]:
import datetime
In [4]:
dt = datetime.datetime(year=2016, month=12, day=19, hour=13, minute=30)
dt
Out[4]:
In [5]:
print(dt) # .day,...
In [6]:
print(dt.strftime("%d %B %Y"))
Pandas has its own date and time objects, which are compatible with the standard datetime
objects, but provide some more functionality to work with.
The Timestamp
object can also be constructed from a string:
In [7]:
ts = pd.Timestamp('2016-12-19')
ts
Out[7]:
Like with datetime.datetime
objects, there are several useful attributes available on the Timestamp
. For example, we can get the month (experiment with tab completion!):
In [8]:
ts.month
Out[8]:
There is also a Timedelta
type, which can e.g. be used to add intervals of time:
In [9]:
ts + pd.Timedelta('5 days')
Out[9]:
Unfortunately, when working with real world data, you encounter many different datetime
formats. Most of the time when you have to deal with them, they come in text format, e.g. from a CSV
file. To work with those data in Pandas, we first have to parse the strings to actual Timestamp
objects.
In [10]:
pd.to_datetime("2016-12-09")
Out[10]:
In [11]:
pd.to_datetime("09/12/2016")
Out[11]:
In [12]:
pd.to_datetime("09/12/2016", dayfirst=True)
Out[12]:
In [13]:
pd.to_datetime("09/12/2016", format="%d/%m/%Y")
Out[13]:
A detailed overview of how to specify the format
string, see the table in the python documentation: https://docs.python.org/3.5/library/datetime.html#strftime-and-strptime-behavior
In [14]:
s = pd.Series(['2016-12-09 10:00:00', '2016-12-09 11:00:00', '2016-12-09 12:00:00'])
In [15]:
s
Out[15]:
The to_datetime
function can also be used to convert a full series of strings:
In [16]:
ts = pd.to_datetime(s)
In [17]:
ts
Out[17]:
Notice the data type of this series has changed: the datetime64[ns]
dtype. This indicates that we have a series of actual datetime values.
The same attributes as on single Timestamp
s are also available on a Series with datetime data, using the .dt
accessor:
In [18]:
ts.dt.hour
Out[18]:
In [19]:
ts.dt.weekday
Out[19]:
To quickly construct some regular time series data, the pd.date_range
function comes in handy:
In [20]:
pd.Series(pd.date_range(start="2016-01-01", periods=10, freq='3H'))
Out[20]:
For the following demonstration of the time series functionality, we use a sample of discharge data of the Maarkebeek (Flanders) with 3 hour averaged values, derived from the Waterinfo website.
In [21]:
data = pd.read_csv("../data/vmm_flowdata.csv")
In [22]:
data.head()
Out[22]:
We already know how to parse a date column with Pandas:
In [23]:
data['Time'] = pd.to_datetime(data['Time'])
With set_index('datetime')
, we set the column with datetime values as the index, which can be done by both Series
and DataFrame
.
In [24]:
data = data.set_index("Time")
In [25]:
data
Out[25]:
The steps above are provided as built-in functionality of read_csv
:
In [26]:
data = pd.read_csv("../data/vmm_flowdata.csv", index_col=0, parse_dates=True)
When we ensure the DataFrame has a DatetimeIndex
, time-series related functionality becomes available:
In [27]:
data.index
Out[27]:
Similar to a Series with datetime data, there are some attributes of the timestamp values available:
In [28]:
data.index.day
Out[28]:
In [29]:
data.index.dayofyear
Out[29]:
In [30]:
data.index.year
Out[30]:
The plot
method will also adapt its labels (when you zoom in, you can see the different levels of detail of the datetime labels):
In [31]:
# %matplotlib notebook
In [32]:
data.plot()
Out[32]:
We have too much data to sensibly plot on one figure. Let's see how we can easily select part of the data or aggregate the data to other time resolutions in the next sections.
We can use label based indexing on a timeseries as expected:
In [33]:
data[pd.Timestamp("2012-01-01 09:00"):pd.Timestamp("2012-01-01 19:00")]
Out[33]:
But, for convenience, indexing a time series also works with strings:
In [34]:
data["2012-01-01 09:00":"2012-01-01 19:00"]
Out[34]:
A nice feature is "partial string" indexing, where we can do implicit slicing by providing a partial datetime string.
E.g. all data of 2013:
In [35]:
data['2013']
Out[35]:
Normally you would expect this to access a column named '2013', but as for a DatetimeIndex, pandas also tries to interprete it as a datetime slice.
Or all data of January up to March 2012:
In [36]:
data['2012-01':'2012-03']
Out[36]:
In [37]:
data['2012':]
Out[37]:
In [38]:
data[data.index.month == 1]
Out[38]:
In [39]:
data[data.index.month.isin([4, 5, 6])]
Out[39]:
In [40]:
data[(data.index.hour > 8) & (data.index.hour < 20)]
Out[40]:
A very powerfull method is resample
: converting the frequency of the time series (e.g. from hourly to daily data).
The time series has a frequency of 1 hour. I want to change this to daily:
In [41]:
data.resample('D').mean().head()
Out[41]:
Other mathematical methods can also be specified:
In [42]:
data.resample('D').max().head()
Out[42]:
In [43]:
data.resample('M').mean().plot() # 10D
Out[43]:
In [44]:
data.resample('M').std().plot() # 'A'
Out[44]:
In [46]:
subset = data['2011':'2012']['L06_347']
subset.resample('M').agg(['mean', 'median']).plot()
Out[46]:
In [47]:
daily = data['LS06_348'].resample('D').mean() # daily averages calculated
In [48]:
daily.resample('M').agg(['min', 'max']).plot() # monthly minimum and maximum values of these daily averages
Out[48]:
In [49]:
data['2013'].mean().plot(kind='barh')
Out[49]: