A time series is a measurement of one or more variables over a period of time and at a specific interval. Once a time series is captured, analysis is performed to identify patterns in the data, in essence, determining what is happening as time goes by.
pandas provides extensive support for working with time-series data. When working with time-series data you are frequently required to perform a number of tasks such as the following:
In [1]:
# import numpy, pandas and datetime
import numpy as np
import pandas as pd
# needed for representing dates and times
import datetime
from datetime import datetime
# Set some pandas options for controlling output
pd.set_option('display.notebook_repr_html',False)
pd.set_option('display.max_columns',10)
pd.set_option('display.max_rows',10)
# matplotlib and inline graphics
import matplotlib.pyplot as plt
%matplotlib inline
The datetime object is part of the datetime library and not a part of pandas. This class can be utilized to construct objects representing a fixed point in time at a specific date and time or simply a date without time or time without a date component.
With respect to pandas, the datetime objects do not have the accuracy needed for much of the mathematics involved in extensive calculations on time-series data. However they are commonly used to initialize pandas objects with pandas converting them into pandas timestamp objects behind the scenes.
In [2]:
# datetime object for Dec 12 2014
datetime(2014,12,15)
Out[2]:
In [4]:
# specific date and also with a time of 5:30 pm
datetime(2014,12,15,17,30)
Out[4]:
In [5]:
# get the local "now" (date and time)
# can take a time zone, but thats not demonstrated here
datetime.now()
Out[5]:
In [7]:
# a date without time can be represented
# by creating a date using a datetime object
datetime.date(datetime(2014,12,14))
Out[7]:
In [8]:
# get the current date
datetime.now().date()
Out[8]:
In [9]:
# get just a time from a datet ime
datetime.time(datetime(2015,12,14,15,17,30))
Out[9]:
In [10]:
# get current local time
datetime.now().time()
Out[10]:
In [11]:
# a timestamp representing a specific date
pd.Timestamp('2014-12-15')
Out[11]:
In [12]:
# a timestamp with both date and time
pd.Timestamp('2014-12-14 17:30')
Out[12]:
In [13]:
# timestamp with just a time
# which adds in the current local date
pd.Timestamp('17:55')
Out[13]:
In [14]:
# get the current date and time (now)
pd.Timestamp("now")
Out[14]:
In [15]:
# what is one day from 2014-11-30?
today = datetime(2014,11,30)
tomorrow = today + pd.Timedelta(days=1)
tomorrow
Out[15]:
In [16]:
# how many days between two dates?
date1 = datetime(2014,12,2)
date2 = datetime(2014,11,28)
date1 - date2
Out[16]:
Due to its roots in finance, pandas excels in manipulating time-series data. Its abilities have been continuously refined over all of its versions to progressively increase its capabilities for time-series manipulation.
The core of time-series functionality in pandas revolves around the use of specialized indexes that represents measurements of data at one or more timestamps. These indexes in pandas are referred to as DatetimeIndex objects.
In [17]:
# create a very simple time-series with two index labels
# and random values
dates = [datetime(2014,8,1),datetime(2014,8,2)]
ts = pd.Series(np.random.randn(2),dates)
ts
Out[17]:
In [18]:
# what is the type of the index?
type(ts.index)
Out[18]:
In [19]:
# and we can see it is a collection of timestamps
type(ts.index[0])
Out[19]:
In [20]:
# create from just a list of dates as strings!
np.random.seed(123456)
dates = ['2014-08-01','2014-08-02']
ts = pd.Series(np.random.randn(2),dates)
ts
Out[20]:
In [22]:
# convert a sequence of objects to a DatetimeIndex
dti = pd.to_datetime(['Aug 1,2014','2014-08-2','2014.8.3',None])
for l in dti: print(l)
In [23]:
type('dti')
Out[23]:
In [27]:
# pandas fallback to NumPy array of objects if cannot
# parse a value to Timestamp
# this is a list of objects, not timestamps...
# pd.to_datetime(['Aug 1, 2014','foo'])
In [30]:
# force the conversion, NaT for items that dont work
# pd.to_datetime(['Aug 1, 2014','foo'],coerce=True)
In [31]:
# create a range of dates starting at a specific date
# and for a specific number of days, creating a Series
np.random.seed(123456)
periods = pd.date_range('8/1/2014',periods=10)
date_series = pd.Series(np.random.randn(10),index=periods)
date_series
Out[31]:
In [34]:
# slice by location
subset = date_series[3:7]
subset
Out[34]:
In [35]:
# a series to demonstrate alignment
s2 = pd.Series([10,100,1000,10000],subset.index)
s2
Out[35]:
In [36]:
# demonstrate alignment by date on a subset of items
date_series + s2
Out[36]:
In [37]:
# lookup item by a string representing a date
date_series['2014-08-05']
Out[37]:
In [38]:
# slice between two dates specified by string representing dates
date_series['2014-08-05':'2014-08-07']
Out[38]:
In [39]:
# a two year range of daily data in a Series
# only select those in 2013
s3 = pd.Series(0,pd.date_range('2013-01-01','2014-12-31'))
s3['2013']
Out[39]:
In [40]:
# 31 items for May 2014
s3['2014-05']
Out[40]:
In [41]:
# items between two months
s3['2014-08':'2014-09']
Out[41]:
In [43]:
# generate a Series at one minute intervals
np.random.seed(123456)
bymin = pd.Series(np.random.randn(24*60*90),pd.date_range('2014-08-01','2014-10-29 23:59',freq='T'))
bymin
Out[43]:
In [44]:
# slice down to the minute
bymin['2014-08-01 00:02':'2014-08-01 00:10']
Out[44]:
Some of the possible frequency values:
Alias | Description |
---|---|
B | Business Day Frequency |
C | Custom Business Day Frequency |
D | Calendar Day Frequency (default |
W | Weekly Frequency |
M | Month End Frequency |
BM | Business Month End Frequency |
CBM | Custom Business Month End Frequency |
MS | Month Start Frequency |
BMS | Business Month Start Frequency |
CBMS | Custom Business Month Start Freq |
Q | Quarter End Frequency |
In [45]:
# generate a series based upon business days
days = pd.date_range('2014-08-29','2014-09-05',freq='B')
for d in days: print(d)
In this time series, we can see that two days were skipped as they were on the weekend, which would not have occured using a calendar-day frequency.
A range can be created starting at a particular date and time with a specific frequency and for a specific number of periods using the periods parameter.
In [46]:
# periods will use the frequency as the increment
pd.date_range('2014-08-01 12:10:01',freq='S',periods=10)
Out[46]:
Frequencies in pandas are represented using date offsets. We have touched on this concept at the beginning of the chapter when discussing Timedelta objects. pandas extends the capabilities of these using the concept of DateOffset objects, which represent knowledge of how to integrate time offsets and frequencies relative to DatetimeIndex objects.
The use of DatetimeIndex and DateOffset objects provides the user of pandas great flexibility in calculating a new date/time from another using an offset other than one that represents a fixed period of time.
A practical example would be to calculate the next day of business. This is not simply determined by adding one day to datetime. If a date represents a Friday, the next business day in the US financial marked is not Saturday but Monday. In some cases, one business day from a Friday may actually be Tuesday if Monday is a holiday.
In [47]:
dti = pd.date_range('2014-08-29','2014-09-05',freq='B')
dti.values
Out[47]:
In [48]:
# check the frequency is BusinessDay
dti.freq
Out[48]:
Class | Description |
---|---|
DateOffset | Generic offset - default one cal |
BDay | Business Day |
CDay | Custom Business Day |
Week | one week,optionally anchored a day of the week |
WeekOfMonth | The x-th day of the y-th week of each month |
pandas takes this strategy of using DateOffset and its specializations to codify logic to calculate the next datetime from another datetime. This makes using these objects very flexible as well as powerful.
DateOffset objects can be created by passing them a datetime object that represents a fixed duration of time or using a number of keyword arguments.
In [49]:
# calculate a one day offset from 2014-8-29
d = datetime(2014,8,29)
do = pd.DateOffset(days = 1)
d + do
Out[49]:
In [50]:
# import the data offset types
from pandas.tseries.offsets import *
# calculate one business day from 2014-8-31
d + BusinessDay()
Out[50]:
In [51]:
# determine 2 business days from 2014-8-29
d + 2 * BusinessDay()
Out[51]:
In [52]:
# what is the next business month end
# from a specific date?
d + BMonthEnd()
Out[52]:
In [53]:
# calculate the next month end by
# rolling forward from a specific date
BMonthEnd().rollforward(datetime(2014,9,15))
Out[53]:
In [54]:
# calculate the date of the Tuesday previous
# to a specified date
d - Week(weekday = 1)
Out[54]:
In [55]:
# calculate all Wednesdays between 2014-06-01
# and 2014-08-31
wednesdays = pd.date_range('2014-06-01','2014-08-31', freq='W-WED')
wednesdays.values
Out[55]:
In [56]:
# what are all of the business quarterly end
# dates in 2014?
qends = pd.date_range('2014-01-01','2014-12-31',freq='BQS-JUN')
qends.values
Out[56]:
Many useful mathematical operations on time-series data require that events within a specific time interval be analyzed. A simple example would be to determine how many financial transactions occured in a specific period.
This can be performed using Timestamp and DateOffset, where the bounds are calculated and then items filtered based on these bounds. However, this becomes cumbersome when you need to deal with events that must be grouped into multiple periods of time as you start to need to manage sets of the Timestamp and DateOffset objects.
To facilitate these types of data organization and calculations, pandas makes intervals of time a formal construct using the Period class.
pandas also formalizes series of Period objects using Period Index, which provides capabilities of aligning data items based on the indexes associated period objects.
Period is created using a timestamp and a frequency where the timestamp represents the anchor used as a point of reference and the frequency is the duration of time.
In [57]:
# create a period representing a month of time
# starting in August 2014
aug2014 = pd.Period('2014-08',freq='M')
aug2014
Out[57]:
In [58]:
# examine the start and end times of this period
aug2014.start_time, aug2014.end_time
Out[58]:
In [59]:
# calculate the period that is one frequency
# unit of aug2014 period further along in time
# This happens to be September 2014
sep2014 = aug2014 + 1
sep2014
Out[59]:
The concept of the shift is very important and powerful. The addition of 1 to this Period object informs it to shift in time one positive unit of whatever frequency is represented by the object. In this case, it shifts the period one month forward to September 2014.
In [60]:
sep2014.start_time, sep2014.end_time
Out[60]:
Note that Period has the intelligence to know that September is 30 days and not 31. This is part of the incredible intelligence behind the Period object that saves us a lot of coding.It is not simply adding 30 days but one unit frequency of the period.
A series of Period objects can be combined into a special form of pandas index known as PeriodIndex. A PeriodIndex index is useful for being able to associate data to specific intervals of time and being able to slice and perform analysis on the events in each interval represented in PeriodIndex.
In [61]:
# create a period index representing
# all monthly boundaries in 2013
mp2013 = pd.period_range('1/1/2013','12/31/2013',freq='M')
mp2013
Out[61]:
In [62]:
# loop through all period objects in the index
# printing start and end time of each
for p in mp2013:
print("{0} {1}".format(p.start_time,p.end_time))
In [63]:
# create a Series with a PeriodIndex
np.random.seed(123456)
ps = pd.Series(np.random.randn(12),mp2013)
ps
Out[63]:
In [65]:
# create a Series with a PeriodIndex and which
# represents all calendar month period in 2013 and 2014
np.random.seed(123456)
ps = pd.Series(np.random.randn(24),pd.period_range('1/1/2013','12/31/2014',freq='M'))
ps
Out[65]:
In [66]:
# get value for period represented by 2014-06
ps['2014-06']
Out[66]:
In [67]:
# get values for all periods in 2014
ps['2014']
Out[67]:
In [68]:
# all values between (and including) March and June 2014
ps['2014-03':'2014-06']
Out[68]:
Earlier when we calculated the next business day from August 29,2014 we were told by pandas that this date is September 1,2014. This is actually not correct in the United States: September 1,2014 is a US federal holiday and banks and exchanges are closed on this day. The reason for this is that pandas uses a specific default calendar when calculating the next business day, and this default pandas calendar does not include September 1,2014 as a holiday.
In [69]:
# demonstrate using the US federal holiday calender
# first need to import it
from pandas.tseries.holiday import *
# create it and show what it considers holidays
cal = USFederalHolidayCalendar()
for d in cal.holidays(start='2014-01-01', end='2014-12-31'):
print(d)
In [70]:
# create CustomBusinessDay object based on the federal calendar
cbd = CustomBusinessDay(holidays=cal.holidays())
# now calc next business day from 2014-8-29
datetime(2014,8,29) + cbd
Out[70]:
Time zone management can be one of the most complicated issues to deal with when working with time-series data. Data is ofter collected in different systems across the globe using local time and at some point, it will require coordination with data collected in other time zones.
In [71]:
# get the current local time and demonstrate there is no
# timezone info by default
now = pd.Timestamp('now')
now, now.tz is None
Out[71]:
In [72]:
# default DatetimeIndex and its Timestamps do not have
# time zone information
rng = pd.date_range('3/6/2012 00:00', periods=15, freq='D')
rng.tz is None, rng[0].tz is None
Out[72]:
In [73]:
# import common timezones from pytz
from pytz import common_timezones
# report the first 5
common_timezones[:5]
Out[73]:
In [74]:
# get now, and now localized to UTC
now = Timestamp("now")
local_now = now.tz_localize('UTC')
now, local_now
Out[74]:
In [76]:
# localize a timestamp to US/Mountain time zone
tstamp = Timestamp('2014-08-01 12:00:00', tz='US/Mountain')
tstamp
Out[76]:
In [78]:
# create a DatetimeIndex using a time zone
rng = pd.date_range('3/6/2012 00:00:00', periods=10,freq="D",tz="US/Mountain")
rng.tz, rng[0].tz
Out[78]:
In [79]:
# show use of time zone objects
# need to reference pytz
import pytz
# create an object for two different time zones
mountain_tz = pytz.timezone("US/Mountain")
eastern_tz = pytz.timezone("US/Eastern")
# apply each to 'now'
mountain_tz.localize(now),eastern_tz.localize(now)
Out[79]:
In [80]:
# create two Series, same start, same periods, same frequencies
# each with a different time zone
s_mountain = Series(np.arange(0,5),index=pd.date_range('2014-08-01', periods=5, freq="H",tz="US/Mountain"))
s_eastern = Series(np.arange(0,5),index=pd.date_range('2014-08-01', periods=5, freq="H",tz="US/Eastern"))
s_mountain
Out[80]:
In [81]:
s_eastern
Out[81]:
In [82]:
# add the two series
# This only results in three items being aligned
s_eastern + s_mountain
Out[82]:
Once a time zone is assigned to an object, that object can be converted to another time zone using the tz.convert() method.
In [84]:
# convert s1 from US/Eastern to US/Pacific
s_pacific = s_eastern.tz_convert("US/Pacific")
s_pacific
Out[84]:
In [85]:
# this will be the same result as s_eastern + s_mountain
# as the time zones still get aligned to be the same
s_mountain + s_pacific
Out[85]:
Common operations that are performed on time-series data are realigning data, changing the frequency of the samples and their values and calculating aggregate results on continuously moving subsets of the data to determine the behavior of the values in the data as time changes.
In [86]:
# create a series to work with
np.random.seed(123456)
ts = Series([1,2,2.5,1.5,0.5],pd.date_range('2014-08-01',periods=5))
ts
Out[86]:
In [87]:
# shift forward one day
ts.shift(1)
Out[87]:
pandas has moved the values forward one unit of the index's frequency, which is one day. The index remains unchanged. There was no replacement data for 2014-08-01 so it is fulled with NaN.
A lag is a shift in a negative direction. The following lags the Series by 2 days:
In [88]:
# lag two days
ts.shift(-2)
Out[88]:
A common calculation that is performed using a shift is to calculate the percentage daily change in values. This can be performed by dividing a Series object by its values shifted by 1:
In [89]:
# calculate daily percentage change
ts / ts.shift(1)
Out[89]:
Shifts can be performed on different frequencies that that in the index. When this is performed, the index will be modified and the values remain the same. As an example, the following shifts the Series forward by one business day:
In [90]:
# shift forward one business day
ts.shift(1,freq="B")
Out[90]:
In [91]:
# shift forward by 5 hours
ts.tshift(5,freq="H")
Out[91]:
In [92]:
# shift using a DateOffset
ts.shift(1,DateOffset(minutes=0.5))
Out[92]:
There is an alternative form of shifting provided by the .tshift() method. This method shifts the index labels by the specified units and a frequency specified by the freq paramater (which is required).
In [93]:
# shift just the index values
ts.tshift(-1,freq="H")
Out[93]:
Frequency data can be converted in pandas using the .asfreq() method of a time-series object, such as Series or DataFrame. When converting frequency, a new Series object with a new DatetimeIndex object will be created. The Datetime Index of the new Series object starts at the first Timestamp of the original and progresses at the given frequency until the last Timestamp of the original. Values will then be aligned into the new Series.
In [95]:
# create a Series of incremental values
# index by hour through all of August 2014
periods = 31 * 24
hourly = Series(np.arange(0,periods),pd.date_range('08-01-2014',freq="2H",periods=periods))
hourly
Out[95]:
In [97]:
# convert to daily frequency
# many items will be dropped due to alignment
daily = hourly.asfreq('D')
daily
Out[97]:
In [98]:
# convert back to hourly
# results in many NaNs
# as the new index has many labels that do not
# align with the source
daily.asfreq("H")
Out[98]:
The new index has Timestamp objects at hourly intervals, so only the timestamps at exact days align with the daily time series, resulting in 670 NaN values. This default behavior can be changed using the method parameter of the .asfreq() method. This value can be used for forward fill, reverse fill or to pad the NaN values.
In [99]:
daily.asfreq('H',method='ffill')
Out[99]:
In [100]:
daily.asfreq('H',method='bfill')
Out[100]:
Frequency conversion provides a basic way to convert the index in a time series to another frequency. Data in the new series is aligned with the old data and can result in many NaN values. This can be partially solved using a fill method, but that is limited in capabilities to fill with appropriate information.
Resampling differs in that it does not perform a pure alignment. The values placed in the new series can use the same forward and reverse fill options, but they can also be specified using the other pandas-provided algorithms or with your own functions.
In [101]:
# calculate a random walk five days long at one second intervals
# these many items will be needed
count = 24 * 60 * 60 * 5
# create a series of values
np.random.seed(123456)
values = np.random.randn(count)
ws = pd.Series(values)
# calculate the walk
walk = ws.cumsum()
# patch the index
walk.index = pd.date_range('2014-08-01',periods=count,freq="S")
walk
Out[101]:
Resampling in pandas is accomplished using the .resample() method by passing it a new frequency.To demonstrate this the following resamples our by-the-second data to by-the-minute. This is a down sampling as the result has a lower frequency and results in less values:
In [106]:
# resample to minute intervals
walkmin = walk.resample("1Min")
A resampling will actually split the data into buckets of data based on new periods and then apply a particular operation to the data in each bucket. The default scenario is to calculate the mean of each bucket. This can be verified with the following, which slices the first minute of data from the walk and calculates its mean:
In [107]:
# calculate the mean of the first minute of the walk
walk['2014-08-01 00:00'].mean()
Out[107]:
In downsampling as the existing data is put into buckets based on the new intervals, there can often be a question of what values are on each end of the bucket. As an example, should the first interval in the previous resampling be from 2014-08-01 00:00:00 through 2014-08-01 23:59:59 or should it end at 2014-08-04 00:00:00 but start at 2014-08-03 23:59:59?
The default is the former and it is referred to as a left close. To other scenario that excludes the left value and includes the right is a right close and can be performed by using the close="right" parameter.
In [108]:
# use a right close
walk.resample("1Min", closed="right")
Out[108]:
In [110]:
# take the first value of each bucket
walk.resample("1Min").first()
Out[110]:
In [112]:
# resample to 1 min intervals, then back to 1 sec
bymin = walk.resample("1Min")
bymin.resample('S').mean()
Out[112]:
In [114]:
# resample to 1 sec intervals using forward fill
bymin.resample("S").bfill()
Out[114]:
In [115]:
# demo interpolating the NaN values
interpolated = bymin.resample("S").interpolate()
interpolated
Out[115]:
In [117]:
# show ohlc resampling
ohlc = walk.resample("H").ohlc()
ohlc
Out[117]:
pandas provides a number of functions to compute moving (also known as rolling) statistics. In a rolling window, pandas computes the statistic on a window of data represented by a particular period of time. The window is then rolled along a certain interval, and the statistic is continually calculated on each window as long as the window fits within the dates of the time series.
As a practical example, a rolling mean is commonly used to smooth out short-term fluctuations and highlight longer-term trends in data and is used quite commonly in financial time-series analysis.
In [118]:
first_minute = walk['2014-08-01 00:00']
# calculate a rolling mean window of 5 periods
pd.rolling_mean(first_minute,5).plot()
# plot it against the raw data
first_minute.plot()
# add a legend
plt.legend(labels=['Rolling Mean','Raw']);
It can be seen how rolling_mean provides a smoother representation of the underlying data. A larger window will create less variance and smaller windows will create more.
In [119]:
# demonstrate the difference between 2, 5 and 10
# interval rolling windows
hlw = walk['2014-08-01 00:00']
hlw.plot()
pd.rolling_mean(hlw,2).plot()
pd.rolling_mean(hlw,5).plot()
pd.rolling_mean(hlw,10).plot()
plt.legend(labels=['Raw','2-interval RM','5-interval RM','10-interval RM']);
Any function can be applied via a rolling window using the pd.rolling_apply function. The supplied function will be passed an array of values in the window and should return a single value, which pandas will aggregate with these results into a time series.
In [120]:
# calculate mean average deviation with window of 5 intervals
mean_abs_dev = lambda x: np.fabs(x-x.mean()).mean()
pd.rolling_apply(hlw,5,mean_abs_dev).plot();
An expanding window mean can be calculated using a slight variant of the use of the pd.rolling_mean function that repeatedly calculates the mean by always starting with the first value in the time series and for each iteration increases the window size by one. An expanding window mean will be more stable (less responsive) than a rolling window, because as the size of the window increases, the less the impact of the next value will be:
In [122]:
# calculate an expanding rolling mean
expanding_mean = lambda x: pd.rolling_mean(x,len(x),min_periods=1)
hlw.plot()
pd.expanding_mean(hlw).plot()
plt.legend(labels=['Expanding Mean','Raw']);