Overview. We introduce the tools for working with dates, times, and time series data. We start with functionality built into python itself, then discuss how pandas builds on these tools to add powerful time series capabilities to DataFrames.
Outline
Note: requires internet access to run.
This Jupyter notebook was created by Chase Coleman and Spencer Lyon for the NYU Stern course Data Bootcamp.
In order to run the code in this notebook, you will need to have the quandl package installed. You can do this from the command line using
pip install quandl --upgrade
In [ ]:
import sys # system module
import pandas as pd # data package
import matplotlib.pyplot as plt # graphics module
import datetime as dt # date and time module
import numpy as np
%matplotlib inline
plt.style.use("ggplot")
# quandl package
import quandl
# check versions (overkill, but why not?)
print('Python version:', sys.version)
print('Pandas version: ', pd.__version__)
print('quandl version: ', quandl.version.VERSION)
print('Today: ', dt.date.today())
# helper function to print info about dataframe
def df_info(df):
print("Shape: ", df.shape)
print("dtypes: ", df.dtypes.to_dict())
print("index dtype: ", df.index.dtype)
return pd.concat([df.head(3), df.tail(3)])
quandl is a company that collects and maintains financial and economic data from standard sources (e.g. FRED, IMF, BEA, etc.) and non-standard sources (Fx data, company level data, trader receipts). The data is viewable on their webpage (see here or there for examples), but made available to programming languages via their API. We will access their API using their python library.
Suppose, for example, that we wanted to get data on taxes in the US. Here's how we might find some:
OECD/REV_NES_TOTALTAX_TAXUSD_USA
) and description.Exercise (5 min): Go to Quandl's website and explore some of the data quandl has available. Come up with 2 datasets and make a dictionary that maps the quandl code into a reasonable name. For example, for the us tax revenue dataset above I could have done
my_data = {"OECD/REV_NES_TOTALTAX_TAXUSD_USA": "US_tax_rev"}
In [ ]:
We can download the data using the quandl.get
function and passing it one of the Quandl codes we collected in the previous exercise
In [ ]:
us_tax = quandl.get("OECD/REV_NES_TOTALTAX_TAXUSD_USA")
df_info(us_tax)
We can also pass start_date
and end_date
parameters to control the dates for the downloaded data:
In [ ]:
us_tax_recent = quandl.get("OECD/REV_NES_TOTALTAX_TAXUSD_USA", start_date="2000-01-01")
df_info(us_tax_recent)
Now, let's read in the data sets we found were interesting. Feel free to use the codes you looked up, or the ones I'm using here.
In [ ]:
my_data = {
"FRED/DFF": "risk_free_rate",
"NVCA/VENTURE_3_09C": "vc_investments"
}
dfs = []
for k in my_data.keys():
dfs.append(quandl.get(k))
In [ ]:
df_info(dfs[0])
In [ ]:
df_info(dfs[1])
So, "FRED/DFF"
is the federal funds rate, or the interest rate at which banks can trade federal assets with eachother overnight. This is often used as a proxy for the risk free rate i economic analysis.
From the printout above it looks like we have more than 22k observations starting in 1954 at a daily frequency.
Notice, however that the column name is VALUE
. Let's use our dict to clean up that name:
In [ ]:
dfs[0].rename(columns={"VALUE": my_data["FRED/DFF"]}, inplace=True)
df_info(dfs[0])
The second dataframe we dowloaded (using code NVCA/VENTURE_3_09C
) contains quarterly data on total investment by venture capital firms in the US, broken down by the stage of the project.
The column names here are ok, so we don't need to change anything.
Exercise (5 min) do a simlar analysis/report for whatever datasets you chose to work with. Make sure to do the following:
If you chose to use the same data as me, do something interesting with the data. Perhaps construct plots of differenet variables, or compute summary statistics -- use your imagination here.
In [ ]:
So that we have the data easily acessible for later on, let's store these two variables in individual dataframes:
In [ ]:
ffr = dfs[0]
vc = dfs[1]
The date and time functionality in python comes from the built in datetime
module. Notice above that we ran
import datetime as dt
We've been using the dt.date.today()
function throughout this course when we print the date at the top of our notebooks, but we haven't given it very much thought. Let's take a closer look now.
To start, let's see what the type of dt.date.today()
is
In [ ]:
today = dt.date.today()
print("the type of today is ", type(today))
Given that we have an object of type datetime.date
we can do things like ask for the day, month, and year
In [ ]:
print("the day of the month is: ", today.day)
print("we are curretly in month number", today.month)
print("The year is", today.year)
In [ ]:
# construct a date by hand
new_years_eve = dt.date(2016, 12, 31)
In [ ]:
until_nye = new_years_eve - today
In [ ]:
type(until_nye)
We can get the number of days until new years eve by looking at until_nye.days
In [ ]:
until_nye.days
Exercise (5 min): write a python function named days_until
that accepts one argument (a datetime.date
) and returns the number of days between today and that date. Apply your function to
In [ ]:
We could also construct a datetime.timedelta
by hand and add it to an existing date. Here's an example to see how many days until Spencer turns 30
In [ ]:
spencer_bday = dt.date(1989, 4, 25)
# NOTE: add 7 for the 7 leap years between 1989 and 2019
thirty_years = dt.timedelta(days=365*30 + 7)
In [ ]:
# check to make sure it is still April 25th
spencer_bday + thirty_years
In [ ]:
days_to_30 = (spencer_bday + thirty_years - today).days
print("Spencer will be 30 in {} days".format(days_to_30))
In [ ]:
now = dt.datetime.now()
print("type of now:", type(now))
now
The numbers in the printout above are year
, month
, day
, hour
, minute
, second
, millisecond
.
Because we still have day, month, year information ; we can access these properties just as we did for the today
above:
In [ ]:
print("the day of the month is: ", now.day)
print("we are curretly in month number", now.month)
print("The year is", now.year)
Exercise (2 min): Use tab completion to see what else we can access on our dt.datetime
object now
In [ ]:
Time deltas work the same way with datetime
objects as they did with date
objects.
We can see how many seconds until Spencer turns 30:
In [ ]:
# NOTE: we can only do arithmetic between many date objects or datetime obejcts
# we cannot add or subtract a datetime to/from a date. So, we need to
# re-create spencer_bday as a datetime object.
# NOTE: The timedelta object is already compatible with date and datetime objects
spencer_bday_time = dt.datetime(1989, 4, 25, 16, 33, 5)
seconds_to_30 = (spencer_bday_time + thirty_years - now).seconds
print("Spencer will be 30 in {} seconds".format(seconds_to_30))
In [ ]:
print(today.strftime("Today is %Y-%m-%d"))
Notice that the argument to strftime
is a python string that can contain normal text (e.g. Today is
) and a special formatters (the stuff starting with %
). We haven't talked much about how to do string formatting, but in Python and many other languages using %
inside strings has special meaning.
Exercise (6 min) Using the documentation for the string formatting behavior, figure out how to write the following strings using the method strftime
method on the spencer_bday_time
object
In [ ]:
spencer_bday_time.strftime("Spencer was born on %A, %B %dth at %I:%M %p")
Now we will look at how to use date and dateime functionality in pandas.
To begin, lets take a closer look at the type of index we have on our ffr
and vc
dataframes:
In [ ]:
type(ffr.index)
Here we have a DatetimeIndex
, which menas pandas recogizes this DataFrame as containing time series data.
What can we do now? A lot, here's a brief list:
subset
the data using strings to get data for a particular time frameresample
the data to a diffrent frequency: this means we could convert daily to monthly, quarterly, etc.year
, month
, and day
for the observationsnap
the observations to a particular frequency -- this one is a bit advanced and we won't cover it hereFor a much more comprehensive list with other examples see the docs
For now, let's look at how to do these things with the data we obtained from quandl
NOTE You can only do these things when you have a DatetimeIndex
. This means that even if one of the columns in your DataFrame has date or datetime information, you will need to set it as the index to access this functionality.
In [ ]:
ffr2008 = ffr["2008"]
print("ffr2008 is a", type(ffr2008))
df_info(ffr2008)
In [ ]:
ffr2008.plot()
Suppose we want to restrict to September 2008:
In [ ]:
ffr_sep2008 = ffr["2008-09"]
df_info(ffr_sep2008)
We can use this same functionality to extract ranges of dates. To get the data starting in june 2007 and going until march 2011 we would do
In [ ]:
ffr2 = ffr["2007-06":"2011-03"]
df_info(ffr2)
In [ ]:
ffr2.plot()
Exercise (3 min) Using one of your datasets from quandl, plot one or more variables for the last 3 years (2013 through 2016)
In [ ]:
In [ ]:
# MS means "month start"
ffrM_resample = ffr.resample("MS")
type(ffrM_resample)
Notice that when we call resample
we don't get back a DataFrame at that frequency. This is because there is some ambiguity regarding just how the frequency should be converted: should we take the average during the period, the first observation, last observation, sum the observations?
In order to get a DataFrame we have to call a method on our DatetimeIndexResampler
object. For this example, let's do the first observation in each period:
In [ ]:
ffrM = ffrM_resample.first()
df_info(ffrM)
Note that we can also combine numbers with the specification of the resampling frequency. As an example, we can resample to a bi-weekly frequency using
In [ ]:
ffr.resample("2w")
Exercise (5 min): Using the documentation for the most common frequencies, figure out how to resample one of your datasets to
more than you need: I want to point out that when you use the first
or last
methods to perform the aggregations, there are two dates involved: (1) the date the resultant index will have and (2) the date used to fill in the data at that date. The first date (one on the index) will be assigned based on the string you pass to the resample
method. The second date (the one for extracting data from the original dataframe) will be determined based on the method used to do the aggregation. first
will extract the first data point from that subset and last
will extract the last.
Let's see some examples:
In [ ]:
ffr.resample("M").first().head()
In [ ]:
ffr.resample("M").last().head()
Notice that the index is the same on both, but the data is clearly different.
If we use MS
instead of M
we will have the index based on the first day of the month:
In [ ]:
ffr.resample("MS").first().head()
In [ ]:
ffr.resample("MS").last().head()
Notice how the data associated with "M"
and first
is the same as the data for "MS"
and first
. The same holds for last
.
In [ ]:
ffr.index.year
In [ ]:
ffr.index.day
In [ ]:
ffr.index.month
We can use pandas to do rolling computations.
For example, suppose we want to plot the maximum and minimum of the risk free rate within the past week at each date (think about that slowly -- for every date, we want to look back 7 days and compute the max).
Here's how we can do that
In [ ]:
fig, ax = plt.subplots()
ffr.rolling(window=7).max().plot(ax=ax)
ffr.rolling(window=7).min().plot(ax=ax)
ax.legend(["max", "min"])
Note that this is different from just resampling because we will have an observation for every date in the original dataframe (except the number of dates at the front needed to construct the initial window).
In [ ]:
ffr.rolling(window=7).max().head(10)
In [ ]:
ffr.resample("7D").max().head(10)
In [ ]:
# do a left merge on the index (date info)
df = pd.merge(ffr, vc, left_index=True, right_index=True, how="left")
df_info(df)
In [ ]:
vc.head()
Notice that we ended up with a lot of missing data. This happened for two reasons:
To resolve the first issue we can subset the ffr data and only keep from 1985 on
In [ ]:
ffr_recent = ffr["1985":]
To resolve the second issue we will do two-steps:
ffr
data to a monthly frequencyvc
data to a monthly frequency by padding. This is called upsampling because we are going from a lower frequency (quarterly) to a higher one (monthly)
In [ ]:
ffr_recentM = ffr_recent.resample("M").first()
vc_M = vc.resample("M").pad()
In [ ]:
vc_M.head()
Notice that using pad
here just copied data forwards to fill in missing months (e.g. the data for March 1985 was applied to April and May)
Now let's try that merge again
In [ ]:
df = pd.merge(ffr_recentM, vc_M, left_index=True, right_index=True, how="left")
print(df.head(6))
print("\n\n", df.tail(8))
That looks much better -- we have missing data at the top and the bottom for months that aren't available in the venture capital dataset, but nothing else should be missing.
Let's try to do something interesting with this data. We want to plot the growth rate in the risk free rate, early stage vc funding, and total vc funding for the months following the start of the dotcom boom (rougly Jan 1995) and the housing boom (roughly Jan 2004).
Read that again carefully.
For each of the three series we want 2 lines. For each line, the x axis will be quarters since start of boom. The y axis growth rates since first month of bubble.
In [ ]:
# subset the data, then remove datetime index as we don't need it again
post_dotcom = df["1995":].reset_index(drop=True)
post_housing = df["2004":].reset_index(drop=True)
# take logs so we can do growth rates as log(x_{t+N}) - log(x_t)
post_dotcom = np.log(post_dotcom)
post_housing = np.log(post_housing)
dotcom_growth = post_dotcom - post_dotcom.iloc[0, :]
housing_growth = post_housing - post_housing.iloc[0, :]
In [ ]:
fig, axs = plt.subplots(3, 1, figsize=(10, 5))
variables = ["risk_free_rate", "Early Stage", "Total"]
for i in range(len(variables)):
var = variables[i]
# add dotcom line
dotcom_growth[var].plot(ax=axs[i])
# add housing line
housing_growth[var].plot(ax=axs[i])
# set title
axs[i].set_title(var)
# set legend and xlabel on last plot only
axs[-1].legend(["dotcom", "housing"])
axs[-1].set_xlabel("Quarters since boom")
# make subplots not overlap
fig.tight_layout()
So we see that these booms were qualitatively different in terms of the response of the risk free rate and venture capital funding.
In [ ]: