Timeseries


Overview. We introduce the tools for working with dates, times, and time series data. We start with functionality built into python itself, then discuss how pandas builds on these tools to add powerful time series capabilities to DataFrames.

Outline

  • Quandl: We show how to use the quandl package to access a large database of financial and economic data.
  • Dates in python: covers the basics of working with dates and times in python
  • Dates in pandas: shows how to use dates with pandas objects

Note: requires internet access to run.

This Jupyter notebook was created by Chase Coleman and Spencer Lyon for the NYU Stern course Data Bootcamp.

In order to run the code in this notebook, you will need to have the quandl package installed. You can do this from the command line using

pip install quandl --upgrade


In [2]:
import sys                             # system module
import pandas as pd                    # data package
import matplotlib.pyplot as plt        # graphics module  
import datetime as dt                  # date and time module
import numpy as np
%matplotlib inline    
plt.style.use("ggplot")

# quandl package
import quandl

# check versions (overkill, but why not?)
print('Python version:', sys.version)
print('Pandas version: ', pd.__version__)
print('quandl version: ', quandl.version.VERSION)
print('Today: ', dt.date.today())

# helper function to print info about dataframe

def df_info(df):
    print("Shape: ", df.shape)
    print("dtypes: ", df.dtypes.to_dict())
    print("index dtype: ", df.index.dtype)
    return pd.concat([df.head(3), df.tail(3)])


Python version: 3.5.2 |Anaconda custom (64-bit)| (default, Jul  2 2016, 17:53:06) 
[GCC 4.4.7 20120313 (Red Hat 4.4.7-1)]
Pandas version:  0.19.2
quandl version:  3.1.0
Today:  2017-04-24

Quandl

quandl is a company that collects and maintains financial and economic data from standard sources (e.g. FRED, IMF, BEA, etc.) and non-standard sources (Fx data, company level data, trader receipts). The data is viewable on their webpage (see here or there for examples), but made available to programming languages via their API. We will access their API using their python library.

Suppose, for example, that we wanted to get data on taxes in the US. Here's how we might find some:

  • Open up the quandl search page
  • Type in "US tax revenue"
  • Click on one of the results that seems interesting to us
  • Checkout things like the frequency (Annual for this data set), the quandl code (top right, here it is OECD/REV_NES_TOTALTAX_TAXUSD_USA) and description.

Exercise (5 min): Go to Quandl's website and explore some of the data quandl has available. Come up with 2 datasets and make a dictionary that maps the quandl code into a reasonable name. For example, for the us tax revenue dataset above I could have done

my_data = {"OECD/REV_NES_TOTALTAX_TAXUSD_USA": "US_tax_rev"}

In [ ]:

We can download the data using the quandl.get function and passing it one of the Quandl codes we collected in the previous exercise


In [3]:
us_tax = quandl.get("OECD/REV_NES_TOTALTAX_TAXUSD_USA")
df_info(us_tax)


Shape:  (51, 1)
dtypes:  {'Value': dtype('float64')}
index dtype:  datetime64[ns]
Out[3]:
Value
Date
1965-12-31 167022.00
1966-12-31 186065.00
1967-12-31 208577.00
2013-12-31 4283028.74
2014-12-31 4500529.62
2015-12-31 4754119.88

We can also pass start_date and end_date parameters to control the dates for the downloaded data:


In [4]:
us_tax_recent = quandl.get("OECD/REV_NES_TOTALTAX_TAXUSD_USA", start_date="2000-01-01")
df_info(us_tax_recent)


Shape:  (16, 1)
dtypes:  {'Value': dtype('float64')}
index dtype:  datetime64[ns]
Out[4]:
Value
Date
2000-12-31 2900519.06
2001-12-31 2884729.87
2002-12-31 2738000.14
2013-12-31 4283028.74
2014-12-31 4500529.62
2015-12-31 4754119.88

Now, let's read in the data sets we found were interesting. Feel free to use the codes you looked up, or the ones I'm using here.


In [5]:
# For multiple data sources, it's useful to define a list/dict
my_data = {"FRED/DFF": "risk_free_rate",
           "NVCA/VENTURE_3_09C": "vc_investments"}

dfs = []

for k in my_data.keys():
    dfs.append(quandl.get(k))

In [6]:
df_info(dfs[0])


Shape:  (126, 5)
dtypes:  {'Expansion': dtype('float64'), 'Seed': dtype('float64'), 'Later Stage': dtype('float64'), 'Total': dtype('float64'), 'Early Stage': dtype('float64')}
index dtype:  datetime64[ns]
Out[6]:
Seed Early Stage Expansion Later Stage Total
Date
1985-03-31 153.0000 96.3000 219.5000 154.4000 623.1000
1985-06-30 146.5000 185.3000 319.6000 89.4000 740.8000
1985-09-30 93.7000 106.3000 312.8000 164.4000 677.2000
2015-12-31 482.5349 5193.9647 3282.8216 3167.6023 12126.9235
2016-03-31 561.8670 4601.1276 4010.2623 3516.5483 12689.8052
2016-06-30 534.7283 3962.5863 8507.1810 2286.6906 15291.1862

To request specific columns use column indices (NOT 0-based)


In [35]:
quandl.get(['NVCA/VENTURE_3_09C.2', 'NVCA/VENTURE_3_09C.5']).head()


Out[35]:
NVCA/VENTURE_3_09C - Early Stage NVCA/VENTURE_3_09C - Total
Date
1985-03-31 96.3 623.1
1985-06-30 185.3 740.8
1985-09-30 106.3 677.2
1985-12-31 129.9 735.1
1986-03-31 129.6 710.5

To combine variables from different datasets


In [32]:
quandl.get(['NSE/OIL.1', 'WIKI/AAPL.4'])


Out[32]:
NSE/OIL - Open WIKI/AAPL - Close
Date
1980-12-12 NaN 28.75
1980-12-15 NaN 27.25
1980-12-16 NaN 25.25
1980-12-17 NaN 25.87
1980-12-18 NaN 26.63
1980-12-19 NaN 28.25
1980-12-22 NaN 29.63
1980-12-23 NaN 30.88
1980-12-24 NaN 32.50
1980-12-26 NaN 35.50
1980-12-29 NaN 36.00
1980-12-30 NaN 35.12
1980-12-31 NaN 34.13
1981-01-02 NaN 34.50
1981-01-05 NaN 33.75
1981-01-06 NaN 32.25
1981-01-07 NaN 30.88
1981-01-08 NaN 30.25
1981-01-09 NaN 31.88
1981-01-12 NaN 31.62
1981-01-13 NaN 30.50
1981-01-14 NaN 30.63
1981-01-15 NaN 31.25
1981-01-16 NaN 31.00
1981-01-19 NaN 32.87
1981-01-20 NaN 31.88
1981-01-21 NaN 32.50
1981-01-22 NaN 32.87
1981-01-23 NaN 32.75
1981-01-26 NaN 32.25
... ... ...
2017-03-10 325.00 139.14
2017-03-13 NaN 139.20
2017-03-14 337.20 138.99
2017-03-15 334.10 140.46
2017-03-16 336.80 140.69
2017-03-17 338.00 139.99
2017-03-20 336.00 141.46
2017-03-21 332.60 139.84
2017-03-22 330.05 141.42
2017-03-23 329.90 140.92
2017-03-24 331.70 140.64
2017-03-27 331.70 140.88
2017-03-28 332.20 143.80
2017-03-29 331.80 144.12
2017-03-30 331.95 143.93
2017-03-31 329.95 143.71
2017-04-03 333.40 143.70
2017-04-04 NaN 144.77
2017-04-05 330.80 144.02
2017-04-06 332.20 143.66
2017-04-07 334.00 143.34
2017-04-10 334.30 143.14
2017-04-11 340.50 141.63
2017-04-12 337.75 141.79
2017-04-13 334.10 141.05
2017-04-17 334.75 141.80
2017-04-18 334.45 141.20
2017-04-19 331.10 140.68
2017-04-20 332.70 142.44
2017-04-21 324.80 NaN

9233 rows × 2 columns

What happened to the first column?


In [33]:
mix = quandl.get(['NSE/OIL.1', 'WIKI/AAPL.4'])
mix.plot(subplots=True)


Out[33]:
array([<matplotlib.axes._subplots.AxesSubplot object at 0x7fc9eccacf60>,
       <matplotlib.axes._subplots.AxesSubplot object at 0x7fc9e9526e10>], dtype=object)

In [36]:
df_info(dfs[1])


Shape:  (22940, 1)
dtypes:  {'risk_free_rate': dtype('float64')}
index dtype:  datetime64[ns]
Out[36]:
risk_free_rate
DATE
1954-07-01 1.13
1954-07-02 1.25
1954-07-03 1.25
2017-04-18 0.91
2017-04-19 0.91
2017-04-20 0.91

So, "FRED/DFF" is the federal funds rate, or the interest rate at which banks can trade federal assets with each other overnight. This is often used as a proxy for the risk free rate in economic analysis.

From the printout above it looks like we have more than 22k observations starting in 1954 at a daily frequency.

Notice, however that the column name is DFF. Let's use our dict to clean up that name:


In [14]:
dfs[1].rename(columns={"DFF": my_data["FRED/DFF"]}, inplace=True)
df_info(dfs[1])


Shape:  (22940, 1)
dtypes:  {'risk_free_rate': dtype('float64')}
index dtype:  datetime64[ns]
Out[14]:
risk_free_rate
DATE
1954-07-01 1.13
1954-07-02 1.25
1954-07-03 1.25
2017-04-18 0.91
2017-04-19 0.91
2017-04-20 0.91

We can change the sampling frequency:


In [44]:
quart = quandl.get("FRED/DFF", collapse='quarterly')
print(quart.head())
quart.plot()


             DFF
DATE            
1954-09-30  1.44
1954-12-31  1.44
1955-03-31  1.44
1955-06-30  1.69
1955-09-30  2.19
Out[44]:
<matplotlib.axes._subplots.AxesSubplot at 0x7fc9e93dd588>

Or we can perform elementary calculations on the data


In [45]:
diff = quandl.get("FRED/DFF", transformation='rdiff')
diff.plot()


Out[45]:
<matplotlib.axes._subplots.AxesSubplot at 0x7fc9e94c3a90>

The other dataframe we dowloaded (using code NVCA/VENTURE_3_09C) contains quarterly data on total investment by venture capital firms in the US, broken down by the stage of the project.

The column names here are ok, so we don't need to change anything.

So that we have the data easily acessible for later on, let's store these two variables in individual dataframes:


In [46]:
ffr = dfs[1]
vc = dfs[0]

Usage Limits

The Quandl Python module is free. If you would like to make more than 50 calls a day, however, you will need to create a free Quandl account and set your API key:

quandl.ApiConfig.api_key = "YOUR_KEY_HERE"
mydata = quandl.get("FRED/GDP")

The personalized API key is available here: https://www.quandl.com/account/api

Dates in python

The date and time functionality in python comes from the built in datetime module. Notice above that we ran

import datetime as dt

We've been using the dt.date.today() function throughout this course when we print the date at the top of our notebooks, but we haven't given it very much thought. Let's take a closer look now.

To start, let's see what the type of dt.date.today() is


In [47]:
today = dt.date.today()
print("the type of today is ", type(today))


the type of today is  <class 'datetime.date'>

Given that we have an object of type datetime.date we can do things like ask for the day, month, and year


In [48]:
print("the day of the month is: ", today.day)
print("we are curretly in month number", today.month)
print("The year is", today.year)


the day of the month is:  24
we are curretly in month number 4
The year is 2017

timedelta

Suppose that we wanted to construct a "days until" counter.

To do this we will construct another datetime.date and use the - operator to find the differene between the other date and today.


In [50]:
# construct a date by hand
new_years_eve = dt.date(2017, 12, 31)

In [51]:
until_nye = new_years_eve - today

In [52]:
type(until_nye)


Out[52]:
datetime.timedelta

We can get the number of days until new years eve by looking at until_nye.days


In [53]:
until_nye.days


Out[53]:
251

Exercise: write a python function named days_until that accepts one argument (a datetime.date) and returns the number of days between today and that date. Apply your function to

  • May 5, 2017 (day the UG project is due)
  • Your birthday

In [54]:
def days_until(date):
    today = dt.date.today()
    numb_days = date - today
    
    return numb_days.days

In [55]:
project_due = dt.date(2017, 5, 5)

In [56]:
days_until(project_due)


Out[56]:
11

We could also construct a datetime.timedelta by hand and add it to an existing date. Here's an example to see how many days until Spencer turns 30


In [57]:
spencer_bday = dt.date(1989, 4, 25)

# NOTE: add 7 for the 7 leap years between 1989 and 2019
thirty_years = dt.timedelta(days=365*30 + 7)

In [58]:
# check to make sure it is still April 25th
spencer_bday + thirty_years


Out[58]:
datetime.date(2019, 4, 25)

In [59]:
days_to_30 = (spencer_bday + thirty_years - today).days
print("Spencer will be 30 in {} days".format(days_to_30))


Spencer will be 30 in 731 days

datetime

Being able to work with dates and the difference between dates is very useful, but sometimes we need to also think about times. To do that, we will look to the dt.datetime module.

We can get the current date and time using dt.datetime.now():


In [60]:
now = dt.datetime.now()
print("type of now:", type(now))
now


type of now: <class 'datetime.datetime'>
Out[60]:
datetime.datetime(2017, 4, 24, 11, 13, 54, 622506)

The numbers in the printout above are year, month, day, hour, minute, second, millisecond.

Because we still have day, month, year information ; we can access these properties just as we did for the today above:


In [61]:
print("the day of the month is: ", now.day)
print("we are curretly in month number", now.month)
print("The year is", now.year)


the day of the month is:  24
we are curretly in month number 4
The year is 2017

Exercise: Use tab completion to see what else we can access on our dt.datetime object now


In [67]:
now.weekday()


Out[67]:
0

Time deltas work the same way with datetime objects as they did with date objects.

We can see how many seconds until Spencer turns 30:


In [68]:
# NOTE: we can only do arithmetic between many date objects or datetime obejcts
#       we cannot add or subtract a datetime to/from a date. So, we need to 
#       re-create spencer_bday as a datetime object. 
# NOTE: The timedelta object is already compatible with date and datetime objects

spencer_bday_time = dt.datetime(1989, 4, 25, 16, 33, 5)
seconds_to_30 = (spencer_bday_time + thirty_years - now).seconds

print("Spencer will be 30 in {} seconds".format(seconds_to_30))


Spencer will be 30 in 19150 seconds

strftime

Once we have date and time information, a very common thing to do is to print out a formatted version of that date. For example, suppose we wanted to print out a string in the format YYYY-MM-DD. To do this we use the strftime method. Here's an example


In [69]:
print(today.strftime("Today is %Y-%m-%d"))


Today is 2017-04-24

Notice that the argument to strftime is a python string that can contain normal text (e.g. Today is) and a special formatters (the stuff starting with %). We haven't talked much about how to do string formatting, but in Python and many other languages using % inside strings has special meaning.

Exercise Using the documentation for the string formatting behavior, figure out how to write the following strings using the method strftime method on the spencer_bday_time object

  • "Spencer was born on 1989-04-25"

In [80]:
spencer_bday_time.strftime("Spencer was born on %Y-%m-%d")


Out[80]:
'Spencer was born on 1989-04-25'
  • "Spencer was born on a Tuesday"

In [79]:
spencer_bday_time.strftime("Spencer was born on a %A")


Out[79]:
'Spencer was born on a Tuesday'
  • "Spencer was born on Tuesday, Apr 25th"

In [78]:
spencer_bday_time.strftime("Spencer was born on %A, %b %dth")


Out[78]:
'Spencer was born on Tuesday, Apr 25th'
  • (bonus) "Spencer was born on Tuesday, April 25th at 04:33 PM"

In [77]:
spencer_bday_time.strftime("Spencer was born on %A, %B %dth at %I:%M %p")


Out[77]:
'Spencer was born on Tuesday, April 25th at 04:33 PM'

Dates in Pandas

Now we will look at how to use date and dateime functionality in pandas.

To begin, lets take a closer look at the type of index we have on our ffr and vc dataframes:


In [81]:
type(ffr.index)


Out[81]:
pandas.tseries.index.DatetimeIndex

Here we have a DatetimeIndex, which means pandas recogizes this DataFrame as containing time series data.

What can we do now? A lot, here's a brief list:

  • subset the data using strings to get data for a particular time frame
  • resample the data to a diffrent frequency: this means we could convert daily to monthly, quarterly, etc.
  • quickly access things like year, month, and day for the observation
  • rolling computations: this will allow us to compute statistics on a rolling subset of the data. We'll show a simple example here, but check out the docs for more info
  • snap the observations to a particular frequency -- this one is a bit advanced and we won't cover it here

For a much more comprehensive list with other examples see the docs

For now, let's look at how to do these things with the data we obtained from quandl

NOTE You can only do these things when you have a DatetimeIndex. This means that even if one of the columns in your DataFrame has date or datetime information, you will need to set it as the index to access this functionality.

subsetting

Suppose we wanted to extract all the data for the federal funds rate for the year 2008.


In [82]:
ffr2008 = ffr["2008"]
print("ffr2008 is a", type(ffr2008))
df_info(ffr2008)


ffr2008 is a <class 'pandas.core.frame.DataFrame'>
Shape:  (366, 1)
dtypes:  {'risk_free_rate': dtype('float64')}
index dtype:  datetime64[ns]
Out[82]:
risk_free_rate
DATE
2008-01-01 3.06
2008-01-02 4.11
2008-01-03 4.25
2008-12-29 0.10
2008-12-30 0.09
2008-12-31 0.14

In [83]:
ffr2008.plot()


Out[83]:
<matplotlib.axes._subplots.AxesSubplot at 0x7fc9e8e44ba8>

Suppose we want to restrict to September 2008:


In [84]:
ffr_sep2008 = ffr["2008-09"]
df_info(ffr_sep2008)


Shape:  (30, 1)
dtypes:  {'risk_free_rate': dtype('float64')}
index dtype:  datetime64[ns]
Out[84]:
risk_free_rate
DATE
2008-09-01 1.94
2008-09-02 1.96
2008-09-03 2.01
2008-09-28 1.08
2008-09-29 1.56
2008-09-30 2.03

In [85]:
ffr_sep2008.plot()


Out[85]:
<matplotlib.axes._subplots.AxesSubplot at 0x7fc9e9157748>

We can use this same functionality to extract ranges of dates. To get the data starting in june 2007 and going until march 2011 we would do


In [86]:
ffr2 = ffr["2007-06":"2011-03"]
df_info(ffr2)


Shape:  (1400, 1)
dtypes:  {'risk_free_rate': dtype('float64')}
index dtype:  datetime64[ns]
Out[86]:
risk_free_rate
DATE
2007-06-01 5.23
2007-06-02 5.23
2007-06-03 5.23
2011-03-29 0.13
2011-03-30 0.13
2011-03-31 0.10

In [87]:
ffr2.plot()


Out[87]:
<matplotlib.axes._subplots.AxesSubplot at 0x7fc9e91c59e8>

Exercise Using one of your datasets from quandl, plot one or more variables for the last 3 years (2013 through 2016)


In [91]:
vc['2013':'2016'].plot()


Out[91]:
<matplotlib.axes._subplots.AxesSubplot at 0x7fc9e8d63a20>

resampling

Now suppose that instead of daily data, we wanted our federal funds data at a monthly frequency.

To do this we use the resample method on our DataFrame


In [92]:
# MS means "month start"
ffrM_resample = ffr.resample("MS")
type(ffrM_resample)


Out[92]:
pandas.tseries.resample.DatetimeIndexResampler

Notice that when we call resample we don't get back a DataFrame at that frequency. This is because there is some ambiguity regarding just how the frequency should be converted: should we take the average during the period, the first observation, last observation, sum the observations?

In order to get a DataFrame we have to call a method on our DatetimeIndexResampler object. For this example, let's do the first observation in each period:


In [93]:
ffrM = ffrM_resample.first()
df_info(ffrM)


Shape:  (754, 1)
dtypes:  {'risk_free_rate': dtype('float64')}
index dtype:  datetime64[ns]
Out[93]:
risk_free_rate
DATE
1954-07-01 1.13
1954-08-01 0.25
1954-09-01 1.44
2017-02-01 0.66
2017-03-01 0.66
2017-04-01 0.82

Note that we can also combine numbers with the specification of the resampling frequency. As an example, we can resample to a bi-weekly frequency using


In [96]:
ffr.resample("2w")


Out[96]:
DatetimeIndexResampler [freq=<2 * Weeks: weekday=6>, axis=0, closed=right, label=right, convention=start, base=0]

In [97]:
ffr.resample("2w").mean()


Out[97]:
risk_free_rate
DATE
1954-07-04 1.220000
1954-07-18 0.974286
1954-08-01 0.465000
1954-08-15 1.115000
1954-08-29 1.364286
1954-09-12 1.275000
1954-09-26 0.832143
1954-10-10 0.911429
1954-10-24 0.868571
1954-11-07 0.886429
1954-11-21 0.493571
1954-12-05 1.386429
1954-12-19 1.300714
1955-01-02 1.237857
1955-01-16 1.381429
1955-01-30 1.390714
1955-02-13 1.282857
1955-02-27 1.300714
1955-03-13 1.395000
1955-03-27 1.278571
1955-04-10 1.440000
1955-04-24 1.476429
1955-05-08 1.475000
1955-05-22 1.193571
1955-06-05 1.662857
1955-06-19 1.640714
1955-07-03 1.642143
1955-07-17 1.677143
1955-07-31 1.685714
1955-08-14 1.916429
... ...
2016-03-20 0.364286
2016-04-03 0.361429
2016-04-17 0.370000
2016-05-01 0.355000
2016-05-15 0.370000
2016-05-29 0.370000
2016-06-12 0.364286
2016-06-26 0.382857
2016-07-10 0.397857
2016-07-24 0.400000
2016-08-07 0.378571
2016-08-21 0.400000
2016-09-04 0.392857
2016-09-18 0.400000
2016-10-02 0.376429
2016-10-16 0.404286
2016-10-30 0.410000
2016-11-13 0.402857
2016-11-27 0.410000
2016-12-11 0.402857
2016-12-25 0.606429
2017-01-08 0.628571
2017-01-22 0.660000
2017-02-05 0.652857
2017-02-19 0.660000
2017-03-05 0.653571
2017-03-19 0.731429
2017-04-02 0.890714
2017-04-16 0.910000
2017-04-30 0.910000

1640 rows × 1 columns

Exercise: Using the documentation for the most common frequencies, figure out how to resample one of your datasets to

  • A quarterly frequency -- make sure to get the start of the quarter

In [99]:
ffr.resample('QS').mean().head()


Out[99]:
risk_free_rate
DATE
1954-07-01 1.028478
1954-10-01 0.988370
1955-01-01 1.345000
1955-04-01 1.501099
1955-07-01 1.936522
  • An annual frequency -- use the end of the year

In [100]:
ffr.resample('A').mean().head()


Out[100]:
risk_free_rate
DATE
1954-12-31 1.008424
1955-12-31 1.788603
1956-12-31 2.729590
1957-12-31 3.105342
1958-12-31 1.572356

more than you need: I want to point out that when you use the first or last methods to perform the aggregations, there are two dates involved: (1) the date the resultant index will have and (2) the date used to fill in the data at that date. The first date (one on the index) will be assigned based on the string you pass to the resample method. The second date (the one for extracting data from the original dataframe) will be determined based on the method used to do the aggregation. first will extract the first data point from that subset and last will extract the last.

Let's see some examples:


In [101]:
ffr.resample("M").first().head()


Out[101]:
risk_free_rate
DATE
1954-07-31 1.13
1954-08-31 0.25
1954-09-30 1.44
1954-10-31 0.50
1954-11-30 1.38

In [103]:
ffr.resample("M").last().head()


Out[103]:
risk_free_rate
DATE
1954-07-31 0.25
1954-08-31 1.44
1954-09-30 1.44
1954-10-31 1.13
1954-11-30 1.38

Notice that the index is the same on both, but the data is clearly different.

If we use MS instead of M we will have the index based on the first day of the month:


In [104]:
ffr.resample("MS").first().head()


Out[104]:
risk_free_rate
DATE
1954-07-01 1.13
1954-08-01 0.25
1954-09-01 1.44
1954-10-01 0.50
1954-11-01 1.38

In [105]:
ffr.resample("MS").last().head()


Out[105]:
risk_free_rate
DATE
1954-07-01 0.25
1954-08-01 1.44
1954-09-01 1.44
1954-10-01 1.13
1954-11-01 1.38

Notice how the data associated with "M" and first is the same as the data for "MS" and first. The same holds for last.

Access year, month, day...

Given a DatetimeIndex you can access the day, month, or year (also second, millisecond, etc.) by simply accessing the .XX property; where XX is the data you want


In [106]:
ffr.index.year


Out[106]:
array([1954, 1954, 1954, ..., 2017, 2017, 2017], dtype=int32)

In [107]:
ffr.index.day


Out[107]:
array([ 1,  2,  3, ..., 18, 19, 20], dtype=int32)

In [108]:
ffr.index.month


Out[108]:
array([7, 7, 7, ..., 4, 4, 4], dtype=int32)

Rolling computations

We can use pandas to do rolling computations.

For example, suppose we want to plot the maximum and minimum of the risk free rate within the past week at each date (think about that slowly -- for every date, we want to look back 7 days and compute the max).

Here's how we can do that


In [111]:
fig, ax = plt.subplots()
ffr.rolling(window=7).max().plot(ax=ax)
ffr.rolling(window=7).min().plot(ax=ax)
ax.legend(["max", "min"])


Out[111]:
<matplotlib.legend.Legend at 0x7fc9e2c49240>

Note that this is different from just resampling because we will have an observation for every date in the original dataframe (except the number of dates at the front needed to construct the initial window).


In [112]:
ffr.rolling(window=7).max().head(10)


Out[112]:
risk_free_rate
DATE
1954-07-01 NaN
1954-07-02 NaN
1954-07-03 NaN
1954-07-04 NaN
1954-07-05 NaN
1954-07-06 NaN
1954-07-07 1.25
1954-07-08 1.25
1954-07-09 1.25
1954-07-10 1.25

In [113]:
ffr.resample("7D").max().head(10)


Out[113]:
risk_free_rate
DATE
1954-07-01 1.25
1954-07-08 1.25
1954-07-15 0.75
1954-07-22 0.75
1954-07-29 0.38
1954-08-05 1.44
1954-08-12 1.44
1954-08-19 1.44
1954-08-26 1.44
1954-09-02 1.38

Merging with dates

Let's see what happens when we merge the ffr and vc datasets


In [114]:
# do a left merge on the index (date info)
df = pd.merge(ffr, vc, left_index=True, right_index=True, how="left")
df_info(df)


Shape:  (22940, 6)
dtypes:  {'Seed': dtype('float64'), 'Expansion': dtype('float64'), 'risk_free_rate': dtype('float64'), 'Later Stage': dtype('float64'), 'Total': dtype('float64'), 'Early Stage': dtype('float64')}
index dtype:  datetime64[ns]
Out[114]:
risk_free_rate Seed Early Stage Expansion Later Stage Total
DATE
1954-07-01 1.13 NaN NaN NaN NaN NaN
1954-07-02 1.25 NaN NaN NaN NaN NaN
1954-07-03 1.25 NaN NaN NaN NaN NaN
2017-04-18 0.91 NaN NaN NaN NaN NaN
2017-04-19 0.91 NaN NaN NaN NaN NaN
2017-04-20 0.91 NaN NaN NaN NaN NaN

In [115]:
vc.head()


Out[115]:
Seed Early Stage Expansion Later Stage Total
Date
1985-03-31 153.0 96.3 219.5 154.4 623.1
1985-06-30 146.5 185.3 319.6 89.4 740.8
1985-09-30 93.7 106.3 312.8 164.4 677.2
1985-12-31 133.0 129.9 393.7 78.5 735.1
1986-03-31 185.6 129.6 270.0 125.3 710.5

Notice that we ended up with a lot of missing data. This happened for two reasons:

  1. The ffr data goes back to 1954, but the vc data starts in 1985
  2. The ffr data is at a daily frequency, but vc is at quarterly.

To resolve the first issue we can subset the ffr data and only keep from 1985 on


In [116]:
ffr_recent = ffr["1985":]

To resolve the second issue we will do two-steps:

  1. resample the ffr data to a monthly frequency
  2. resample the vc data to a monthly frequency by padding. This is called upsampling because we are going from a lower frequency (quarterly) to a higher one (monthly)

In [117]:
ffr_recentM = ffr_recent.resample("M").first()
vc_M = vc.resample("M").pad()

In [118]:
vc_M.head()


Out[118]:
Seed Early Stage Expansion Later Stage Total
Date
1985-03-31 153.0 96.3 219.5 154.4 623.1
1985-04-30 153.0 96.3 219.5 154.4 623.1
1985-05-31 153.0 96.3 219.5 154.4 623.1
1985-06-30 146.5 185.3 319.6 89.4 740.8
1985-07-31 146.5 185.3 319.6 89.4 740.8

Notice that using pad here just copied data forwards to fill in missing months (e.g. the data for March 1985 was applied to April and May)

Now let's try that merge again


In [119]:
df = pd.merge(ffr_recentM, vc_M, left_index=True, right_index=True, how="left")
print(df.head(6))
print("\n\n", df.tail(8))


            risk_free_rate   Seed  Early Stage  Expansion  Later Stage  Total
DATE                                                                         
1985-01-31            8.74    NaN          NaN        NaN          NaN    NaN
1985-02-28            8.74    NaN          NaN        NaN          NaN    NaN
1985-03-31            8.74  153.0         96.3      219.5        154.4  623.1
1985-04-30            8.83  153.0         96.3      219.5        154.4  623.1
1985-05-31            8.83  153.0         96.3      219.5        154.4  623.1
1985-06-30            7.64  146.5        185.3      319.6         89.4  740.8


             risk_free_rate  Seed  Early Stage  Expansion  Later Stage  Total
DATE                                                                        
2016-09-30            0.40   NaN          NaN        NaN          NaN    NaN
2016-10-31            0.29   NaN          NaN        NaN          NaN    NaN
2016-11-30            0.41   NaN          NaN        NaN          NaN    NaN
2016-12-31            0.41   NaN          NaN        NaN          NaN    NaN
2017-01-31            0.55   NaN          NaN        NaN          NaN    NaN
2017-02-28            0.66   NaN          NaN        NaN          NaN    NaN
2017-03-31            0.66   NaN          NaN        NaN          NaN    NaN
2017-04-30            0.82   NaN          NaN        NaN          NaN    NaN

That looks much better -- we have missing data at the top and the bottom for months that aren't available in the venture capital dataset, but nothing else should be missing.

Let's try to do something interesting with this data. We want to plot the growth rate in the risk free rate, early stage vc funding, and total vc funding for the months following the start of the dotcom boom (rougly Jan 1995) and the housing boom (roughly Jan 2004).

Read that again carefully.

For each of the three series we want 2 lines. For each line, the x axis will be quarters since start of boom. The y axis growth rates since first month of bubble.


In [120]:
# subset the data, then remove datetime index as we don't need it again
post_dotcom = df["1995":].reset_index(drop=True)
post_housing = df["2004":].reset_index(drop=True)

# take logs so we can do growth rates as log(x_{t+N}) - log(x_t)
post_dotcom = np.log(post_dotcom)
post_housing = np.log(post_housing)

dotcom_growth = post_dotcom - post_dotcom.iloc[0, :]
housing_growth = post_housing - post_housing.iloc[0, :]

In [121]:
fig, axs = plt.subplots(3, 1, figsize=(10, 5))

variables = ["risk_free_rate", "Early Stage", "Total"]

for i in range(len(variables)):
    var = variables[i]
    
    # add dotcom line
    dotcom_growth[var].plot(ax=axs[i])
    
    # add housing line
    housing_growth[var].plot(ax=axs[i])
    
    # set title
    axs[i].set_title(var)

# set legend and xlabel on last plot only
axs[-1].legend(["dotcom", "housing"])
axs[-1].set_xlabel("Quarters since boom")
        
# make subplots not overlap
fig.tight_layout()


So we see that these booms were qualitatively different in terms of the response of the risk free rate and venture capital funding.


In [ ]: