Timeseries

Overview. We introduce the tools for working with dates, times, and time series data. We start with functionality built into python itself, then discuss how pandas builds on these tools to add powerful time series capabilities to DataFrames.

Outline

Quandl: We show how to use the quandl package to access a large database of financial and economic data.
Dates in python: covers the basics of working with dates and times in python
Dates in pandas: shows how to use dates with pandas objects

Note: requires internet access to run.

This Jupyter notebook was created by Chase Coleman and Spencer Lyon for the NYU Stern course Data Bootcamp.

In order to run the code in this notebook, you will need to have the quandl package installed. You can do this from the command line using

pip install quandl --upgrade



In [2]:

    
import sys                             # system module
import pandas as pd                    # data package
import matplotlib.pyplot as plt        # graphics module  
import datetime as dt                  # date and time module
import numpy as np
%matplotlib inline    
plt.style.use("ggplot")

# quandl package
import quandl

# check versions (overkill, but why not?)
print('Python version:', sys.version)
print('Pandas version: ', pd.__version__)
print('quandl version: ', quandl.version.VERSION)
print('Today: ', dt.date.today())

# helper function to print info about dataframe

def df_info(df):
    print("Shape: ", df.shape)
    print("dtypes: ", df.dtypes.to_dict())
    print("index dtype: ", df.index.dtype)
    return pd.concat([df.head(3), df.tail(3)])









    



Python version: 3.5.2 |Anaconda custom (64-bit)| (default, Jul  2 2016, 17:53:06) 
[GCC 4.4.7 20120313 (Red Hat 4.4.7-1)]
Pandas version:  0.19.2
quandl version:  3.1.0
Today:  2017-04-24

Quandl

quandl is a company that collects and maintains financial and economic data from standard sources (e.g. FRED, IMF, BEA, etc.) and non-standard sources (Fx data, company level data, trader receipts). The data is viewable on their webpage (see here or there for examples), but made available to programming languages via their API. We will access their API using their python library.

Suppose, for example, that we wanted to get data on taxes in the US. Here's how we might find some:

Open up the quandl search page
Type in "US tax revenue"
Click on one of the results that seems interesting to us
Checkout things like the frequency (Annual for this data set), the quandl code (top right, here it is OECD/REV_NES_TOTALTAX_TAXUSD_USA) and description.

Exercise (5 min): Go to Quandl's website and explore some of the data quandl has available. Come up with 2 datasets and make a dictionary that maps the quandl code into a reasonable name. For example, for the us tax revenue dataset above I could have done

my_data = {"OECD/REV_NES_TOTALTAX_TAXUSD_USA": "US_tax_rev"}



In [ ]:

We can download the data using the quandl.get function and passing it one of the Quandl codes we collected in the previous exercise



In [3]:

    
us_tax = quandl.get("OECD/REV_NES_TOTALTAX_TAXUSD_USA")
df_info(us_tax)









    



Shape:  (51, 1)
dtypes:  {'Value': dtype('float64')}
index dtype:  datetime64[ns]






    Out[3]:






  
    
      
      Value
    
    
      Date
      
    
  
  
    
      1965-12-31
      167022.00
    
    
      1966-12-31
      186065.00
    
    
      1967-12-31
      208577.00
    
    
      2013-12-31
      4283028.74
    
    
      2014-12-31
      4500529.62
    
    
      2015-12-31
      4754119.88

We can also pass start_date and end_date parameters to control the dates for the downloaded data:



In [4]:

    
us_tax_recent = quandl.get("OECD/REV_NES_TOTALTAX_TAXUSD_USA", start_date="2000-01-01")
df_info(us_tax_recent)









    



Shape:  (16, 1)
dtypes:  {'Value': dtype('float64')}
index dtype:  datetime64[ns]






    Out[4]:






  
    
      
      Value
    
    
      Date
      
    
  
  
    
      2000-12-31
      2900519.06
    
    
      2001-12-31
      2884729.87
    
    
      2002-12-31
      2738000.14
    
    
      2013-12-31
      4283028.74
    
    
      2014-12-31
      4500529.62
    
    
      2015-12-31
      4754119.88

Now, let's read in the data sets we found were interesting. Feel free to use the codes you looked up, or the ones I'm using here.



In [5]:

    
# For multiple data sources, it's useful to define a list/dict
my_data = {"FRED/DFF": "risk_free_rate",
           "NVCA/VENTURE_3_09C": "vc_investments"}

dfs = []

for k in my_data.keys():
    dfs.append(quandl.get(k))



In [6]:

    
df_info(dfs[0])









    



Shape:  (126, 5)
dtypes:  {'Expansion': dtype('float64'), 'Seed': dtype('float64'), 'Later Stage': dtype('float64'), 'Total': dtype('float64'), 'Early Stage': dtype('float64')}
index dtype:  datetime64[ns]






    Out[6]:






  
    
      
      Seed
      Early Stage
      Expansion
      Later Stage
      Total
    
    
      Date
      
      
      
      
      
    
  
  
    
      1985-03-31
      153.0000
      96.3000
      219.5000
      154.4000
      623.1000
    
    
      1985-06-30
      146.5000
      185.3000
      319.6000
      89.4000
      740.8000
    
    
      1985-09-30
      93.7000
      106.3000
      312.8000
      164.4000
      677.2000
    
    
      2015-12-31
      482.5349
      5193.9647
      3282.8216
      3167.6023
      12126.9235
    
    
      2016-03-31
      561.8670
      4601.1276
      4010.2623
      3516.5483
      12689.8052
    
    
      2016-06-30
      534.7283
      3962.5863
      8507.1810
      2286.6906
      15291.1862

To request specific columns use column indices (NOT 0-based)



In [35]:

    
quandl.get(['NVCA/VENTURE_3_09C.2', 'NVCA/VENTURE_3_09C.5']).head()









    Out[35]:






  
    
      
      NVCA/VENTURE_3_09C - Early Stage
      NVCA/VENTURE_3_09C - Total
    
    
      Date
      
      
    
  
  
    
      1985-03-31
      96.3
      623.1
    
    
      1985-06-30
      185.3
      740.8
    
    
      1985-09-30
      106.3
      677.2
    
    
      1985-12-31
      129.9
      735.1
    
    
      1986-03-31
      129.6
      710.5

To combine variables from different datasets



In [32]:

    
quandl.get(['NSE/OIL.1', 'WIKI/AAPL.4'])









    Out[32]:






  
    
      
      NSE/OIL - Open
      WIKI/AAPL - Close
    
    
      Date
      
      
    
  
  
    
      1980-12-12
      NaN
      28.75
    
    
      1980-12-15
      NaN
      27.25
    
    
      1980-12-16
      NaN
      25.25
    
    
      1980-12-17
      NaN
      25.87
    
    
      1980-12-18
      NaN
      26.63
    
    
      1980-12-19
      NaN
      28.25
    
    
      1980-12-22
      NaN
      29.63
    
    
      1980-12-23
      NaN
      30.88
    
    
      1980-12-24
      NaN
      32.50
    
    
      1980-12-26
      NaN
      35.50
    
    
      1980-12-29
      NaN
      36.00
    
    
      1980-12-30
      NaN
      35.12
    
    
      1980-12-31
      NaN
      34.13
    
    
      1981-01-02
      NaN
      34.50
    
    
      1981-01-05
      NaN
      33.75
    
    
      1981-01-06
      NaN
      32.25
    
    
      1981-01-07
      NaN
      30.88
    
    
      1981-01-08
      NaN
      30.25
    
    
      1981-01-09
      NaN
      31.88
    
    
      1981-01-12
      NaN
      31.62
    
    
      1981-01-13
      NaN
      30.50
    
    
      1981-01-14
      NaN
      30.63
    
    
      1981-01-15
      NaN
      31.25
    
    
      1981-01-16
      NaN
      31.00
    
    
      1981-01-19
      NaN
      32.87
    
    
      1981-01-20
      NaN
      31.88
    
    
      1981-01-21
      NaN
      32.50
    
    
      1981-01-22
      NaN
      32.87
    
    
      1981-01-23
      NaN
      32.75
    
    
      1981-01-26
      NaN
      32.25
    
    
      ...
      ...
      ...
    
    
      2017-03-10
      325.00
      139.14
    
    
      2017-03-13
      NaN
      139.20
    
    
      2017-03-14
      337.20
      138.99
    
    
      2017-03-15
      334.10
      140.46
    
    
      2017-03-16
      336.80
      140.69
    
    
      2017-03-17
      338.00
      139.99
    
    
      2017-03-20
      336.00
      141.46
    
    
      2017-03-21
      332.60
      139.84
    
    
      2017-03-22
      330.05
      141.42
    
    
      2017-03-23
      329.90
      140.92
    
    
      2017-03-24
      331.70
      140.64
    
    
      2017-03-27
      331.70
      140.88
    
    
      2017-03-28
      332.20
      143.80
    
    
      2017-03-29
      331.80
      144.12
    
    
      2017-03-30
      331.95
      143.93
    
    
      2017-03-31
      329.95
      143.71
    
    
      2017-04-03
      333.40
      143.70
    
    
      2017-04-04
      NaN
      144.77
    
    
      2017-04-05
      330.80
      144.02
    
    
      2017-04-06
      332.20
      143.66
    
    
      2017-04-07
      334.00
      143.34
    
    
      2017-04-10
      334.30
      143.14
    
    
      2017-04-11
      340.50
      141.63
    
    
      2017-04-12
      337.75
      141.79
    
    
      2017-04-13
      334.10
      141.05
    
    
      2017-04-17
      334.75
      141.80
    
    
      2017-04-18
      334.45
      141.20
    
    
      2017-04-19
      331.10
      140.68
    
    
      2017-04-20
      332.70
      142.44
    
    
      2017-04-21
      324.80
      NaN
    
  

9233 rows × 2 columns

What happened to the first column?



In [33]:

    
mix = quandl.get(['NSE/OIL.1', 'WIKI/AAPL.4'])
mix.plot(subplots=True)









    Out[33]:





array([<matplotlib.axes._subplots.AxesSubplot object at 0x7fc9eccacf60>,
       <matplotlib.axes._subplots.AxesSubplot object at 0x7fc9e9526e10>], dtype=object)



In [36]:

    
df_info(dfs[1])









    



Shape:  (22940, 1)
dtypes:  {'risk_free_rate': dtype('float64')}
index dtype:  datetime64[ns]






    Out[36]:






  
    
      
      risk_free_rate
    
    
      DATE
      
    
  
  
    
      1954-07-01
      1.13
    
    
      1954-07-02
      1.25
    
    
      1954-07-03
      1.25
    
    
      2017-04-18
      0.91
    
    
      2017-04-19
      0.91
    
    
      2017-04-20
      0.91

So, "FRED/DFF" is the federal funds rate, or the interest rate at which banks can trade federal assets with each other overnight. This is often used as a proxy for the risk free rate in economic analysis.

From the printout above it looks like we have more than 22k observations starting in 1954 at a daily frequency.

Notice, however that the column name is DFF. Let's use our dict to clean up that name:



In [14]:

    
dfs[1].rename(columns={"DFF": my_data["FRED/DFF"]}, inplace=True)
df_info(dfs[1])









    



Shape:  (22940, 1)
dtypes:  {'risk_free_rate': dtype('float64')}
index dtype:  datetime64[ns]






    Out[14]:






  
    
      
      risk_free_rate
    
    
      DATE
      
    
  
  
    
      1954-07-01
      1.13
    
    
      1954-07-02
      1.25
    
    
      1954-07-03
      1.25
    
    
      2017-04-18
      0.91
    
    
      2017-04-19
      0.91
    
    
      2017-04-20
      0.91

We can change the sampling frequency:



In [44]:

    
quart = quandl.get("FRED/DFF", collapse='quarterly')
print(quart.head())
quart.plot()









    



             DFF
DATE            
1954-09-30  1.44
1954-12-31  1.44
1955-03-31  1.44
1955-06-30  1.69
1955-09-30  2.19






    Out[44]:





<matplotlib.axes._subplots.AxesSubplot at 0x7fc9e93dd588>

Or we can perform elementary calculations on the data



In [45]:

    
diff = quandl.get("FRED/DFF", transformation='rdiff')
diff.plot()









    Out[45]:





<matplotlib.axes._subplots.AxesSubplot at 0x7fc9e94c3a90>

The other dataframe we dowloaded (using code NVCA/VENTURE_3_09C) contains quarterly data on total investment by venture capital firms in the US, broken down by the stage of the project.

The column names here are ok, so we don't need to change anything.

So that we have the data easily acessible for later on, let's store these two variables in individual dataframes:



In [46]:

    
ffr = dfs[1]
vc = dfs[0]

Usage Limits

The Quandl Python module is free. If you would like to make more than 50 calls a day, however, you will need to create a free Quandl account and set your API key:

quandl.ApiConfig.api_key = "YOUR_KEY_HERE"
mydata = quandl.get("FRED/GDP")

The personalized API key is available here: https://www.quandl.com/account/api

Dates in python

The date and time functionality in python comes from the built in datetime module. Notice above that we ran

import datetime as dt

We've been using the dt.date.today() function throughout this course when we print the date at the top of our notebooks, but we haven't given it very much thought. Let's take a closer look now.

To start, let's see what the type of dt.date.today() is



In [47]:

    
today = dt.date.today()
print("the type of today is ", type(today))









    



the type of today is  <class 'datetime.date'>

Given that we have an object of type datetime.date we can do things like ask for the day, month, and year



In [48]:

    
print("the day of the month is: ", today.day)
print("we are curretly in month number", today.month)
print("The year is", today.year)









    



the day of the month is:  24
we are curretly in month number 4
The year is 2017

timedelta

Suppose that we wanted to construct a "days until" counter.

To do this we will construct another datetime.date and use the - operator to find the differene between the other date and today.



In [50]:

    
# construct a date by hand
new_years_eve = dt.date(2017, 12, 31)



In [51]:

    
until_nye = new_years_eve - today



In [52]:

    
type(until_nye)









    Out[52]:





datetime.timedelta

We can get the number of days until new years eve by looking at until_nye.days



In [53]:

    
until_nye.days









    Out[53]:





251

Exercise: write a python function named days_until that accepts one argument (a datetime.date) and returns the number of days between today and that date. Apply your function to

May 5, 2017 (day the UG project is due)
Your birthday



In [54]:

    
def days_until(date):
    today = dt.date.today()
    numb_days = date - today
    
    return numb_days.days



In [55]:

    
project_due = dt.date(2017, 5, 5)



In [56]:

    
days_until(project_due)









    Out[56]:





11

We could also construct a datetime.timedelta by hand and add it to an existing date. Here's an example to see how many days until Spencer turns 30



In [57]:

    
spencer_bday = dt.date(1989, 4, 25)

# NOTE: add 7 for the 7 leap years between 1989 and 2019
thirty_years = dt.timedelta(days=365*30 + 7)



In [58]:

    
# check to make sure it is still April 25th
spencer_bday + thirty_years









    Out[58]:





datetime.date(2019, 4, 25)



In [59]:

    
days_to_30 = (spencer_bday + thirty_years - today).days
print("Spencer will be 30 in {} days".format(days_to_30))









    



Spencer will be 30 in 731 days

datetime

Being able to work with dates and the difference between dates is very useful, but sometimes we need to also think about times. To do that, we will look to the dt.datetime module.

We can get the current date and time using dt.datetime.now():



In [60]:

    
now = dt.datetime.now()
print("type of now:", type(now))
now









    



type of now: <class 'datetime.datetime'>






    Out[60]:





datetime.datetime(2017, 4, 24, 11, 13, 54, 622506)

The numbers in the printout above are year, month, day, hour, minute, second, millisecond.

Because we still have day, month, year information ; we can access these properties just as we did for the today above:



In [61]:

    
print("the day of the month is: ", now.day)
print("we are curretly in month number", now.month)
print("The year is", now.year)









    



the day of the month is:  24
we are curretly in month number 4
The year is 2017

Exercise: Use tab completion to see what else we can access on our dt.datetime object now



In [67]:

    
now.weekday()









    Out[67]:





0

Time deltas work the same way with datetime objects as they did with date objects.

We can see how many seconds until Spencer turns 30:



In [68]:

    
# NOTE: we can only do arithmetic between many date objects or datetime obejcts
#       we cannot add or subtract a datetime to/from a date. So, we need to 
#       re-create spencer_bday as a datetime object. 
# NOTE: The timedelta object is already compatible with date and datetime objects

spencer_bday_time = dt.datetime(1989, 4, 25, 16, 33, 5)
seconds_to_30 = (spencer_bday_time + thirty_years - now).seconds

print("Spencer will be 30 in {} seconds".format(seconds_to_30))









    



Spencer will be 30 in 19150 seconds

strftime

Once we have date and time information, a very common thing to do is to print out a formatted version of that date. For example, suppose we wanted to print out a string in the format YYYY-MM-DD. To do this we use the strftime method. Here's an example



In [69]:

    
print(today.strftime("Today is %Y-%m-%d"))









    



Today is 2017-04-24

Notice that the argument to strftime is a python string that can contain normal text (e.g. Today is) and a special formatters (the stuff starting with %). We haven't talked much about how to do string formatting, but in Python and many other languages using % inside strings has special meaning.

Exercise Using the documentation for the string formatting behavior, figure out how to write the following strings using the method strftime method on the spencer_bday_time object

"Spencer was born on 1989-04-25"



In [80]:

    
spencer_bday_time.strftime("Spencer was born on %Y-%m-%d")









    Out[80]:





'Spencer was born on 1989-04-25'

"Spencer was born on a Tuesday"



In [79]:

    
spencer_bday_time.strftime("Spencer was born on a %A")









    Out[79]:





'Spencer was born on a Tuesday'

"Spencer was born on Tuesday, Apr 25th"



In [78]:

    
spencer_bday_time.strftime("Spencer was born on %A, %b %dth")









    Out[78]:





'Spencer was born on Tuesday, Apr 25th'

(bonus) "Spencer was born on Tuesday, April 25th at 04:33 PM"



In [77]:

    
spencer_bday_time.strftime("Spencer was born on %A, %B %dth at %I:%M %p")









    Out[77]:





'Spencer was born on Tuesday, April 25th at 04:33 PM'

Dates in Pandas

Now we will look at how to use date and dateime functionality in pandas.

To begin, lets take a closer look at the type of index we have on our ffr and vc dataframes:



In [81]:

    
type(ffr.index)









    Out[81]:





pandas.tseries.index.DatetimeIndex

Here we have a DatetimeIndex, which means pandas recogizes this DataFrame as containing time series data.

What can we do now? A lot, here's a brief list:

subset the data using strings to get data for a particular time frame
resample the data to a diffrent frequency: this means we could convert daily to monthly, quarterly, etc.
quickly access things like year, month, and day for the observation
rolling computations: this will allow us to compute statistics on a rolling subset of the data. We'll show a simple example here, but check out the docs for more info
snap the observations to a particular frequency -- this one is a bit advanced and we won't cover it here

For a much more comprehensive list with other examples see the docs

For now, let's look at how to do these things with the data we obtained from quandl

NOTE You can only do these things when you have a DatetimeIndex. This means that even if one of the columns in your DataFrame has date or datetime information, you will need to set it as the index to access this functionality.

subsetting

Suppose we wanted to extract all the data for the federal funds rate for the year 2008.



In [82]:

    
ffr2008 = ffr["2008"]
print("ffr2008 is a", type(ffr2008))
df_info(ffr2008)









    



ffr2008 is a <class 'pandas.core.frame.DataFrame'>
Shape:  (366, 1)
dtypes:  {'risk_free_rate': dtype('float64')}
index dtype:  datetime64[ns]






    Out[82]:






  
    
      
      risk_free_rate
    
    
      DATE
      
    
  
  
    
      2008-01-01
      3.06
    
    
      2008-01-02
      4.11
    
    
      2008-01-03
      4.25
    
    
      2008-12-29
      0.10
    
    
      2008-12-30
      0.09
    
    
      2008-12-31
      0.14



In [83]:

    
ffr2008.plot()









    Out[83]:





<matplotlib.axes._subplots.AxesSubplot at 0x7fc9e8e44ba8>

Suppose we want to restrict to September 2008:



In [84]:

    
ffr_sep2008 = ffr["2008-09"]
df_info(ffr_sep2008)









    



Shape:  (30, 1)
dtypes:  {'risk_free_rate': dtype('float64')}
index dtype:  datetime64[ns]






    Out[84]:






  
    
      
      risk_free_rate
    
    
      DATE
      
    
  
  
    
      2008-09-01
      1.94
    
    
      2008-09-02
      1.96
    
    
      2008-09-03
      2.01
    
    
      2008-09-28
      1.08
    
    
      2008-09-29
      1.56
    
    
      2008-09-30
      2.03



In [85]:

    
ffr_sep2008.plot()









    Out[85]:





<matplotlib.axes._subplots.AxesSubplot at 0x7fc9e9157748>

We can use this same functionality to extract ranges of dates. To get the data starting in june 2007 and going until march 2011 we would do



In [86]:

    
ffr2 = ffr["2007-06":"2011-03"]
df_info(ffr2)









    



Shape:  (1400, 1)
dtypes:  {'risk_free_rate': dtype('float64')}
index dtype:  datetime64[ns]






    Out[86]:






  
    
      
      risk_free_rate
    
    
      DATE
      
    
  
  
    
      2007-06-01
      5.23
    
    
      2007-06-02
      5.23
    
    
      2007-06-03
      5.23
    
    
      2011-03-29
      0.13
    
    
      2011-03-30
      0.13
    
    
      2011-03-31
      0.10



In [87]:

    
ffr2.plot()









    Out[87]:





<matplotlib.axes._subplots.AxesSubplot at 0x7fc9e91c59e8>

Exercise Using one of your datasets from quandl, plot one or more variables for the last 3 years (2013 through 2016)



In [91]:

    
vc['2013':'2016'].plot()









    Out[91]:





<matplotlib.axes._subplots.AxesSubplot at 0x7fc9e8d63a20>

resampling

Now suppose that instead of daily data, we wanted our federal funds data at a monthly frequency.

To do this we use the resample method on our DataFrame



In [92]:

    
# MS means "month start"
ffrM_resample = ffr.resample("MS")
type(ffrM_resample)









    Out[92]:





pandas.tseries.resample.DatetimeIndexResampler

Notice that when we call resample we don't get back a DataFrame at that frequency. This is because there is some ambiguity regarding just how the frequency should be converted: should we take the average during the period, the first observation, last observation, sum the observations?

In order to get a DataFrame we have to call a method on our DatetimeIndexResampler object. For this example, let's do the first observation in each period:



In [93]:

    
ffrM = ffrM_resample.first()
df_info(ffrM)









    



Shape:  (754, 1)
dtypes:  {'risk_free_rate': dtype('float64')}
index dtype:  datetime64[ns]






    Out[93]:






  
    
      
      risk_free_rate
    
    
      DATE
      
    
  
  
    
      1954-07-01
      1.13
    
    
      1954-08-01
      0.25
    
    
      1954-09-01
      1.44
    
    
      2017-02-01
      0.66
    
    
      2017-03-01
      0.66
    
    
      2017-04-01
      0.82

Note that we can also combine numbers with the specification of the resampling frequency. As an example, we can resample to a bi-weekly frequency using



In [96]:

    
ffr.resample("2w")









    Out[96]:





DatetimeIndexResampler [freq=<2 * Weeks: weekday=6>, axis=0, closed=right, label=right, convention=start, base=0]



In [97]:

    
ffr.resample("2w").mean()









    Out[97]:






  
    
      
      risk_free_rate
    
    
      DATE
      
    
  
  
    
      1954-07-04
      1.220000
    
    
      1954-07-18
      0.974286
    
    
      1954-08-01
      0.465000
    
    
      1954-08-15
      1.115000
    
    
      1954-08-29
      1.364286
    
    
      1954-09-12
      1.275000
    
    
      1954-09-26
      0.832143
    
    
      1954-10-10
      0.911429
    
    
      1954-10-24
      0.868571
    
    
      1954-11-07
      0.886429
    
    
      1954-11-21
      0.493571
    
    
      1954-12-05
      1.386429
    
    
      1954-12-19
      1.300714
    
    
      1955-01-02
      1.237857
    
    
      1955-01-16
      1.381429
    
    
      1955-01-30
      1.390714
    
    
      1955-02-13
      1.282857
    
    
      1955-02-27
      1.300714
    
    
      1955-03-13
      1.395000
    
    
      1955-03-27
      1.278571
    
    
      1955-04-10
      1.440000
    
    
      1955-04-24
      1.476429
    
    
      1955-05-08
      1.475000
    
    
      1955-05-22
      1.193571
    
    
      1955-06-05
      1.662857
    
    
      1955-06-19
      1.640714
    
    
      1955-07-03
      1.642143
    
    
      1955-07-17
      1.677143
    
    
      1955-07-31
      1.685714
    
    
      1955-08-14
      1.916429
    
    
      ...
      ...
    
    
      2016-03-20
      0.364286
    
    
      2016-04-03
      0.361429
    
    
      2016-04-17
      0.370000
    
    
      2016-05-01
      0.355000
    
    
      2016-05-15
      0.370000
    
    
      2016-05-29
      0.370000
    
    
      2016-06-12
      0.364286
    
    
      2016-06-26
      0.382857
    
    
      2016-07-10
      0.397857
    
    
      2016-07-24
      0.400000
    
    
      2016-08-07
      0.378571
    
    
      2016-08-21
      0.400000
    
    
      2016-09-04
      0.392857
    
    
      2016-09-18
      0.400000
    
    
      2016-10-02
      0.376429
    
    
      2016-10-16
      0.404286
    
    
      2016-10-30
      0.410000
    
    
      2016-11-13
      0.402857
    
    
      2016-11-27
      0.410000
    
    
      2016-12-11
      0.402857
    
    
      2016-12-25
      0.606429
    
    
      2017-01-08
      0.628571
    
    
      2017-01-22
      0.660000
    
    
      2017-02-05
      0.652857
    
    
      2017-02-19
      0.660000
    
    
      2017-03-05
      0.653571
    
    
      2017-03-19
      0.731429
    
    
      2017-04-02
      0.890714
    
    
      2017-04-16
      0.910000
    
    
      2017-04-30
      0.910000
    
  

1640 rows × 1 columns

Exercise: Using the documentation for the most common frequencies, figure out how to resample one of your datasets to

A quarterly frequency -- make sure to get the start of the quarter



In [99]:

    
ffr.resample('QS').mean().head()









    Out[99]:






  
    
      
      risk_free_rate
    
    
      DATE
      
    
  
  
    
      1954-07-01
      1.028478
    
    
      1954-10-01
      0.988370
    
    
      1955-01-01
      1.345000
    
    
      1955-04-01
      1.501099
    
    
      1955-07-01
      1.936522

An annual frequency -- use the end of the year



In [100]:

    
ffr.resample('A').mean().head()









    Out[100]:






  
    
      
      risk_free_rate
    
    
      DATE
      
    
  
  
    
      1954-12-31
      1.008424
    
    
      1955-12-31
      1.788603
    
    
      1956-12-31
      2.729590
    
    
      1957-12-31
      3.105342
    
    
      1958-12-31
      1.572356

more than you need: I want to point out that when you use the first or last methods to perform the aggregations, there are two dates involved: (1) the date the resultant index will have and (2) the date used to fill in the data at that date. The first date (one on the index) will be assigned based on the string you pass to the resample method. The second date (the one for extracting data from the original dataframe) will be determined based on the method used to do the aggregation. first will extract the first data point from that subset and last will extract the last.

Let's see some examples:



In [101]:

    
ffr.resample("M").first().head()









    Out[101]:






  
    
      
      risk_free_rate
    
    
      DATE
      
    
  
  
    
      1954-07-31
      1.13
    
    
      1954-08-31
      0.25
    
    
      1954-09-30
      1.44
    
    
      1954-10-31
      0.50
    
    
      1954-11-30
      1.38



In [103]:

    
ffr.resample("M").last().head()









    Out[103]:






  
    
      
      risk_free_rate
    
    
      DATE
      
    
  
  
    
      1954-07-31
      0.25
    
    
      1954-08-31
      1.44
    
    
      1954-09-30
      1.44
    
    
      1954-10-31
      1.13
    
    
      1954-11-30
      1.38

Notice that the index is the same on both, but the data is clearly different.

If we use MS instead of M we will have the index based on the first day of the month:



In [104]:

    
ffr.resample("MS").first().head()









    Out[104]:






  
    
      
      risk_free_rate
    
    
      DATE
      
    
  
  
    
      1954-07-01
      1.13
    
    
      1954-08-01
      0.25
    
    
      1954-09-01
      1.44
    
    
      1954-10-01
      0.50
    
    
      1954-11-01
      1.38



In [105]:

    
ffr.resample("MS").last().head()









    Out[105]:






  
    
      
      risk_free_rate
    
    
      DATE
      
    
  
  
    
      1954-07-01
      0.25
    
    
      1954-08-01
      1.44
    
    
      1954-09-01
      1.44
    
    
      1954-10-01
      1.13
    
    
      1954-11-01
      1.38

Notice how the data associated with "M" and first is the same as the data for "MS" and first. The same holds for last.

Access year, month, day...

Given a DatetimeIndex you can access the day, month, or year (also second, millisecond, etc.) by simply accessing the .XX property; where XX is the data you want



In [106]:

    
ffr.index.year









    Out[106]:





array([1954, 1954, 1954, ..., 2017, 2017, 2017], dtype=int32)



In [107]:

    
ffr.index.day









    Out[107]:





array([ 1,  2,  3, ..., 18, 19, 20], dtype=int32)



In [108]:

    
ffr.index.month









    Out[108]:





array([7, 7, 7, ..., 4, 4, 4], dtype=int32)

Rolling computations

We can use pandas to do rolling computations.

For example, suppose we want to plot the maximum and minimum of the risk free rate within the past week at each date (think about that slowly -- for every date, we want to look back 7 days and compute the max).

Here's how we can do that



In [111]:

    
fig, ax = plt.subplots()
ffr.rolling(window=7).max().plot(ax=ax)
ffr.rolling(window=7).min().plot(ax=ax)
ax.legend(["max", "min"])









    Out[111]:





<matplotlib.legend.Legend at 0x7fc9e2c49240>

Note that this is different from just resampling because we will have an observation for every date in the original dataframe (except the number of dates at the front needed to construct the initial window).



In [112]:

    
ffr.rolling(window=7).max().head(10)









    Out[112]:






  
    
      
      risk_free_rate
    
    
      DATE
      
    
  
  
    
      1954-07-01
      NaN
    
    
      1954-07-02
      NaN
    
    
      1954-07-03
      NaN
    
    
      1954-07-04
      NaN
    
    
      1954-07-05
      NaN
    
    
      1954-07-06
      NaN
    
    
      1954-07-07
      1.25
    
    
      1954-07-08
      1.25
    
    
      1954-07-09
      1.25
    
    
      1954-07-10
      1.25



In [113]:

    
ffr.resample("7D").max().head(10)









    Out[113]:






  
    
      
      risk_free_rate
    
    
      DATE
      
    
  
  
    
      1954-07-01
      1.25
    
    
      1954-07-08
      1.25
    
    
      1954-07-15
      0.75
    
    
      1954-07-22
      0.75
    
    
      1954-07-29
      0.38
    
    
      1954-08-05
      1.44
    
    
      1954-08-12
      1.44
    
    
      1954-08-19
      1.44
    
    
      1954-08-26
      1.44
    
    
      1954-09-02
      1.38

Merging with dates

Let's see what happens when we merge the ffr and vc datasets



In [114]:

    
# do a left merge on the index (date info)
df = pd.merge(ffr, vc, left_index=True, right_index=True, how="left")
df_info(df)









    



Shape:  (22940, 6)
dtypes:  {'Seed': dtype('float64'), 'Expansion': dtype('float64'), 'risk_free_rate': dtype('float64'), 'Later Stage': dtype('float64'), 'Total': dtype('float64'), 'Early Stage': dtype('float64')}
index dtype:  datetime64[ns]






    Out[114]:






  
    
      
      risk_free_rate
      Seed
      Early Stage
      Expansion
      Later Stage
      Total
    
    
      DATE
      
      
      
      
      
      
    
  
  
    
      1954-07-01
      1.13
      NaN
      NaN
      NaN
      NaN
      NaN
    
    
      1954-07-02
      1.25
      NaN
      NaN
      NaN
      NaN
      NaN
    
    
      1954-07-03
      1.25
      NaN
      NaN
      NaN
      NaN
      NaN
    
    
      2017-04-18
      0.91
      NaN
      NaN
      NaN
      NaN
      NaN
    
    
      2017-04-19
      0.91
      NaN
      NaN
      NaN
      NaN
      NaN
    
    
      2017-04-20
      0.91
      NaN
      NaN
      NaN
      NaN
      NaN



In [115]:

    
vc.head()









    Out[115]:






  
    
      
      Seed
      Early Stage
      Expansion
      Later Stage
      Total
    
    
      Date
      
      
      
      
      
    
  
  
    
      1985-03-31
      153.0
      96.3
      219.5
      154.4
      623.1
    
    
      1985-06-30
      146.5
      185.3
      319.6
      89.4
      740.8
    
    
      1985-09-30
      93.7
      106.3
      312.8
      164.4
      677.2
    
    
      1985-12-31
      133.0
      129.9
      393.7
      78.5
      735.1
    
    
      1986-03-31
      185.6
      129.6
      270.0
      125.3
      710.5

Notice that we ended up with a lot of missing data. This happened for two reasons:

The ffr data goes back to 1954, but the vc data starts in 1985
The ffr data is at a daily frequency, but vc is at quarterly.

To resolve the first issue we can subset the ffr data and only keep from 1985 on



In [116]:

    
ffr_recent = ffr["1985":]

To resolve the second issue we will do two-steps:

resample the ffr data to a monthly frequency
resample the vc data to a monthly frequency by padding. This is called upsampling because we are going from a lower frequency (quarterly) to a higher one (monthly)



In [117]:

    
ffr_recentM = ffr_recent.resample("M").first()
vc_M = vc.resample("M").pad()



In [118]:

    
vc_M.head()









    Out[118]:






  
    
      
      Seed
      Early Stage
      Expansion
      Later Stage
      Total
    
    
      Date
      
      
      
      
      
    
  
  
    
      1985-03-31
      153.0
      96.3
      219.5
      154.4
      623.1
    
    
      1985-04-30
      153.0
      96.3
      219.5
      154.4
      623.1
    
    
      1985-05-31
      153.0
      96.3
      219.5
      154.4
      623.1
    
    
      1985-06-30
      146.5
      185.3
      319.6
      89.4
      740.8
    
    
      1985-07-31
      146.5
      185.3
      319.6
      89.4
      740.8

Notice that using pad here just copied data forwards to fill in missing months (e.g. the data for March 1985 was applied to April and May)

Now let's try that merge again



In [119]:

    
df = pd.merge(ffr_recentM, vc_M, left_index=True, right_index=True, how="left")
print(df.head(6))
print("\n\n", df.tail(8))









    



            risk_free_rate   Seed  Early Stage  Expansion  Later Stage  Total
DATE                                                                         
1985-01-31            8.74    NaN          NaN        NaN          NaN    NaN
1985-02-28            8.74    NaN          NaN        NaN          NaN    NaN
1985-03-31            8.74  153.0         96.3      219.5        154.4  623.1
1985-04-30            8.83  153.0         96.3      219.5        154.4  623.1
1985-05-31            8.83  153.0         96.3      219.5        154.4  623.1
1985-06-30            7.64  146.5        185.3      319.6         89.4  740.8


             risk_free_rate  Seed  Early Stage  Expansion  Later Stage  Total
DATE                                                                        
2016-09-30            0.40   NaN          NaN        NaN          NaN    NaN
2016-10-31            0.29   NaN          NaN        NaN          NaN    NaN
2016-11-30            0.41   NaN          NaN        NaN          NaN    NaN
2016-12-31            0.41   NaN          NaN        NaN          NaN    NaN
2017-01-31            0.55   NaN          NaN        NaN          NaN    NaN
2017-02-28            0.66   NaN          NaN        NaN          NaN    NaN
2017-03-31            0.66   NaN          NaN        NaN          NaN    NaN
2017-04-30            0.82   NaN          NaN        NaN          NaN    NaN

That looks much better -- we have missing data at the top and the bottom for months that aren't available in the venture capital dataset, but nothing else should be missing.

Let's try to do something interesting with this data. We want to plot the growth rate in the risk free rate, early stage vc funding, and total vc funding for the months following the start of the dotcom boom (rougly Jan 1995) and the housing boom (roughly Jan 2004).

Read that again carefully.

For each of the three series we want 2 lines. For each line, the x axis will be quarters since start of boom. The y axis growth rates since first month of bubble.



In [120]:

    
# subset the data, then remove datetime index as we don't need it again
post_dotcom = df["1995":].reset_index(drop=True)
post_housing = df["2004":].reset_index(drop=True)

# take logs so we can do growth rates as log(x_{t+N}) - log(x_t)
post_dotcom = np.log(post_dotcom)
post_housing = np.log(post_housing)

dotcom_growth = post_dotcom - post_dotcom.iloc[0, :]
housing_growth = post_housing - post_housing.iloc[0, :]



In [121]:

    
fig, axs = plt.subplots(3, 1, figsize=(10, 5))

variables = ["risk_free_rate", "Early Stage", "Total"]

for i in range(len(variables)):
    var = variables[i]
    
    # add dotcom line
    dotcom_growth[var].plot(ax=axs[i])
    
    # add housing line
    housing_growth[var].plot(ax=axs[i])
    
    # set title
    axs[i].set_title(var)

# set legend and xlabel on last plot only
axs[-1].legend(["dotcom", "housing"])
axs[-1].set_xlabel("Quarters since boom")
        
# make subplots not overlap
fig.tight_layout()

So we see that these booms were qualitatively different in terms of the response of the risk free rate and venture capital funding.



In [ ]:

	Value
Date
1965-12-31	167022.00
1966-12-31	186065.00
1967-12-31	208577.00
2013-12-31	4283028.74
2014-12-31	4500529.62
2015-12-31	4754119.88

	Value
Date
2000-12-31	2900519.06
2001-12-31	2884729.87
2002-12-31	2738000.14
2013-12-31	4283028.74
2014-12-31	4500529.62
2015-12-31	4754119.88

	Seed	Early Stage	Expansion	Later Stage	Total
Date
1985-03-31	153.0000	96.3000	219.5000	154.4000	623.1000
1985-06-30	146.5000	185.3000	319.6000	89.4000	740.8000
1985-09-30	93.7000	106.3000	312.8000	164.4000	677.2000
2015-12-31	482.5349	5193.9647	3282.8216	3167.6023	12126.9235
2016-03-31	561.8670	4601.1276	4010.2623	3516.5483	12689.8052
2016-06-30	534.7283	3962.5863	8507.1810	2286.6906	15291.1862

	NVCA/VENTURE_3_09C - Early Stage	NVCA/VENTURE_3_09C - Total
Date
1985-03-31	96.3	623.1
1985-06-30	185.3	740.8
1985-09-30	106.3	677.2
1985-12-31	129.9	735.1
1986-03-31	129.6	710.5

	NSE/OIL - Open	WIKI/AAPL - Close
Date
1980-12-12	NaN	28.75
1980-12-15	NaN	27.25
1980-12-16	NaN	25.25
1980-12-17	NaN	25.87
1980-12-18	NaN	26.63
1980-12-19	NaN	28.25
1980-12-22	NaN	29.63
1980-12-23	NaN	30.88
1980-12-24	NaN	32.50
1980-12-26	NaN	35.50
1980-12-29	NaN	36.00
1980-12-30	NaN	35.12
1980-12-31	NaN	34.13
1981-01-02	NaN	34.50
1981-01-05	NaN	33.75
1981-01-06	NaN	32.25
1981-01-07	NaN	30.88
1981-01-08	NaN	30.25
1981-01-09	NaN	31.88
1981-01-12	NaN	31.62
1981-01-13	NaN	30.50
1981-01-14	NaN	30.63
1981-01-15	NaN	31.25
1981-01-16	NaN	31.00
1981-01-19	NaN	32.87
1981-01-20	NaN	31.88
1981-01-21	NaN	32.50
1981-01-22	NaN	32.87
1981-01-23	NaN	32.75
1981-01-26	NaN	32.25
...	...	...
2017-03-10	325.00	139.14
2017-03-13	NaN	139.20
2017-03-14	337.20	138.99
2017-03-15	334.10	140.46
2017-03-16	336.80	140.69
2017-03-17	338.00	139.99
2017-03-20	336.00	141.46
2017-03-21	332.60	139.84
2017-03-22	330.05	141.42
2017-03-23	329.90	140.92
2017-03-24	331.70	140.64
2017-03-27	331.70	140.88
2017-03-28	332.20	143.80
2017-03-29	331.80	144.12
2017-03-30	331.95	143.93
2017-03-31	329.95	143.71
2017-04-03	333.40	143.70
2017-04-04	NaN	144.77
2017-04-05	330.80	144.02
2017-04-06	332.20	143.66
2017-04-07	334.00	143.34
2017-04-10	334.30	143.14
2017-04-11	340.50	141.63
2017-04-12	337.75	141.79
2017-04-13	334.10	141.05
2017-04-17	334.75	141.80
2017-04-18	334.45	141.20
2017-04-19	331.10	140.68
2017-04-20	332.70	142.44
2017-04-21	324.80	NaN

	risk_free_rate
DATE
1954-07-01	1.13
1954-07-02	1.25
1954-07-03	1.25
2017-04-18	0.91
2017-04-19	0.91
2017-04-20	0.91

	risk_free_rate
DATE
2008-01-01	3.06
2008-01-02	4.11
2008-01-03	4.25
2008-12-29	0.10
2008-12-30	0.09
2008-12-31	0.14

	risk_free_rate
DATE
2008-09-01	1.94
2008-09-02	1.96
2008-09-03	2.01
2008-09-28	1.08
2008-09-29	1.56
2008-09-30	2.03

	risk_free_rate
DATE
2007-06-01	5.23
2007-06-02	5.23
2007-06-03	5.23
2011-03-29	0.13
2011-03-30	0.13
2011-03-31	0.10

	risk_free_rate
DATE
1954-07-04	1.220000
1954-07-18	0.974286
1954-08-01	0.465000
1954-08-15	1.115000
1954-08-29	1.364286
1954-09-12	1.275000
1954-09-26	0.832143
1954-10-10	0.911429
1954-10-24	0.868571
1954-11-07	0.886429
1954-11-21	0.493571
1954-12-05	1.386429
1954-12-19	1.300714
1955-01-02	1.237857
1955-01-16	1.381429
1955-01-30	1.390714
1955-02-13	1.282857
1955-02-27	1.300714
1955-03-13	1.395000
1955-03-27	1.278571
1955-04-10	1.440000
1955-04-24	1.476429
1955-05-08	1.475000
1955-05-22	1.193571
1955-06-05	1.662857
1955-06-19	1.640714
1955-07-03	1.642143
1955-07-17	1.677143
1955-07-31	1.685714
1955-08-14	1.916429
...	...
2016-03-20	0.364286
2016-04-03	0.361429
2016-04-17	0.370000
2016-05-01	0.355000
2016-05-15	0.370000
2016-05-29	0.370000
2016-06-12	0.364286
2016-06-26	0.382857
2016-07-10	0.397857
2016-07-24	0.400000
2016-08-07	0.378571
2016-08-21	0.400000
2016-09-04	0.392857
2016-09-18	0.400000
2016-10-02	0.376429
2016-10-16	0.404286
2016-10-30	0.410000
2016-11-13	0.402857
2016-11-27	0.410000
2016-12-11	0.402857
2016-12-25	0.606429
2017-01-08	0.628571
2017-01-22	0.660000
2017-02-05	0.652857
2017-02-19	0.660000
2017-03-05	0.653571
2017-03-19	0.731429
2017-04-02	0.890714
2017-04-16	0.910000
2017-04-30	0.910000

	risk_free_rate
DATE
1954-07-01	1.028478
1954-10-01	0.988370
1955-01-01	1.345000
1955-04-01	1.501099
1955-07-01	1.936522

	risk_free_rate
DATE
1954-12-31	1.008424
1955-12-31	1.788603
1956-12-31	2.729590
1957-12-31	3.105342
1958-12-31	1.572356

	risk_free_rate
DATE
1954-07-31	1.13
1954-08-31	0.25
1954-09-30	1.44
1954-10-31	0.50
1954-11-30	1.38

	Seed	Early Stage	Expansion	Later Stage	Total
Date
1985-03-31	153.0	96.3	219.5	154.4	623.1
1985-06-30	146.5	185.3	319.6	89.4	740.8
1985-09-30	93.7	106.3	312.8	164.4	677.2
1985-12-31	133.0	129.9	393.7	78.5	735.1
1986-03-31	185.6	129.6	270.0	125.3	710.5