From Rob Pike's Notes on Programming in C:

Rule 5. Data dominates. If you've chosen the right data structures and organized things well, the algorithms will almost always be self-evident. Data structures, not algorithms, are central to programming.

Pandas is built on a hierarchy of a few powerful data structures. Each of these structures is composed of, and designed to interoperate with, the simpler structures.

  • Index (1-Dimensional immutable ordered hash table)
  • Series (1-Dimensional Labelled Array)
  • DataFrame (2-Dimensional Labelled Array)
  • Panel (3-Dimensional Labelled Array)

In [1]:
# Tell IPython to display mapltplotlib plots inline.
%matplotlib inline

# Set default font attributes.
import matplotlib
font = {'family' : 'normal',
        'weight' : 'bold',
        'size'   : 13}
matplotlib.rc('font', **font)

In [2]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
randn = np.random.randn

pd.set_option('display.mpl_style', 'default')
pd.set_option('display.max_rows', 15)

# Make a default figure size for later use.
DEFAULT_FIGSIZE = (12, 6)

Series

Basics


In [3]:
s = pd.Series([3,5,7,2])
s


Out[3]:
0    3
1    5
2    7
3    2
dtype: int64

In [4]:
# An important concept to understand when working with a `Series` is that it's
# actually composed of two pieces: an index array, and a data array.

print "The index is {0}.".format(s.index)
print "The values are {0}.".format(s.values)


The index is Int64Index([0, 1, 2, 3], dtype='int64').
The values are [3 5 7 2].

In [5]:
# You can explicitly pass your own labels to use as an index.  If you don't
# Pandas will construct a default index with integer labels.
pd.Series(np.random.randn(4), index=['a', 'b', 'c', 'd'])


Out[5]:
a    1.185723
b   -0.180358
c    0.762084
d    1.277645
dtype: float64

In [6]:
# You can also construct a Series from a dictionary.
# The keys are used as the index, and the values are used as the Series' values
pd.Series(
    {
        'a': 1, 
        'b': 2,
        'c': 3,
    }
)


Out[6]:
a    1
b    2
c    3
dtype: int64

In [7]:
# You get performance (and code clarity!) benefits if your Series'
# labels/values are homogenously-typed, but mixed-type arrays are supported.
pd.Series(
    [1, 2.6, 'a', {'a': 'b'}], 
    index=[1, 'a', 2, 2.5],
)


Out[7]:
1                 1
a               2.6
2                 a
2.5    {u'a': u'b'}
dtype: object

Slicing Series with __getitem__ (aka [])

Pandas objects support a wide range of selection and filtering methods. An important idea to keep in mind is the following:

If you have an N-dimensional object:

  • Indexing with a scalar returns a value of dimension N-1.
  • Indexing with a slice filters the object, but maintains the original dimension.

In [8]:
s = pd.Series(range(10), index=list('ABCDEFGHIJ'))
s


Out[8]:
A    0
B    1
C    2
D    3
E    4
F    5
G    6
H    7
I    8
J    9
dtype: int64

In [9]:
# Lookups by key work as you'd expect.
s['E']


Out[9]:
4

In [10]:
# We can look up multiple values at a time by passing a list of keys.
# The resulting value is a new `Series`.
s[['E', 'I', 'B']]


Out[10]:
E    4
I    8
B    1
dtype: int64

In [11]:
# Because the Index is ordered, we can use Python's slicing syntax.
s['E':]


Out[11]:
E    4
F    5
G    6
H    7
I    8
J    9
dtype: int64

In [12]:
# Label-based slicing is inclusive of both endpoints.
s[:'I']


Out[12]:
A    0
B    1
C    2
D    3
E    4
F    5
G    6
H    7
I    8
dtype: int64

In [13]:
s['E':'I']


Out[13]:
E    4
F    5
G    6
H    7
I    8
dtype: int64

In [14]:
# Step arguments work just like Python lists.
s['E':'I':2]


Out[14]:
E    4
G    6
I    8
dtype: int64

In [15]:
# If you don't know the label you want, but you do know the position, you can
# use `iloc`.
print "The first entry is: %d" % s.iloc[0]
print "The last entry is: %d" % s.iloc[-1]


The first entry is: 0
The last entry is: 9

In [16]:
# Slicing works with `iloc` as well.

# Note that, unlike with label-based slicing, integer-based slices are
# right-open intervals, i.e. doing s.iloc[X:Y] gives you elements with indices
# in [X, Y).  This is the same as the semantics for list slicing.
s.iloc[5:]


Out[16]:
F    5
G    6
H    7
I    8
J    9
dtype: int64

In [17]:
print s.iloc[:5]


A    0
B    1
C    2
D    3
E    4
dtype: int64

In [18]:
s.iloc[-3:]


Out[18]:
H    7
I    8
J    9
dtype: int64

Numerical Operations


In [19]:
# Create two Series objects containing 100 samples each of sine and cosine.
sine = pd.Series(np.sin(np.linspace(0, 3.14 * 2, 100)), name='sine')
cosine = pd.Series(np.cos(np.linspace(0, 3.14 * 2, 100)), name='cosine')

In [20]:
sine


Out[20]:
0    0.000000
1    0.063392
2    0.126529
3    0.189156
4    0.251023
...
94   -0.314905
95   -0.254105
96   -0.192283
97   -0.129688
98   -0.066570
99   -0.003185
Name: sine, Length: 100, dtype: float64

In [21]:
cosine


Out[21]:
0    1.000000
1    0.997989
2    0.991963
3    0.981947
4    0.967981
...
94    0.949123
95    0.967177
96    0.981339
97    0.991555
98    0.997782
99    0.999995
Name: cosine, Length: 100, dtype: float64

In [22]:
# Multiplying two Series objects produces a new Series by multiplying values that have the same keys.
product = cosine * sine
product


Out[22]:
0    0.000000
1    0.063264
2    0.125512
3    0.185742
4    0.242986
...
94   -0.298884
95   -0.245765
96   -0.188695
97   -0.128592
98   -0.066423
99   -0.003185
Length: 100, dtype: float64

In [23]:
# Adding or multiplying a Series by a scalar applies that operation to each value in the Series.
cosine_plus_one = cosine + 1
cosine_plus_one


Out[23]:
0    2.000000
1    1.997989
2    1.991963
3    1.981947
4    1.967981
...
94    1.949123
95    1.967177
96    1.981339
97    1.991555
98    1.997782
99    1.999995
Name: cosine, Length: 100, dtype: float64

In [24]:
# Other binary operators work as you'd expect.  

# Note how much cleaner and clearer this is
# compared to looping over two containers and 
# performing multiple operations on elements 
# from each.
identity = (sine ** 2) + (cosine ** 2)
identity


Out[24]:
0    1
1    1
2    1
3    1
4    1
...
94    1
95    1
96    1
97    1
98    1
99    1
Length: 100, dtype: float64

All of the pandas data structures have plot methods that provide a user-friendly interface to matplotlib.


In [25]:
# Plot our sines values.
trigplot = sine.plot(
    ylim=(-1.2, 1.2),
    legend=True,
    figsize=DEFAULT_FIGSIZE,
    linewidth=3,
    label='sine',
)
# Add our other Series' to the same plot.
cosine.plot(ax=trigplot, legend=True, linewidth=3)
product.plot(ax=trigplot, legend=True, linewidth=3, label='product')
identity.plot(ax=trigplot, legend=True, linewidth=3, label='identity')


Out[25]:
<matplotlib.axes.AxesSubplot at 0x1055dec50>

We can map more complicated functions over a Series using the apply method.


In [26]:
def tenths_place(N):
    s = str(N)
    return s[s.find('.') + 1]
product.apply(tenths_place)


Out[26]:
0    0
1    0
2    1
3    1
4    2
...
94    2
95    2
96    1
97    1
98    0
99    0
Length: 100, dtype: object

Handling missing data.


In [27]:
# A major problem when working with real world data is handling missing entries.
# Pandas handles missing data by taking 
s1 = pd.Series({'a': 1, 'b': 2, 'c': 3})
# s2 is missing an entry for 'b'
s2 = pd.Series({'a': 4, 'c': 5})
s1 + s2


Out[27]:
a     5
b   NaN
c     8
dtype: float64

Boolean Operations on Series


In [28]:
s1 = pd.Series(
    {
        'A': 1,
        'B': 2,
        'C': 3,
        'D': 4,
        'E': 3,
        'F': 2,
        'G': 1,
    }
)
# You can create a constant Series by passing a scalar value and an index.
s2 = pd.Series(2, index=s1.index)

In [29]:
greater = s1 > s2
greater


Out[29]:
A    False
B    False
C     True
D     True
E     True
F    False
G    False
dtype: bool

In [30]:
less = s1 < s2
less


Out[30]:
A     True
B    False
C    False
D    False
E    False
F    False
G     True
dtype: bool

In [31]:
equal = s1 == s2
equal


Out[31]:
A    False
B     True
C    False
D    False
E    False
F     True
G    False
dtype: bool

In [32]:
# Comparisons against scalars also work.
s1_equal_to_3 = s1 == 3
s1_equal_to_3


Out[32]:
A    False
B    False
C     True
D    False
E     True
F    False
G    False
dtype: bool

In [33]:
#TODO: Move this down?

pd.DataFrame({
    's1': s1,
    's2': s2,
    's1 > s2': greater,
    's1 == s2': equal,
    's1 < s2': less,
    's1 == 3': s1_equal_to_3,
}, columns=['s1','s2', 's1 > s2', 's1 == s2', 's1 < s2', 's1 == 3'])


Out[33]:
s1 s2 s1 > s2 s1 == s2 s1 < s2 s1 == 3
A 1 2 False False True False
B 2 2 False True False False
C 3 2 True False False True
D 4 2 True False False False
E 3 2 True False False True
F 2 2 False True False False
G 1 2 False False True False

Boolean-valued Series can be used for slicing. You can think of this as marking particular index values as "keep" (True) or "drop" (False).


In [34]:
# Indexing into a series with a boolean Series masks away the values which were
# false in the passed Series.
s1[s1 > s2]


Out[34]:
C    3
D    4
E    3
dtype: int64

In [35]:
# We can combine these operators to concisely express complex
# computations/filters.
s1[(s1 > 1) & ~(s1 > s2)]


Out[35]:
B    2
F    2
dtype: int64

Working with Time Series Data


In [36]:
# Pandas has a special index class, `DatetimeIndex`, for representing
# TimeSeries data.
start = pd.Timestamp('2014-01-01', tz='UTC')
end = pd.Timestamp('2014-01-09', tz='UTC')

# date_range is an easy way to construct a DatetimeIndex
daily_index = pd.date_range(start, end)
daily_index


Out[36]:
<class 'pandas.tseries.index.DatetimeIndex'>
[2014-01-01, ..., 2014-01-09]
Length: 9, Freq: D, Timezone: UTC

In [37]:
# DatetimeIndex has a notion of its Frequency.
from pandas.tseries.offsets import Day, Hour, BDay, Minute

hourly_index = pd.date_range(
    pd.Timestamp('2014-01-01', tz='UTC'),
    pd.Timestamp('2014-01-9', tz='UTC'),
    freq=Hour(),
)
hourly_index


Out[37]:
<class 'pandas.tseries.index.DatetimeIndex'>
[2014-01-01 00:00:00+00:00, ..., 2014-01-09 00:00:00+00:00]
Length: 193, Freq: H, Timezone: UTC

In [38]:
bihourly_index = pd.date_range(
    pd.Timestamp('2014-01-01', tz='UTC'),
    pd.Timestamp('2014-01-09', tz='UTC'),
    freq=Hour(2),
)
bihourly_index


Out[38]:
<class 'pandas.tseries.index.DatetimeIndex'>
[2014-01-01 00:00:00+00:00, ..., 2014-01-09 00:00:00+00:00]
Length: 97, Freq: 2H, Timezone: UTC

In [39]:
weekday_index = pd.date_range(
    pd.Timestamp('2014-01-01', tz='UTC'),
    pd.Timestamp('2014-01-09', tz='UTC'),
    freq=BDay(),
)
print weekday_index
[i for i in weekday_index]


<class 'pandas.tseries.index.DatetimeIndex'>
[2014-01-01, ..., 2014-01-09]
Length: 7, Freq: B, Timezone: UTC
Out[39]:
[Timestamp('2014-01-01 00:00:00+0000', tz='UTC', offset='B'),
 Timestamp('2014-01-02 00:00:00+0000', tz='UTC', offset='B'),
 Timestamp('2014-01-03 00:00:00+0000', tz='UTC', offset='B'),
 Timestamp('2014-01-06 00:00:00+0000', tz='UTC', offset='B'),
 Timestamp('2014-01-07 00:00:00+0000', tz='UTC', offset='B'),
 Timestamp('2014-01-08 00:00:00+0000', tz='UTC', offset='B'),
 Timestamp('2014-01-09 00:00:00+0000', tz='UTC', offset='B')]

If your Series has a DatetimeIndex, then you immediately get access to sophisticated resampling tools.


In [40]:
ts = pd.Series(
    np.arange(30) ** 2,
    pd.date_range(
        start=pd.Timestamp('2014-01-01', tz='UTC'),
        freq='1D',
        periods=30,
    )
)
ts.plot()


Out[40]:
<matplotlib.axes.AxesSubplot at 0x105610b50>

In [41]:
# By default, resampling to a lower frequency takes the mean of the entries
# that were downsampled.
resampled = ts.resample('5D')
resampled


Out[41]:
2014-01-01 00:00:00+00:00      6
2014-01-06 00:00:00+00:00     51
2014-01-11 00:00:00+00:00    146
2014-01-16 00:00:00+00:00    291
2014-01-21 00:00:00+00:00    486
2014-01-26 00:00:00+00:00    731
Freq: 5D, dtype: int64

In [42]:
# We can customize this behavior though.
resampled_first = ts.resample('5D', how='first')
resampled_first


Out[42]:
2014-01-01 00:00:00+00:00      0
2014-01-06 00:00:00+00:00     25
2014-01-11 00:00:00+00:00    100
2014-01-16 00:00:00+00:00    225
2014-01-21 00:00:00+00:00    400
2014-01-26 00:00:00+00:00    625
Freq: 5D, dtype: int64

In [43]:
resampled_last = ts.resample('5D', how='last')
resampled_last


Out[43]:
2014-01-01 00:00:00+00:00     16
2014-01-06 00:00:00+00:00     81
2014-01-11 00:00:00+00:00    196
2014-01-16 00:00:00+00:00    361
2014-01-21 00:00:00+00:00    576
2014-01-26 00:00:00+00:00    841
Freq: 5D, dtype: int64

In [44]:
# We can even define our own custom sampling methods.
def geometric_mean(subseries):
    return np.product(subseries.values) ** (1.0 / len(subseries))
resampled_geometric = ts.resample('5D', how=geometric_mean)
print resampled_geometric


2014-01-01 00:00:00+00:00      0.000000
2014-01-06 00:00:00+00:00     46.970017
2014-01-11 00:00:00+00:00    141.990120
2014-01-16 00:00:00+00:00    286.995117
2014-01-21 00:00:00+00:00    481.997094
2014-01-26 00:00:00+00:00    726.998074
Freq: 5D, dtype: float64

In [45]:
pd.DataFrame(
    {
        "resampled": resampled,
        "resampled_first": resampled_first,
        "resampled_last": resampled_last,
        "resampled_geometric": resampled_geometric,
    }
).plot(linewidth=2, figsize=DEFAULT_FIGSIZE)


Out[45]:
<matplotlib.axes.AxesSubplot at 0x105610c90>

In [46]:
# Upsampling creates missing data, which is represented by numpy.nan.
ts.resample('6H')


Out[46]:
2014-01-01 00:00:00+00:00     0
2014-01-01 06:00:00+00:00   NaN
2014-01-01 12:00:00+00:00   NaN
2014-01-01 18:00:00+00:00   NaN
2014-01-02 00:00:00+00:00     1
...
2014-01-28 18:00:00+00:00    NaN
2014-01-29 00:00:00+00:00    784
2014-01-29 06:00:00+00:00    NaN
2014-01-29 12:00:00+00:00    NaN
2014-01-29 18:00:00+00:00    NaN
2014-01-30 00:00:00+00:00    841
Freq: 6H, Length: 117

We can handle missing data in a variety of ways.


In [47]:
# We can fill empty values with fillna.
zero_filled = ts.resample('6H').fillna(0)
print zero_filled


2014-01-01 00:00:00+00:00    0
2014-01-01 06:00:00+00:00    0
2014-01-01 12:00:00+00:00    0
2014-01-01 18:00:00+00:00    0
2014-01-02 00:00:00+00:00    1
...
2014-01-28 18:00:00+00:00      0
2014-01-29 00:00:00+00:00    784
2014-01-29 06:00:00+00:00      0
2014-01-29 12:00:00+00:00      0
2014-01-29 18:00:00+00:00      0
2014-01-30 00:00:00+00:00    841
Freq: 6H, Length: 117

In [48]:
# We can forward-fill with the last known prior value.
ffilled = ts.resample('6H').ffill()
print ffilled


2014-01-01 00:00:00+00:00    0
2014-01-01 06:00:00+00:00    0
2014-01-01 12:00:00+00:00    0
2014-01-01 18:00:00+00:00    0
2014-01-02 00:00:00+00:00    1
...
2014-01-28 18:00:00+00:00    729
2014-01-29 00:00:00+00:00    784
2014-01-29 06:00:00+00:00    784
2014-01-29 12:00:00+00:00    784
2014-01-29 18:00:00+00:00    784
2014-01-30 00:00:00+00:00    841
Freq: 6H, Length: 117

In [49]:
# We can backfill with earliest known next value.
bfilled = ts.resample('6H').bfill()
print bfilled


2014-01-01 00:00:00+00:00    0
2014-01-01 06:00:00+00:00    1
2014-01-01 12:00:00+00:00    1
2014-01-01 18:00:00+00:00    1
2014-01-02 00:00:00+00:00    1
...
2014-01-28 18:00:00+00:00    784
2014-01-29 00:00:00+00:00    784
2014-01-29 06:00:00+00:00    841
2014-01-29 12:00:00+00:00    841
2014-01-29 18:00:00+00:00    841
2014-01-30 00:00:00+00:00    841
Freq: 6H, Length: 117

In [50]:
# We can interpolate between known values.

# Note: `interpolate` is new as of pandas 0.14.0
# Quantopian is currently on pandas 0.12.0 due to breaking changes in the
# pandas API in 0.13.0.
linear_interpolated = ts.resample('6H').interpolate()
linear_interpolated


Out[50]:
2014-01-01 00:00:00+00:00    0.00
2014-01-01 06:00:00+00:00    0.25
2014-01-01 12:00:00+00:00    0.50
2014-01-01 18:00:00+00:00    0.75
2014-01-02 00:00:00+00:00    1.00
...
2014-01-28 18:00:00+00:00    770.25
2014-01-29 00:00:00+00:00    784.00
2014-01-29 06:00:00+00:00    798.25
2014-01-29 12:00:00+00:00    812.50
2014-01-29 18:00:00+00:00    826.75
2014-01-30 00:00:00+00:00    841.00
Freq: 6H, Length: 117

In [51]:
quadratic_interpolated = ts.resample('6H').interpolate('polynomial', order=2)
quadratic_interpolated
# Note: `interpolate` is new as of pandas 0.14.0
# Quantopian is currently on pandas 0.12.0 due to breaking changes in the
# pandas API in 0.13.0.


Out[51]:
2014-01-01 00:00:00+00:00    0.0000
2014-01-01 06:00:00+00:00    0.0625
2014-01-01 12:00:00+00:00    0.2500
2014-01-01 18:00:00+00:00    0.5625
2014-01-02 00:00:00+00:00    1.0000
...
2014-01-28 18:00:00+00:00    770.0625
2014-01-29 00:00:00+00:00    784.0000
2014-01-29 06:00:00+00:00    798.0625
2014-01-29 12:00:00+00:00    812.2500
2014-01-29 18:00:00+00:00    826.5625
2014-01-30 00:00:00+00:00    841.0000
Freq: 6H, Length: 117

In [52]:
pd.DataFrame(
    {
        "linear_interpolated": linear_interpolated,
        "quadratic_interpolated": quadratic_interpolated,
        "bfilled": bfilled,
        "ffilled": ffilled,
        "zero_filled": zero_filled,
    }
).plot(linewidth=2, figsize=DEFAULT_FIGSIZE)


Out[52]:
<matplotlib.axes.AxesSubplot at 0x1072bf350>

DataFrame - 2D Tables of Interwoven Series


In [53]:
# Oftentimes we have more than one axis on which we want to store data.
from pandas.io.data import get_data_yahoo
spy = get_data_yahoo(
    symbols='SPY',
    start=pd.Timestamp('2011-01-01'),
    end=pd.Timestamp('2014-01-01'),
    adjust_price=True,
)
spy


Out[53]:
Open High Low Close Volume Adj_Ratio
Date
2011-01-03 117.923577 118.751861 116.983613 118.24 138725200 0.930657
2011-01-04 118.495717 118.532941 117.434811 118.17 137409700 0.930619
2011-01-05 117.803496 118.864453 117.691816 118.79 133975300 0.930664
2011-01-06 118.839206 118.969502 118.206340 118.56 122519000 0.930685
2011-01-07 118.710864 118.906295 117.398679 118.32 156034600 0.930628
2011-01-10 117.797752 118.337511 117.444117 118.17 122401700 0.930619
2011-01-11 118.599306 118.878495 118.143298 118.59 110287000 0.930629
... ... ... ... ... ... ...
2013-12-20 179.037954 180.326069 178.919052 179.90 197087000 0.990857
2013-12-23 180.780732 180.968994 180.404209 180.86 85598000 0.990851
2013-12-24 180.873560 181.339270 180.863652 181.26 45368800 0.990871
2013-12-26 181.664751 182.279086 181.644934 182.18 63365000 0.990863
2013-12-27 182.417716 182.496984 181.981736 182.17 61814000 0.990862
2013-12-30 182.189543 182.338172 181.902193 182.14 56857000 0.990861
2013-12-31 182.385673 183.000000 182.246954 183.00 86119900 0.990850

754 rows × 6 columns


In [54]:
# Just plotting this DataFrame with the default arguments isn't very useful,
# because the scale of volume is so much greater than all the other columns.
spy.plot(figsize=DEFAULT_FIGSIZE)


Out[54]:
<matplotlib.axes.AxesSubplot at 0x107331910>

In [55]:
# Let's make a more interesting plot.

# Create a figure
fig = plt.figure()

# Add a subplot for price.
price_subplot = fig.add_subplot('311', xlabel='Date', ylabel='Price')
spy['Close'].plot(ax=price_subplot, lw=2)  # lw means "line width"

# Add another subplot for each day's spread.
spread_subplot = fig.add_subplot('312', xlabel='Date', ylabel='Spread')
spread = spy['High'] - spy['Low']
spread.plot(ax=spread_subplot, lw=2, color='r')

# And add a third plot for volume.
volume_subplot = fig.add_subplot('313', xlabel='Date', ylabel='Volume')
spy['Volume'].plot(ax=volume_subplot, lw=2)

# matplotlib.pyplot.gcf is short for "Get Current Figure".  It provides an easy
# way to modify the last drawn plot.
plt.gcf().set_size_inches(*DEFAULT_FIGSIZE)



In [56]:
# Unsurprisingly, spread is strongly correlated with daily volume
spread.corr(spy['Volume'])


Out[56]:
0.787975095601951

Selecting Data with DataFrames

A DataFrame has two indices, representing row labels and column labels. Since these are Index objects, we can use all the same slicing tools we used when working with Series.


In [57]:
# Default slicing acts on column labels.
# Passing a scalar value drops the dimension by one.
spy['Close'] # Returns a Series


Out[57]:
Date
2011-01-03    118.24
2011-01-04    118.17
2011-01-05    118.79
2011-01-06    118.56
2011-01-07    118.32
...
2013-12-23    180.86
2013-12-24    181.26
2013-12-26    182.18
2013-12-27    182.17
2013-12-30    182.14
2013-12-31    183.00
Name: Close, Length: 754

In [58]:
# Passing a list filters the columns down to the supplied values.
spy[['Close', 'Volume']]


Out[58]:
Close Volume
Date
2011-01-03 118.24 138725200
2011-01-04 118.17 137409700
2011-01-05 118.79 133975300
2011-01-06 118.56 122519000
2011-01-07 118.32 156034600
2011-01-10 118.17 122401700
2011-01-11 118.59 110287000
... ... ...
2013-12-20 179.90 197087000
2013-12-23 180.86 85598000
2013-12-24 181.26 45368800
2013-12-26 182.18 63365000
2013-12-27 182.17 61814000
2013-12-30 182.14 56857000
2013-12-31 183.00 86119900

754 rows × 2 columns


In [59]:
# Using .loc with one argument takes a slice of rows based on label.
spy.loc[pd.Timestamp('2013-02-01'):pd.Timestamp('2013-02-28')]


Out[59]:
Open High Low Close Volume Adj_Ratio
Date
2013-02-01 146.316970 147.064823 146.064448 146.89 131173000 0.971238
2013-02-04 145.997571 146.920254 145.133163 145.24 159073600 0.971245
2013-02-05 146.030113 147.127645 145.971836 146.71 113912400 0.971268
2013-02-06 146.188418 146.907122 146.081583 146.81 138762800 0.971223
2013-02-07 146.862813 146.998788 145.551624 146.62 162490000 0.971251
2013-02-08 146.876659 147.527415 146.876659 147.44 103133700 0.971278
2013-02-11 147.380862 147.536265 147.040917 147.41 73775000 0.971272
... ... ... ... ... ... ...
2013-02-20 148.738262 148.786825 146.912299 146.99 160574800 0.971257
2013-02-21 146.614456 147.061214 145.623817 146.09 183257000 0.971214
2013-02-22 146.801290 147.520000 146.160279 147.52 106356600 0.971229
2013-02-25 148.245729 148.469122 144.720000 144.72 245824800 0.971275
2013-02-26 145.418619 145.884829 144.457061 145.71 186596200 0.971270
2013-02-27 145.578109 147.947918 145.451849 147.54 150781900 0.971233
2013-02-28 147.531660 148.473765 147.055752 147.25 126866000 0.971242

19 rows × 6 columns


In [60]:
# Using .loc with two arguments takes a slice of rows based on label, then a
# slice of columns based on name.

# Note the comma between the first slice and the second slice!
spy.loc[pd.Timestamp('2013-02-01'):pd.Timestamp('2013-02-28'), 'Open':'Low']


Out[60]:
Open High Low
Date
2013-02-01 146.316970 147.064823 146.064448
2013-02-04 145.997571 146.920254 145.133163
2013-02-05 146.030113 147.127645 145.971836
2013-02-06 146.188418 146.907122 146.081583
2013-02-07 146.862813 146.998788 145.551624
2013-02-08 146.876659 147.527415 146.876659
2013-02-11 147.380862 147.536265 147.040917
... ... ... ...
2013-02-20 148.738262 148.786825 146.912299
2013-02-21 146.614456 147.061214 145.623817
2013-02-22 146.801290 147.520000 146.160279
2013-02-25 148.245729 148.469122 144.720000
2013-02-26 145.418619 145.884829 144.457061
2013-02-27 145.578109 147.947918 145.451849
2013-02-28 147.531660 148.473765 147.055752

19 rows × 3 columns


In [61]:
# We can use iloc when we want lookups by position.
spy.iloc[-20:-10, [0,2]]


Out[61]:
Open Low
Date
2013-12-03 177.327241 176.568422
2013-12-04 176.509114 175.769963
2013-12-05 176.813197 176.182461
2013-12-06 178.053910 177.541439
2013-12-09 178.838985 178.533480
2013-12-10 178.356666 178.021594
2013-12-11 178.199567 175.913188
2013-12-12 176.052613 175.185359
2013-12-13 175.914351 175.194925
2013-12-16 176.353917 176.304642

Boolean Series slicing is very useful with DataFrame.


In [62]:
# Get the days on which SPY closed higher than it opened.
up_days = spy['Close'] > spy['Open']
up_days


Out[62]:
Date
2011-01-03     True
2011-01-04    False
2011-01-05     True
2011-01-06    False
2011-01-07    False
...
2013-12-23     True
2013-12-24     True
2013-12-26     True
2013-12-27    False
2013-12-30    False
2013-12-31     True
Length: 754

In [63]:
spy[up_days]


Out[63]:
Open High Low Close Volume Adj_Ratio
Date
2011-01-03 117.923577 118.751861 116.983613 118.24 138725200 0.930657
2011-01-05 117.803496 118.864453 117.691816 118.79 133975300 0.930664
2011-01-10 117.797752 118.337511 117.444117 118.17 122401700 0.930619
2011-01-12 119.315668 119.790288 118.617698 119.66 107929200 0.930627
2011-01-14 119.297005 120.357919 119.213248 120.33 117677900 0.930626
2011-01-18 120.223573 120.651680 120.083973 120.54 114401300 0.930667
2011-01-20 119.088320 119.497814 118.315865 119.20 175745700 0.930668
... ... ... ... ... ... ...
2013-12-18 176.330239 179.099566 174.753398 179.07 234906000 0.985526
2013-12-19 178.554492 179.066957 178.091303 178.86 136531200 0.985509
2013-12-20 179.037954 180.326069 178.919052 179.90 197087000 0.990857
2013-12-23 180.780732 180.968994 180.404209 180.86 85598000 0.990851
2013-12-24 180.873560 181.339270 180.863652 181.26 45368800 0.990871
2013-12-26 181.664751 182.279086 181.644934 182.18 63365000 0.990863
2013-12-31 182.385673 183.000000 182.246954 183.00 86119900 0.990850

423 rows × 6 columns


In [64]:
# We can use .ix when we want mixed lookups.
spy.ix[-20:-10, 'Open':'High']


Out[64]:
Open High
Date
2013-12-03 177.327241 177.770707
2013-12-04 176.509114 177.869150
2013-12-05 176.813197 177.138421
2013-12-06 178.053910 178.487538
2013-12-09 178.838985 179.036085
2013-12-10 178.356666 178.731158
2013-12-11 178.199567 178.229132
2013-12-12 176.052613 176.269427
2013-12-13 175.914351 176.072033
2013-12-16 176.353917 177.201441

Example: Investigating a Momentum Strategy


In [65]:
five_day_returns = spy['Close'].pct_change(5)
five_day_returns


Out[65]:
Date
2011-01-03   NaN
2011-01-04   NaN
2011-01-05   NaN
2011-01-06   NaN
2011-01-07   NaN
...
2013-12-23    0.024006
2013-12-24    0.029535
2013-12-26    0.017368
2013-12-27    0.018506
2013-12-30    0.012451
2013-12-31    0.011832
Name: Close, Length: 754

In [66]:
# Checking for equality of floating point numbers is a bad idea because of
# roundoff error.  `numpy.allclose` does an appropriate epsilon test.
test_return = (spy['Close'].iloc[5] - spy['Close'].iloc[0]) / spy['Close'].iloc[0]
np.allclose(five_day_returns.iloc[5], test_return)


Out[66]:
True

In [67]:
thirty_day_forward_returns = (spy['Close'].shift(-30) - spy['Close']) / spy['Close']
test_return = (spy['Close'].iloc[30] - spy['Close'].iloc[0]) / spy['Close'].iloc[0]
np.allclose(thirty_day_forward_returns.iloc[0], test_return)


Out[67]:
True

In [68]:
returns = pd.DataFrame(
    {
        'forward_30Day': thirty_day_forward_returns,
        'backA_2Day': spy['Close'].pct_change(2),
        'backB_5Day': spy['Close'].pct_change(5),
        'backD_50Day': spy['Close'].pct_change(50),
        'backE_100Day': spy['Close'].pct_change(100),
        'backF_200Day': spy['Close'].pct_change(200),
        'backG_300Day': spy['Close'].pct_change(300),
    }
).dropna(how='any')
returns.plot(figsize=DEFAULT_FIGSIZE)


Out[68]:
<matplotlib.axes.AxesSubplot at 0x101508190>

In [69]:
# Pairwise correlation of forward and backward returns.
corr = returns.corr()
corr


Out[69]:
backA_2Day backB_5Day backD_50Day backE_100Day backF_200Day backG_300Day forward_30Day
backA_2Day 1.000000 0.595697 0.188921 0.071506 0.162341 0.103689 -0.069359
backB_5Day 0.595697 1.000000 0.303945 0.106169 0.209141 0.133395 -0.132127
backD_50Day 0.188921 0.303945 1.000000 0.443526 0.299287 0.324473 -0.364206
backE_100Day 0.071506 0.106169 0.443526 1.000000 0.116012 0.288164 -0.533907
backF_200Day 0.162341 0.209141 0.299287 0.116012 1.000000 0.276626 -0.016570
backG_300Day 0.103689 0.133395 0.324473 0.288164 0.276626 1.000000 0.135996
forward_30Day -0.069359 -0.132127 -0.364206 -0.533907 -0.016570 0.135996 1.000000

In [70]:
corr.ix['forward_30Day',:-1].plot(kind='bar', position=.5, xlim=(-1, 6))
plt.gcf().set_size_inches(9, 6)


Example: Investigating a Pair Trade for Pepsi and Coca-Cola

Here we show how to load data for two securities, graph the data, and compute correlation of returns and volatility for each security over the specified period.


In [71]:
# Load data for Pepsi and Coca-Cola from Yahoo.
symbols = [
    'PEP',
    'KO',
]
cola_data = get_data_yahoo(['PEP', 'KO'], adjust_price=True)
cola_data


Out[71]:
<class 'pandas.core.panel.Panel'>
Dimensions: 6 (items) x 1186 (major_axis) x 2 (minor_axis)
Items axis: Open to Adj_Ratio
Major_axis axis: 2010-01-04 00:00:00 to 2014-09-18 00:00:00
Minor_axis axis: KO to PEP

In [72]:
# Compute the 1-day forward log returns for both securities' close prices.
closes = cola_data['Close']
yesterday_closes = cola_data['Close'].shift(1)
cola_log_returns = (closes / yesterday_closes).apply(np.log)
cola_raw_returns = closes.pct_change(1)

# Look at the data we just calculated by throwing it into a Panel and
# pulling out just the DataFrame or Kola.
pd.Panel({
    'closes'     : closes,
    'prev_closes': yesterday_closes,
    'log_returns': cola_log_returns,
    'raw_returns': cola_raw_returns,
}).loc[:,:,'KO']


Out[72]:
closes log_returns prev_closes raw_returns
Date
2010-01-04 24.84 NaN NaN NaN
2010-01-05 24.54 -0.012151 24.84 -0.012077
2010-01-06 24.53 -0.000408 24.54 -0.000407
2010-01-07 24.47 -0.002449 24.53 -0.002446
2010-01-08 24.02 -0.018561 24.47 -0.018390
2010-01-11 24.50 0.019786 24.02 0.019983
2010-01-12 24.77 0.010960 24.50 0.011020
... ... ... ... ...
2014-09-10 41.86 0.005269 41.64 0.005283
2014-09-11 41.95 0.002148 41.86 0.002150
2014-09-12 41.46 -0.011749 41.95 -0.011681
2014-09-15 41.50 0.000964 41.46 0.000965
2014-09-16 41.64 0.003368 41.50 0.003373
2014-09-17 41.61 -0.000721 41.64 -0.000720
2014-09-18 41.79 0.004317 41.61 0.004326

1186 rows × 4 columns


In [73]:
# Pull the standard returns and the log returns into a single DataFrame using DataFrame.join.
closes.join(cola_log_returns, rsuffix='_lr')\
      .join(cola_raw_returns, rsuffix='_rr')\
      .dropna(how='any')


Out[73]:
KO PEP KO_lr PEP_lr KO_rr PEP_rr
Date
2010-01-05 24.54 53.80 -0.012151 0.011967 -0.012077 0.012039
2010-01-06 24.53 53.26 -0.000408 -0.010088 -0.000407 -0.010037
2010-01-07 24.47 52.93 -0.002449 -0.006215 -0.002446 -0.006196
2010-01-08 24.02 52.75 -0.018561 -0.003407 -0.018390 -0.003401
2010-01-11 24.50 52.69 0.019786 -0.001138 0.019983 -0.001137
2010-01-12 24.77 53.43 0.010960 0.013947 0.011020 0.014044
2010-01-13 24.84 53.86 0.002822 0.008016 0.002826 0.008048
... ... ... ... ... ... ...
2014-09-10 41.86 91.79 0.005269 0.004039 0.005283 0.004047
2014-09-11 41.95 91.65 0.002148 -0.001526 0.002150 -0.001525
2014-09-12 41.46 90.87 -0.011749 -0.008547 -0.011681 -0.008511
2014-09-15 41.50 91.20 0.000964 0.003625 0.000965 0.003632
2014-09-16 41.64 92.57 0.003368 0.014910 0.003373 0.015022
2014-09-17 41.61 92.85 -0.000721 0.003020 -0.000720 0.003025
2014-09-18 41.79 93.37 0.004317 0.005585 0.004326 0.005600

1185 rows × 6 columns


In [74]:
# Create a figure with three 'slots' for subplots.
fig = plt.figure()
# 311 here means "Put the subplot in the 1st slot of a 3 x 1 grid.
# 312 and 313 tell matplotlib to place the subsequent plots in the 2nd and 3rd slot
price_subplot = fig.add_subplot('311', xlabel='Date', ylabel='Price')
return_subplot_pep = fig.add_subplot('312', xlabel='Date', ylabel='PEP Log Returns')
return_subplot_ko = fig.add_subplot('313', xlabel='Date', ylabel='KO Log Returns')

cola_data['Close'].plot(ax=price_subplot, color=['purple', 'red'])
cola_log_returns['PEP'].plot(ax=return_subplot_pep, color='red')
cola_log_returns['KO'].plot(ax=return_subplot_ko, color='purple')

# Set the size of the whole plot array.  gcf stands for `get_current_figure`.
plt.gcf().set_size_inches(14, 10)



In [75]:
# Compute the correlation of our log returns
correlation = (cola_log_returns['PEP']).corr(cola_log_returns['KO'])
correlation


Out[75]:
0.60912363471867426

In [76]:
# Compute column-wise standard deviation of daily returns and divide by
# 1 / sqrt(252) to get annualized volatility.
volatility = cola_log_returns.std() * np.sqrt(252)
volatility


Out[76]:
KO     0.151500
PEP    0.140033
dtype: float64