From Rob Pike's Notes on Programming in C:

Rule 5. Data dominates. If you've chosen the right data structures and organized things well, the algorithms will almost always be self-evident. Data structures, not algorithms, are central to programming.

Pandas is built on a hierarchy of a few powerful data structures. Each of these structures is composed of, and designed to interoperate with, the simpler structures.

Index (1-Dimensional immutable ordered hash table)
Series (1-Dimensional Labelled Array)
DataFrame (2-Dimensional Labelled Array)
Panel (3-Dimensional Labelled Array)



In [1]:

    
# Tell IPython to display mapltplotlib plots inline.
%matplotlib inline

# Set default font attributes.
import matplotlib
font = {'family' : 'normal',
        'weight' : 'bold',
        'size'   : 13}
matplotlib.rc('font', **font)



In [2]:

    
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
randn = np.random.randn

pd.set_option('display.mpl_style', 'default')
pd.set_option('display.max_rows', 15)

# Make a default figure size for later use.
DEFAULT_FIGSIZE = (12, 6)

`Series`

Basics



In [3]:

    
s = pd.Series([3,5,7,2])
s









    Out[3]:





0    3
1    5
2    7
3    2
dtype: int64



In [4]:

    
# An important concept to understand when working with a `Series` is that it's
# actually composed of two pieces: an index array, and a data array.

print "The index is {0}.".format(s.index)
print "The values are {0}.".format(s.values)









    



The index is Int64Index([0, 1, 2, 3], dtype='int64').
The values are [3 5 7 2].



In [5]:

    
# You can explicitly pass your own labels to use as an index.  If you don't
# Pandas will construct a default index with integer labels.
pd.Series(np.random.randn(4), index=['a', 'b', 'c', 'd'])









    Out[5]:





a    1.185723
b   -0.180358
c    0.762084
d    1.277645
dtype: float64



In [6]:

    
# You can also construct a Series from a dictionary.
# The keys are used as the index, and the values are used as the Series' values
pd.Series(
    {
        'a': 1, 
        'b': 2,
        'c': 3,
    }
)









    Out[6]:





a    1
b    2
c    3
dtype: int64



In [7]:

    
# You get performance (and code clarity!) benefits if your Series'
# labels/values are homogenously-typed, but mixed-type arrays are supported.
pd.Series(
    [1, 2.6, 'a', {'a': 'b'}], 
    index=[1, 'a', 2, 2.5],
)









    Out[7]:





1                 1
a               2.6
2                 a
2.5    {u'a': u'b'}
dtype: object

Slicing `Series` with `getitem` (aka `[]`)

Pandas objects support a wide range of selection and filtering methods. An important idea to keep in mind is the following:

If you have an N-dimensional object:

Indexing with a scalar returns a value of dimension N-1.
Indexing with a slice filters the object, but maintains the original dimension.



In [8]:

    
s = pd.Series(range(10), index=list('ABCDEFGHIJ'))
s









    Out[8]:





A    0
B    1
C    2
D    3
E    4
F    5
G    6
H    7
I    8
J    9
dtype: int64



In [9]:

    
# Lookups by key work as you'd expect.
s['E']









    Out[9]:





4



In [10]:

    
# We can look up multiple values at a time by passing a list of keys.
# The resulting value is a new `Series`.
s[['E', 'I', 'B']]









    Out[10]:





E    4
I    8
B    1
dtype: int64



In [11]:

    
# Because the Index is ordered, we can use Python's slicing syntax.
s['E':]









    Out[11]:





E    4
F    5
G    6
H    7
I    8
J    9
dtype: int64



In [12]:

    
# Label-based slicing is inclusive of both endpoints.
s[:'I']









    Out[12]:





A    0
B    1
C    2
D    3
E    4
F    5
G    6
H    7
I    8
dtype: int64



In [13]:

    
s['E':'I']









    Out[13]:





E    4
F    5
G    6
H    7
I    8
dtype: int64



In [14]:

    
# Step arguments work just like Python lists.
s['E':'I':2]









    Out[14]:





E    4
G    6
I    8
dtype: int64



In [15]:

    
# If you don't know the label you want, but you do know the position, you can
# use `iloc`.
print "The first entry is: %d" % s.iloc[0]
print "The last entry is: %d" % s.iloc[-1]









    



The first entry is: 0
The last entry is: 9



In [16]:

    
# Slicing works with `iloc` as well.

# Note that, unlike with label-based slicing, integer-based slices are
# right-open intervals, i.e. doing s.iloc[X:Y] gives you elements with indices
# in [X, Y).  This is the same as the semantics for list slicing.
s.iloc[5:]









    Out[16]:





F    5
G    6
H    7
I    8
J    9
dtype: int64



In [17]:

    
print s.iloc[:5]









    



A    0
B    1
C    2
D    3
E    4
dtype: int64



In [18]:

    
s.iloc[-3:]









    Out[18]:





H    7
I    8
J    9
dtype: int64

Numerical Operations



In [19]:

    
# Create two Series objects containing 100 samples each of sine and cosine.
sine = pd.Series(np.sin(np.linspace(0, 3.14 * 2, 100)), name='sine')
cosine = pd.Series(np.cos(np.linspace(0, 3.14 * 2, 100)), name='cosine')



In [20]:

    
sine









    Out[20]:





0    0.000000
1    0.063392
2    0.126529
3    0.189156
4    0.251023
...
94   -0.314905
95   -0.254105
96   -0.192283
97   -0.129688
98   -0.066570
99   -0.003185
Name: sine, Length: 100, dtype: float64



In [21]:

    
cosine









    Out[21]:





0    1.000000
1    0.997989
2    0.991963
3    0.981947
4    0.967981
...
94    0.949123
95    0.967177
96    0.981339
97    0.991555
98    0.997782
99    0.999995
Name: cosine, Length: 100, dtype: float64



In [22]:

    
# Multiplying two Series objects produces a new Series by multiplying values that have the same keys.
product = cosine * sine
product









    Out[22]:





0    0.000000
1    0.063264
2    0.125512
3    0.185742
4    0.242986
...
94   -0.298884
95   -0.245765
96   -0.188695
97   -0.128592
98   -0.066423
99   -0.003185
Length: 100, dtype: float64



In [23]:

    
# Adding or multiplying a Series by a scalar applies that operation to each value in the Series.
cosine_plus_one = cosine + 1
cosine_plus_one









    Out[23]:





0    2.000000
1    1.997989
2    1.991963
3    1.981947
4    1.967981
...
94    1.949123
95    1.967177
96    1.981339
97    1.991555
98    1.997782
99    1.999995
Name: cosine, Length: 100, dtype: float64



In [24]:

    
# Other binary operators work as you'd expect.  

# Note how much cleaner and clearer this is
# compared to looping over two containers and 
# performing multiple operations on elements 
# from each.
identity = (sine ** 2) + (cosine ** 2)
identity









    Out[24]:





0    1
1    1
2    1
3    1
4    1
...
94    1
95    1
96    1
97    1
98    1
99    1
Length: 100, dtype: float64

All of the `pandas` data structures have `plot` methods that provide a user-friendly interface to `matplotlib`.



In [25]:

    
# Plot our sines values.
trigplot = sine.plot(
    ylim=(-1.2, 1.2),
    legend=True,
    figsize=DEFAULT_FIGSIZE,
    linewidth=3,
    label='sine',
)
# Add our other Series' to the same plot.
cosine.plot(ax=trigplot, legend=True, linewidth=3)
product.plot(ax=trigplot, legend=True, linewidth=3, label='product')
identity.plot(ax=trigplot, legend=True, linewidth=3, label='identity')









    Out[25]:





<matplotlib.axes.AxesSubplot at 0x1055dec50>

We can map more complicated functions over a `Series` using the `apply` method.



In [26]:

    
def tenths_place(N):
    s = str(N)
    return s[s.find('.') + 1]
product.apply(tenths_place)









    Out[26]:





0    0
1    0
2    1
3    1
4    2
...
94    2
95    2
96    1
97    1
98    0
99    0
Length: 100, dtype: object

Handling missing data.



In [27]:

    
# A major problem when working with real world data is handling missing entries.
# Pandas handles missing data by taking 
s1 = pd.Series({'a': 1, 'b': 2, 'c': 3})
# s2 is missing an entry for 'b'
s2 = pd.Series({'a': 4, 'c': 5})
s1 + s2









    Out[27]:





a     5
b   NaN
c     8
dtype: float64

Boolean Operations on `Series`



In [28]:

    
s1 = pd.Series(
    {
        'A': 1,
        'B': 2,
        'C': 3,
        'D': 4,
        'E': 3,
        'F': 2,
        'G': 1,
    }
)
# You can create a constant Series by passing a scalar value and an index.
s2 = pd.Series(2, index=s1.index)



In [29]:

    
greater = s1 > s2
greater









    Out[29]:





A    False
B    False
C     True
D     True
E     True
F    False
G    False
dtype: bool



In [30]:

    
less = s1 < s2
less









    Out[30]:





A     True
B    False
C    False
D    False
E    False
F    False
G     True
dtype: bool



In [31]:

    
equal = s1 == s2
equal









    Out[31]:





A    False
B     True
C    False
D    False
E    False
F     True
G    False
dtype: bool



In [32]:

    
# Comparisons against scalars also work.
s1_equal_to_3 = s1 == 3
s1_equal_to_3









    Out[32]:





A    False
B    False
C     True
D    False
E     True
F    False
G    False
dtype: bool



In [33]:

    
#TODO: Move this down?

pd.DataFrame({
    's1': s1,
    's2': s2,
    's1 > s2': greater,
    's1 == s2': equal,
    's1 < s2': less,
    's1 == 3': s1_equal_to_3,
}, columns=['s1','s2', 's1 > s2', 's1 == s2', 's1 < s2', 's1 == 3'])









    Out[33]:






  
    
      
      s1
      s2
      s1 > s2
      s1 == s2
      s1 < s2
      s1 == 3
    
  
  
    
      A
       1
       2
       False
       False
        True
       False
    
    
      B
       2
       2
       False
        True
       False
       False
    
    
      C
       3
       2
        True
       False
       False
        True
    
    
      D
       4
       2
        True
       False
       False
       False
    
    
      E
       3
       2
        True
       False
       False
        True
    
    
      F
       2
       2
       False
        True
       False
       False
    
    
      G
       1
       2
       False
       False
        True
       False

Boolean-valued Series can be used for slicing. You can think of this as marking particular index values as "keep" (True) or "drop" (False).



In [34]:

    
# Indexing into a series with a boolean Series masks away the values which were
# false in the passed Series.
s1[s1 > s2]









    Out[34]:





C    3
D    4
E    3
dtype: int64



In [35]:

    
# We can combine these operators to concisely express complex
# computations/filters.
s1[(s1 > 1) & ~(s1 > s2)]









    Out[35]:





B    2
F    2
dtype: int64

Working with Time Series Data



In [36]:

    
# Pandas has a special index class, `DatetimeIndex`, for representing
# TimeSeries data.
start = pd.Timestamp('2014-01-01', tz='UTC')
end = pd.Timestamp('2014-01-09', tz='UTC')

# date_range is an easy way to construct a DatetimeIndex
daily_index = pd.date_range(start, end)
daily_index









    Out[36]:





<class 'pandas.tseries.index.DatetimeIndex'>
[2014-01-01, ..., 2014-01-09]
Length: 9, Freq: D, Timezone: UTC



In [37]:

    
# DatetimeIndex has a notion of its Frequency.
from pandas.tseries.offsets import Day, Hour, BDay, Minute

hourly_index = pd.date_range(
    pd.Timestamp('2014-01-01', tz='UTC'),
    pd.Timestamp('2014-01-9', tz='UTC'),
    freq=Hour(),
)
hourly_index









    Out[37]:





<class 'pandas.tseries.index.DatetimeIndex'>
[2014-01-01 00:00:00+00:00, ..., 2014-01-09 00:00:00+00:00]
Length: 193, Freq: H, Timezone: UTC



In [38]:

    
bihourly_index = pd.date_range(
    pd.Timestamp('2014-01-01', tz='UTC'),
    pd.Timestamp('2014-01-09', tz='UTC'),
    freq=Hour(2),
)
bihourly_index









    Out[38]:





<class 'pandas.tseries.index.DatetimeIndex'>
[2014-01-01 00:00:00+00:00, ..., 2014-01-09 00:00:00+00:00]
Length: 97, Freq: 2H, Timezone: UTC



In [39]:

    
weekday_index = pd.date_range(
    pd.Timestamp('2014-01-01', tz='UTC'),
    pd.Timestamp('2014-01-09', tz='UTC'),
    freq=BDay(),
)
print weekday_index
[i for i in weekday_index]









    



<class 'pandas.tseries.index.DatetimeIndex'>
[2014-01-01, ..., 2014-01-09]
Length: 7, Freq: B, Timezone: UTC






    Out[39]:





[Timestamp('2014-01-01 00:00:00+0000', tz='UTC', offset='B'),
 Timestamp('2014-01-02 00:00:00+0000', tz='UTC', offset='B'),
 Timestamp('2014-01-03 00:00:00+0000', tz='UTC', offset='B'),
 Timestamp('2014-01-06 00:00:00+0000', tz='UTC', offset='B'),
 Timestamp('2014-01-07 00:00:00+0000', tz='UTC', offset='B'),
 Timestamp('2014-01-08 00:00:00+0000', tz='UTC', offset='B'),
 Timestamp('2014-01-09 00:00:00+0000', tz='UTC', offset='B')]

If your Series has a DatetimeIndex, then you immediately get access to sophisticated resampling tools.



In [40]:

    
ts = pd.Series(
    np.arange(30) ** 2,
    pd.date_range(
        start=pd.Timestamp('2014-01-01', tz='UTC'),
        freq='1D',
        periods=30,
    )
)
ts.plot()









    Out[40]:





<matplotlib.axes.AxesSubplot at 0x105610b50>



In [41]:

    
# By default, resampling to a lower frequency takes the mean of the entries
# that were downsampled.
resampled = ts.resample('5D')
resampled









    Out[41]:





2014-01-01 00:00:00+00:00      6
2014-01-06 00:00:00+00:00     51
2014-01-11 00:00:00+00:00    146
2014-01-16 00:00:00+00:00    291
2014-01-21 00:00:00+00:00    486
2014-01-26 00:00:00+00:00    731
Freq: 5D, dtype: int64



In [42]:

    
# We can customize this behavior though.
resampled_first = ts.resample('5D', how='first')
resampled_first









    Out[42]:





2014-01-01 00:00:00+00:00      0
2014-01-06 00:00:00+00:00     25
2014-01-11 00:00:00+00:00    100
2014-01-16 00:00:00+00:00    225
2014-01-21 00:00:00+00:00    400
2014-01-26 00:00:00+00:00    625
Freq: 5D, dtype: int64



In [43]:

    
resampled_last = ts.resample('5D', how='last')
resampled_last









    Out[43]:





2014-01-01 00:00:00+00:00     16
2014-01-06 00:00:00+00:00     81
2014-01-11 00:00:00+00:00    196
2014-01-16 00:00:00+00:00    361
2014-01-21 00:00:00+00:00    576
2014-01-26 00:00:00+00:00    841
Freq: 5D, dtype: int64



In [44]:

    
# We can even define our own custom sampling methods.
def geometric_mean(subseries):
    return np.product(subseries.values) ** (1.0 / len(subseries))
resampled_geometric = ts.resample('5D', how=geometric_mean)
print resampled_geometric









    



2014-01-01 00:00:00+00:00      0.000000
2014-01-06 00:00:00+00:00     46.970017
2014-01-11 00:00:00+00:00    141.990120
2014-01-16 00:00:00+00:00    286.995117
2014-01-21 00:00:00+00:00    481.997094
2014-01-26 00:00:00+00:00    726.998074
Freq: 5D, dtype: float64



In [45]:

    
pd.DataFrame(
    {
        "resampled": resampled,
        "resampled_first": resampled_first,
        "resampled_last": resampled_last,
        "resampled_geometric": resampled_geometric,
    }
).plot(linewidth=2, figsize=DEFAULT_FIGSIZE)









    Out[45]:





<matplotlib.axes.AxesSubplot at 0x105610c90>



In [46]:

    
# Upsampling creates missing data, which is represented by numpy.nan.
ts.resample('6H')









    Out[46]:





2014-01-01 00:00:00+00:00     0
2014-01-01 06:00:00+00:00   NaN
2014-01-01 12:00:00+00:00   NaN
2014-01-01 18:00:00+00:00   NaN
2014-01-02 00:00:00+00:00     1
...
2014-01-28 18:00:00+00:00    NaN
2014-01-29 00:00:00+00:00    784
2014-01-29 06:00:00+00:00    NaN
2014-01-29 12:00:00+00:00    NaN
2014-01-29 18:00:00+00:00    NaN
2014-01-30 00:00:00+00:00    841
Freq: 6H, Length: 117

We can handle missing data in a variety of ways.



In [47]:

    
# We can fill empty values with fillna.
zero_filled = ts.resample('6H').fillna(0)
print zero_filled









    



2014-01-01 00:00:00+00:00    0
2014-01-01 06:00:00+00:00    0
2014-01-01 12:00:00+00:00    0
2014-01-01 18:00:00+00:00    0
2014-01-02 00:00:00+00:00    1
...
2014-01-28 18:00:00+00:00      0
2014-01-29 00:00:00+00:00    784
2014-01-29 06:00:00+00:00      0
2014-01-29 12:00:00+00:00      0
2014-01-29 18:00:00+00:00      0
2014-01-30 00:00:00+00:00    841
Freq: 6H, Length: 117



In [48]:

    
# We can forward-fill with the last known prior value.
ffilled = ts.resample('6H').ffill()
print ffilled









    



2014-01-01 00:00:00+00:00    0
2014-01-01 06:00:00+00:00    0
2014-01-01 12:00:00+00:00    0
2014-01-01 18:00:00+00:00    0
2014-01-02 00:00:00+00:00    1
...
2014-01-28 18:00:00+00:00    729
2014-01-29 00:00:00+00:00    784
2014-01-29 06:00:00+00:00    784
2014-01-29 12:00:00+00:00    784
2014-01-29 18:00:00+00:00    784
2014-01-30 00:00:00+00:00    841
Freq: 6H, Length: 117



In [49]:

    
# We can backfill with earliest known next value.
bfilled = ts.resample('6H').bfill()
print bfilled









    



2014-01-01 00:00:00+00:00    0
2014-01-01 06:00:00+00:00    1
2014-01-01 12:00:00+00:00    1
2014-01-01 18:00:00+00:00    1
2014-01-02 00:00:00+00:00    1
...
2014-01-28 18:00:00+00:00    784
2014-01-29 00:00:00+00:00    784
2014-01-29 06:00:00+00:00    841
2014-01-29 12:00:00+00:00    841
2014-01-29 18:00:00+00:00    841
2014-01-30 00:00:00+00:00    841
Freq: 6H, Length: 117



In [50]:

    
# We can interpolate between known values.

# Note: `interpolate` is new as of pandas 0.14.0
# Quantopian is currently on pandas 0.12.0 due to breaking changes in the
# pandas API in 0.13.0.
linear_interpolated = ts.resample('6H').interpolate()
linear_interpolated









    Out[50]:





2014-01-01 00:00:00+00:00    0.00
2014-01-01 06:00:00+00:00    0.25
2014-01-01 12:00:00+00:00    0.50
2014-01-01 18:00:00+00:00    0.75
2014-01-02 00:00:00+00:00    1.00
...
2014-01-28 18:00:00+00:00    770.25
2014-01-29 00:00:00+00:00    784.00
2014-01-29 06:00:00+00:00    798.25
2014-01-29 12:00:00+00:00    812.50
2014-01-29 18:00:00+00:00    826.75
2014-01-30 00:00:00+00:00    841.00
Freq: 6H, Length: 117



In [51]:

    
quadratic_interpolated = ts.resample('6H').interpolate('polynomial', order=2)
quadratic_interpolated
# Note: `interpolate` is new as of pandas 0.14.0
# Quantopian is currently on pandas 0.12.0 due to breaking changes in the
# pandas API in 0.13.0.









    Out[51]:





2014-01-01 00:00:00+00:00    0.0000
2014-01-01 06:00:00+00:00    0.0625
2014-01-01 12:00:00+00:00    0.2500
2014-01-01 18:00:00+00:00    0.5625
2014-01-02 00:00:00+00:00    1.0000
...
2014-01-28 18:00:00+00:00    770.0625
2014-01-29 00:00:00+00:00    784.0000
2014-01-29 06:00:00+00:00    798.0625
2014-01-29 12:00:00+00:00    812.2500
2014-01-29 18:00:00+00:00    826.5625
2014-01-30 00:00:00+00:00    841.0000
Freq: 6H, Length: 117



In [52]:

    
pd.DataFrame(
    {
        "linear_interpolated": linear_interpolated,
        "quadratic_interpolated": quadratic_interpolated,
        "bfilled": bfilled,
        "ffilled": ffilled,
        "zero_filled": zero_filled,
    }
).plot(linewidth=2, figsize=DEFAULT_FIGSIZE)









    Out[52]:





<matplotlib.axes.AxesSubplot at 0x1072bf350>

`DataFrame` - 2D `Tables` of Interwoven `Series`



In [53]:

    
# Oftentimes we have more than one axis on which we want to store data.
from pandas.io.data import get_data_yahoo
spy = get_data_yahoo(
    symbols='SPY',
    start=pd.Timestamp('2011-01-01'),
    end=pd.Timestamp('2014-01-01'),
    adjust_price=True,
)
spy









    Out[53]:






  
    
      
      Open
      High
      Low
      Close
      Volume
      Adj_Ratio
    
    
      Date
      
      
      
      
      
      
    
  
  
    
      2011-01-03
       117.923577
       118.751861
       116.983613
       118.24
       138725200
       0.930657
    
    
      2011-01-04
       118.495717
       118.532941
       117.434811
       118.17
       137409700
       0.930619
    
    
      2011-01-05
       117.803496
       118.864453
       117.691816
       118.79
       133975300
       0.930664
    
    
      2011-01-06
       118.839206
       118.969502
       118.206340
       118.56
       122519000
       0.930685
    
    
      2011-01-07
       118.710864
       118.906295
       117.398679
       118.32
       156034600
       0.930628
    
    
      2011-01-10
       117.797752
       118.337511
       117.444117
       118.17
       122401700
       0.930619
    
    
      2011-01-11
       118.599306
       118.878495
       118.143298
       118.59
       110287000
       0.930629
    
    
      ...
      ...
      ...
      ...
      ...
      ...
      ...
    
    
      2013-12-20
       179.037954
       180.326069
       178.919052
       179.90
       197087000
       0.990857
    
    
      2013-12-23
       180.780732
       180.968994
       180.404209
       180.86
        85598000
       0.990851
    
    
      2013-12-24
       180.873560
       181.339270
       180.863652
       181.26
        45368800
       0.990871
    
    
      2013-12-26
       181.664751
       182.279086
       181.644934
       182.18
        63365000
       0.990863
    
    
      2013-12-27
       182.417716
       182.496984
       181.981736
       182.17
        61814000
       0.990862
    
    
      2013-12-30
       182.189543
       182.338172
       181.902193
       182.14
        56857000
       0.990861
    
    
      2013-12-31
       182.385673
       183.000000
       182.246954
       183.00
        86119900
       0.990850
    
  

754 rows × 6 columns



In [54]:

    
# Just plotting this DataFrame with the default arguments isn't very useful,
# because the scale of volume is so much greater than all the other columns.
spy.plot(figsize=DEFAULT_FIGSIZE)









    Out[54]:





<matplotlib.axes.AxesSubplot at 0x107331910>



In [55]:

    
# Let's make a more interesting plot.

# Create a figure
fig = plt.figure()

# Add a subplot for price.
price_subplot = fig.add_subplot('311', xlabel='Date', ylabel='Price')
spy['Close'].plot(ax=price_subplot, lw=2)  # lw means "line width"

# Add another subplot for each day's spread.
spread_subplot = fig.add_subplot('312', xlabel='Date', ylabel='Spread')
spread = spy['High'] - spy['Low']
spread.plot(ax=spread_subplot, lw=2, color='r')

# And add a third plot for volume.
volume_subplot = fig.add_subplot('313', xlabel='Date', ylabel='Volume')
spy['Volume'].plot(ax=volume_subplot, lw=2)

# matplotlib.pyplot.gcf is short for "Get Current Figure".  It provides an easy
# way to modify the last drawn plot.
plt.gcf().set_size_inches(*DEFAULT_FIGSIZE)



In [56]:

    
# Unsurprisingly, spread is strongly correlated with daily volume
spread.corr(spy['Volume'])









    Out[56]:





0.787975095601951

Selecting Data with `DataFrames`

A DataFrame has two indices, representing row labels and column labels. Since these are Index objects, we can use all the same slicing tools we used when working with Series.



In [57]:

    
# Default slicing acts on column labels.
# Passing a scalar value drops the dimension by one.
spy['Close'] # Returns a Series









    Out[57]:





Date
2011-01-03    118.24
2011-01-04    118.17
2011-01-05    118.79
2011-01-06    118.56
2011-01-07    118.32
...
2013-12-23    180.86
2013-12-24    181.26
2013-12-26    182.18
2013-12-27    182.17
2013-12-30    182.14
2013-12-31    183.00
Name: Close, Length: 754



In [58]:

    
# Passing a list filters the columns down to the supplied values.
spy[['Close', 'Volume']]









    Out[58]:






  
    
      
      Close
      Volume
    
    
      Date
      
      
    
  
  
    
      2011-01-03
       118.24
       138725200
    
    
      2011-01-04
       118.17
       137409700
    
    
      2011-01-05
       118.79
       133975300
    
    
      2011-01-06
       118.56
       122519000
    
    
      2011-01-07
       118.32
       156034600
    
    
      2011-01-10
       118.17
       122401700
    
    
      2011-01-11
       118.59
       110287000
    
    
      ...
      ...
      ...
    
    
      2013-12-20
       179.90
       197087000
    
    
      2013-12-23
       180.86
        85598000
    
    
      2013-12-24
       181.26
        45368800
    
    
      2013-12-26
       182.18
        63365000
    
    
      2013-12-27
       182.17
        61814000
    
    
      2013-12-30
       182.14
        56857000
    
    
      2013-12-31
       183.00
        86119900
    
  

754 rows × 2 columns



In [59]:

    
# Using .loc with one argument takes a slice of rows based on label.
spy.loc[pd.Timestamp('2013-02-01'):pd.Timestamp('2013-02-28')]









    Out[59]:






  
    
      
      Open
      High
      Low
      Close
      Volume
      Adj_Ratio
    
    
      Date
      
      
      
      
      
      
    
  
  
    
      2013-02-01
       146.316970
       147.064823
       146.064448
       146.89
       131173000
       0.971238
    
    
      2013-02-04
       145.997571
       146.920254
       145.133163
       145.24
       159073600
       0.971245
    
    
      2013-02-05
       146.030113
       147.127645
       145.971836
       146.71
       113912400
       0.971268
    
    
      2013-02-06
       146.188418
       146.907122
       146.081583
       146.81
       138762800
       0.971223
    
    
      2013-02-07
       146.862813
       146.998788
       145.551624
       146.62
       162490000
       0.971251
    
    
      2013-02-08
       146.876659
       147.527415
       146.876659
       147.44
       103133700
       0.971278
    
    
      2013-02-11
       147.380862
       147.536265
       147.040917
       147.41
        73775000
       0.971272
    
    
      ...
      ...
      ...
      ...
      ...
      ...
      ...
    
    
      2013-02-20
       148.738262
       148.786825
       146.912299
       146.99
       160574800
       0.971257
    
    
      2013-02-21
       146.614456
       147.061214
       145.623817
       146.09
       183257000
       0.971214
    
    
      2013-02-22
       146.801290
       147.520000
       146.160279
       147.52
       106356600
       0.971229
    
    
      2013-02-25
       148.245729
       148.469122
       144.720000
       144.72
       245824800
       0.971275
    
    
      2013-02-26
       145.418619
       145.884829
       144.457061
       145.71
       186596200
       0.971270
    
    
      2013-02-27
       145.578109
       147.947918
       145.451849
       147.54
       150781900
       0.971233
    
    
      2013-02-28
       147.531660
       148.473765
       147.055752
       147.25
       126866000
       0.971242
    
  

19 rows × 6 columns



In [60]:

    
# Using .loc with two arguments takes a slice of rows based on label, then a
# slice of columns based on name.

# Note the comma between the first slice and the second slice!
spy.loc[pd.Timestamp('2013-02-01'):pd.Timestamp('2013-02-28'), 'Open':'Low']









    Out[60]:






  
    
      
      Open
      High
      Low
    
    
      Date
      
      
      
    
  
  
    
      2013-02-01
       146.316970
       147.064823
       146.064448
    
    
      2013-02-04
       145.997571
       146.920254
       145.133163
    
    
      2013-02-05
       146.030113
       147.127645
       145.971836
    
    
      2013-02-06
       146.188418
       146.907122
       146.081583
    
    
      2013-02-07
       146.862813
       146.998788
       145.551624
    
    
      2013-02-08
       146.876659
       147.527415
       146.876659
    
    
      2013-02-11
       147.380862
       147.536265
       147.040917
    
    
      ...
      ...
      ...
      ...
    
    
      2013-02-20
       148.738262
       148.786825
       146.912299
    
    
      2013-02-21
       146.614456
       147.061214
       145.623817
    
    
      2013-02-22
       146.801290
       147.520000
       146.160279
    
    
      2013-02-25
       148.245729
       148.469122
       144.720000
    
    
      2013-02-26
       145.418619
       145.884829
       144.457061
    
    
      2013-02-27
       145.578109
       147.947918
       145.451849
    
    
      2013-02-28
       147.531660
       148.473765
       147.055752
    
  

19 rows × 3 columns



In [61]:

    
# We can use iloc when we want lookups by position.
spy.iloc[-20:-10, [0,2]]

Boolean Series slicing is very useful with DataFrame.



In [62]:

    
# Get the days on which SPY closed higher than it opened.
up_days = spy['Close'] > spy['Open']
up_days









    Out[62]:





Date
2011-01-03     True
2011-01-04    False
2011-01-05     True
2011-01-06    False
2011-01-07    False
...
2013-12-23     True
2013-12-24     True
2013-12-26     True
2013-12-27    False
2013-12-30    False
2013-12-31     True
Length: 754



In [63]:

    
spy[up_days]









    Out[63]:






  
    
      
      Open
      High
      Low
      Close
      Volume
      Adj_Ratio
    
    
      Date
      
      
      
      
      
      
    
  
  
    
      2011-01-03
       117.923577
       118.751861
       116.983613
       118.24
       138725200
       0.930657
    
    
      2011-01-05
       117.803496
       118.864453
       117.691816
       118.79
       133975300
       0.930664
    
    
      2011-01-10
       117.797752
       118.337511
       117.444117
       118.17
       122401700
       0.930619
    
    
      2011-01-12
       119.315668
       119.790288
       118.617698
       119.66
       107929200
       0.930627
    
    
      2011-01-14
       119.297005
       120.357919
       119.213248
       120.33
       117677900
       0.930626
    
    
      2011-01-18
       120.223573
       120.651680
       120.083973
       120.54
       114401300
       0.930667
    
    
      2011-01-20
       119.088320
       119.497814
       118.315865
       119.20
       175745700
       0.930668
    
    
      ...
      ...
      ...
      ...
      ...
      ...
      ...
    
    
      2013-12-18
       176.330239
       179.099566
       174.753398
       179.07
       234906000
       0.985526
    
    
      2013-12-19
       178.554492
       179.066957
       178.091303
       178.86
       136531200
       0.985509
    
    
      2013-12-20
       179.037954
       180.326069
       178.919052
       179.90
       197087000
       0.990857
    
    
      2013-12-23
       180.780732
       180.968994
       180.404209
       180.86
        85598000
       0.990851
    
    
      2013-12-24
       180.873560
       181.339270
       180.863652
       181.26
        45368800
       0.990871
    
    
      2013-12-26
       181.664751
       182.279086
       181.644934
       182.18
        63365000
       0.990863
    
    
      2013-12-31
       182.385673
       183.000000
       182.246954
       183.00
        86119900
       0.990850
    
  

423 rows × 6 columns



In [64]:

    
# We can use .ix when we want mixed lookups.
spy.ix[-20:-10, 'Open':'High']

Example: Investigating a Momentum Strategy



In [65]:

    
five_day_returns = spy['Close'].pct_change(5)
five_day_returns









    Out[65]:





Date
2011-01-03   NaN
2011-01-04   NaN
2011-01-05   NaN
2011-01-06   NaN
2011-01-07   NaN
...
2013-12-23    0.024006
2013-12-24    0.029535
2013-12-26    0.017368
2013-12-27    0.018506
2013-12-30    0.012451
2013-12-31    0.011832
Name: Close, Length: 754



In [66]:

    
# Checking for equality of floating point numbers is a bad idea because of
# roundoff error.  `numpy.allclose` does an appropriate epsilon test.
test_return = (spy['Close'].iloc[5] - spy['Close'].iloc[0]) / spy['Close'].iloc[0]
np.allclose(five_day_returns.iloc[5], test_return)









    Out[66]:





True



In [67]:

    
thirty_day_forward_returns = (spy['Close'].shift(-30) - spy['Close']) / spy['Close']
test_return = (spy['Close'].iloc[30] - spy['Close'].iloc[0]) / spy['Close'].iloc[0]
np.allclose(thirty_day_forward_returns.iloc[0], test_return)









    Out[67]:





True



In [68]:

    
returns = pd.DataFrame(
    {
        'forward_30Day': thirty_day_forward_returns,
        'backA_2Day': spy['Close'].pct_change(2),
        'backB_5Day': spy['Close'].pct_change(5),
        'backD_50Day': spy['Close'].pct_change(50),
        'backE_100Day': spy['Close'].pct_change(100),
        'backF_200Day': spy['Close'].pct_change(200),
        'backG_300Day': spy['Close'].pct_change(300),
    }
).dropna(how='any')
returns.plot(figsize=DEFAULT_FIGSIZE)









    Out[68]:





<matplotlib.axes.AxesSubplot at 0x101508190>



In [69]:

    
# Pairwise correlation of forward and backward returns.
corr = returns.corr()
corr









    Out[69]:






  
    
      
      backA_2Day
      backB_5Day
      backD_50Day
      backE_100Day
      backF_200Day
      backG_300Day
      forward_30Day
    
  
  
    
      backA_2Day
       1.000000
       0.595697
       0.188921
       0.071506
       0.162341
       0.103689
      -0.069359
    
    
      backB_5Day
       0.595697
       1.000000
       0.303945
       0.106169
       0.209141
       0.133395
      -0.132127
    
    
      backD_50Day
       0.188921
       0.303945
       1.000000
       0.443526
       0.299287
       0.324473
      -0.364206
    
    
      backE_100Day
       0.071506
       0.106169
       0.443526
       1.000000
       0.116012
       0.288164
      -0.533907
    
    
      backF_200Day
       0.162341
       0.209141
       0.299287
       0.116012
       1.000000
       0.276626
      -0.016570
    
    
      backG_300Day
       0.103689
       0.133395
       0.324473
       0.288164
       0.276626
       1.000000
       0.135996
    
    
      forward_30Day
      -0.069359
      -0.132127
      -0.364206
      -0.533907
      -0.016570
       0.135996
       1.000000



In [70]:

    
corr.ix['forward_30Day',:-1].plot(kind='bar', position=.5, xlim=(-1, 6))
plt.gcf().set_size_inches(9, 6)

Example: Investigating a Pair Trade for Pepsi and Coca-Cola

Here we show how to load data for two securities, graph the data, and compute correlation of returns and volatility for each security over the specified period.



In [71]:

    
# Load data for Pepsi and Coca-Cola from Yahoo.
symbols = [
    'PEP',
    'KO',
]
cola_data = get_data_yahoo(['PEP', 'KO'], adjust_price=True)
cola_data









    Out[71]:





<class 'pandas.core.panel.Panel'>
Dimensions: 6 (items) x 1186 (major_axis) x 2 (minor_axis)
Items axis: Open to Adj_Ratio
Major_axis axis: 2010-01-04 00:00:00 to 2014-09-18 00:00:00
Minor_axis axis: KO to PEP



In [72]:

    
# Compute the 1-day forward log returns for both securities' close prices.
closes = cola_data['Close']
yesterday_closes = cola_data['Close'].shift(1)
cola_log_returns = (closes / yesterday_closes).apply(np.log)
cola_raw_returns = closes.pct_change(1)

# Look at the data we just calculated by throwing it into a Panel and
# pulling out just the DataFrame or Kola.
pd.Panel({
    'closes'     : closes,
    'prev_closes': yesterday_closes,
    'log_returns': cola_log_returns,
    'raw_returns': cola_raw_returns,
}).loc[:,:,'KO']









    Out[72]:






  
    
      
      closes
      log_returns
      prev_closes
      raw_returns
    
    
      Date
      
      
      
      
    
  
  
    
      2010-01-04
       24.84
            NaN
         NaN
            NaN
    
    
      2010-01-05
       24.54
      -0.012151
       24.84
      -0.012077
    
    
      2010-01-06
       24.53
      -0.000408
       24.54
      -0.000407
    
    
      2010-01-07
       24.47
      -0.002449
       24.53
      -0.002446
    
    
      2010-01-08
       24.02
      -0.018561
       24.47
      -0.018390
    
    
      2010-01-11
       24.50
       0.019786
       24.02
       0.019983
    
    
      2010-01-12
       24.77
       0.010960
       24.50
       0.011020
    
    
      ...
      ...
      ...
      ...
      ...
    
    
      2014-09-10
       41.86
       0.005269
       41.64
       0.005283
    
    
      2014-09-11
       41.95
       0.002148
       41.86
       0.002150
    
    
      2014-09-12
       41.46
      -0.011749
       41.95
      -0.011681
    
    
      2014-09-15
       41.50
       0.000964
       41.46
       0.000965
    
    
      2014-09-16
       41.64
       0.003368
       41.50
       0.003373
    
    
      2014-09-17
       41.61
      -0.000721
       41.64
      -0.000720
    
    
      2014-09-18
       41.79
       0.004317
       41.61
       0.004326
    
  

1186 rows × 4 columns



In [73]:

    
# Pull the standard returns and the log returns into a single DataFrame using DataFrame.join.
closes.join(cola_log_returns, rsuffix='_lr')\
      .join(cola_raw_returns, rsuffix='_rr')\
      .dropna(how='any')









    Out[73]:






  
    
      
      KO
      PEP
      KO_lr
      PEP_lr
      KO_rr
      PEP_rr
    
    
      Date
      
      
      
      
      
      
    
  
  
    
      2010-01-05
       24.54
       53.80
      -0.012151
       0.011967
      -0.012077
       0.012039
    
    
      2010-01-06
       24.53
       53.26
      -0.000408
      -0.010088
      -0.000407
      -0.010037
    
    
      2010-01-07
       24.47
       52.93
      -0.002449
      -0.006215
      -0.002446
      -0.006196
    
    
      2010-01-08
       24.02
       52.75
      -0.018561
      -0.003407
      -0.018390
      -0.003401
    
    
      2010-01-11
       24.50
       52.69
       0.019786
      -0.001138
       0.019983
      -0.001137
    
    
      2010-01-12
       24.77
       53.43
       0.010960
       0.013947
       0.011020
       0.014044
    
    
      2010-01-13
       24.84
       53.86
       0.002822
       0.008016
       0.002826
       0.008048
    
    
      ...
      ...
      ...
      ...
      ...
      ...
      ...
    
    
      2014-09-10
       41.86
       91.79
       0.005269
       0.004039
       0.005283
       0.004047
    
    
      2014-09-11
       41.95
       91.65
       0.002148
      -0.001526
       0.002150
      -0.001525
    
    
      2014-09-12
       41.46
       90.87
      -0.011749
      -0.008547
      -0.011681
      -0.008511
    
    
      2014-09-15
       41.50
       91.20
       0.000964
       0.003625
       0.000965
       0.003632
    
    
      2014-09-16
       41.64
       92.57
       0.003368
       0.014910
       0.003373
       0.015022
    
    
      2014-09-17
       41.61
       92.85
      -0.000721
       0.003020
      -0.000720
       0.003025
    
    
      2014-09-18
       41.79
       93.37
       0.004317
       0.005585
       0.004326
       0.005600
    
  

1185 rows × 6 columns



In [74]:

    
# Create a figure with three 'slots' for subplots.
fig = plt.figure()
# 311 here means "Put the subplot in the 1st slot of a 3 x 1 grid.
# 312 and 313 tell matplotlib to place the subsequent plots in the 2nd and 3rd slot
price_subplot = fig.add_subplot('311', xlabel='Date', ylabel='Price')
return_subplot_pep = fig.add_subplot('312', xlabel='Date', ylabel='PEP Log Returns')
return_subplot_ko = fig.add_subplot('313', xlabel='Date', ylabel='KO Log Returns')

cola_data['Close'].plot(ax=price_subplot, color=['purple', 'red'])
cola_log_returns['PEP'].plot(ax=return_subplot_pep, color='red')
cola_log_returns['KO'].plot(ax=return_subplot_ko, color='purple')

# Set the size of the whole plot array.  gcf stands for `get_current_figure`.
plt.gcf().set_size_inches(14, 10)



In [75]:

    
# Compute the correlation of our log returns
correlation = (cola_log_returns['PEP']).corr(cola_log_returns['KO'])
correlation









    Out[75]:





0.60912363471867426



In [76]:

    
# Compute column-wise standard deviation of daily returns and divide by
# 1 / sqrt(252) to get annualized volatility.
volatility = cola_log_returns.std() * np.sqrt(252)
volatility









    Out[76]:





KO     0.151500
PEP    0.140033
dtype: float64

	Open	Low
Date
2013-12-03	177.327241	176.568422
2013-12-04	176.509114	175.769963
2013-12-05	176.813197	176.182461
2013-12-06	178.053910	177.541439
2013-12-09	178.838985	178.533480
2013-12-10	178.356666	178.021594
2013-12-11	178.199567	175.913188
2013-12-12	176.052613	175.185359
2013-12-13	175.914351	175.194925
2013-12-16	176.353917	176.304642

	Open	High
Date
2013-12-03	177.327241	177.770707
2013-12-04	176.509114	177.869150
2013-12-05	176.813197	177.138421
2013-12-06	178.053910	178.487538
2013-12-09	178.838985	179.036085
2013-12-10	178.356666	178.731158
2013-12-11	178.199567	178.229132
2013-12-12	176.052613	176.269427
2013-12-13	175.914351	176.072033
2013-12-16	176.353917	177.201441

	s1	s2	s1 > s2	s1 == s2	s1 < s2	s1 == 3
A	1	2	False	False	True	False
B	2	2	False	True	False	False
C	3	2	True	False	False	True
D	4	2	True	False	False	False
E	3	2	True	False	False	True
F	2	2	False	True	False	False
G	1	2	False	False	True	False

	Open	High	Low	Close	Volume	Adj_Ratio
Date
2011-01-03	117.923577	118.751861	116.983613	118.24	138725200	0.930657
2011-01-04	118.495717	118.532941	117.434811	118.17	137409700	0.930619
2011-01-05	117.803496	118.864453	117.691816	118.79	133975300	0.930664
2011-01-06	118.839206	118.969502	118.206340	118.56	122519000	0.930685
2011-01-07	118.710864	118.906295	117.398679	118.32	156034600	0.930628
2011-01-10	117.797752	118.337511	117.444117	118.17	122401700	0.930619
2011-01-11	118.599306	118.878495	118.143298	118.59	110287000	0.930629
...	...	...	...	...	...	...
2013-12-20	179.037954	180.326069	178.919052	179.90	197087000	0.990857
2013-12-23	180.780732	180.968994	180.404209	180.86	85598000	0.990851
2013-12-24	180.873560	181.339270	180.863652	181.26	45368800	0.990871
2013-12-26	181.664751	182.279086	181.644934	182.18	63365000	0.990863
2013-12-27	182.417716	182.496984	181.981736	182.17	61814000	0.990862
2013-12-30	182.189543	182.338172	181.902193	182.14	56857000	0.990861
2013-12-31	182.385673	183.000000	182.246954	183.00	86119900	0.990850

	Open	High	Low	Close	Volume	Adj_Ratio
Date
2013-02-01	146.316970	147.064823	146.064448	146.89	131173000	0.971238
2013-02-04	145.997571	146.920254	145.133163	145.24	159073600	0.971245
2013-02-05	146.030113	147.127645	145.971836	146.71	113912400	0.971268
2013-02-06	146.188418	146.907122	146.081583	146.81	138762800	0.971223
2013-02-07	146.862813	146.998788	145.551624	146.62	162490000	0.971251
2013-02-08	146.876659	147.527415	146.876659	147.44	103133700	0.971278
2013-02-11	147.380862	147.536265	147.040917	147.41	73775000	0.971272
...	...	...	...	...	...	...
2013-02-20	148.738262	148.786825	146.912299	146.99	160574800	0.971257
2013-02-21	146.614456	147.061214	145.623817	146.09	183257000	0.971214
2013-02-22	146.801290	147.520000	146.160279	147.52	106356600	0.971229
2013-02-25	148.245729	148.469122	144.720000	144.72	245824800	0.971275
2013-02-26	145.418619	145.884829	144.457061	145.71	186596200	0.971270
2013-02-27	145.578109	147.947918	145.451849	147.54	150781900	0.971233
2013-02-28	147.531660	148.473765	147.055752	147.25	126866000	0.971242

	backA_2Day	backB_5Day	backD_50Day	backE_100Day	backF_200Day	backG_300Day	forward_30Day
backA_2Day	1.000000	0.595697	0.188921	0.071506	0.162341	0.103689	-0.069359
backB_5Day	0.595697	1.000000	0.303945	0.106169	0.209141	0.133395	-0.132127
backD_50Day	0.188921	0.303945	1.000000	0.443526	0.299287	0.324473	-0.364206
backE_100Day	0.071506	0.106169	0.443526	1.000000	0.116012	0.288164	-0.533907
backF_200Day	0.162341	0.209141	0.299287	0.116012	1.000000	0.276626	-0.016570
backG_300Day	0.103689	0.133395	0.324473	0.288164	0.276626	1.000000	0.135996
forward_30Day	-0.069359	-0.132127	-0.364206	-0.533907	-0.016570	0.135996	1.000000

	closes	log_returns	prev_closes	raw_returns
Date
2010-01-04	24.84	NaN	NaN	NaN
2010-01-05	24.54	-0.012151	24.84	-0.012077
2010-01-06	24.53	-0.000408	24.54	-0.000407
2010-01-07	24.47	-0.002449	24.53	-0.002446
2010-01-08	24.02	-0.018561	24.47	-0.018390
2010-01-11	24.50	0.019786	24.02	0.019983
2010-01-12	24.77	0.010960	24.50	0.011020
...	...	...	...	...
2014-09-10	41.86	0.005269	41.64	0.005283
2014-09-11	41.95	0.002148	41.86	0.002150
2014-09-12	41.46	-0.011749	41.95	-0.011681
2014-09-15	41.50	0.000964	41.46	0.000965
2014-09-16	41.64	0.003368	41.50	0.003373
2014-09-17	41.61	-0.000721	41.64	-0.000720
2014-09-18	41.79	0.004317	41.61	0.004326

Series

Basics

Slicing Series with __getitem__ (aka [])

Numerical Operations

All of the pandas data structures have plot methods that provide a user-friendly interface to matplotlib.

We can map more complicated functions over a Series using the apply method.