In [2]:

    
%matplotlib inline
import matplotlib.pyplot as plt
from matplotlib.markers import MarkerStyle
import pandas as pd
import seaborn as sns
import numpy as np

np.set_printoptions(threshold=30)
pd.set_option('display.max_rows', 15)
pd.set_option('display.max_columns', 15)
sns.set(style="whitegrid", context="poster", font_scale=1.3)

Developing an Expression Language for Quantitative Financial Modeling

Scott Sanderson

https://github.com/ssanderson/pydata-nyc-2015

About Me:

API Design Lead at Quantopian
Background in Mathematics and Philosophy
Twitter: @ssanderson11235
GitHub: ssanderson

Outline

Trading API Design 101
Benefits of Symbolic Computation
Demo
The Joys of Financial Data
Future Work

What Is a Trading Algorithm?

Any program is a function from current state of the world $\rightarrow$ side effects.

Our World State:

Current Portfolio State (e.g. Position Counts, Available Cash)
Per-Asset Data (Price/Volume Data, Fundamentals, many more exotic sources)
Global Macroeconomic Data (Oil Prices, Unemployment, GDP)

Our Side Effects:

Orders
Cancellations

Good APIs Encourage Problem Decomposition

Large Problem

Place Orders Based on State of the World

Medium Problems

Compute desired portfolio allocations.
Place orders to move from current allocations toward desired allocations.

Small Problems:

Compute desired portfolio allocations.
- Choose input data.
- Decide what values we want to compute on the data.
  - Reductions
  - Boolean Masks
  - Categoricals/Classifiers
- Combine and Compare Computed Values
  - Weighted Combinations
  - Ranking
  - Filtering
  - Normalization

Many of these atomic computations share a common structure:

Get last N periods worth of data. (Sometimes N is just 1.)
Apply a reduction function to produce a single-valued output.

Three Major Kinds of Expression:

Factors produce numerical-valued results.
Filters produce boolean-valued results.
Classifiers produce categorical-valued results.

We can compose symbolic expressions differently based the type of data they produce.

Factors

Factors are reductions producing numerically-valued outputs.
mean, median, first, last, stddev, etc.
Factors can be added, subtracted, etc.
Factors can produce new factors via operations like rank() or zscore().

Filters

Filters are reductions producing boolean-valued outputs.
Most commonly constructed via comparision operators (e.g. factor1 > factor2).
Can be combined via & and |.
Can be used as masks for operations like rank() and percentile().

Classifiers

Classifiers are reductions producing categorical-valued outputs.
Can be combined via operations like cross_product().
Can be used to defining grouping criteria for normalizations.
- Example: "Compute earnings of each stock minus the mean of the earnings for stock in the same industry."

"Platonic" Target Algorithm:

For each asset in a known (large) universe, look at trailing windows of data and compute Factors (numerical-valued expressions), Filters (boolean-valued expressions) and Classifiers (categorical-valued expressions).
Compose meta-expressions by combining computed values.
Use outputs from (2) to compute desired portfolio allocations.
Compare desired to current allocations and place orders to reconcile the difference.

Design Goals

Make it easy to share common transformations.
Make it easy to compose transformations.
Be performant enough look at operate on large universes without slowing down backtests.
Be abstract enough that the underlying machinery can be changed without breaking lots of APIs.

Software Symbolic Computation is Eating the World

Symbolic and/or deferred computation frameworks are increasingly the norm for providing a high-level API to performant code.

In the PyData ecosystem alone we have:

Blaze
Dask
Ibis
Theano
Tensorflow (as of yesterday!)

Optimization

Common subexpression elimination.
Caching/precomputation of pure functions.
Rewriting of equivalent terms (e.g. a + a + a -> 3a).

Abstraction

Framework can change how computations are executed under the hood without breaking public APIs.
Framework can support multiple execution styles in different contexts:
- Blaze can compute the same abstract expression against SQL or numpy.
- Dask can compute single-threaded, multi-threaded, multi-process, or multi-machine.

Correctness

Symbolic computation frameworks tend to encourage immutability of data structures, eliminating many classes of bugs.
Type systems can catch bugs at "expression compile" time, providing many of the benefits of a traditional compiler.

Demo Time

The Joys of Financial Data

"How hard can it be to get the last 30 days of price?"

Harder than one might think, unfortunately.

Companies are constantly created and dissolved.
Splits, Dividends, and Mergers render past and current prices/volumes incomparable.
Restatements are tricky to model correctly in a simulation.

Asset Flux



In [3]:

    
from zipline.assets import AssetFinder
 
finder = AssetFinder("sqlite:///data/assets.db")
lifetimes = finder.lifetimes(
    dates=pd.date_range('2001-01-01', '2015-10-01'), 
    include_start_date=True,
)
lifetimes.head(5)









    Out[3]:






  
    
      
      2
      6
      14
      21
      24
      25
      31
      ...
      49526
      49528
      49529
      49530
      49531
      49532
      49533
    
  
  
    
      2001-01-01
      True
      True
      True
      True
      True
      True
      True
      ...
      False
      False
      False
      False
      False
      False
      False
    
    
      2001-01-02
      True
      True
      True
      True
      True
      True
      True
      ...
      False
      False
      False
      False
      False
      False
      False
    
    
      2001-01-03
      True
      True
      True
      True
      True
      True
      True
      ...
      False
      False
      False
      False
      False
      False
      False
    
    
      2001-01-04
      True
      True
      True
      True
      True
      True
      True
      ...
      False
      False
      False
      False
      False
      False
      False
    
    
      2001-01-05
      True
      True
      True
      True
      True
      True
      True
      ...
      False
      False
      False
      False
      False
      False
      False
    
  

5 rows × 20353 columns



In [4]:

    
daily_count = lifetimes.sum(axis=1)
daily_count.plot(title="Companies in Existence by Day");

Splits, Dividends, and Mergers Oh My!



In [5]:

    
AAPL_prices = pd.read_csv(
    'data_public/AAPL-split.csv', 
    parse_dates=['Date'], 
    index_col='Date',
)

def plot_prices(prices):
    price_plot = prices.plot(title='AAPL Price', grid=False)
    price_plot.set_ylabel("Price", rotation='horizontal', labelpad=50)
    price_plot.vlines(
        ['2014-05-08'], 0, 700, 
        label="$3.05 Dividend", 
        linestyles='dotted',
        colors='black',
    )
    price_plot.vlines(
        ['2014-06-09'], 0, 700, 
        label="7:1 Split", 
        linestyles='--', 
        colors='black',
    )

    price_plot.legend()
    sns.despine()
    return price_plot



In [6]:

    
plot_prices(AAPL_prices);

Prices occurring at different times aren't necessarily comparable.



In [7]:

    
naive_returns = AAPL_prices.pct_change()
naive_returns.plot();

Restatements

Hypothetical Scenario:

On March 5th, Apple announces that it's Q1 revenue was \$15.
On March 15th, Apple issues an amendment; its Q1 revenue was actually \$15 billion.
User asks for Apple's Q1 revenue.

Question: What's the "correct" value to return?

Answer: It depends.

On March 4th, the right answer is "I don't know".
On Marth 6th, the right answer (as of that date) is \$15.
On Marth 16th, the right answer is \$15 billion.

Perspectival Data

Traditional solution to above problems is to use "Adjusted Prices".

Generally means making prices comparable by normalizing everything to current-day prices.
This is what you get from Yahoo/Google Finance APIs.

Problems with Adjusted Prices:

Non-reproducible:
- A backtest run in real time 2014 will see different values for historical prices than a backtest run today.

Can potentially introduce subtle biases:
- If algo has access to real price, it can tell a dividend/split is coming.
- Stocks that have undergone multiple splits will have very low prices in the past.

Still doesn't help for restatements.
- Faithfully representing restated values requires quoting different values at different sim times.

When in Doubt, Add Another Dimension

Naively, we want to say something like this:

$$D(a, t)$$

is the value of dataset $D$ for asset $a$ at time $t$.

This abstraction is broken in the face of splits, dividends, and restatements.

When in Doubt, Add Another Dimension

What we can say is this:

$$D(a, t, t_{ref})$$

is the value of $D$ for asset $a$ at time $t$ from the perspective of time $t_{ref}$.

Napkin Math:



In [8]:

    
from bcolz import open
from humanize import naturalsize

all_prices = open('data/equity_daily_bars.bcolz')
min_offset = min(all_prices.attrs['calendar_offset'].itervalues())
max_offset = max(all_prices.attrs['calendar_offset'].itervalues())
calendar = pd.DatetimeIndex(all_prices.attrs['calendar'])[min_offset:max_offset]

nassets = len(lifetimes.columns)
ndates = len(calendar)
nfields = len(('id', 'open', 'high', 'low', 'close', 'volume', 'date'))

print "Number of Assets: %d" % nassets
print "Number of Dates: %d" % ndates
print "Naive Dataset Size: %s" % naturalsize(
    nassets * ndates * nfields * 8
)









    



Number of Assets: 20353
Number of Dates: 3480
Naive Dataset Size: 4.0 GB

Data Size on Disk



In [9]:

    
!du -h -d0 data/equity_daily_bars.bcolz
!du -h -d0 data/adjustments.db









    



299M	data/equity_daily_bars.bcolz
30M	data/adjustments.db

Tricks used to make dataset smaller:

Don't store entries on dates for which an asset didn't exist.
- Instead, store offset into the calendar of the date the asset started/stopped trading.
Represent data points as 32-bit unsigned ints on disk.
- Going from 64-bit to 32-bit cuts the data size in half.
- Switching from floating-point to integer dramatically improves compression ratio.

Data Structures for Representing Perspectival Data

Rule 5: Data dominates. If you've chosen the right data structures and organized things well, the algorithms will almost always be self-evident. Data structures, not algorithms, are central to programming. - Rob Pike



In [10]:

    
import pandas as pd
from zipline.utils.tradingcalendar import trading_day
from zipline.pipeline.data import USEquityPricing
from zipline.pipeline.loaders import USEquityPricingLoader

loader = USEquityPricingLoader.from_files(
    'data/equity_daily_bars.bcolz',
    'data/adjustments.db'
)
dates = pd.date_range(
    '2014-5-20', 
    '2014-06-30', 
    freq=trading_day,
    tz='UTC',
)



In [11]:

    
# load_adjusted_array() returns a dictionary mapping columns to instances of `AdjustedArray`.
(closes,) = loader.load_adjusted_array(
    columns=[USEquityPricing.close],
    dates=dates,
    assets=pd.Int64Index([24, 5061]),
    mask=None,
).values()
closes









    Out[11]:





Adjusted Array:

Data:
 array([[ 604.4 ,   39.74],
       [ 604.55,   39.69],
       [ 606.28,   40.35],
       ..., 
       [  90.35,   42.02],
       [  90.92,   41.73],
       [  91.96,   42.24]])

Adjustments:
{13: [Float64Multiply(first_row=0, last_row=13, first_col=0, last_col=0, value=0.142860)]}



In [14]:

    
dates_iter = iter(dates[4:])
window = closes.traverse(5)
window









    Out[14]:





_Float64AdjustedArrayWindow
Window Length: 5
Current Buffer:
[[ 604.4     39.74 ]
 [ 604.55    39.69 ]
 [ 606.28    40.35 ]
 [ 607.33    40.105]
 [ 614.14    40.12 ]]
Remaining Adjustments:
{13: [Float64Multiply(first_row=0, last_row=13, first_col=0, last_col=0, value=0.142860)]}



In [15]:

    
# This cell is run multiple times to show the numbers scrolling up until we hit the split.
data = next(window)
print data
print next(dates_iter)









    



[[ 604.4     39.74 ]
 [ 604.55    39.69 ]
 [ 606.28    40.35 ]
 [ 607.33    40.105]
 [ 614.14    40.12 ]]
2014-05-27 00:00:00+00:00

Future Work

Near Term:

Classifiers/Normalization schemes
Non-numeric data types. (Currently we only support float and bool).
Datasets that have more than one data point per asset/day.
- Earnings estimates.
- Calendars.
- More exotic data?

Longer Term:

Parallel and/or Remote execution engines.
- Leveraging dask is a potentially interesting route.
Just-in-time compilation of factor compute functions.

	2	6	14	21	24	25	31	...	49526	49528	49529	49530	49531	49532	49533
2001-01-01	True	True	True	True	True	True	True	...	False	False	False	False	False	False	False
2001-01-02	True	True	True	True	True	True	True	...	False	False	False	False	False	False	False
2001-01-03	True	True	True	True	True	True	True	...	False	False	False	False	False	False	False
2001-01-04	True	True	True	True	True	True	True	...	False	False	False	False	False	False	False
2001-01-05	True	True	True	True	True	True	True	...	False	False	False	False	False	False	False

Developing an Expression Language for Quantitative Financial Modeling

Scott Sanderson

About Me:

Outline

What Is a Trading Algorithm?

Good APIs Encourage Problem Decomposition

Large Problem

Medium Problems

Small Problems:

Factors

Filters

Classifiers

"Platonic" Target Algorithm:

Design Goals

Software Symbolic Computation is Eating the World

Optimization

Abstraction

Correctness

Demo Time

The Joys of Financial Data

"How hard can it be to get the last 30 days of price?"

Harder than one might think, unfortunately.

Asset Flux

Splits, Dividends, and Mergers Oh My!

Prices occurring at different times aren't necessarily comparable.

Restatements

Hypothetical Scenario:

Perspectival Data

Problems with Adjusted Prices:

When in Doubt, Add Another Dimension

When in Doubt, Add Another Dimension

Napkin Math:

Data Size on Disk

Data Structures for Representing Perspectival Data

Future Work

Near Term:

Longer Term:

Questions?