An Introductory Tutorial to IPython Notebooks

By Delaney Granizo-Mackenzie & Justin Lent Adapted from a notebook by Dr. Thomas Wiecki

Notebook released under the Creative Commons Attribution 4.0 License.


IPython notebooks are a powerful tool for interacting with data. We hook our notebooks up to historical market data so that you can try out ideas for strategies. Often, the first step to developing a strategy is testing whether an idea would work in a clean, controlled environment. To do that we'll artificially generate some returns.

What is an IPython Notebook

An IPython notebook is an interface connected to a server somewhere. The interface is built out of cells that can be evaluated. Each cell is a chunk of code, and the output of the last line is displayed underneath the cell. Text editing commands work normally within a cell, and pressing the play button evaluates the cell.


In [1]:
2 + 2


Out[1]:
4

In [2]:
# This is a comment, it won't be evaluated
x = 1
x = x + 1
x


Out[2]:
2

Your work is saved between cells.


In [3]:
# You can also print within a cell to get more than just the final line's output
print x
x = x + 1
x


2
Out[3]:
3

In [4]:
# Some lines return nothing
x = 1

This is a markdown cell.

Markdown cells can display $\LaTeX$ as well. $$\sum_{i=1}{n} x_{i}^k$$

Simulated Returns

Importing Libraries

The first step is to import the libraries we need. You can pretty much copy-paste these imports for most cases. Notice how I'm renaming the libraries for easier use later.


In [5]:
# This is a numerical processing library
import numpy as np
# This is a library for plotting
import matplotlib.pyplot as plt

Generating Random Numbers

Now we'll call a function from the numpy library we imported. We'll use numpy.random.randn to sample some simulated returns from a normal distribution.


In [6]:
# You can get help with a function like this
np.random.randn?

In [7]:
np.random.randn(10)


Out[7]:
array([-0.46452141,  0.85763376, -0.93862487,  1.81387423, -0.83241194,
       -0.33915358,  0.03584861, -1.4449243 ,  0.11033862, -1.14644981])

In [8]:
# Get 100 random returns
returns_vector = np.random.randn(1000)
# Plot the returns
plt.plot(returns_vector)


Out[8]:
[<matplotlib.lines.Line2D at 0x7fa63a594210>]

In [9]:
# Plot the price of the instrument
# This is the cumulative sum of the returns
plt.plot(np.cumsum(returns_vector))


Out[9]:
[<matplotlib.lines.Line2D at 0x7fa63a4cb9d0>]

Obtaining Statistics in IPython

We can use numpy's tools to obtain statistics. In this case we're obtaining statistics from sample data, but we would do exactly the same thing had we obtained real market data.


In [10]:
np.mean(returns_vector)


Out[10]:
-0.030911955574617102

In [11]:
np.std(returns_vector)


Out[11]:
0.97941100440079165

Getting Real Pricing Data

Our notebooks are hooked up to our database of historical pricing data. You can use the get_pricing function to load pricing data into your notebook. The fields we provide it tell it which symbol to get pricing for, and over what time period.


In [12]:
# Get data for the S&P 500 from the start of this year
real_pricing = get_pricing('SPY', start_date='2015-01-01', end_date='2015-03-31')
real_pricing.head(5)


[2015-04-07 21:28:26.801951] INFO: requests.packages.urllib3.connectionpool: Starting new HTTP connection (1): localhost
[2015-04-07 21:28:26.817650] DEBUG: requests.packages.urllib3.connectionpool: "POST /api/pricing HTTP/1.1" 200 5045
Out[12]:
open_price high low close_price volume price
2015-01-02 00:00:00+00:00 206.38 206.88 204.180 205.41 96186838 205.41
2015-01-05 00:00:00+00:00 204.17 204.37 201.350 201.80 135105261 201.80
2015-01-06 00:00:00+00:00 202.09 202.72 198.855 199.82 169761073 199.82
2015-01-07 00:00:00+00:00 201.42 202.72 200.880 202.34 105090881 202.34
2015-01-08 00:00:00+00:00 204.01 206.16 203.990 205.92 114210211 205.92

We have all these fields, but we're just interested in the daily reported price for now.


In [13]:
# Index in to just get the pricing part
real_pricing = real_pricing['price']
real_pricing.head(5)


Out[13]:
2015-01-02 00:00:00+00:00    205.41
2015-01-05 00:00:00+00:00    201.80
2015-01-06 00:00:00+00:00    199.82
2015-01-07 00:00:00+00:00    202.34
2015-01-08 00:00:00+00:00    205.92
Name: price, dtype: float64

Let's plot it to get a sense of what it looks like.


In [14]:
plt.plot(real_pricing)


Out[14]:
[<matplotlib.lines.Line2D at 0x7fa638a954d0>]

Now get the daily percent returns of the stock's price


In [15]:
real_returns = real_pricing.pct_change()
plt.plot(real_returns.index, real_returns)


Out[15]:
[<matplotlib.lines.Line2D at 0x7fa638509610>]

In [16]:
# or you can plot the daily percent returns as a bar graph like this
plt.bar(real_returns.index,real_returns)


Out[16]:
<Container object of 61 artists>

Finally, let's compute the mean and standard deviation of the S&P 500 returns so far this year, as well as the Sharpe Ratio


In [17]:
np.mean(real_returns)


Out[17]:
0.00012180751193912783

In [18]:
np.std(real_returns)


Out[18]:
0.008769633527615702

In [19]:
sharpe_ratio = (np.mean(real_returns) * 252) / (np.std(real_returns) * np.sqrt(252))
sharpe_ratio


Out[19]:
0.22049203086693839

In [20]:
compounded_returns_series = np.cumprod(1+real_returns) - 1

In [22]:
# if you compare this to the plot of the actual SPY above, plt.plot(real_pricing)
# you'll notice it's the same shape except normalized to a starting value of 100
(100 * compounded_returns_series).plot()


Out[22]:
<matplotlib.axes._subplots.AxesSubplot at 0x7fa63893b1d0>

Now we'll explore something more interesting

A tutorial on Markowitz portfolio optimization in Python using cvxopt

Authors: Dr. Thomas Starke, David Edwards, Dr. Thomas Wiecki

About the author:

Today's blog post is written in collaboration with Dr. Thomas Starke. It is based on a longer whitepaper by Thomas Starke on the relationship between Markowitz portfolio optimization and Kelly optimization. The full whitepaper can be found here.

Introduction

In this blog post you will learn about the basic idea behind Markowitz portfolio optimization as well as how to do it in Python. We will then show how you can create a simple backtest that rebalances its portfolio in a Markowitz-optimal way. We hope you enjoy it and get a little more enlightened in the process.

In this post we will use random data rather than actual stock data, which will hopefully help you to get a sense of how to use modelling and simulation to improve your understanding of the theoretical concepts. Don‘t forget that the skill of an algo-trader is to put mathematical models into code and this example is great practice.

As a final note we would like to point out that from my experience there are a lot of people with IT experience reading these articles who do not always have the in-depth mathematical background. For that reason I try not to jump too many steps to increase the educational value. So let‘s start with importing a few modules, which we need later and produce a series of normally distributed returns. cvxopt is a convex solver which you can easily download with sudo pip install cvxopt (it is already available on the Quantopian research platform).

Simulations


In [23]:
import cvxopt as opt
from cvxopt import blas, solvers
import pandas as pd

np.random.seed(123)

# Turn off progress printing 
solvers.options['show_progress'] = False

Assume that we have 4 assets, each with a return series of length 1000. We can use numpy.random.randn to sample returns from a normal distribution.


In [24]:
## NUMBER OF ASSETS
n_assets = 4

## NUMBER OF OBSERVATIONS
n_obs = 1000

return_vec = np.random.randn(n_assets, n_obs)

In [25]:
plt.plot(return_vec.T, alpha=.4);
plt.xlabel('time')
plt.ylabel('returns')


Out[25]:
<matplotlib.text.Text at 0x7fa6389c9b10>

These return series can be used to create a wide range of portfolios, which all have different returns and risks (standard deviation). We can produce a wide range of random weight vectors and plot those portfolios. As we want all our capital to be invested, this vector will have to some to one.


In [26]:
def rand_weights(n):
    ''' Produces n random weights that sum to 1 '''
    k = np.random.rand(n)
    return k / sum(k)

print rand_weights(n_assets)
print rand_weights(n_assets)


[ 0.54066805  0.2360283   0.11660484  0.1066988 ]
[ 0.27638339  0.03006307  0.47850085  0.21505269]

Next, lets evaluate how many of these random portfolios would perform. Towards this goal we are calculating the mean returns as well as the volatility (here we are using standard deviation). You can also see that there is a filter that only allows to plot portfolios with a standard deviation of < 2 for better illustration.


In [27]:
def random_portfolio(returns):
    ''' 
    Returns the mean and standard deviation of returns for a random portfolio
    '''

    p = np.asmatrix(np.mean(returns, axis=1))
    w = np.asmatrix(rand_weights(returns.shape[0]))
    C = np.asmatrix(np.cov(returns))
    
    mu = w * p.T
    sigma = np.sqrt(w * C * w.T)
    
    # This recursion reduces outliers to keep plots pretty
    if sigma > 2:
        return random_portfolio(returns)
    return mu, sigma

In the code you will notice the calculation of the return with:

$$ R = p^T w $$

where $R$ is the expected return, $p^T$ is the transpose of the vector for the mean returns for each time series and w is the weight vector of the portfolio. $p$ is a Nx1 column vector, so $p^T$ turns into a 1xN row vector which can be multiplied with the Nx1 weight (column) vector w to give a scalar result. This is equivalent to the dot product used in the code. Keep in mind that Python has a reversed definition of rows and columns and the accurate NumPy version of the previous equation would be R = w * p.T

Next, we calculate the standard deviation with

$$\sigma = \sqrt{w^T C w}$$

where $C$ is the covariance matrix of the returns which is a NxN matrix. Please note that if we simply calculated the simple standard deviation with the appropriate weighting using std(array(ret_vec).T*w) we would get a slightly different ’bullet’. This is because the simple standard deviation calculation would not take covariances into account. In the covariance matrix, the values of the diagonal represent the simple variances of each asset while the off-diagonals are the variances between the assets. By using ordinary std() we effectively only regard the diagonal and miss the rest. A small but significant difference.

Lets generate the mean returns and volatility for 500 random portfolios:


In [28]:
n_portfolios = 500
means, stds = np.column_stack([
    random_portfolio(return_vec) 
    for _ in xrange(n_portfolios)
])

Upon plotting those you will observe that they form a characteristic parabolic shape called the ‘Markowitz bullet‘ with the boundaries being called the ‘efficient frontier‘, where we have the lowest variance for a given expected.


In [29]:
plt.plot(stds, means, 'o', markersize=5)
plt.xlabel('std')
plt.ylabel('mean')
plt.title('Mean and standard deviation of returns of randomly generated portfolios')


Out[29]:
<matplotlib.text.Text at 0x7fa6382c5750>

Markowitz optimization and the Efficient Frontier

Once we have a good representation of our portfolios as the blue dots show we can calculate the efficient frontier Markowitz-style. This is done by minimising

$$ w^T C w$$

for $w$ on the expected portfolio return $R^T w$ whilst keeping the sum of all the weights equal to 1:

$$ \sum_{i}{w_i} = 1 $$

Here we parametrically run through $R^T w = \mu$ and find the minimum variance for different $\mu$‘s. This can be done with scipy.optimise.minimize but we have to define quite a complex problem with bounds, constraints and a Lagrange multiplier. Conveniently, the cvxopt package, a convex solver, does all of that for us. We used one of their examples with some modifications as shown below. You will notice that there are some conditioning expressions in the code. They are simply needed to set up the problem. For more information please have a look at the cvxopt example.

The mus vector produces a series of expected return values $\mu$ in a non-linear and more appropriate way. We will see later that we don‘t need to calculate a lot of these as they perfectly fit a parabola, which can safely be extrapolated for higher values.


In [30]:
def optimal_portfolio(returns):
    n = len(returns)
    returns = np.asmatrix(returns)
    
    N = 100
    mus = [10**(5.0 * t/N - 1.0) for t in range(N)]
    
    # Convert to cvxopt matrices
    S = opt.matrix(np.cov(returns))
    pbar = opt.matrix(np.mean(returns, axis=1))
    
    # Create constraint matrices
    G = -opt.matrix(np.eye(n))   # negative n x n identity matrix
    h = opt.matrix(0.0, (n ,1))
    A = opt.matrix(1.0, (1, n))
    b = opt.matrix(1.0)
    
    # Calculate efficient frontier weights using quadratic programming
    portfolios = [solvers.qp(mu*S, -pbar, G, h, A, b)['x'] 
                  for mu in mus]
    ## CALCULATE RISKS AND RETURNS FOR FRONTIER
    returns = [blas.dot(pbar, x) for x in portfolios]
    risks = [np.sqrt(blas.dot(x, S*x)) for x in portfolios]
    ## CALCULATE THE 2ND DEGREE POLYNOMIAL OF THE FRONTIER CURVE
    m1 = np.polyfit(returns, risks, 2)
    x1 = np.sqrt(m1[2] / m1[0])
    # CALCULATE THE OPTIMAL PORTFOLIO
    wt = solvers.qp(opt.matrix(x1 * S), -pbar, G, h, A, b)['x']
    return np.asarray(wt), returns, risks

weights, returns, risks = optimal_portfolio(return_vec)

plt.plot(stds, means, 'o')
plt.ylabel('mean')
plt.xlabel('std')
plt.plot(risks, returns, 'y-o')


Out[30]:
[<matplotlib.lines.Line2D at 0x7fa638a42450>]

In yellow you can see the optimal portfolios for each of the desired returns (i.e. the mus). In addition, we get the one optimal portfolio returned:


In [31]:
print weights


[[  2.77880107e-09]
 [  3.20322848e-06]
 [  1.54301198e-06]
 [  9.99995251e-01]]

Backtesting on real market data

This is all very interesting but not very applied. We next demonstrate how you can create a simple algorithm in zipline -- the open-source backtester that powers Quantopian -- to test this optimization on actual historical stock data.

First, lets load in some historical data using Quantopian's get_pricing().


In [32]:
data = get_pricing(['IBM', 'GLD', 'XOM', 'AAPL', 
                    'MSFT', 'TLT', 'SHY'],
                     start_date='2005-06-07', end_date='2014-01-27')


[2015-04-07 21:30:37.025368] INFO: requests.packages.urllib3.connectionpool: Starting new HTTP connection (1): localhost
[2015-04-07 21:30:38.504771] DEBUG: requests.packages.urllib3.connectionpool: "POST /api/pricing HTTP/1.1" 200 999764

In [33]:
data.loc['price', :, :].plot(figsize=(8,5))
plt.ylabel('price in $')


Out[33]:
<matplotlib.text.Text at 0x7fa638a32c50>

Next, we'll create a zipline algorithm by defining two functions -- initialize() which is called once before the simulation starts, and handle_data() which is called for every trading bar. We then instantiate the algorithm object.

If you are confused about the syntax of zipline, check out the tutorial.


In [34]:
import zipline
from zipline.api import (add_history, 
                         history, 
                         set_slippage, 
                         slippage,
                         set_commission, 
                         commission, 
                         order_target_percent)

from zipline import TradingAlgorithm


def initialize(context):
    '''
    Called once at the very beginning of a backtest (and live trading). 
    Use this method to set up any bookkeeping variables.
    
    The context object is passed to all the other methods in your algorithm.

    Parameters

    context: An initialized and empty Python dictionary that has been 
             augmented so that properties can be accessed using dot 
             notation as well as the traditional bracket notation.
    
    Returns None
    '''
    # Register history container to keep a window of the last 100 prices.
    add_history(100, '1d', 'price')
    # Turn off the slippage model
    set_slippage(slippage.FixedSlippage(spread=0.0))
    # Set the commission model (Interactive Brokers Commission)
    set_commission(commission.PerShare(cost=0.01, min_trade_cost=1.0))
    context.tick = 0
    
def handle_data(context, data):
    '''
    Called when a market event occurs for any of the algorithm's 
    securities. 

    Parameters

    data: A dictionary keyed by security id containing the current 
          state of the securities in the algo's universe.

    context: The same context object from the initialize function.
             Stores the up to date portfolio as well as any state 
             variables defined.

    Returns None
    '''
    # Allow history to accumulate 100 days of prices before trading
    # and rebalance every day thereafter.
    context.tick += 1
    if context.tick < 100:
        return
    # Get rolling window of past prices and compute returns
    prices = history(100, '1d', 'price').dropna()
    returns = prices.pct_change().dropna()
    try:
        # Perform Markowitz-style portfolio optimization
        weights, _, _ = optimal_portfolio(returns.T)
        # Rebalance portfolio accordingly
        for stock, weight in zip(prices.columns, weights):
            order_target_percent(stock, weight)
    except ValueError as e:
        # Sometimes this error is thrown
        # ValueError: Rank(A) < p or Rank([P; A; G]) < n
        pass
        
# Instantinate algorithm        
algo = TradingAlgorithm(initialize=initialize, 
                        handle_data=handle_data)
# Run algorithm
results = algo.run(data.swapaxes(2, 0, 1))
results.portfolio_value.plot()


[2015-04-07 21:35:18.641456] INFO: Performance: Simulated 2175 trading days out of 2175.
[2015-04-07 21:35:18.642022] INFO: Performance: first open: 2005-06-07 13:31:00+00:00
[2015-04-07 21:35:18.642446] INFO: Performance: last close: 2014-01-27 21:00:00+00:00
Out[34]:
<matplotlib.axes._subplots.AxesSubplot at 0x7fa62b386bd0>

As you can see, the performance here is pretty good, even through the 2008 financial crisis. This is most likey due to our universe selection and shouldn't always be expected. Increasing the number of stocks in the universe might reduce the volatility as well. Please let us know in the comments section if you had any success with this strategy and how many stocks you used.

Conclusions

In this blog, co-written by Quantopian friend Dr. Thomas Starke, we wanted to provide an intuitive and gentle introduction to Markowitz portfolio optimization which still remains relevant today. By using simulation of various random portfolios we have seen that certain portfolios perform better than others. Convex optimization using cvxopt allowed us to then numerical determine the portfolios that live on the efficient frontier. The zipline backtest serves as an example but also shows compelling performance.

Next steps

  • Clone this notebook in the Quantopian Research Platform and run it on your own to see if you can enhance the performance.
  • You can also download just the notebook for use in your own environment here.
  • In a future blog post we will outline the connections to Kelly optimization which also tells us the amount of leverage to use.
  • We are currently in the process of adding cvxopt to the Quantopian backtester -- stay tuned!

In [ ]: