By Evgenia "Jenny" Nitishinskaya, Delaney Granizo-Mackenzie, and Maxwell Margenot.
Part of the Quantopian Lecture Series:
Notebook released under the Creative Commons Attribution 4.0 License.
Fundamentals are data having to do with the asset issuer, like the sector, size, and expenses of the company. We can use this data to build a linear factor model, expressing returns on any asset as
$$R_t = a_t + b_{t1} F_1 + b_{t2} F_2 + \ldots + b_{tK} F_K + \epsilon_t$$There are two different approaches to computing the factors $F_j$, which represent the returns associated with some fundamental characteristics, and the factor sensitivities $b_{tj}$.
In the first, we start by representing each characteristic of interest by a portfolio: we sort all assets by that characteristic, then build the portfolio by going long the top quantile of assets and short the bottom quantile. The factor corresponding to this characteristic is the return on this portfolio. Then, the $b_{ij}$ are estimated for each asset $i$ by regressing over the historical values of $R_i$ and of the factors.
We'll use the canonical Fama-French factors for this example, which are the returns of portfolios constructred based on fundamental factors.
We start by getting the fundamentals data for all assets and constructing the portfolios for each characteristic:
Import some libraries.
In [1]:
import numpy as np
import pandas as pd
from quantopian.pipeline.data import morningstar
import statsmodels.api as sm
from statsmodels import regression
import matplotlib.pyplot as plt
import scipy.stats
Set the date range for which we want data.
In [2]:
start_date = '2011-1-1'
end_date = '2012-1-1'
The pipeline API is a very useful tool for factor analysis. We use it here to get data for our analysis. Specifically, we want the daily values of book to price ratio and market cap for every security. But we also do several other useful filtering steps which are detailed in code comments.
In [3]:
import numpy as np
from quantopian.pipeline import Pipeline
from quantopian.pipeline.data import morningstar
from quantopian.pipeline.data.builtin import USEquityPricing
from quantopian.pipeline.factors import CustomFactor, Returns
# Here's the raw data we need, everything else is derivative.
class MarketCap(CustomFactor):
# Here's the data we need for this factor
inputs = [morningstar.valuation.shares_outstanding, USEquityPricing.close]
# Only need the most recent values for both series
window_length = 1
def compute(self, today, assets, out, shares, close_price):
# Shares * price/share = total price = market cap
out[:] = shares * close_price
class BookToPrice(CustomFactor):
# pb = price to book, we'll need to take the reciprocal later
inputs = [morningstar.valuation_ratios.pb_ratio]
window_length = 1
def compute(self, today, assets, out, pb):
out[:] = 1 / pb
def make_pipeline():
"""
Create and return our pipeline.
We break this piece of logic out into its own function to make it easier to
test and modify in isolation.
In particular, this function can be copy/pasted into research and run by itself.
"""
pipe = Pipeline()
# Add our factors to the pipeline
market_cap = MarketCap()
# Raw market cap and book to price data gets fed in here
pipe.add(market_cap, "market_cap")
book_to_price = BookToPrice()
pipe.add(book_to_price, "book_to_price")
# We also get daily returns
returns = Returns(inputs=[USEquityPricing.close], window_length=2)
pipe.add(returns, "returns")
# We compute a daily rank of both factors, this is used in the next step,
# which is computing portfolio membership.
market_cap_rank = market_cap.rank()
pipe.add(market_cap_rank, 'market_cap_rank')
book_to_price_rank = book_to_price.rank()
pipe.add(book_to_price_rank, 'book_to_price_rank')
# Build Filters representing the top and bottom 1000 stocks by our combined ranking system.
biggest = market_cap_rank.top(1000)
smallest = market_cap_rank.bottom(1000)
highpb = book_to_price_rank.top(1000)
lowpb = book_to_price_rank.bottom(1000)
# Don't return anything not in this set, as we don't need it.
pipe.set_screen(biggest | smallest | highpb | lowpb)
# Add the boolean flags we computed to the output data
pipe.add(biggest, 'biggest')
pipe.add(smallest, 'smallest')
pipe.add(highpb, 'highpb')
pipe.add(lowpb, 'lowpb')
return pipe
Now we initialize the pipeline.
In [4]:
pipe = make_pipeline()
We can visualize the dependency graph of our data computations here.
In [5]:
pipe.show_graph('png')
Out[5]:
This function will allow us to run the pipeline.
In [6]:
from quantopian.research import run_pipeline
Now let's actually run it and check out our results.
In [7]:
# This takes a few minutes.
results = run_pipeline(pipe, start_date, end_date)
results
Out[7]:
Great, we have all the data. Now we need to compute the returns of our portfolios over time. We have the daily returns for each equity, plus whether or not that equity was included in any given portfolio on any given day. We can combine that information in the following way to yield daily portfolio returns.
Step 1: Subset our results into only data belonging to our 'biggest' portfolio.
In [8]:
results[results.biggest]
Out[8]:
Step 2: Get returns.
In [9]:
results[results.biggest]['returns']
Out[9]:
Step 3: Group by day and take the mean. This is pretty deep into pandas logic, so if you don't understand this on first pass it is recommended to check out pandas' documentation on all the functions used. Especially groupby
, which is very useful. Keep in mind that the index in our results is a MultiIndex
rather than a regular Index
, that can complicate things.
In [10]:
results[results.biggest]['returns'].groupby(level=0).mean()
Out[10]:
Now run through this computation for each portfolio and get our final results.
In [11]:
R_biggest = results[results.biggest]['returns'].groupby(level=0).mean()
R_smallest = results[results.smallest]['returns'].groupby(level=0).mean()
R_highpb = results[results.highpb]['returns'].groupby(level=0).mean()
R_lowpb = results[results.lowpb]['returns'].groupby(level=0).mean()
SMB = R_smallest - R_biggest
HML = R_highpb - R_lowpb
What were the daily returns?
In [12]:
plt.plot(SMB.index, SMB.values)
plt.ylabel('Daily Percent Return')
plt.legend(['SMB Portfolio Returns']);
In [13]:
plt.plot(HML.index, HML.values)
plt.ylabel('Daily Percent Return')
plt.legend(['HML Portfolio Returns']);
And what would it look like to hold these portfolios over time?
In [14]:
plt.plot(SMB.index, np.cumprod(SMB.values+1))
plt.ylabel('Cumulative Return')
plt.legend(['SMB Portfolio Returns']);
The last data we need are the daily returns on the broad market.
In [15]:
M = get_pricing('SPY', start_date='2011-1-1', end_date='2012-1-1', fields='price').pct_change()[1:]
In [16]:
plt.plot(M.index, M.values)
plt.ylabel('Daily Percent Return')
plt.legend(['Market Portfolio Returns']);
In [17]:
# Get returns data for our portfolio
portfolio = get_pricing(['MSFT', 'AAPL', 'YHOO', 'FB', 'TSLA'],
fields='price', start_date=start_date, end_date=end_date).pct_change()[1:]
R = np.mean(portfolio, axis=1)
Put all the data into one dataframe for convenience.
In [18]:
# Define a constant to compute intercept
constant = pd.TimeSeries(np.ones(len(R.index)), index=R.index)
df = pd.DataFrame({'R': R,
'M': M,
'SMB': SMB,
'HML': HML,
'Constant': constant})
df = df.dropna()
Perform the regression. You'll notice that these are the sensitivities over an entire year. It can be valuable to look at the rolling sensitivities as well to determine how stable they are.
In [19]:
# Perform linear regression to get the coefficients in the model
b1, b2, b3 = regression.linear_model.OLS(df['R'], df[['M', 'SMB', 'HML']]).fit().params
# Print the coefficients from the linear regression
print 'Historical Sensitivities of portfolio returns to factors:\nMarket: %f\nMarket cap: %f\nB/P: %f' % (b1, b2, b3)
Let's perform a rolling regression to look at how the estimated sensitivities change over time.
In [20]:
model = pd.stats.ols.MovingOLS(y = df['R'], x=df[['M', 'SMB', 'HML']],
window_type='rolling',
window=100)
rolling_parameter_estimates = model.beta
rolling_parameter_estimates.plot();
plt.title('Computed Betas');
plt.legend(['Market Beta', 'SMB Beta', 'HML Beta', 'Intercept']);
Another approach is to normalize factor values each bar and see how predictive of that bar's returns they were. We do this by computing a normalized factor value $b_{aj}$ for each asset $a$ in the following way.
$$b_{aj} = \frac{F_{aj} - \mu_{F_j}}{\sigma_{F_j}}$$$F_{aj}$ is the value of factor $j$ for asset $a$ during this bar, $\mu_{F_j}$ is the mean factor value across all assets, and $\sigma_{F_j}$ is the standard deviation of factor values over all assets. Notice that we are just computing a z-score to make asset specific factor values comparable across different factors.
The exceptions to this formula are indicator variables, which are set to 1 for true and 0 for false. One example is industry membership: the coefficient tells us whether the asset belongs to the industry or not.
After we calculate all of the normalized scores during bar $t$, we can estimate factor $j$'s returns $F_{jt}$, using a cross-sectional regression (i.e. at each time step, we perform a regression using the equations for all of the assets). Specifically, once we have returns for each asset $R_{at}$, and normalized factor coefficients $b_{aj}$, we construct the following model and estimate the $F_j$s and $a_t$
$$R_{at} = a_t + b_{a1}F_1 + b_{a2}F_2 + \dots + b_{aK}F_K$$You can think of this as slicing through the other direction from the first analysis, as now the factor returns are unknowns to be solved for, whereas originally the coefficients were the unknowns. Another way to think about it is that you're determining how predictive of returns the factor was on that day, and therefore how much return you could have squeezed out of that factor.
Following this procedure, we'll get the cross-sectional returns on 2011-01-03, and compute the coefficients for all assets:
In [21]:
BTP = results['book_to_price']['2011-1-3']
zscore = (BTP - np.mean(BTP)) / np.std(BTP)
zscore.dropna(inplace=True)
plt.hist(zscore)
plt.xlabel('Z-Score')
plt.ylabel('Frequency');
Notice how there are big outliers in the dataset that cause the z-scores to lose a lot of information. Basically the presence of some huge book to price datapoints causes the rest of the data to seem to occupy a relatively small area. We need to get around this issue using some data cleaning technique, here we're use winsorization.
Winzorization takes the top $n\%$ of a dataset and sets it all equal to the least extreme value in the top $n\%$. For example, if your dataset ranged from 0-10, plus a few crazy outliers, those outliers would be set to 0 or 10 depending on their direction. Here is an example.
In [22]:
# Get some random data
X = np.random.normal(0, 1, 100)
# Put in some outliers
X[0] = 1000
X[1] = -1000
# Perform winsorization
print 'Before winsorization', np.min(X), np.max(X)
scipy.stats.mstats.winsorize(X, inplace=True, limits=0.01)
print 'After winsorization', np.min(X), np.max(X)
This looks good, let's see how our book to price data looks when winsorized.
In [23]:
BTP = results['book_to_price']['2011-1-3']
scipy.stats.mstats.winsorize(BTP, inplace=True, limits=0.01)
BTP_z = (BTP - np.mean(BTP)) / np.std(BTP)
BTP_z.dropna(inplace=True)
plt.hist(BTP_z)
plt.xlabel('Z-Score')
plt.ylabel('Frequency');
We need the returns for that day as well.
In [24]:
R_day = results['returns']['2011-1-3']
Now set up our data and estimate $F_j$ using linear regression.
In [25]:
constant = pd.TimeSeries(np.ones(len(R_day.index)), index=R_day.index)
df_day = pd.DataFrame({'R': R_day,
'BTP_z': BTP_z,
'Constant': constant})
df_day = df_day.dropna()
# Perform linear regression to get the coefficients in the model
F1 = regression.linear_model.OLS(df_day['R'], df_day['BTP_z']).fit().params
print F1
Finally, let's add another factor so you can see how the code changes.
In [26]:
MKT = results['market_cap']['2011-1-3']
scipy.stats.mstats.winsorize(MKT, inplace=True, limits=0.01)
MKT_z = (MKT - np.mean(MKT)) / np.std(MKT)
constant = pd.TimeSeries(np.ones(len(R_day.index)), index=R_day.index)
df_day = pd.DataFrame({'R': R_day,
'BTP_z': BTP_z,
'MKT_z': MKT_z,
'Constant': constant})
df_day = df_day.dropna()
# Perform linear regression to get the coefficients in the model
F1, F2 = regression.linear_model.OLS(df_day['R'], df_day[['BTP_z', 'MKT_z']]).fit().params
print F1, F2
To expand this analysis, you would simply loop through days, running this every day and getting an estimated factor return.
As discussed in the Arbitrage Price Theory lecture, factor modeling can be used to predict future returns based on current fundamental factors, or to determine when an asset may be mispriced. Modeling future returns is accomplished by offsetting the returns in the regression, so that rather than predicted for current returns, you are predicting for future returns. Once you have a predictive model, the most canonical way to create a strategy is to attempt a long-short equity approach.
There is a full lecture describing long-short equity, but the general idea is that you rank equities based on their predicted future returns. You then long the top p% and short the bottom p% remaining neutral on dollar volume. If the assets at the top of the ranking on average tend to make $5\%$ more per year than the market, and assets at the bottom tend to make $5\%$ less, then you will make $(M + 0.05) - (M - 0.05) = 0.10$ or $10\%$ percent per year, where $M$ is the market return that gets canceled out.
Once we've determined that we are exposed to a factor, we may want to avoid depending on the performance of that factor by taking out a hedge. This is discussed in the Beta Hedging lecture and also in the Risk Factor Exposure notebook.
This presentation is for informational purposes only and does not constitute an offer to sell, a solicitation to buy, or a recommendation for any security; nor does it constitute an offer to provide investment advisory or other services by Quantopian, Inc. ("Quantopian"). Nothing contained herein constitutes investment advice or offers any opinion with respect to the suitability of any security, and any views expressed herein should not be taken as advice to buy, sell, or hold any security or as an endorsement of any security or company. In preparing the information contained herein, Quantopian, Inc. has not taken into account the investment needs, objectives, and financial circumstances of any particular investor. Any views expressed and data illustrated herein were prepared based upon information, believed to be reliable, available to Quantopian, Inc. at the time of publication. Quantopian makes no guarantees as to their accuracy or completeness. All information is subject to change and may quickly become unreliable for various reasons, including changes in market conditions or economic circumstances.