By Christopher van Hoecke, Max Margenot, and Delaney Mackenzie
https://www.quantopian.com/lectures/linear-regression
This lecture corresponds to the Linear Regression lecture, which is part of the Quantopian lecture series. This homework expects you to rely heavily on the code presented in the corresponding lecture. Please copy and paste regularly from that lecture when starting to work on the problems, as trying to do them from scratch will likely be too difficult.
Part of the Quantopian Lecture Series:
In [ ]:
# Useful Functions
def linreg(X,Y):
# Running the linear regression
X = sm.add_constant(X)
model = regression.linear_model.OLS(Y, X).fit()
a = model.params[0]
b = model.params[1]
X = X[:, 1]
# Return summary of the regression and plot results
X2 = np.linspace(X.min(), X.max(), 100)
Y_hat = X2 * b + a
plt.scatter(X, Y, alpha=0.3) # Plot the raw data
plt.plot(X2, Y_hat, 'r', alpha=0.9); # Add the regression line, colored in red
plt.xlabel('X Value')
plt.ylabel('Y Value')
return model.summary()
In [ ]:
# Useful Libraries
import math
import numpy as np
import matplotlib.pyplot as plt
from statsmodels import regression
from statsmodels.stats import diagnostic
import statsmodels.regression as smr
import statsmodels.api as sm
from statsmodels.stats.diagnostic import het_breushpagan
import scipy as sp
import scipy.stats
import seaborn
In [ ]:
fahrenheit = [-868, -778, -688, -598, -508, -418, -328, -238, -144, -58, 32, 122, 212, 302, 392, 482,
572, 662, 752, 842, 932]
celsius = [-500, -450, -400, -350, -300, -250, -200, -150, -100, -50, 0, 50, 100, 150, 200, 250,
300, 350, 400, 450, 500]
## Your code goes here
In [ ]:
start = '2014-01-01'
end = '2015-01-01'
asset = get_pricing('KO', fields='price', start_date=start, end_date=end)
benchmark = get_pricing('PEP', fields='price', start_date=start, end_date=end)
returns1 = asset.pct_change()[1:]
returns2 = benchmark.pct_change()[1:]
## Your code goes here
Let's directly calculate the $95\%$ confidence intervals of our parameters. The formula for a given parameter is:
$$ CI = \left(\beta - z \cdot \frac{s}{\sqrt{n}}, \beta + z \cdot \frac{s}{\sqrt{n}}\right) $$Where, $\beta$ is the coefficient, $z$ is the critical value(t-statistic required to obtain a probability less than the alpha significance level), and $SE_{i,i}$ is the Standard Error Matrix.
In [ ]:
start = '2014-01-01'
end = '2015-01-01'
asset = get_pricing('KO', fields='price', start_date=start, end_date=end)
benchmark = get_pricing('PEP', fields='price', start_date=start, end_date=end)
X = asset.pct_change()[1:]
Y = benchmark.pct_change()[1:]
result = sm.OLS(Y,X).fit()
# Convert X to Matrix (adding columns of one)
X = np.vstack((X, np.ones( X.size ) ))
X = np.matrix( X )
# Matrix Multiplication and inverse calculation
C = np.linalg.inv( X * X.T )
C *= result.mse_resid
SE = np.sqrt(C) # Calucaltion of Standart Error.
# Critical Values of the t-statistic
N = result.nobs
P = result.df_model
dof = N - P - 1
z = scipy.stats.t(dof).ppf(0.975)
i = 0
## Your code goes here
# Fetch values of Beta and parameters of SE from the matrix
beta = ## Your code goes here
c = ## Your code goes here
print ## Your code goes here
$R^2$ is the measure of how closely your data points are to the regression line, and is defined as $$ R^2 = 1 - \frac{\Sigma((y_{predicted} - (y_{actual}))^2)}{\Sigma( y_{predicited} - \frac{\Sigma y_{actual}}{len(y_{actual}})^2} $$ Given the information from exercise 1, calculate the value of $R^2$ manually. You can start by expressing f as a function of c from the data obtained from Exercise 1 (these are the predicted values of y).
In [ ]:
# Creat an empty numpy array (float values).
# Find the predicted value of f for every c in celsius (given by f = 32 + 1.8c)
fpred = np.array([])
f = [#________#
for a in celsius] ## Your code goes here (fill in the values of Beta, and X1)
ypredicted = np.append(f, fpred)
Using the values of $y_{predicted}$ and $y_{actual}$, calculate the squared element by element difference of the two lists, and sum them.
In [ ]:
# Calucate the difference between the predicted values of y and the actual values of y,
# Find the square of the difference
# Sum the Squares
ypred_yact = [#______#
for a, b in zip(ypredicted, fahrenheit)] ## your code goes here (a - b)
diff1squared = [#_______#
for a in ypred_yact] ## Your code goes here
sumsquares1 = sum(diff1squared) ## Your code goes here
Using the values of $y_{predicted}$ and mean, calculate the mean of the predicted values, along with the difference between $y_{predicted} - mean$. Square the values in the list obtained from the difference and sum them.
In [ ]:
# Calucate the difference between the predicted values of y and mean of y.
# Find the square of the difference
# Sum the Squares
mean = ## Your code goes here
ypred_mean = ## Your code goes here
ypred_meansquared = ## Your code goes here
sumsquares2 = ## Your code goes here
We can now calculate the R-Squared by subtracting one to the ratio of the two sums.
In [ ]:
r = ## Your code goes here
print 'R-squared = ', r
Defintion : In statistics, the residuals are differences between the predicted values and the actual values:
$$e = y - ŷ$$
In [ ]:
asset1 = get_pricing('SPY',
fields='price',
start_date='2005-01-01',
end_date='2010-01-01')
asset2 = get_pricing('GS',
fields='price',
start_date='2005-01-01',
end_date='2010-01-01')
returns1 = asset1.pct_change()[1:]
returns2 = asset2.pct_change()[1:]
## Your code goes here
Run the Breush-Pagan test to check for heteroskedasticity in the residuals. Note that the residuals of the model should have constant variance, presence of heteroskedasticity would indicate our choice of model is not optimal.
In [ ]:
lm, p_lm, fv, p_fv = ## Your code goes here
print 'p-value for f-statistic of the breush-pagan test:', ## Your code goes here
print '===='
print "Since the p-value obtained is ______ than alpha (0.05), \
we ______ reject the null hypothesis of the breush-pagan test"
In [ ]:
# Predicted values of asset1
y = ## Your code goes here
plt.scatter(y, results.resid)
plt.title('Scatter plot of Residuals to predicted model')
plt.xlabel('Predicted Model')
plt.ylabel('Residuals');
In [ ]:
p1 = get_pricing('SPY', start_date = '2005-01-01',
end_date = '2010-01-01',
fields = 'price').pct_change()[1:]
p2 = get_pricing('XLF', start_date = '2005-01-01',
end_date = '2010-01-01',
fields = 'price').pct_change()[1:]
## Your code goes here
results2 =
y =
In [ ]:
plt.scatter(y, results2.resid)
plt.title('Scatter plot of Residuals to predicted model')
plt.xlabel('Predicted Model')
plt.ylabel('Residuals')
lm, p_lm, fv, p_fv = ## Your code goes here
print 'p-value for f-statistic of the breush-pagan test:',## Your code goes here
print '===='
print "Since the p-value obtained is ____ than alpha (0.05), \
we ______ the null hypothesis of the breush-pagan test"
While checking for residual is a good way of checking the accuracy of our model choice, we must also check fot heteroscedasticity (checking if there are sub-populations that have different variabilities from others). An assumption of the linear regression model is that there is no heteroscedasticity, OLS estimators are no longer the Best Linear Unbiased Estimators if this assumption is broken.
Read more about heteroscedasticity here https://en.wikipedia.org/wiki/Heteroscedasticity#Consequences
Congratulations on completing the Linear Regression exercise!
As you learn more about writing trading algorithms and the Quantopian platform, be sure to check out the Quantopian Daily Contest, in which you can compete for a cash prize every day.
Start by going through the Writing a Contest Algorithm Tutorial.
This presentation is for informational purposes only and does not constitute an offer to sell, a solicitation to buy, or a recommendation for any security; nor does it constitute an offer to provide investment advisory or other services by Quantopian, Inc. ("Quantopian"). Nothing contained herein constitutes investment advice or offers any opinion with respect to the suitability of any security, and any views expressed herein should not be taken as advice to buy, sell, or hold any security or as an endorsement of any security or company. In preparing the information contained herein, Quantopian, Inc. has not taken into account the investment needs, objectives, and financial circumstances of any particular investor. Any views expressed and data illustrated herein were prepared based upon information, believed to be reliable, available to Quantopian, Inc. at the time of publication. Quantopian makes no guarantees as to their accuracy or completeness. All information is subject to change and may quickly become unreliable for various reasons, including changes in market conditions or economic circumstances.