Linear Regression

Load Data

We load data from the web (http://www.quandl.com)

  • london gold index (LBMA/GOLD)
  • london silver index (LBMA/SILVER)

In [1]:
from os import sys, path
sys.path.append(path.abspath('../src/regression'))
import linear_regression
from linear_regression import *
%matplotlib inline

# We use the london market to get the stock values of gold and silver
gold = quandl.get("LBMA/GOLD", returns="numpy", start_date="2015-01-01")
silver = quandl.get("LBMA/SILVER", returns="numpy", start_date="2015-01-01")
copper = quandl.get("CHRIS/CME_SI3", returns="numpy", start_date="2015-01-01")


---------------------------------------------------------------------------
ImportError                               Traceback (most recent call last)
<ipython-input-1-7f25960a6ead> in <module>()
----> 1 from src.regression import linear_regression
      2 # from src.regression import linear_regression
      3 # from linear_regression import *
      4 # %matplotlib inline
      5 

ImportError: No module named src.regression

Format Data

We downloaded the london daily gold and silver values in dollars of the past two years. For each days (x_silver or x_gold), we have gold value (y_gold) and the silver value (y_silver) in dollars.

We vizualize gold and silver values in time (left plot) and visualize how gold value depends on silver (right plot).


In [ ]:
# Retrieve gold and silver values in $ by day
XY_gold = stock_arr_to_XY(gold)
XY_silver = stock_arr_to_XY(silver)
XY_copper = stock_arr_to_XY(copper)

# Filter arrays such that gold and silver shares the same Xs
XY_gold, XY_silver = filter_on_same_X(XY_gold, XY_silver)
XY_gold, XY_copper = filter_on_same_X(XY_gold, XY_copper)
XY_copper, XY_silver = filter_on_same_X(XY_copper, XY_silver)

x_gold, y_gold = XY_gold
x_silver, y_silver = XY_silver
x_copper, y_copper = XY_copper

# Plot the data
plot_data(XY_silver, XY_gold, 'silver', 'gold')
plot_data(XY_copper, XY_gold, 'copper', 'gold')

Linear Regression

Train Linear Regression On Gold knowing Silver

(In case of doubts, read this : http://cs229.stanford.edu/notes/cs229-notes1.pdf)

We want to know whether the gold and the silver values are correlated. We hypothesize that the price of Gold (y_gold or Y) is linearly equal to the one of Silver (y_silver or X) :

  • $P = w*X + b$
    Where $w$ (the weight) and $b$ (the bias) are the linear parameters of our hypothesis (or prediction).

We are going to set the weight and bias to some random values and compute how far our predictions $P$ are from the true gold value $Y$. They are many distances we can chose from, one of which is the sum of squares (L2 distance):

  • $L = \sum_i 1/2*(y^i - p^i)^2$
    ($L$ stands for Loss, it's also called cost function...).

In [ ]:
# Rename y_silver to X and y_gold to Y
X, Y = [np.array(y_silver), ], np.array(y_gold)

# Initilize the parameters
Ws = [0.5, 0.5]
alphas = (0.0001, 0.01)

# Load Trainer
t = Trainer(X, Y, Ws, alphas)

# Define Prediction and Loss
t.pred = lambda X : np.multiply(X[0], t.Ws[0]) + t.Ws[1]
t.loss = lambda : (np.power((t.Y - t.pred(X)), 2) * 1 / 2.).mean()

# Define the gradient functions
dl_dp = lambda : -(t.Y - t.pred(X))
dl_dw0 = lambda : np.multiply(dl_dp(), X[0]).mean()
dl_dw1 = lambda : dl_dp().mean()
t.dWs = (dl_dw0, dl_dw1)

# Start training
anim = t.animated_train(is_notebook=True)

# Show it
from IPython.display import HTML
HTML(anim.to_html5_video())

In [ ]:
print "Final Loss is %f" % t.loss()