From the Wikipedia, we see that linear regression can be expressed as: $$ y = \alpha + \beta x $$ where: $$ \beta = \frac{\overline{xy} -\bar x \bar y}{\overline{x^2} - \bar{x}^2}=\frac{\mathrm{Cov}[x,y]}{\mathrm{Var}[x]} $$ and $\alpha=\overline y - \beta \bar x$
We first import the basic modules and generate some data with noise.
In [6]:
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline
In [8]:
x = np.arange(10.)
y = 5*x+3
np.random.seed(3)
y+= np.random.normal(scale=10,size=x.size)
plt.scatter(x,y);
In [6]:
def lin_reg(x,y):
"""
Perform a linear regression of x vs y.
x, y are 1 dimensional numpy arrays
returns alpha and beta for the model y = alpha + beta*x
"""
beta = np.mean(x*y)-np.mean(x)*np.mean(y)
#finish...
lin_reg(x,y)
Out[6]:
We could also implement it with the numpy covariance function. The diagonal terms represent the variance.
In [9]:
def lin_reg2(x,y):
"""
Perform a linear regression of x vs y. Uses covariances.
x, y are 1 dimensional numpy arrays
returns alpha and beta for the model y = alpha + beta*x
"""
c = np.cov(x,y)
#finish...
lin_reg2(x,y)
Out[9]:
The previous methods only works for single variables. We could generalize it if we code it as a least square problem: $$ \bf y = \bf A \boldsymbol \beta $$ Remark that $\bf A$ is $\bf X$ with an extra column to represent independent term, previously called $\alpha$, that now corresponds to $\beta_{N+1}$. $$ \bf A = \left[\bf X , \bf 1 \right] $$
In [10]:
def lin_reg3(x,y):
"""
Perform a linear regression of x vs y. Uses least squares.
x, y are 1 dimensional numpy arrays
returns alpha and beta for the model y = alpha + beta*x
"""
#finish...
lin_reg3(x,y)
Out[10]:
As usual, for tasks as common as a linear regression, there are already implemented solutions in several packages. In numpy, we can use polyfit, which can fit polinomial of degree $N$.
In [11]:
#finish...
Out[11]:
scipy has a statistics module that returns the fit and the correlation coefficient:
In [12]:
import scipy.stats as stats
In [13]:
#finish
Out[13]:
The most powerful module for doing data analysis, and Machine Learning is scikit-learn. There is a good documentation on linear models
In [63]:
from sklearn import linear_model
In [64]:
#Finish
In [11]:
x = np.arange(10.)
y = 5*x+3
np.random.seed(3)
y+= np.random.normal(scale=10,size=x.size)
plt.scatter(x,y);
In [ ]: