In [35]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
%matplotlib inline
def LSCE(x, y):
beta_1 = np.sum((x - np.mean(x))*(y-np.mean(y))) / np.sum((x-np.mean(x))*(x-np.mean(x)))
beta_0 = np.mean(y) - beta_1 * np.mean(x)
return beta_0, beta_1
advertising = pd.read_csv('Advertising.csv',index_col=0)
tv = advertising['TV']
sales = advertising['Sales']
beta_0, beta_1 = LSCE(tv,sales)
x = np.linspace(-10,310,1000)
y = beta_1 * x + beta_0
plt.scatter(tv, sales, marker='+')
plt.plot(x, y,c='k')
plt.xlim(-10,310)
plt.show()
In [45]:
beta_1 = 3
beta_0 = 2
random = np.random.normal(size=100, loc=0, scale=1)
X = np.linspace(-2,2,500)
X = np.random.choice(X, size=100, replace=False)
Y = X*beta_1+beta_0 +random
y_true = X*beta_1+beta_0
beta_0_, beta_1_ = LSCE(X, Y)
y_predict = X *beta_1_ + beta_0_
plt.scatter(X,Y)
plt.plot(X,y_true, c='g')
plt.plot(X, y_predict, c='r')
plt.show()
The difference between the population regression line adn the least squres lien many seem quite confusing. The answer is using a sample to estimate the characteristics of a large population.
How accurate is the sample mean $\hat{\mu}$ as an estimate of $\mu$
$$
Var(\hat{\mu}) = SE(\hat{\mu})^2=\frac{\sigma^2}{n}
$$
computing the standard errors associated with $\hat{\beta_0}$ and $\hat{\beta_1}$
$$
SE(\hat{\beta_0})^2 = \sigma^2\left[ \frac{1}{n} + \frac{\bar{x}^2}{\sum_{i=1}^n(x_i-\bar{x})^2} \right],
SE(\hat{\beta_1})^2 = \frac{\sigma^2}{\sum_{i=1}^{n}(x_i-\bar{x})^2}
$$
where $\sigma^2=Var(\epsilon)$, $\sigma$ is known as the residual standard error, and is given by the formula $RSE=\sqrt{RSS/(n-2)}$
The most common hypothesis test involves testing the null hypothesis of $$H_0: \text{There is no relationship between}\quad X \text{and} Y$$ versus the alternative hypothesis $$H_a: \text{There is some relationship between }X \text{and} Y$$ We use t-statistic given by $$t= \frac{\hat{\beta_1}-0}{SE(\hat{\beta_1})}$$ we except that will have a $t$-distribution with $n-2$ degrees of freedom, asumming $\beta_1=0$ we call this probability the $p-value$ . small $p-value$ indicates that it is unlikely to observe such a substantial association between the predictor and the response due to chance. The typical p-value cutoffs for rejecting the null hypothesis ares 5 or 1%
$R^2$ statistic provides an alternative measurs of fit. It takes the form of $proportion$, taking on a value between 0 and 1, and is independent of the scale of $Y$. $$R^2 = \frac{TSS-RSS}{TSS}=1-\frac{RSS}{TSS}$$ where $TSS=\sum(y_i-\bar{y})^2$ is the total sum of squares. Hence, $R^2$ measures the proportion of variability in Y that can be explained using X.
Correlation $$ Cor(X,Y)=\frac{\sum_{i=1}^n(x_i-\bar{x})(y_i-\bar{y})}{\sqrt{\sum_{i=1}^n(x_i-\bar{x})^2}\sqrt{\sum_{i=1}^n(y_i-\bar{y})^2}} $$ Using $r=Cor(X,Y)$ instead of $R^2$ in order to assess the fit of the linear model.
Then the multiple linear regression model takes the form $$ Y = \beta_0 + \beta_1X_1 + \beta_2X_2+\ldots+\beta_pX_p+\epsilon $$ Given estimates $\hat{\beta_0},\hat{\beta_1},\ldots,\hat{\beta_p}$, we can make predictions using the formula $$ \hat{y}=\hat{\beta_0}+\hat{\beta_1}x_1+\ldots+\hat{\beta_p}x_p $$ The sum of squared residuals $$ RSS = \sum_{i=1}^{n}(y_i-\hat{y_i})^2 $$ We assume that: $$Sales=\beta_0 + \beta_1 \times TV + \beta_2 \times Radio + \beta_3 \times Newspaper$$ We reform the above formula $$ \begin{bmatrix} Sale_1 \\ Sale_2 \\ \vdots \\ Sale_p \end{bmatrix} = \begin{bmatrix} tv_1 & radio_1 & newspaper_1 & 1 \\ tv_2 & radio_2 & newspaper_2 & 1 \\ \vdots & \vdots & \vdots & \vdots \\ tv_p & radio_p & newpaper_p & 1 \\ \end{bmatrix} \times \begin{bmatrix} \beta_1 \\ \beta_2 \\ \beta_3 \\ \beta_0 \\ \end{bmatrix} $$ Using LSE formula $$y=X\beta \rightarrow \beta=(X^TX)^{-1}X^Ty$$
In [69]:
# calculate the parameter
from numpy.linalg import inv
X = advertising[['TV','Radio','Newspaper']].values
Y = advertising['Sales'].values
X = np.hstack((X, np.full((len(Y),1), 1.0)))
beta = inv(X.T.dot(X)).dot(X.T).dot(Y)
print ('the parameters are: ',beta[0], beta[1], beta[2], beta[-1])
In [83]:
# calculate the correlation
# X = advertising[['TV', 'Radio','Newspaper','Sales']].values
# X_mean = np.mean(X,axis=0)
# X -= X_mean
# numerator =X.T.dot(X)
# XX = X*X
# XX = np.sum(XX, axis=0)
# denumorator = np.sqrt((XX.T.dot(XX)))
# numerator/denumorator
advertising.corr()
Out[83]:
null hypothesis $$H_0:\beta_1=\beta_2=\ldots=\beta_p=0$$
alternative hypothesis $$H_a: \text{at least one}\quad \beta_j \quad\text{is non-zero}$$ F-statisitc $$F=\frac{(TSS-RSS)/p}{RSS/(n-p-1)}$$
If there is no relationshio between the response and predictors, that F-statistic takes on a value close to 1. On ther other hand, if $H_a$ is true, we except F to be greater than 1.
Use this variable as the predictor in the regression equation $$ y_i = \beta_0+\beta_1x_i+\epsilon = \begin{cases} \beta_0+\beta_1+\epsilon & \text{if ith person is female} \\ \beta_0 + \epsilon & \text{if ith person is male} \\ \end{cases} $$
Above all, we assumpt that the relationship between the predictors and response are additive and linear.
We can assumpt that $$ Y = \beta_0 + \beta_1X_1+\beta_2X_2+\beta_3X_1X_2+\epsilon=\beta_0+\overline{\beta_1}X_1+\beta_2X_2+\epsilon $$
An interaction between a quantitative and qualitative variables has a particularly nice interpertation. $$ balance_i = \beta_0 +\beta_1\times income_i + \begin{cases} \beta_2 & \text{if ith person is a student} \\ 0 & \text{if ith person is not a student} \\ \end{cases} $$
In [145]:
auto = pd.read_table('Auto',sep='\s+')
rows=np.sum(auto.values=='?',axis=1)
delete_rows = []
for idx,_ in enumerate(rows):
if _!=0:
delete_rows.append(idx)
auto=auto.drop(auto.index[delete_rows])
data = auto[['mpg','horsepower']]
horsepower= data['horsepower'].values.astype('float')
data['horsepower_2'] = horsepower * horsepower
data['beta_0'] = np.full(horsepower.shape,1.0)
# plot the scatter
plt.scatter(horsepower, auto['mpg'])
# calcault linear
X = data[['horsepower','beta_0']].values.astype('float')
y = data['mpg'].values.astype('float').reshape(X.shape[0],1)
beta_linear =inv(X.T.dot(X)).dot(X.T).dot(y)
X = data[['horsepower','horsepower_2','beta_0']].values.astype('float')
beta_linear2 = inv(X.T.dot(X)).dot(X.T).dot(y)
x = np.linspace(40,230, 500)
y_linear = x*beta_linear[0] + beta_linear[1]
y_linear2 = x*beta_linear2[0] + x*x*beta_linear2[1] +beta_linear2[2]
plt.plot(x, y_linear, c='b')
plt.plot(x, y_linear2, c='g')
plt.show()
In [146]:
horsepower = data['horsepower'].values.astype('float')
mpg = data['mpg'].values.astype('float')
residual_linear = mpg - (horsepower*beta_linear[0]+beta_linear[1])
plt.scatter(mpg, residual_linear)
plt.show()
In [147]:
residual_quadratic = mpg - (horsepower*beta_linear2[0]+horsepower*horsepower*beta_linear2[1]+beta_linear2[-1])
plt.scatter(mpg, residual_quadratic)
plt.show()
Why might correlations among the error terms occur? Such correlationsfrequently occur in the context of time series data
Another important assumption of the linear regression model is that the error terms have a constant variance, $Var(\epsilon_i) = \sigma ^2$. The standard errors, confidence intervals, and hypothesis tests associated with the linear model rely upon this assumption.
In [8]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from numpy.linalg import inv
auto = pd.read_table('Auto',sep='\s+')
rows=np.sum(auto.values=='?',axis=1)
delete_rows = []
for idx,_ in enumerate(rows):
if _!=0:
delete_rows.append(idx)
auto=auto.drop(auto.index[delete_rows])
horsepower= auto['horsepower'].values.astype('float')
auto['ones'] = np.full(horsepower.shape, 1.0)
X = auto[['horsepower','ones']].values.astype('float')
y = auto['mpg'].values.astype('float').reshape(X.shape[0],1)
beta_linear =inv(X.T.dot(X)).dot(X.T).dot(y)
print('β0 :',beta_linear[-1][0])
print('β1 :', beta_linear[0][0])
In [9]:
sample_num = len(y)
residual = np.power(X.dot(beta_linear)-y,2).sum()
sigma = np.sqrt(residual/(sample_num-2))
horsepower_98 = np.array([[98.0,1.0]])
mpg_98 = horsepower_98.dot(beta_linear)[0,0]
mpg_98_uppper_bound = mpg_98+2*sigma
mpg_98_lower_bound = mpg_98-2*sigma
print('predict value is %f when horsepower is 98'%mpg_98)
print('The range is [%f,%f]' %(mpg_98_lower_bound,mpg_98_uppper_bound))
In [11]:
mpg = auto['mpg'].values.astype('float')
plt.scatter(horsepower, mpg)
plt.show()
In [13]:
auto_cor=auto[['mpg','displacement','horsepower','weight','acceleration']]
auto_cor.corr()
Out[13]: