Answer in Markdown 2 Points each
What assumption do we make on the noise terms when doing linear regression? How can we check it?
Your friend tells you that it's important to minimize both the SSR and TSS. What's wrong with minimizing the TSS?
How do you justify the presence of a slope?
What is the best numeric value or statistic for justifying the existence of a correlation?
What should you plot to justify an ordinary 4-dimensional least squares regression?
Why do we use different number of deducted degrees of freedom when doing hypothesis testing vs performing the regression?
Write a model equation for 3-dimensional ordinary least squares regression with an intercept. For example, a one dimensional model equation without an intercept would be $y = \beta_0 x + \epsilon$
Write a model equation for when $y \propto \ln{x}$. Assume no intercept
Write a model equation for a person's life expectancy ($l$) assuming it depends on gender ($s$) and if the person eats vegetables ($v$). Assume for this problem that gender and eating vegetables are both binary (0 or 1).
Write a model equation for homework performanced ($h$) based on music genre listended to while working. The following genres are conisdered: Kwaito, Electroswing, and Djent Metal. You can only listen to one genre at a time. Use the letters $k$, $e$, and $d$.
Normal. Shapiro-Wilks
TSS is fixed based on data and cannot change based on model parameters
Hypothesis test on slope coefficient or Spearman hypothesis test
$p$-value for slope or $p$-value from Spearman hypothesis test
$\hat{y}$ vs $y$ or histogram residuals
In regression, ddof is number of parameters. In hypothesis test, we assume null which means the paraemter we're considering is 0 and not part of model.
$y = \beta_0 x_0 + \beta_1 x_1 + \beta_2 x_2 + \beta_3 + \epsilon$
$y = \beta \ln x + \epsilon$
$l = \beta_0 \delta_s + \beta_1 \delta_v + \beta_2 \delta_v\delta_s + \beta_3 + \epsilon$
no interaction, because you can only listen to one at a time
$h = \beta_0 \delta_k + \beta_1 \delta_e + \beta_2 \delta_d + \beta_3 + \epsilon$
Answer in Python or Markdown as appropiate 4 Points each
If $\sigma_{xy} = -2.1$, $\sigma_{x}^2 = 3.5$, $\sigma_{y}^2 = 1.7$, what is the best fit slope? How does it change if the intercept is $-2.1$?
If your model equation is $y = \beta_0 + \beta_1 x + \beta_2 z + \epsilon$, what is the deducted degrees of freedom?
If $N = 12$, $D = 2$, and $S^2_{\beta_0} = 2.5$, what is the width of a 90% confidence interval for $\beta_0$?
If your best fit intercept is $\hat{\alpha} = 3$ with a standard error of $0.7$, what is the $p$-value for the existence of the that intercept? Take $N = 15$ and assume it's 1D OLS.
In [1]:
slope = -2.1 / 3.5
print(slope, 'no change if wrt intercept')
In [2]:
import scipy.stats as ss
import numpy as np
T = ss.t.ppf(0.95, 12 - 2)
print(np.sqrt(2.5) * T)
In [3]:
T = 3 / 0.7
p = (1 - (ss.t.cdf(T, 15 - 1) - ss.t.cdf(-T, 15 - 1)))
print(p)
Regress the following data to the model equation $y = \beta_0 \ln x + \beta_1 x + \beta_2 +\epsilon$ using a linearization so that you use ND OLS. Report the following:
x = [0.2, 0.29, 0.39, 0.48, 0.57, 0.66, 0.76, 0.85, 0.94, 1.04, 1.13, 1.22, 1.31, 1.41, 1.5]
y = [2.92, 2.58, 3.18, 4.27, 4.5, 3.93, 4.32, 4.57, 4.55, 4.7, 5.02, 4.21, 3.04, 4.98, 6.45]
In [4]:
x = [0.2, 0.29, 0.39, 0.48, 0.57, 0.66, 0.76, 0.85, 0.94, 1.04, 1.13, 1.22, 1.31, 1.41, 1.5]
y = [2.92, 2.58, 3.18, 4.27, 4.5, 3.93, 4.32, 4.57, 4.55, 4.7, 5.02, 4.21, 3.04, 4.98, 6.45]
x = np.array(x)
y = np.array(y)
In [5]:
#justification
ss.spearmanr(x, y)
Out[5]:
$p$-value shows the correlation is significant.
In [6]:
#fit
x_mat = np.column_stack( (np.log(x), x, np.ones(len(x))) )
beta, *_ = np.linalg.lstsq(x_mat, y)
print(beta)
In [7]:
#confidence intervals
yhat = beta[0] * np.log(x) + beta[1] * x + beta[2]
s2_e = np.sum( (yhat - y)**2 ) / (len(x) - len(beta))
se2_beta = s2_e * np.linalg.inv(x_mat.transpose() @ x_mat)
for i in range(len(beta)):
T = ss.t.ppf(0.975, len(x) - len(beta))
cwidth = T * np.sqrt(se2_beta[i,i])
print("beta_{} is {:.3f} +/- {:.3f} with 95% confidence".format(i, beta[i], cwidth))
In [8]:
#plot
import matplotlib.pyplot as plt
%matplotlib inline
plt.plot(x,y, 'p', label='data')
plt.plot(x, yhat, '-', label='fit')
plt.xlabel('x')
plt.ylabel('y')
plt.legend()
plt.show()
In [9]:
ss.shapiro(y - yhat)
Out[9]:
The $p$-value is 0.13, so the residuals are not not normal. So probably normal
In [10]:
#fit
import scipy.optimize as opt
def SSR(beta):
yhat = beta[0] * np.log(x) + beta[1] * x + beta[2]
return np.sum( (yhat - y)**2)
result = opt.minimize(SSR, x0=[1,1,1])
print(result.message)
print(result.x)
In [11]:
#making F
f_mat = np.column_stack( (np.log(x), x, np.ones(len(x))) )
In [12]:
#confidence intervals
yhat = beta[0] * np.log(x) + beta[1] * x**2 + beta[2]
s2_e = np.sum( (yhat - y)**2 ) / (len(x) - len(beta))
#MAKE SURE THEY CHANGE TO f_mat AND NOT REPEAT x_mat
se2_beta = s2_e * np.linalg.inv(f_mat.transpose() @ f_mat)
for i in range(len(beta)):
T = ss.t.ppf(0.975, len(x) - len(beta))
cwidth = T * np.sqrt(se2_beta[i,i])
print("beta_{} is {:.3f} +/- {:.3f} with 95% confidence".format(i, beta[i], cwidth))