Problem 1 Instructions

Answer the following short-answer questions using Markdown cells

Problem 1.1

A $t$-test and $zM$ test rely on the assumption of normality. How could you test that assumption?

Shapiro Wilks hypothesis test

Problem 1.2

What is $\hat{\beta}$ in OLS?

The best-fit slope

Problem 1.3

What is $S_{\epsilon}$ in OLS?

The standard error in residuals.

Problem 1.4

What is the difference between SSR and TSS? Is one always greater than the other?

SSR is the sum of squared distance between fit y and data y. TTS is the sum of squared distance between average y and all y data. $TTS \geq SSR$

Problem 1.5

We learned three ways to do regression. One way was with algebraic equations (OLS-ND). What were the other two ways?

OLS-1D, NLS-ND

Problem 1.6

Aside from a plot, what are the steps to complete for a good regression analysis?

(1) Justify with Spearmann test (2) Check normality of residuals (3) hypothesis tests/confidence intervals as needed

Problem 1.7

Is a goodness of fit applicable to a multidimensional regression? If so, what are the x/y axes for this plot?

yes, $y$ vs $\hat{y}$

Problem 1.8

When is it valid to linearize a non-linear problem?

When it doesn't change the noise in the model from normal to some other distribution

Problem 1.9

Sometimes expressions for a model have $\hat{y}=\ldots$ on the left-hand side and other times $y=\ldots$. What is the difference between these two quantities and what changes on the right-hand side when adding/removing the $\hat{}$?

$\hat{y}$ is the best fit and $y$ is the data. When we write $y$, to achieve equality with our model we have to add $\epsilon$, some noise to describe the discrepancy between our model and the data.

Problem Set 2

Problem 2.1

Are these numbers normally distributed? [-26.3,-24.2, -20.9, -25.8, -24.3, -22.6, -23.0, -26.8, -26.5, -23.1, -20.0, -23.1, -22.4, -22.8]


In [1]:
import scipy.stats as ss

ss.shapiro([-26.3,-24.2, -20.9, -25.8, -24.3, -22.6, -23.0, -26.8, -26.5, -23.1, -20.0, -23.1, -22.4, -22.8])


Out[1]:
(0.9408471584320068, 0.42928802967071533)

The $p$-value is 0.43, so it could be normal

Problem 2.2

Given $\hat{\alpha} = 0.2$, $\hat{\beta} = 1.6$, $N = 11$, $S^2_\alpha = 0.4$, $S^2_\epsilon = 0.5$, $S^2_\beta = 4$, give a justification for or against their being an intercept


In [2]:
import numpy as np
T = (0.2 - 0) / np.sqrt(0.4)
# Use 11 - 1 because null hypothesis is there is no intercept!
1 - (ss.t.cdf(T, 11 - 1) - ss.t.cdf(-T, 11- 1))


Out[2]:
0.75833153571117373

The $p$-value is 0.76, so we cannot reject the null hypothesis of no intercept

Problem 2.3

Conduct a hpyothesis test for the slope being positive using the above data. This is a one-sided hypothesis test. Hint: a good null hypothesis would be that the slope is negative. Describe your test in Markdown first, then complete it in Python, and finally write an explanation of the p-value in the final cell.

Let's make the null hypothesis that the slope is negative as suggested. We will create a T statistic, which should correspond to some interval/$p$-value that gets smaller (closer to our significance threshold) as we get more positive in our slope. This will work:

$$ p = 1 - \int_{0}^{T} p(t)\,dt$$

where $T$ is our positive value reflecting how positive the slope is.

You can use 1 or 2 deducted degrees of freedom. 1 is correct, since there is no degree of freedom for the intercept here, but it's a little bit tricky to see that.


In [4]:
T = 1.6 / np.sqrt(4)
ss.t.cdf(T, 11 - 1) - ss.t.cdf(0,11 - 1)


Out[4]:
0.27884979042922931

The $p$-value is 0.28, so it's not guaranteed that the slope is positive. This is due to the large uncertainty in the intercept

Problem 2.4

Write a function which computes the SSR for $\hat{y} = \beta_0 x + \beta_1 \exp\left( -\beta_2 x\right) $. Your function should take in one argument. You may assume $x$ and $y$ are defined elsewhere in the code.


In [ ]:
def ssr(beta):
    yhat =  beta[0] * x + beta[1] * np.exp(-beta[2] * x)
    return np.sum( (y - yhat)**2)

Problem 2.5

In NLS-ND, if I have 11 $x$ values, each is 2 dimensions, and my fit equation is $y = \beta_0 x_0 x_1$ (where $x_0$ is first dimension and $x_1$ is second), how many degrees of freedom do I have? Why?

$11 - 1 = 10$. Only deduct number of fit coefficients for non-linear regression

Problem 2.6

If my model equation is $\hat{z} = \beta_0 x y^{\,\beta_1}$, what would ${\mathbf F_{10}}$ be if $\hat{\beta_0} = 1.2$, $\hat{\beta_1} = 1.8$, $x_1 = 1.0$, $x_2 = 1.5$, $y_1 = 0.5$, $y_2 = -0.2$. Answer in Markdown (you can compute in a Python cell or with calculator).

$$F_{10} = \frac{\partial f(\hat{\beta}, x_1)}{\partial \beta_0} = x_1 y_1^{\hat{\beta}_1} = 0.287$$