Revist a problem you got wrong on homework 7. If you got a perfect score in homework 7, state that fact. Go through each part you missed and state what your answer was and what your mistake was. If you completed this already on homework 8, state that you completed this on homework 8
For example:
My answer used the scipy comb
function instead of factorial
.
[-26.6,-24.0, -20.9, -25.8, -24.3, -22.6, -23.0, -26.8, -26.5, -23.6, -20.0, -23.1, -22.4, -22.5]
In [1]:
import scipy.stats as ss
ss.shapiro([-26.6,-24.0, -20.9, -25.8, -24.3, -22.6, -23.0, -26.8, -26.5, -23.6, -20.0, -23.1, -22.4, -22.5])
Out[1]:
The $p$-value is 0.46, so the data is likely normal
In [8]:
import numpy as np
T = (1.2 - 0) / np.sqrt(1.2)
1 - (ss.t.cdf(T, 14- 1) - ss.t.cdf(-T, 14- 1))
Out[8]:
The $p$-value is 0.29, so we cannot reject the null hypothesis. No intercept necessary
Let's make the null hypothesis that the slope is positive. We will create a T statistic, which should correspond to some interval/$p$-value that gets smaller (closer to our significance threshold) as we get more negative in our slope. This will work:
$$ p = \int_{-\infty}^{T} p(T)$$where $T$ is our negative value reflecting how negative the slope is.
You can use 1 or 2 deducted degrees of freedom. 1 is correct, since there is no degree of freedom for the intercept here, but it's a little bit tricky to see that.
In [11]:
T = -5.3 / np.sqrt(12)
ss.t.cdf(T, 14 - 1)
Out[11]:
Due to the high standard error, there is not enough evidence to reject the null hpyothesis of a positive slope
In [ ]:
def ssr(beta):
yhat= beta[0] + beta[1] * np.cos(x * beta[2])
return np.sum( (y - yhat)**2)
In [2]:
from math import log
1.5 * 1.5 * 1.2**(2.0) * log(1.2)
Out[2]:
Regress the data in the next cell to a slope/intercept equation. Use the np.savetxt
to create a CSV file. Provide the following labeled/bolded quantities at the top of your Excel file:
You do not need to do all the steps for a good regression, but do make a plot of your fit and the data. Use the linest
command in Excel to compute the slope/intercept and standard errors
In [13]:
x = [0.5,1.3, 2.1, 1.0, 2.1, 1.7, 1.2, 3.9, 3.9, 1.5, 3.5, 3.9, 5.7, 4.7, 5.8, 4.6, 5.1, 5.9, 5.5, 6.4, 6.7, 7.8, 7.4, 6.7, 8.4, 6.9, 10.2, 9.7, 10.0, 9.9]
y = [-1.6,0.5, 3.0, 3.1, 1.5, -1.8, -3.6, 7.0, 8.6, 2.2, 9.3, 3.6, 14.1, 9.5, 14.0, 7.4, 6.4, 17.2, 11.8, 12.2, 18.9, 21.9, 20.6, 15.7, 23.7, 13.6, 26.8, 22.0, 27.5, 23.3]
In [15]:
np.savetxt(fname='data.csv', delimiter=',', X=np.column_stack( (x, y)))
Regress the following non-linear equation in Matlab:
$$y =\beta_0 + \beta_1 x + \beta_2 x^2 $$Perform the regression with and without $\beta_2$. Should there be a $\beta_2$ term? Justify your answer. You do not need to do all the steps for a good regression. Do plot your two regressions and original data.
Hints:
In [4]:
x = [-5.8,-4.6, -3.9, -3.4, -1.8, -2.1, -3.0, -0.8, 0.4, -0.2, -0.4, -0.0, 2.0, 1.1, 1.4, 1.2, 3.3, 4.3, 4.3, 3.0]
y = [-6.4,-7.7, -9.3, -9.2, -8.9, -7.3, -9.5, -5.0, -3.7, -6.9, -4.0, -3.8, 2.6, -0.6, -0.7, -0.1, 5.0, 4.8, 8.5, 2.5]
We will linearize the equation to dimensions $\left[1, x, x^2\right]$ and dimensions $\left[1, x\right]$ using the equations from lecture notes. Note: writing the code below is easier done in a matlab notebook and then copied back. That will you give you autocomplete, syntax highlighting, etc.
In [2]:
%load_ext pymatbridge
In [69]:
%%matlab
x = [-5.8,-4.6, -3.9, -3.4, -1.8, -2.1, -3.0, -0.8, 0.4, -0.2, -0.4, -0.0, 2.0, 1.1, 1.4, 1.2, 3.3, 4.3, 4.3, 3.0];
y = [-6.4,-7.7, -9.3, -9.2, -8.9, -7.3, -9.5, -5.0, -3.7, -6.9, -4.0, -3.8, 2.6, -0.6, -0.7, -0.1, 5.0, 4.8, 8.5, 2.5];
[nothing N] = size(x);
%get regressed fit
x_mat_1 = cat(1, ones(1, N), x)';
beta_mat_1 = inv(x_mat_1' * x_mat_1) * x_mat_1' * y'
%get regressed fit for 2nd degree polynomial
x_mat_2 = cat(1, ones(1, N), x, x .^ 2)';
beta_mat_2 = inv(x_mat_2' * x_mat_2) * x_mat_2' * y'
%get DOF
dof_2 = N - 3
%get error
s2_e_2 = sum((x_mat_2 * beta_mat_2 - y') .^ 2) / dof_2;
s_b_2 = sqrt(diag((s2_e_2 * inv(x_mat_2' * x_mat_2))))
%compute T-value
T = beta_mat_2(3) / s_b_2
%for plotting, some of the x values are out of order so I'll make new ones
x_plot = linspace(-6, 5, 100);
x_mat_plot_2 = cat(1, ones(1, 100), x_plot, x_plot .^ 2)';
x_mat_plot_1 = cat(1, ones(1, 100), x_plot)';
plot(x, y, 'o')
hold on
plot(x_plot, x_mat_plot_1 * beta_mat_1, '-')
plot(x_plot, x_mat_plot_2 * beta_mat_2, '-')
legend('data', 'fit - line', 'fit - 2nd degree')
Our $T$-value is 0.46. Our MATLAB install doesn't have the stats add-on, so we'll use a table or Python to look up the CDF function. Our degrees of freedom comes from the null hypothesis, which is $N - 2 = 18$
In [70]:
import scipy.stats as ss
1 - (ss.t.cdf(0.46, 18) - ss.t.cdf(-0.46, 18))
Out[70]:
So we do not have enough evidence for the extra $x^2$ term in the regression. The fit parameters are -2.5 (intercept) and 1.7 (slope)
In [72]:
x = [1.4,2.3, 3.7, 5.3, 6.6, 8.2, 10.2, 11.8, 12.7, 13.3, 14.6, 17.3, 18.6, 19.5, 21.6, 22.7, 23.6, 24.1]
y = [1.0,0.3, -0.1, -0.1, -0.3, -0.4, -0.4, -0.5, -0.4, -0.5, -0.4, -0.6, -0.8, -0.8, -0.6, -0.9, -0.7, -1.1]
In [73]:
ss.spearmanr(x, y)
Out[73]:
There is a strong correlation in the data, with a $p$-value of $10^{-9}$. This indicates a regression is justified to perform
In [76]:
import numpy as np
import scipy.optimize as opt
def ssr(betas, data):
'''Compute the SSR given the betas and the data. The data should be a list containing two arrays, the x and y data.'''
x = data[0]
y = data[1]
yhat = betas[0] * np.log(x / betas[1])
return np.sum((yhat - y)**2)
#test my function
betas = [1,1]
x = np.array(x)
y = np.array (y)
ssr(betas, data=[x, y])
Out[76]:
In [80]:
result = opt.minimize(ssr, x0=[1,1], args=([x,y],))
print(result.x)
#check another point to see if it's non-concex
result = opt.minimize(ssr, x0=[-2,100], args=([x,y]))
print(result.x)
This is a non-convex problem based on this test. I will use basin-hopping then
In [85]:
minimizer_kwargs = {'args': ([x,y])}
result = opt.basinhopping(ssr, x0=[1000,1000], niter=1000, minimizer_kwargs=minimizer_kwargs)
print(result)
It tried to put negative values in a log, so I will add a bound
In [86]:
minimizer_kwargs = {'args': ([x,y]), 'bounds':[(-np.infty, np.infty), (10**-10, np.infty)]}
result = opt.basinhopping(ssr, x0=[100,100], niter=10000, minimizer_kwargs=minimizer_kwargs)
print(result)
In [87]:
#store them in a nice spot for later
beta_hat = result.x
In [91]:
import matplotlib.pyplot as plt
%matplotlib inline
resids = beta_hat[0] * np.log(x / beta_hat[1]) - y
plt.hist(resids)
plt.show()
print(ss.shapiro(resids))
SSR = np.sum(resids ** 2)
TSS = np.sum((np.mean(y) - y)**2)
print(1 - SSR / TSS)
The $R^2$ value is $0.895$ and the $p$-value for normality is $0.82$. These two data together show that the residuals are likely normally distributed and the regression fits well
In [93]:
plt.plot(x,y,'o', label='data')
plt.plot(x, beta_hat[0] * np.log(x / beta_hat[1]), label='fit')
plt.legend()
plt.show()
The fit looks excellent
In [106]:
import scipy.linalg as linalg
#compute the F-matrix
F_mat = np.column_stack( (np.log(x / beta_hat[1]), np.ones(len(x)) * -beta_hat[0] / beta_hat[1]) )
#standard error in residuals
s2_e = np.sum(resids**2) / (len(x) - len(beta_hat))
#standard error in parametrs
s2_b = s2_e * linalg.inv(F_mat.transpose().dot(F_mat))
#go from matrix of squares to actual standard errors
s_b = np.sqrt(np.diag(s2_b))
#use a for loop to print them all pretty
from IPython import display
for beta, se,i in zip(beta_hat, s_b, range(len(s_b))):
display.display(display.Latex('$\\beta_{} = {:.2} \pm {:.2}$'.format(i, beta, se * ss.t.ppf(0.975, len(x) - len(beta_hat)))))