In [4]:
from math import erf, sqrt
import numpy as np
import scipy.stats
For full credit, you must have the following items for each problem:
[1 point] Describe what and why the method you're using is applicable. For example, 'I chose the signed rank test because these are two matched datasets describing one measurement'
[1 point] Write out the null hypothesis. For example, 'The null hypothesis is that the two measurements sets came from the same population (synonymous with probability distribution)'
[1 point] Report the p-value and your alpha value
[1 point] if you accept/reject the null hypothesis and answer the question
You have a sample of an unknown metal with a melting point of $1,070^\circ{}$ C. You know that gold has a melting point of $1,064^\circ{}$ C and your measurements have a standard deviation of $5^\circ{}$ C. Is the unknown metal likely to be gold?
Recall from confidence intervals, that the standard deviation in distance from the true mean is $\sigma / \sqrt{N}$ when you know the true standard deviation, $\sigma$. You take three additional samples and get $1,071^\circ{}$ C, $1,067^\circ{}$ C, and $1,075^\circ{}$ C. Does your evidence for gold change? USe the original measurement as well.
In [8]:
mu_sample=1070
mu_popul=1064.
st_dev=5
z=(-mu_popul+mu_sample)/st_dev
print('Z:', z)
p=(1 - np.abs((scipy.stats.norm.cdf(z)-scipy.stats.norm.cdf(-z))))
print('P-Value:', p)
The formula for a $Z$-statistic with a sample size greater than 1 is:
$$ Z = \frac{\mu - \bar{x}}{\sigma / \sqrt{N}}$$
In [14]:
mu = 1064.
sigma = 5.
data = [1070, 1071, 1067, 1075]
Z = (mu - np.mean(data)) / (sigma / sqrt(len(data)))
print('Z:', Z)
p = 1 - (scipy.stats.norm.cdf(abs(Z)) - scipy.stats.norm.cdf(-abs(Z)))
print('P-Value:', p)
In [12]:
mu = 89.3
data = [112.7, 78, 59.9, 127]
T = (mu - np.mean(data)) / np.sqrt(np.var(data, ddof=1) / len(data))
T = np.abs(T)
print('T:', T)
p = 1 - (scipy.stats.t.cdf(T, len(data)) - scipy.stats.t.cdf(-T, len(data)))
print('p-value:', p)
In [15]:
mu = 1064
data = [1070, 1071, 1067, 1075]
T = (mu - np.mean(data)) / np.sqrt(np.var(data, ddof=1) / len(data))
T = np.abs(T)
print('T:', T)
p = 1 - (scipy.stats.t.cdf(T, len(data)) - scipy.stats.t.cdf(-T, len(data)))
print('p-value:', p)
In [17]:
data_1 = [3.05, 3.01, 3.20, 3.16, 3.11, 3.09]
data_2 = [3.18, 3.23, 3.19, 3.28, 3.08, 3.18]
In [16]:
_,p = scipy.stats.ranksums(data_1, data_2)
print(p)
In [21]:
data_empty_tummy = [17.1, 29.5, 23.8, 37.3, 19.6, 24.2, 30.0, 20.9]
data_garbage_tummy = [14.2, 30.3, 21.5, 36.3, 19.6, 24.5, 26.7, 20.6]
In [25]:
_,p = scipy.stats.wilcoxon(data_empty_tummy, data_garbage_tummy)
print('p-value:', p)
In [26]:
temperature = [15, 18, 21, 24, 27, 30, 33]
chem_yield = [66, 69, 69, 70, 64, 73, 75]
In [27]:
scipy.stats.spearmanr(temperature, chem_yield)
Out[27]:
In [29]:
p_winning = 1 / 10**7
expected = p_winning * 10**6
p = 2 * (1 - scipy.stats.poisson.cdf(3, mu=expected))
print(p)
In [42]:
p = 0.5
N = 25
n = 8
print(2 *scipy.stats.binom.cdf(n, N, p))
The set-up for this p-value is to construct an interval over the known binomial distribution that just includes the value. I've done this by integrating from 0 up to the value. Our value is lower, so we're getting the left side of the interval. I multiply by 2 to get the other side.
State which test is most appropriate for the following:
In [ ]: