If you sum together 20 numbers sampled from a binomial distribution and 10 from a Poisson distribution, how is your sum distribted?
If you sample 25 numbers from different beta distributions, how will each of the numbers be distributed?
Assume a HW grade is determined as the average of 3 HW assignments. How is the HW grade distributed?
You measure the height of 3 people. What distribution will the uncertainty of the mean of the heights follow?
Answers:
1.1 The sum should follow a normal distribution. Since the total sample size is 30, it is large enough for the CLT to apply. Hence the differences in distributions that we are sampling from does not apply.
1.2 Since we are just sampling 25 numbers without taking either their sum or mean, each of the numbers will reflect the beta distribution that it is sampled from.
1.3 Since we are taking the mean of 3 HW assignments as the HW grade, it will follow a normal distribution according to CLT. Here we are assuming that we know the individual standard deviations of each of the HW assignments separately and that they are normally distributed. However, if the standard deviation of each HW assignment is not known, the HW grade will follow a t-distribution.
1.4 The uncertainty of the mean of weights will follow a t-distribution. This is because 3 is a small sample size and we do not know the value of the true weight standard deviation.
Report the given confidence interval for error in the mean using the data in the next cell and describe in words what the confidence interval is for each example. 4 points each
data_1 = [0.41,2.69,3.82,0.42,1.20]
data_2 = [5.07,2.79,1.24,6.50,3.17,3.59,5.42,4.10,1.26,0.54,1.22,4.43,3.83,0.93,3.45,5.24,3.51,4.64,0.65,3.27,2.41,4.31,4.15,2.24,2.30,3.3]
data_3 = [5.62,2.34,2.76,2.80,1.15,5.19,-0.91]
2.1 Answer
Since $N=5$ and the true standard deviation is not known, we use the t-distribution. We can say with 80% confidence that the true mean lies between the interval 1.7 $\pm$ 1.0 .
In [1]:
from scipy import stats as ss
import numpy as np
In [2]:
data1 = np.array([0.41,2.69,3.82,0.42,1.20])
CI = 0.80
sample_mean = np.mean(data1)
sample_var = np.var(data1, ddof=1)
T = ss.t.ppf((1 - CI) / 2, df=len(data1)-1)
y = -T * np.sqrt(sample_var / len(data1))
print('{} +/ {}'.format(sample_mean, y))
2.2 Answer
Since $N=26$, we use the normal distribution. We can say with 99% confidence that the true mean lies above 3.693 (3.214 + 0.479).
In [3]:
data2 = np.array([5.07,2.79,1.24,6.50,3.17,3.59,5.42,4.10,1.26,0.54,1.22,4.43,3.83,0.93,3.45,5.24,3.51,4.64,0.65,3.27,2.41,4.31,4.15,2.24,2.30,3.3])
CI = 0.99
sample_mean = np.mean(data2)
sample_var = np.var(data2, ddof=1)
Z = ss.norm.ppf((1 - CI))
y = -Z * np.sqrt(sample_var / len(data2))
print('{} + {}'.format(sample_mean, y))
2.3 Answer
Since $N=7$ and the true standard deviation is not known, we use the t-distribution. We can say with 95% confidence that the true mean lies between the interval 2.70 $\pm$ 2 .
In [4]:
data3 = np.array([5.62,2.34,2.76,2.80,1.15,5.19,-0.91])
CI = 0.95
sample_mean = np.mean(data3)
sample_var = np.var(data3,ddof=1)
T = ss.t.ppf((1 - CI)/2,df=len(data3)-1)
y = -T * np.sqrt(sample_var / len(data3))
print('{} +/ {}'.format(sample_mean, y))
2.4 Answer
Even though we have small sample size, $N=7$, we use the normal distribution since we know the true standard deviation. We can say with 95% confidence that the true mean lies between the interval 2.7 $\pm$ 1.5 .
In [5]:
data3 = np.array([5.62,2.34,2.76,2.80,1.15,5.19,-0.91])
CI = 0.95
sample_mean = np.mean(data3)
true_var = 2**2
T = ss.norm.ppf((1 - CI)/2)
y = -T * np.sqrt(true_var / len(data3))
print('{} +/ {}'.format(sample_mean, y))
State the distribution and its parameters for each of the following cases. 2 points each.
Answers
3.1 Normal distribution
mean=0, standard deviation=$\sigma/\sqrt N$ = 1.2
3.2 t-distribution
parameters: $\sigma_x/\sqrt N$=0.965, N-1=10
3.3 Normal distribution
mean=-3, standard deviation=$\sigma_x/\sqrt N$ = 0.355
3.4 Normal distribution
mean=6, standard deviation=$\sigma/\sqrt N$ = 2.008