In [1]:
SL_CI = 0.05
n = 73
s = 14.26
avg_diff = 12.76
SE = round(s/sqrt(n),digits=2)
z = round(qnorm(SL_CI/2,lower.tail=F),digits=2)
print('confidence interval is')
ME = z*SE
print(c(avg_diff-ME,avg_diff+ME))
print('hypothesis testing is')
pnorm(avg_diff, mean=0,sd=SE, lower.tail=F) * 2
Out[1]:
In [27]:
x1 = 41.8
s_1 = 15.14
n_1 = 505
x2 = 39.4
s_2 = 15.12
n_2 = 667
pt = x1 - x2
SL_CI = 0.05
SE = round(sqrt(s_1**2/n_1 + s_2**2/n_2),digits=2)
z = round(qnorm(SL_CI/2,lower.tail=F),digits=2)
print('confidence interval is')
ME = z*SE
print(c(pt-ME,pt+ME))
print('hypothesis testing is')
pnorm(avg_diff, mean=0,sd=SE, lower.tail=F) * 2
Out[27]:
HT "If there is no difference of work hours on average between college degree vs non college degree, there is 0.7% chance of obtaining random samples of 505 college and 667 non-college degree give average difference of work hours at least 2.4 hours"
Screenshot taken from Coursera 03:46
Screenshot taken from Coursera 12:14
In [28]:
#95% = 1.96
#99% = 2.58
n = 100
mu = 882.515
# s = 89.5758
# z = 1.96
CL = 0.9
z = round(qnorm((1-CL)/2,lower.tail=F),digits=2)
# SE = s/sqrt(n)
SE = 89.5759
ME = z*SE
c(mu-ME,mu+ME)
Out[28]:
Bootstrap can be created by using with replacement in one sample. This is different from sampling distribution, where it takes with replacement from population.We can use percentile method; take 100 sample size bootstrap and cut off the sided for XX% interval, or calculate percentile based on the known condition that the distribution is normal, use point estimate bootstrap and standard error bootstrap.There's one weakness of bootstrap, is that when you have skew and sparse bootstrap distribution, it's not reliable.
Paired data, is when you have one observation dependent on other variable.We can use these set differences as a basis to use hypothesis testing and confidence interval.
Screenshot taken from Coursera 03:42
t-distribution can be two-sided test.
t-distribution is used when we have small sample size, less than 30. We use single parameter, degree of freedom to mitigate poor standard of error. We also verify that eventhough sample size is small, it's less than 10% population, and verify that the outliers it not too extreme(below two standard deviation).
Calculating t-score that is 1.65 standard deviation away, with degree of freedom 20. Looking for outer tail.
In [3]:
df=17
#NOT POINT ESTIMATE, BUT T-STATISTIC
t_statistic = 0.5
pt(t_statistic,df=df,lower.tail=F)
Out[3]:
Given confidence interval/significance level, and degree of freedom. find the cut off value, which is always positive.
In [1]:
CL = 0.95
n = 19
qt((1-CL)/2, df=n-1,lower.tail=F)
Out[1]:
In [5]:
CL = 0.99
n = 12
qt((1-CL)/2, df=n-1,lower.tail=F)
Out[5]:
Screenshot taken from Coursera 09:57
Screenshot taken from Coursera 15:49
In [24]:
xbar = 56
mu = 0
sd = 8
n = 20
CL = 0.9
SE = round(sd/sqrt(n),digits=2)
t_star = round((xbar-mu)/SE,digits=2)
z = round(qt ((1-CL)/2,lower.tail=F,df=n-1),digits=2)
print('Hypothesis Testing is ')
pt(t_star, df=n-1, lower.tail=xbar < mu)
print('Confidence Interval is ')
ME = round(z*SE, digits=2)
c(xbar-ME,xbar+ME)
Out[24]:
Out[24]:
In [12]:
t_star
Out[12]:
Since null hypothesis is not in the interval, CI agrees with HT in this case. Rejecting the null hypothesis, must be that null value not in the confidence interval.
In [38]:
x1 = 52.1
s_1 = 45.1
n_1 = 22
x2 = 27.1
s_2 = 26.4
n_2 = 22
avg_diff = x1-x2
SL_CI = 0.02
SE = round(sqrt(s_1**2/n_1 + s_2**2/n_2),digits=2)
z = round(qt(SL_CI/2,df=min(n_1-1,n_2-1),lower.tail=F),digits=2)
print('confidence interval is')
ME = z*SE
print(c(avg_diff-ME,avg_diff+ME))
print('hypothesis testing is')
# pnorm(avg_diff, mean=0,sd=SE, lower.tail=F) * 2
pt(avg_diff, df=min(n_1-1,n_2-1),lower.tail=avg_diff < 0) * 2
Out[38]:
In [10]:
x1 = 52.1
s_1 = 45.1
n_1 = 22
x2 = 27.1
s_2 = 26.4
n_2 = 22
avg_diff = x1-x2
SL_CI = 0.02
SE = round(sqrt(s_1**2/n_1 + s_2**2/n_2),digits=2)
z = round(qt(SL_CI/2,df=min(n_1-1,n_2-1),lower.tail=F),digits=2)
print('confidence interval is')
ME = z*SE
print(c(avg_diff-ME,avg_diff+ME))
print('hypothesis testing is')
# pnorm(avg_diff, mean=0,sd=SE, lower.tail=F) * 2
pt(avg_diff, df=min(n_1-1,n_2-1),lower.tail=avg_diff < 0) * 2
Out[10]:
In [46]:
x1 = 248.3
s_1 = 2
n_1 = 10
x2 = 244.8
s_2 = 3
n_2 = 10
avg_diff = x1-x2
SL_CI = 0.05
# SE = round(sqrt(s_1**2/n_1 + s_2**2/n_2),digits=2)
spooled = ((s_1**2*(n_1-1)) + (s_2**2*(n_2-1))) / (n_1+n_2-2)
SE = round(sqrt(spooled/n_1 + spooled/n_2),digits=2)
z = round(qt(SL_CI/2,df=min(n_1-1,n_2-1),lower.tail=F),digits=2)
print('confidence interval is')
ME = z*SE
print(c(avg_diff-ME,avg_diff+ME))
print('hypothesis testing is')
# pnorm(avg_diff, mean=0,sd=SE, lower.tail=F) * 2
pt(avg_diff, df=min(n_1-1,n_2-1),lower.tail=avg_diff < 0) * 2
Out[46]:
So we can say that we are 95% confident those with eating distractions consume between 1.83 to 48.17 grams more snack to those without distractions, on average.Since null value is not within the interval, CI agrees with HT
Simplify the calculation if population sd of two groups is similar
Screenshot taken from Coursera 08:39
Recognize that the test statistic for ANOVA, the F statistic, is calculated as the ratio of the mean square between groups (MSG, variability between groups) and mean square error (MSE, variability within errors). Also recognize that the F statistic has a right skewed distribution with two different measures of degrees of freedom: one for the numerator (dfG=k−1, where k is the number of groups) and one for the denominator (dfE=n−k, where n is the total sample size). Note that you won’t be expected to calculate MSG or MSE from the raw data, but you should have a conceptual understanding of how they’re calculated and what they measure.
Screenshot taken from Coursera 08:42
F-statistics in ANOVA, larger means smaller p-value.Positive skewed because the variability between groups and among groups can never be negative.
In [42]:
n_group = 4
n_total = 795
f_value = 21.735
dfg = n_group -1
dft = n_total - 1
dfe = dft - dfg
pf(f_value,dfg,dfe,lower.tail=F)
Out[42]:
In [47]:
n_group = 3
n_total = 831
f_value = 3.47
dfg = n_group -1
dft = n_total - 1
dfe = dft - dfg
pf(f_value,dfg,dfe,lower.tail=F)
Out[47]:
Interpret ANOVA,
If p-value < $\alpha$, the data provide convincing evidence that at least one pair of population are different from each other (we can't tell which one).In this case we reject the null hypothesis.
If p-value > $\alpha$, the data doesn't provide convincing evidence that one pair of population means are different from each other, the observed difference is due to sampling variability(or by chance).In this case we fail to reject null hypothesis.
Since p-value is really small in this problem, we reject the null hypothesis, and conclude that the data is provide convincing evidence that at least one pair of population means are different from each other.
Benferroni Correction Where you want to modify significance level, to adapt with changing type 1 error. After you reject null hypothesis using ANOVA, you want to test which of the pair is contribute
In [44]:
k = 7
SL = 0.05
SL/((k*(k-1))/2)
Out[44]:
Screenshot taken from Coursera 08:19
In [30]:
x_1 = 6.76
n_1 = 331
x_2 = 5.07
n_2 = 41
MSE = 3.628
df_res = 791
null=0
T = (x_1-x_2-null)/sqrt(MSE/n_1+MSE/n_2)
2*pt(T,df_res,lower.tail=F)
Out[30]:
Therefore we reject the null hypothesis, and the data provide convincing evidence that average of vocabulary scores between self-reported middle and lower class Americans are different.
ANOVA is statiscal tools that lets you analyze many groups at once. Is it due to chance variability of particular groups compared to variability of others? Different from usual HT, in ANOVA you set null hypothesis to be equal means across groups. But in alternative, you want to observe at least one group is different. Two probability, either all equal, or at least one different.
Conditions for ANOVA is at independence for between groups and within groups. The distribution of each group is nearly normal, and you variability across groups also nearly normal. Use graphical tools like boxplot and summary statistics to make intuition about the data. F-statistic will be get by calculating MSG with MSE, the means will calculated by dividing the sum of squares with its corresponding degree of freedom.This is right skewed distribution, you won't be have negative number, which always yield positive number. Therefore we always use one sided p-value.
Recall that we have 5% type 1 error rate for one hypothesis test. Doing multiple test will pile up the error, so fix number of 5% is not enough. We have to always push down the error rate, and by doing that incorporate with K-value.It's possible that when you reject the null hypothesis(statiscally significant) in ANOVA, but end up not finding it when you test it between groups. Remember that ANOVA is about at least one is different. And you may only one group that is different, but not for the rest of the groups.
Checking the standard error,
In [ ]:
sqrt(0.08*(1-0.08)/125)
Using, binomial, calculate the probability of success at least 190 out of 200, given that proportion of success is 90%.
In [ ]:
sum(dbinom(190:200,200,0.90))
In [40]:
sum(dbinom(100:131,100,0.11))
Out[40]:
In [51]:
#Required sample size proportion for desired ME
SL_CI = 0.95
p = 0.5
z_star = 1.96
ME = 0.04
z = round(qnorm((1-SL_CI)/2,lower.tail=F),digits=2)
z**2*p*(1-p)/ME**2
Out[51]:
observed the likelihood of sample
In [84]:
dbinom(30*0.12,250,0.08)
Out[84]:
Screenshot taken from Coursera 04:33
In [98]:
n = 2254
p = 0.17
CL = 0.95
p_pop = 0.38
z_star = round(qnorm((1-CL)/2, lower.tail=F),digits=2)
print('Confidence Interval is ')
SE_CI = round(sqrt(p*(1-p)/n),digits=3)
ME = z_star*SE_CI
c(p-ME, p+ME)
print('Hypothesis Testing is ')
SE_HT = round(sqrt((p_pop*(1-p_pop)/n)),digits=3)
pnorm(p, mean=p_pop,sd=SE_HT, lower.tail=p < p_pop) * 2
Out[98]:
Out[98]:
In [5]:
n = 3226
p = 0.24
CL = 0.95
p_pop = 0.2
z_star = round(qnorm((1-CL)/2, lower.tail=F),digits=2)
print('Confidence Interval is ')
SE_CI = round(sqrt(p*(1-p)/n),digits=5)
ME = z_star*SE_CI
c(p-ME, p+ME)
print('Hypothesis Testing is ')
SE_HT = round(sqrt((p_pop*(1-p_pop)/n)),digits=5)
pnorm(p, mean=p_pop,sd=SE_HT, lower.tail=p < p_pop) * 2
Out[5]:
Out[5]:
In [6]:
SE_HT
Out[6]:
In [7]:
SE_CI
Out[7]:
In [36]:
pnorm(4,mean=0.11,df=99,lower.tail=F)
In [34]:
dbinom(7,100,0.11)
Out[34]:
In [51]:
#Required sample size proportion for desired ME
SL_CI = 0.95
p = 0.5
z_star = 1.96
ME = 0.04
z = round(qnorm((1-SL_CI)/2,lower.tail=F),digits=2)
z**2*p*(1-p)/ME**2
Out[51]:
In [ ]:
In [51]:
#Required sample size proportion for desired ME
SL_CI = 0.95
p = 0.11
z_star = 1.96
ME = 0.04
z = round(qnorm((1-SL_CI)/2,lower.tail=F),digits=2)
z**2*p*(1-p)/ME**2
Out[51]:
In [96]:
n_1 = 90
p_1 = 0.38
n_2 = 122
p_2 = 0.5
CL = 0.95
avg_diff = p_1-p_2
z_star = round(qnorm((1-CL)/2, lower.tail=F),digits=2)
p_population = 0.3
p_pool = round((p_1*n_1+p_2*n_2)/(n_1+n_2),digits=2)
p_population = p_pool
null = 0
print('Confidence Interval is ')
SE_CI = round(sqrt((p_1*(1-p_1)/n_1)+(p_2*(1-p_2)/n_2)),digits=3)
ME = z_star*SE_CI
c((p_1-p_2)-ME, (p_1-p_2)+ME)
print('p-value for hypothesis test is')
#ONE SIDED OR TWO SIDED?
SE_HT = round(sqrt((p_population*(1-p_population)/n_1)+(p_population*(1-p_population)/n_2)),digits=3)
pnorm(avg_diff, mean=null,sd=SE_HT, lower.tail=avg_diff < null) * 2
Out[96]:
Out[96]:
In [106]:
n_1 = 819
p_1 = 0.7
n_2 = 783
p_2 = 0.42
CL = 0.95
avg_diff = p_1-p_2
z_star = round(qnorm((1-CL)/2, lower.tail=F),digits=2)
# p_population = 0.3
p_pool = round((p_1*n_1+p_2*n_2)/(n_1+n_2),digits=2)
p_population = p_pool
null = 0
print('Confidence Interval is ')
SE_CI = round(sqrt((p_1*(1-p_1)/n_1)+(p_2*(1-p_2)/n_2)),digits=3)
ME = z_star*SE_CI
c((p_1-p_2)-ME, (p_1-p_2)+ME)
print('p-value for hypothesis test is')
#ONE SIDED OR TWO SIDED?
SE_HT = round(sqrt((p_population*(1-p_population)/n_1)+(p_population*(1-p_population)/n_2)),digits=3)
pnorm(avg_diff, mean=null,sd=SE_HT, lower.tail=avg_diff < null) * 2
Out[106]:
Out[106]:
In [1]:
n_1 = 120
p_1 = 493/n_1
n_2 = 1028
p_2 = 596/n_2
CL = 0.95
SE = sqrt( (p_1*(1-p_1)/n_1)+(p_2*(1-p_2)/n_2) )
z_star = round(qnorm((1-CL)/2, lower.tail=F),digits=2)
ME = z_star*SE
c((p_1-p_2)-ME, (p_1-p_2)+ME)
Out[1]:
Based on the p-value and 5% significance level, we would failed to reject null hypothesis, and states there is no difference between males and females with respect to likelihood reporting their kids to being bullied
When defining population proportion, you use p. When you define sample proportion, you use $\hat{p}$. Plug population proportion to standard error formula. But since it almost always not known, use sample proportion.
For proportion, CLT states that the distribution of sample distribution will be nearly normal, centered at the true population proportion,with standard error as long as:
* Observations in the sample are independent to one another.
* At least 10 expected success and 10 expected failures in the observations.
For confidence interval, we use sampled proportion (if we already know the true population proportion, it's useless to build an interval to capture it). For hypothesis testing, we have true population,and incorporate it to our standard error calculation.For numerical variable, standard error doesn't incorporate mean, it uses standard deviation. So it doesn't have discrepancy for computing confidence interval and hypothesis testing.
When calculating required sample size for particular margin of error, if sampled proportion is unknown, we use 0.5. This have advantage in two ways. First, if categorical variable only have two levels, we have fair judgement, best prior uniform. Second, 0.5 will gives us the largest sample size.
Calculating for standard error of two categorical variable, testing the difference, is different when we have confidence interval or hypothesis testing that have null value other than zero. We join standard error of both propotion of categorical variable. But for hypothesis testing that have null value zero, both of categorical variable proportion is not known. Hence we use pool proportion, joining successes divided by sample size of both categorical variables. The reason behind another discrepancy for hypothesis testing with null value zero, is that assumed that proportions are equal for levels in categorical variable, we have to use common proportions that fit both levels.
We can make inference based on simulation. If success and failure condition is not met
In [73]:
n_1 = 83
p_1 = 0.71
n_2 = 1028
p_2 = 0.25
CL = 0.95
avg_diff = p_1-p_2
SE = round(sqrt((p_1*(1-p_1)/n_1)+(p_2*(1-p_2)/n_2)),digits=3)
z_star = round(qnorm((1-CL)/2, lower.tail=F),digits=2)
ME = z_star*SE
print('Confidence Interval is ')
c((p_1-p_2)-ME, (p_1-p_2)+ME)
print('p-value for hypothesis test is')
#ONE SIDED OR TWO SIDED?
pnorm(avg_diff, mean=0.0,sd=SE, lower.tail=avg_diff < 0.0)
Out[73]:
Out[73]:
we are 95% confident that proportion of Courserians is 36% to 56% higher than US that believe there should be law for banning gun possesion
Screenshot taken from Coursera 03:53
Screenshot taken from Coursera 06:55
Using chi-square to calculate the p-value, given degree of freedom
In [3]:
#Given chisquare statistic and degree of freedom, compute the chi-square
pchisq(31.68,2,lower.tail=F)
Out[3]:
Chi-square GOF test one single categorical variable whether it follows hypothesized distribution or not. Null hypothesis states that the observed proportion follows population proportion, and there isn't something going on. On the other hand,alternative hypothesis states the observed proportion doesn't follow population proportion, and there is indeed something going on.For one way table, we can calculate expected counts for each cell, by sample sized times each of proportion in hypothesized distribution.
We can calculate chi-square statistics by calculate the difference of observed and expected squared, divided by expected, and sum all of the cells. For one categorical variable, degree of freedom can be calculated by k-1, as k is the number of groups. For two categorical variable, dof calculated as (R-1)x(C-1) where R is number of rows, and C is number of columns.
The conditions for both chisquare GOF and independence test, is that observations are independent of one another. The expected counts for each cell is at least 5. And degree of freedom is at least two(more than 2 levels outcome). If this condition is not met, we use other methods, such as evaluating proportions.We then calculate each cell incorporate it into chi-square statistic, then using the statistics, degree of freedom and lower tail false to obtain p-value.
For chi-square independence test, we test two categorical variables, whether they independent or dependent of one another. We can't use confidence intervals for this problem, since we observe both variables with many levels(not observe one level in one of the variables). If p-value is above significance level, we failed to reject null hypothesis, and conclude that the data provide strong evidence that both categorical variables are indeed dependent.
In [2]:
#Calculating the slope for x
sy = 5
sx = 4
# mx -
R = 3
#To calculate the slope....
sy/sx*R
#formula for linear regression
#y-yo = slope(x-xo) where xo/yo is the mean
#Finding the intercept
slope = 0.726
x = 107
y = 129
print('The intercept is ')
y - slope*x
Out[2]:
Out[2]:
When observing the relationship, observe on:
In [5]:
#Calculating adjusted R squared
var_e = 23.34
var_y = 83.06
n = 141
k = 4
1 - var_e/var_y * (n-1)/(n-k-1)
Out[5]:
In [8]:
#Calculating adjusted R squared
var_e = 3819.99
var_y = 15079.02
n = 252
k = 8
print('R Squared is ')
1 - var_e/var_y
print('Adjusted R squared is ')
1 - var_e/var_y * (n-1)/(n-k-1)
Out[8]:
Out[8]:
In [9]:
#CI MLR
CL = 0.95
SE = 0.12
pt = -0.08
z = round(qnorm((1-CL)/2,lower.tail=F),digits=2)
ME = z*SE
c(pt-ME,pt+ME)
Out[9]:
Finally, you should validate the conditions for MLR.
In [ ]:
In [ ]: