Inference Means

Compare Paired data Means

  • Paired data should have equal number of samples

In [1]:
SL_CI = 0.05

n = 73
s = 14.26
avg_diff = 12.76

SE = round(s/sqrt(n),digits=2)
z = round(qnorm(SL_CI/2,lower.tail=F),digits=2)

print('confidence interval is')
ME = z*SE
print(c(avg_diff-ME,avg_diff+ME))
print('hypothesis testing is')
pnorm(avg_diff, mean=0,sd=SE, lower.tail=F) * 2


[1] "confidence interval is"
[1]  9.4868 16.0332
[1] "hypothesis testing is"
Out[1]:
[1] 2.160128e-14

Compare Independence Means


In [27]:
x1 = 41.8
s_1 = 15.14
n_1 = 505

x2 = 39.4
s_2 = 15.12
n_2 = 667


pt = x1 - x2
SL_CI = 0.05

SE = round(sqrt(s_1**2/n_1 + s_2**2/n_2),digits=2)
z = round(qnorm(SL_CI/2,lower.tail=F),digits=2)

print('confidence interval is')
ME = z*SE
print(c(pt-ME,pt+ME))
print('hypothesis testing is')
pnorm(avg_diff, mean=0,sd=SE, lower.tail=F) * 2


[1] "confidence interval is"
[1] 0.6556 4.1444
[1] "hypothesis testing is"
Out[27]:
[1] 1.283454e-46

HT "If there is no difference of work hours on average between college degree vs non college degree, there is 0.7% chance of obtaining random samples of 505 college and 667 non-college degree give average difference of work hours at least 2.4 hours"

Bootstrap

Screenshot taken from Coursera 03:46

Screenshot taken from Coursera 12:14


In [28]:
#95% = 1.96
#99% = 2.58

n = 100
mu = 882.515
# s = 89.5758
# z = 1.96
CL = 0.9
z = round(qnorm((1-CL)/2,lower.tail=F),digits=2)
# SE = s/sqrt(n)
SE = 89.5759
ME = z*SE


c(mu-ME,mu+ME)


Out[28]:
[1]  735.6105 1029.4195
  • samping distritbuion with replacement from population, where bootsrap with replacement from sample
  • Both are distributions of sample statistics. CLT can explicitly describe the distribution of the population, where bootstrap also describe that using one sample.

Bootstrap can be created by using with replacement in one sample. This is different from sampling distribution, where it takes with replacement from population.We can use percentile method; take 100 sample size bootstrap and cut off the sided for XX% interval, or calculate percentile based on the known condition that the distribution is normal, use point estimate bootstrap and standard error bootstrap.There's one weakness of bootstrap, is that when you have skew and sparse bootstrap distribution, it's not reliable.

Paired data, is when you have one observation dependent on other variable.We can use these set differences as a basis to use hypothesis testing and confidence interval.

t-distribution

Screenshot taken from Coursera 03:42

t-distribution can be two-sided test.

t-distribution is used when we have small sample size, less than 30. We use single parameter, degree of freedom to mitigate poor standard of error. We also verify that eventhough sample size is small, it's less than 10% population, and verify that the outliers it not too extreme(below two standard deviation).

Calculating t-score that is 1.65 standard deviation away, with degree of freedom 20. Looking for outer tail.


In [3]:
df=17
#NOT POINT ESTIMATE, BUT T-STATISTIC
t_statistic = 0.5
pt(t_statistic,df=df,lower.tail=F)


Out[3]:
[1] 0.3117426

Given confidence interval/significance level, and degree of freedom. find the cut off value, which is always positive.


In [1]:
CL = 0.95
n = 19
qt((1-CL)/2, df=n-1,lower.tail=F)


Out[1]:
[1] 2.100922

In [5]:
CL = 0.99
n = 12
qt((1-CL)/2, df=n-1,lower.tail=F)


Out[5]:
[1] 3.105807

Screenshot taken from Coursera 09:57

Screenshot taken from Coursera 15:49


In [24]:
xbar = 56
mu = 0
sd = 8
n = 20
CL = 0.9


SE = round(sd/sqrt(n),digits=2)
t_star = round((xbar-mu)/SE,digits=2)
z = round(qt ((1-CL)/2,lower.tail=F,df=n-1),digits=2)


print('Hypothesis Testing is ')


pt(t_star, df=n-1, lower.tail=xbar < mu)
print('Confidence Interval is ')
ME = round(z*SE, digits=2)
c(xbar-ME,xbar+ME)


[1] "Hypothesis Testing is "
Out[24]:
[1] 0.01900261
[1] "Confidence Interval is "
Out[24]:
[1] 52.9 59.1

In [12]:
t_star


Out[12]:
[1] -1.8

Since null hypothesis is not in the interval, CI agrees with HT in this case. Rejecting the null hypothesis, must be that null value not in the confidence interval.

Comparing Two Means with t-distribution (small sample size)


In [38]:
x1 = 52.1
s_1 = 45.1
n_1 = 22

x2 = 27.1
s_2 = 26.4
n_2 = 22


avg_diff =  x1-x2
SL_CI = 0.02

SE = round(sqrt(s_1**2/n_1 + s_2**2/n_2),digits=2)
z = round(qt(SL_CI/2,df=min(n_1-1,n_2-1),lower.tail=F),digits=2)

print('confidence interval is')
ME = z*SE
print(c(avg_diff-ME,avg_diff+ME))
print('hypothesis testing is')
# pnorm(avg_diff, mean=0,sd=SE, lower.tail=F) * 2
pt(avg_diff, df=min(n_1-1,n_2-1),lower.tail=avg_diff < 0) * 2


[1] "confidence interval is"
[1]  1.8288 48.1712
[1] "hypothesis testing is"
Out[38]:
[1] 4.150183e-17

In [10]:
x1 = 52.1
s_1 = 45.1
n_1 = 22

x2 = 27.1
s_2 = 26.4
n_2 = 22


avg_diff =  x1-x2
SL_CI = 0.02

SE = round(sqrt(s_1**2/n_1 + s_2**2/n_2),digits=2)
z = round(qt(SL_CI/2,df=min(n_1-1,n_2-1),lower.tail=F),digits=2)

print('confidence interval is')
ME = z*SE
print(c(avg_diff-ME,avg_diff+ME))
print('hypothesis testing is')
# pnorm(avg_diff, mean=0,sd=SE, lower.tail=F) * 2
pt(avg_diff, df=min(n_1-1,n_2-1),lower.tail=avg_diff < 0) * 2


[1] "confidence interval is"
[1] -3.0728 53.0728
[1] "hypothesis testing is"
Out[10]:
[1] 4.150183e-17

In [46]:
x1 = 248.3
s_1 = 2
n_1 = 10

x2 = 244.8
s_2 = 3
n_2 = 10


avg_diff =  x1-x2
SL_CI = 0.05

# SE = round(sqrt(s_1**2/n_1 + s_2**2/n_2),digits=2)

spooled = ((s_1**2*(n_1-1)) + (s_2**2*(n_2-1))) / (n_1+n_2-2)

SE = round(sqrt(spooled/n_1 + spooled/n_2),digits=2)

z = round(qt(SL_CI/2,df=min(n_1-1,n_2-1),lower.tail=F),digits=2)

print('confidence interval is')
ME = z*SE
print(c(avg_diff-ME,avg_diff+ME))
print('hypothesis testing is')
# pnorm(avg_diff, mean=0,sd=SE, lower.tail=F) * 2
pt(avg_diff, df=min(n_1-1,n_2-1),lower.tail=avg_diff < 0) * 2


[1] "confidence interval is"
[1] 0.9236 6.0764
[1] "hypothesis testing is"
Out[46]:
[1] 0.006723516

So we can say that we are 95% confident those with eating distractions consume between 1.83 to 48.17 grams more snack to those without distractions, on average.Since null value is not within the interval, CI agrees with HT

Simplify the calculation if population sd of two groups is similar

Screenshot taken from Coursera 08:39

ANOVA

  • the observations should be independent within and across groups
  • the data within each group are nearly normal
  • the variability across the groups is about equal and use graphical diagnostics to check if these conditions are met.

Recognize that the test statistic for ANOVA, the F statistic, is calculated as the ratio of the mean square between groups (MSG, variability between groups) and mean square error (MSE, variability within errors). Also recognize that the F statistic has a right skewed distribution with two different measures of degrees of freedom: one for the numerator (dfG=k−1, where k is the number of groups) and one for the denominator (dfE=n−k, where n is the total sample size). Note that you won’t be expected to calculate MSG or MSE from the raw data, but you should have a conceptual understanding of how they’re calculated and what they measure.

Screenshot taken from Coursera 08:42

F-statistics in ANOVA, larger means smaller p-value.Positive skewed because the variability between groups and among groups can never be negative.


In [42]:
n_group = 4
n_total = 795
f_value = 21.735
dfg = n_group -1
dft = n_total - 1
dfe = dft - dfg

pf(f_value,dfg,dfe,lower.tail=F)


Out[42]:
[1] 1.559855e-13

In [47]:
n_group = 3
n_total = 831
f_value = 3.47
dfg = n_group -1
dft = n_total - 1
dfe = dft - dfg

pf(f_value,dfg,dfe,lower.tail=F)


Out[47]:
[1] 0.0315703

Interpret ANOVA,

If p-value < $\alpha$, the data provide convincing evidence that at least one pair of population are different from each other (we can't tell which one).In this case we reject the null hypothesis.

If p-value > $\alpha$, the data doesn't provide convincing evidence that one pair of population means are different from each other, the observed difference is due to sampling variability(or by chance).In this case we fail to reject null hypothesis.

Since p-value is really small in this problem, we reject the null hypothesis, and conclude that the data is provide convincing evidence that at least one pair of population means are different from each other.

Benferroni Correction Where you want to modify significance level, to adapt with changing type 1 error. After you reject null hypothesis using ANOVA, you want to test which of the pair is contribute


In [44]:
k = 7
SL = 0.05

SL/((k*(k-1))/2)


Out[44]:
[1] 0.002380952

Screenshot taken from Coursera 08:19


In [30]:
x_1 = 6.76
n_1 = 331

x_2 = 5.07
n_2 = 41

MSE = 3.628
df_res = 791
null=0

T = (x_1-x_2-null)/sqrt(MSE/n_1+MSE/n_2)

2*pt(T,df_res,lower.tail=F)


Out[30]:
[1] 1.09828e-07

Therefore we reject the null hypothesis, and the data provide convincing evidence that average of vocabulary scores between self-reported middle and lower class Americans are different.

ANOVA is statiscal tools that lets you analyze many groups at once. Is it due to chance variability of particular groups compared to variability of others? Different from usual HT, in ANOVA you set null hypothesis to be equal means across groups. But in alternative, you want to observe at least one group is different. Two probability, either all equal, or at least one different.

Conditions for ANOVA is at independence for between groups and within groups. The distribution of each group is nearly normal, and you variability across groups also nearly normal. Use graphical tools like boxplot and summary statistics to make intuition about the data. F-statistic will be get by calculating MSG with MSE, the means will calculated by dividing the sum of squares with its corresponding degree of freedom.This is right skewed distribution, you won't be have negative number, which always yield positive number. Therefore we always use one sided p-value.

Recall that we have 5% type 1 error rate for one hypothesis test. Doing multiple test will pile up the error, so fix number of 5% is not enough. We have to always push down the error rate, and by doing that incorporate with K-value.It's possible that when you reject the null hypothesis(statiscally significant) in ANOVA, but end up not finding it when you test it between groups. Remember that ANOVA is about at least one is different. And you may only one group that is different, but not for the rest of the groups.

Inference Categorical

Checking the standard error,


In [ ]:
sqrt(0.08*(1-0.08)/125)

Using, binomial, calculate the probability of success at least 190 out of 200, given that proportion of success is 90%.


In [ ]:
sum(dbinom(190:200,200,0.90))

In [40]:
sum(dbinom(100:131,100,0.11))


Out[40]:
[1] 1.378061e-96

In [51]:
#Required sample size proportion for desired ME

SL_CI = 0.95
p = 0.5
z_star = 1.96
ME = 0.04
z = round(qnorm((1-SL_CI)/2,lower.tail=F),digits=2)


z**2*p*(1-p)/ME**2


Out[51]:
[1] 600.25

observed the likelihood of sample


In [84]:
dbinom(30*0.12,250,0.08)


Warning message:
In dbinom(30 * 0.12, 250, 0.08): non-integer x = 3.600000
Out[84]:
[1] 0

Screenshot taken from Coursera 04:33


In [98]:
n = 2254
p = 0.17
CL = 0.95
p_pop = 0.38

z_star = round(qnorm((1-CL)/2, lower.tail=F),digits=2)

print('Confidence Interval is ')
SE_CI = round(sqrt(p*(1-p)/n),digits=3)
ME = z_star*SE_CI
c(p-ME, p+ME)
print('Hypothesis Testing is ')
SE_HT = round(sqrt((p_pop*(1-p_pop)/n)),digits=3)
pnorm(p, mean=p_pop,sd=SE_HT, lower.tail=p < p_pop) * 2


[1] "Confidence Interval is "
Out[98]:
[1] 0.15432 0.18568
[1] "Hypothesis Testing is "
Out[98]:
[1] 6.558556e-98

In [5]:
n = 3226
p = 0.24
CL = 0.95
p_pop = 0.2

z_star = round(qnorm((1-CL)/2, lower.tail=F),digits=2)

print('Confidence Interval is ')
SE_CI = round(sqrt(p*(1-p)/n),digits=5)
ME = z_star*SE_CI
c(p-ME, p+ME)
print('Hypothesis Testing is ')
SE_HT = round(sqrt((p_pop*(1-p_pop)/n)),digits=5)
pnorm(p, mean=p_pop,sd=SE_HT, lower.tail=p < p_pop) * 2


[1] "Confidence Interval is "
Out[5]:
[1] 0.2252608 0.2547392
[1] "Hypothesis Testing is "
Out[5]:
[1] 1.332703e-08

In [6]:
SE_HT


Out[6]:
[1] 0.00704

In [7]:
SE_CI


Out[7]:
[1] 0.00752

In [36]:
pnorm(4,mean=0.11,df=99,lower.tail=F)


Error in pnorm(4, mean = 0.11, df = 99, lower.tail = F): unused argument (df = 99)

In [34]:
dbinom(7,100,0.11)


Out[34]:
[1] 0.06128364

In [51]:
#Required sample size proportion for desired ME

SL_CI = 0.95
p = 0.5
z_star = 1.96
ME = 0.04
z = round(qnorm((1-SL_CI)/2,lower.tail=F),digits=2)


z**2*p*(1-p)/ME**2


Out[51]:
[1] 600.25

In [ ]:


In [51]:
#Required sample size proportion for desired ME

SL_CI = 0.95
p = 0.11
z_star = 1.96
ME = 0.04
z = round(qnorm((1-SL_CI)/2,lower.tail=F),digits=2)


z**2*p*(1-p)/ME**2


Out[51]:
[1] 600.25

In [96]:
n_1 = 90
p_1 = 0.38

n_2 = 122
p_2 = 0.5

CL = 0.95

avg_diff = p_1-p_2


z_star = round(qnorm((1-CL)/2, lower.tail=F),digits=2)
p_population = 0.3
p_pool = round((p_1*n_1+p_2*n_2)/(n_1+n_2),digits=2)
p_population = p_pool
null = 0


print('Confidence Interval is ')
SE_CI = round(sqrt((p_1*(1-p_1)/n_1)+(p_2*(1-p_2)/n_2)),digits=3)
ME = z_star*SE_CI
c((p_1-p_2)-ME, (p_1-p_2)+ME)
print('p-value for hypothesis test is')
#ONE SIDED OR TWO SIDED?
SE_HT = round(sqrt((p_population*(1-p_population)/n_1)+(p_population*(1-p_population)/n_2)),digits=3)
pnorm(avg_diff, mean=null,sd=SE_HT, lower.tail=avg_diff < null) * 2


[1] "Confidence Interval is "
Out[96]:
[1] -0.25328  0.01328
[1] "p-value for hypothesis test is"
Out[96]:
[1] 0.08201182

In [106]:
n_1 = 819
p_1 = 0.7

n_2 = 783
p_2 = 0.42

CL = 0.95

avg_diff = p_1-p_2


z_star = round(qnorm((1-CL)/2, lower.tail=F),digits=2)
# p_population = 0.3
p_pool = round((p_1*n_1+p_2*n_2)/(n_1+n_2),digits=2)
p_population = p_pool
null = 0


print('Confidence Interval is ')
SE_CI = round(sqrt((p_1*(1-p_1)/n_1)+(p_2*(1-p_2)/n_2)),digits=3)
ME = z_star*SE_CI
c((p_1-p_2)-ME, (p_1-p_2)+ME)
print('p-value for hypothesis test is')
#ONE SIDED OR TWO SIDED?
SE_HT = round(sqrt((p_population*(1-p_population)/n_1)+(p_population*(1-p_population)/n_2)),digits=3)
pnorm(avg_diff, mean=null,sd=SE_HT, lower.tail=avg_diff < null) * 2


[1] "Confidence Interval is "
Out[106]:
[1] 0.23296 0.32704
[1] "p-value for hypothesis test is"
Out[106]:
[1] 4.077335e-29

In [1]:
n_1 = 120
p_1 = 493/n_1

n_2 = 1028
p_2 = 596/n_2



CL = 0.95

SE = sqrt(   (p_1*(1-p_1)/n_1)+(p_2*(1-p_2)/n_2)   )
z_star = round(qnorm((1-CL)/2, lower.tail=F),digits=2)
ME = z_star*SE

c((p_1-p_2)-ME, (p_1-p_2)+ME)


Out[1]:
[1] -0.14718610 -0.06152731

Based on the p-value and 5% significance level, we would failed to reject null hypothesis, and states there is no difference between males and females with respect to likelihood reporting their kids to being bullied

When defining population proportion, you use p. When you define sample proportion, you use $\hat{p}$. Plug population proportion to standard error formula. But since it almost always not known, use sample proportion.

For proportion, CLT states that the distribution of sample distribution will be nearly normal, centered at the true population proportion,with standard error as long as:

* Observations in the sample are independent to one another.
* At least 10 expected success and 10 expected failures in the observations.

For confidence interval, we use sampled proportion (if we already know the true population proportion, it's useless to build an interval to capture it). For hypothesis testing, we have true population,and incorporate it to our standard error calculation.For numerical variable, standard error doesn't incorporate mean, it uses standard deviation. So it doesn't have discrepancy for computing confidence interval and hypothesis testing.

When calculating required sample size for particular margin of error, if sampled proportion is unknown, we use 0.5. This have advantage in two ways. First, if categorical variable only have two levels, we have fair judgement, best prior uniform. Second, 0.5 will gives us the largest sample size.

Calculating for standard error of two categorical variable, testing the difference, is different when we have confidence interval or hypothesis testing that have null value other than zero. We join standard error of both propotion of categorical variable. But for hypothesis testing that have null value zero, both of categorical variable proportion is not known. Hence we use pool proportion, joining successes divided by sample size of both categorical variables. The reason behind another discrepancy for hypothesis testing with null value zero, is that assumed that proportions are equal for levels in categorical variable, we have to use common proportions that fit both levels.

We can make inference based on simulation. If success and failure condition is not met

  • The focus here is the p-value. Remember p-value is probability of observing at least as favorable to the outcome given the null hypothesis is true.
  • Devise the simulation that assumes null hypothesis is true. Since we have two mutually exclusive outcome, we can use head/tail of fair coin(use fair because we assume null hypothesis is true, proportions are equal). So we flip coin 8 times and record the proportion of heads come out(success).
  • Repeat the simulation of N-times takes record relevant sample statistics, in this case the proportion.
  • Calculate p-value as the probability of at least favorable to the observed outcome. What is the probability that the proportions of heads is at least as an observed outcome. So what are the probability that simulations head comes at least 8 times. Remember that proportion hat is correctly guess of all 8 times, so the p hat is 1.

Focusing on one level


In [73]:
n_1 = 83
p_1 = 0.71

n_2 = 1028
p_2 = 0.25

CL = 0.95

avg_diff = p_1-p_2

SE = round(sqrt((p_1*(1-p_1)/n_1)+(p_2*(1-p_2)/n_2)),digits=3)
z_star = round(qnorm((1-CL)/2, lower.tail=F),digits=2)
ME = z_star*SE

print('Confidence Interval is ')
c((p_1-p_2)-ME, (p_1-p_2)+ME)
print('p-value for hypothesis test is')
#ONE SIDED OR TWO SIDED?
pnorm(avg_diff, mean=0.0,sd=SE, lower.tail=avg_diff < 0.0)


[1] "Confidence Interval is "
Out[73]:
[1] 0.35808 0.56192
[1] "p-value for hypothesis test is"
Out[73]:
[1] 4.529368e-19

we are 95% confident that proportion of Courserians is 36% to 56% higher than US that believe there should be law for banning gun possesion

Chi-Square Independence Test

Screenshot taken from Coursera 03:53

Screenshot taken from Coursera 06:55

Using chi-square to calculate the p-value, given degree of freedom


In [3]:
#Given chisquare statistic and degree of freedom, compute the chi-square

pchisq(31.68,2,lower.tail=F)


Out[3]:
[1] 1.320613e-07

Chi-Square GOF

  • Expected cell size at least 5
  • chi-square is the same, except with GOF, df is (n_cells-1)

Chi-Square Independence Test

  • Expected counts for each cell at least 5
  • Degree of freedom calculated is more than 1

Summary:

Chi-square GOF test one single categorical variable whether it follows hypothesized distribution or not. Null hypothesis states that the observed proportion follows population proportion, and there isn't something going on. On the other hand,alternative hypothesis states the observed proportion doesn't follow population proportion, and there is indeed something going on.For one way table, we can calculate expected counts for each cell, by sample sized times each of proportion in hypothesized distribution.

We can calculate chi-square statistics by calculate the difference of observed and expected squared, divided by expected, and sum all of the cells. For one categorical variable, degree of freedom can be calculated by k-1, as k is the number of groups. For two categorical variable, dof calculated as (R-1)x(C-1) where R is number of rows, and C is number of columns.

The conditions for both chisquare GOF and independence test, is that observations are independent of one another. The expected counts for each cell is at least 5. And degree of freedom is at least two(more than 2 levels outcome). If this condition is not met, we use other methods, such as evaluating proportions.We then calculate each cell incorporate it into chi-square statistic, then using the statistics, degree of freedom and lower tail false to obtain p-value.

For chi-square independence test, we test two categorical variables, whether they independent or dependent of one another. We can't use confidence intervals for this problem, since we observe both variables with many levels(not observe one level in one of the variables). If p-value is above significance level, we failed to reject null hypothesis, and conclude that the data provide strong evidence that both categorical variables are indeed dependent.

Linear Regression


In [2]:
#Calculating the slope for x
sy = 5
sx = 4
# mx - 
R = 3

#To calculate the slope....
sy/sx*R

#formula for linear regression
#y-yo = slope(x-xo) where xo/yo is the mean
#Finding the intercept
slope = 0.726
x = 107
y = 129
print('The intercept is ')
y - slope*x


Out[2]:
[1] 3.75
[1] "The intercept is "
Out[2]:
[1] 51.318

When observing the relationship, observe on:

  • Linear/not?
  • Strong/Weak?
  • Positive/Negative?
  • See the sign of correlation by looking at the direction trend of the data.
  • Interpreting the residuals : There are 6.14% more bike riders wearing helment than predicted by the model in this neighborhhood.

Multiple Linear Regression


In [5]:
#Calculating adjusted R squared

var_e = 23.34
var_y = 83.06
n = 141
k = 4

1 - var_e/var_y * (n-1)/(n-k-1)


Out[5]:
[1] 0.7107336

In [8]:
#Calculating adjusted R squared

var_e = 3819.99
var_y = 15079.02
n = 252
k = 8

print('R Squared is ')
1 - var_e/var_y

print('Adjusted R squared is ')
1 - var_e/var_y * (n-1)/(n-k-1)


[1] "R Squared is "
Out[8]:
[1] 0.7466686
[1] "Adjusted R squared is "
Out[8]:
[1] 0.7383284

In [9]:
#CI MLR

CL = 0.95
SE = 0.12
pt = -0.08

z = round(qnorm((1-CL)/2,lower.tail=F),digits=2)

ME = z*SE
c(pt-ME,pt+ME)


Out[9]:
[1] -0.3152  0.1552

Finally, you should validate the conditions for MLR.

  • You want each numerical to have linearity with the response, and validate that residuals are random scatter around zero, constant variability, normally distributed, and each is independent of one another.
  • Scatter plot each numerical and residuals, check linearity.
  • Histogram or probability plot to check residuals distribution normal
  • Scatter plot residuals-predicted, check residuals constant variability.
  • Scatter plot residuals-index check residuals is independent of one another.

In [ ]:


In [ ]: