Statistics Cheat Sheet


In [67]:
%load_ext rpy2.ipython


The rpy2.ipython extension is already loaded. To reload it, use:
  %reload_ext rpy2.ipython
  • Associated is dependant variable. Both can't happen at the same time
  • Anecdotal evidence happen in haphazard fashion. Only represent one/two cases, not known if can be generalized to population, and sometimes it happen in extreme cases.
  • Prospective collects inf as event unfold/continue. Retrospective collects inf of events in the past.
  • Stratified sampling: adv: each stratum similar output interest. Disadv, more complex analyze than SRS.
  • Cluster sampling, adv = more economic, not required strata like stratitified, useful when each cluster have diverse, but clusters perform similarity. Disadv: most complex technique
  • Randomized experiment:
    • Controlling. Do their best to control the groups difference.
    • Randomization. Randomly assign patient to groups, to avoid hidden lurking variable.
    • Replication. Collecting large sample or replicate entire previous study.
    • Blocking. Divide suspected variable into blocks, and randomly assign each block to treatment and placebo group.
  • Control treatment received fake treatment, placebo.
  • T-boxplot is called, whisker, try to capture the data outside the box. 1.5IQR(-Q1/+Q3). Outside whiskers is outliers.
  • Contigency table represent summary(often frequency) of two categorical variable. Frequency table showing one categorical variable.
  • Identify most useful row/column proportions in contigency. When classify based on first variable, we proportion in second variable.
  • In independent variables, we suspect that one variable should be equal regardless of dependency. But usually we will have small difference that occured due to chance

Probability

  • Disjoint events, probability one of them occured is P(1) + P(2)+ ..., it's the same as probability one of number in die is add up to 1.
  • events : set of outcomes.
  • Law of Addition Rule, disjoint vs not disjoint. Disjoint, P(A and B) automatically zero.
P (A or B) = P (A) + P (B) − P (A and B)
  • Probability distributions : all disjoint, between 0 and 1, add up to 1.
  • Sample space, list of all possible outcomes. die event {2,3}, then complement {1,4,5,6}
  • Independence can allow two events to occurs at the same time. This will have a multiplication rule, where P( A and B occurs at the same time)
P(A and B) = P(A) × P(B)

  • For example, Since gender,handedness and person order(first vs third person) are independence event. We can, multiply together. (First or second or third) person female and right hand = P(female) * P(right hand).First two MR, third FL,
P(MR) ** 2  * P(FL)


  • Marginal probability, row/columns total in propotion of total population. Not consider other variables.
  • Join probability, take intersection of two variables, in proportion of total population.
  • *Conditional probability:
P (A|B) = P (A and B) / P(B) **or** P(A and B) = P(A|B) × P(B)

  • We can use General multiplication rule in conjunction with marginal and conditional probability.
  • with replacement, data can picked over and over again, thus make it independent every pick. Without replacement make it dependant events.

Distributions

For z-score


In [3]:
%%R

mean = 1500
sd = 300
d = 1800
(d-mean)/sd


[1] 1

To calculate the percentiles


In [9]:
%%R

mean = 1500
sd = 300
point = 2100
LT = T

pnorm(point,mean=mean,sd=sd,lower.tail=LT)


[1] 0.9772499

In [11]:
%%R

mean = 1500
sd = 300
percentile = 0.4
LT = T

qnorm(percentile,mean=mean,sd=sd,lower.tail=LT)


[1] 1423.996

Pnorm ranges


In [30]:
%%R

mean = 70
sd = 3.3
lower = 69
upper = 74
LT = T
# IN RANGE
# 1 - pnorm(upper,mean=mean,sd=sd,lower.tail=!LT) - pnorm(lower,mean=mean,sd=sd,lower.tail=LT)

# OUT RANGE
# pnorm(upper,mean=mean,sd=sd,lower.tail=!LT) + pnorm(lower,mean=mean,sd=sd,lower.tail=LT)


[1] 0.5063336
  • Standard normal distributions, is mean 0 and sd 1. Used in default pnorm and qnorm R.
  • Always draw normal probability distributions, and shade the are we're interesed.
  • In evaluation the distribution through normal probability plot, imagine a straight line. If the data is skewed to the high end of the line, it's right skewed. Otherwise , low-end is left skewed.
  • To construct normal probability plot, for each observation, make a scatter plot with its Z-score(horizontal) against its percentile(vertical).

Binomial

Screenshot taken from Coursera video, 09:53


In [5]:
%%R

k = 8
n = 10
p = 0.3

dbinom(k,size=n,p=p)


[1] 0.001446701

Requirements

  • The trials are independent.
  • The number of trials, n, is fixed.
  • Each trial outcome can be classified as a success or failure.
  • The probability of a success, p, is the same for each trial.

For calculating combinations,


In [25]:
%%R
k_success = 8
trials = 8
choose(trials,k_success)


[1] 1

Screenshot taken from Coursera video, 12:07

If for in fact the data is not sufficiently large, then it can't use normal distribution advantage. Then the only is one to do this, is manually compute for each of the value.

P(p,n,k, k < 2) = P(0) + P(1)

When we know mean and standard deviation of binomial-normal distribution, we can determine the probability using z=score.

Screenshot taken from Coursera video, 09:59

If all binomial requirements are followed, then this could also work.


In [75]:
import numpy as np

n = 40
p = 0.35
sd = round(np.sqrt(n*p*(1-p)),2)


print 'Expected Value of point success is', n*p
print 'standard deviation is', sd
print 'variance is' , sd**2


Expected Value of point success is 14.0
standard deviation is 3.02
variance is 9.1204

In [74]:
import numpy as np

n = 2500
p = 0.7
sd = round(np.sqrt(n*p*(1-p)),2)

print 'Expected Value of point success is', n*p
print 'standard deviation is', sd
print 'variance is' , sd**2


Expected Value of point success is 1750.0
standard deviation is 22.91
variance is 524.8681

To check whether the data is sufficiently large,


In [19]:
p = 0.01
n = 300

assert (n*p >= 10 and n*(1-p) >= 10), 'Not large enough to take advantage of normal approximation'


---------------------------------------------------------------------------
AssertionError                            Traceback (most recent call last)
<ipython-input-19-171ee718fcdd> in <module>()
      2 n = 300
      3 
----> 4 assert (n*p >= 10 and n*(1-p) >= 10), 'Not large enough to take advantage of normal approximation'

AssertionError: Not large enough to take advantage of normal approximation

For binorm norm, and for taking the probability of more than one events


In [37]:
%%R

val = 59
mu = 80
sd = 8
tail = T

# Add 0.5 to val if observe small range of observation
ifsmall = (-0.5+!tail)

pnorm (val,mean=mu,sd=sd,lower.tail=tail)


[1] 0.004332448

Because there's no exact value in binomial, we have to decrement by 0.05.

CLT

Requirements

  • independent
  • random sampling/assignment
  • larger than 30 or if population is normal, < 10% population avoid dependant.
  • Or Plot population/sample distribution and see if normal.
  • (>> n, less shape required)

Confidence Interval

Snapshot taken from Coursera 04:26

Let's us state that:

n = 50
mu  =  3.2
s = 1.74
z = 1.96

$$SE = \frac{s}{\sqrt{n}} = \frac{1.74}{\sqrt{50}} \approx 0.246$$$$\bar{x} \pm z * SE = 3.2 \pm 1.96 * 0.246 $$$$3.2 \pm 0.48 = (2.72,3.68)$$

CI formula


In [76]:
%%R

#95% = 1.96
#99% = 2.58

n = 50
mu = 3.2
s = 1.74
# z = 1.96
CL = 0.95
z = round(qnorm((1-CL)/2,lower.tail=F),digits=2)
SE = s/sqrt(n)
ME = z*SE


c(mu-ME,mu+ME)


[1] 2.717697 3.682303

Sample Size for ME


In [82]:
%%R

CL = 0.9
ME = 4
sd = 18

#CONFIDENCE LEVEL
#95% = 1.96
#99% = 2.58
#90% = 1.65
#####

z_star = round(qnorm((1-CL)/2,lower.tail=F),digits=2)
#z_star = 1.65
((z_star*sd)/ME)**2


[1] 77.7924

In [82]:
%%R

CL = 0.9
ME = 4
sd = 18

#CONFIDENCE LEVEL
#95% = 1.96
#99% = 2.58
#90% = 1.65
#####

z_star = round(qnorm((1-CL)/2,lower.tail=F),digits=2)
#z_star = 1.65
((z_star*sd)/ME)**2


[1] 77.7924

Hypothesis Testing

Snapshot taken from Coursera 14:01

P-value

Snapshot taken from Coursera 05:19

Interpreting p-value:

  • X% chance that random sample of n will draw average of z
  • % chance is higher than threshold, therefore it could be happen by chance.

This is one-sided HT. two-sided will need to double the counting 0.209*2 = 0.418

To calculate HT for parameter mean in two-sided, lose x2 for one sided


In [71]:
%%R

xbar = 118.2
mu = 100
sd = 6.5
n = 36

SE = round(sd/sqrt(n),digits=2)

pnorm(xbar, mean=mu,sd=SE, lower.tail=xbar < mu) * 2


[1] 1.016802e-63

Type Error

  • Type 1 Error, H0 actually true, == significance level, SL increase, error type 1 increase
  • Type 2 Error, HA actually true, == power. Power increase, error type 2 increase.
  • Type 1 error more dangerous, use lower significance level. Type 2 Error more dangerous, use higher significance level.

Statistics vs Practical

Statistical significance different you calculate different z-score . Practical significance is where you get the same impact eventhough statistically diffferent. Z-score 5 and 100 is different statistically, but practically the same, produce p-value approximately zero. Therefore one must be careful that they don't want to increase sample size too much, because it's practically useless, and use much resources.

  • Use proportion if data is about categorical. Use mean when data is about numerical.
  • Most of the data is represented by the median.
  • Use stratified sampling as alternative way to decrease the width of the interval.
  • Remember that null value is always about the population parameter.
  • If the null difference is rejected in hypothesis testing, the confidence interval should not contain the null value.