Terminology
- Null Hypothesis
- Alternate Hypothesis
- p-value (Probability of observing the metric from the data at least as extreme as computed just by chance)
- Bootstrap
- Acceptance Region
- Rejection Region
- t-test
- One-tailed test
- Two-tailed test
- Significance test
- Confidence interval
- Power of a test
- type 1 error (Rejecting null hypothesis when it is true). Also called false positive.
- type 2 error (Failing to reject null hypothesis when it is false). Also called false negative
Some Practical thoughts
- Data could be biased. Confidence intervals may then not be representative.
- One way to handle biased data is to use bias-corrected-confidence-intervals.
- Outliers can impact confidence intervals.
- Too often, people remove outliers. But they might be encoding some necessary information.
- One way to handle outliers is to use ranking, instead of actual numbers.
- If sample size is small, bootstrapping underestimates the size of confidence interval.
- Better to use significance testing if sample size is small.
- Bootstrapping should not be used find maximum value (Eg: maximum sales of shoes, 5th largest sales of shoes, etc)
- Use rank transformation when using bootstrapping, if the data has outliers
- Lack of representativeness is a problem for any statistical technique
- The experiment should be random. (Eg: When doing A/B testing, randomize the subjects). Experimental bias can lead to wrong inferences.
- Resampling time series data is tricky. The assumption we used - that each data point is independent, doesn't hold good for time series data.
- Rank transformation changes the question. For our shoe sales example, a rank transformed analysis would be: "Do sales tend to be higher after price optimization?". (Our analysis was: "Does post-price-optimization sales have a higher mean sales?")
- Power of a test increases if sample size increases
Types of Error
- Sampling Bias
- Measurement Error
- Random Error