Terminology

Null Hypothesis
Alternate Hypothesis
p-value (Probability of observing the metric from the data at least as extreme as computed just by chance)
Bootstrap
Acceptance Region
Rejection Region
t-test
One-tailed test
Two-tailed test
Significance test
Confidence interval
Power of a test
type 1 error (Rejecting null hypothesis when it is true). Also called false positive.
type 2 error (Failing to reject null hypothesis when it is false). Also called false negative

Some Practical thoughts

Data could be biased. Confidence intervals may then not be representative.
One way to handle biased data is to use bias-corrected-confidence-intervals.
Outliers can impact confidence intervals.
Too often, people remove outliers. But they might be encoding some necessary information.
One way to handle outliers is to use ranking, instead of actual numbers.
If sample size is small, bootstrapping underestimates the size of confidence interval.
Better to use significance testing if sample size is small.
Bootstrapping should not be used find maximum value (Eg: maximum sales of shoes, 5th largest sales of shoes, etc)
Use rank transformation when using bootstrapping, if the data has outliers
Lack of representativeness is a problem for any statistical technique
The experiment should be random. (Eg: When doing A/B testing, randomize the subjects). Experimental bias can lead to wrong inferences.
Resampling time series data is tricky. The assumption we used - that each data point is independent, doesn't hold good for time series data.
Rank transformation changes the question. For our shoe sales example, a rank transformed analysis would be: "Do sales tend to be higher after price optimization?". (Our analysis was: "Does post-price-optimization sales have a higher mean sales?")
Power of a test increases if sample size increases