Workflow for statistics

Case 1: Suppose that you want to test differences in one variable between groups

  • Step0: iid:
    • The observations are not correlated (for instance they are not different years)
  • Step1: Normality:
    • Make sure this variable is normally distributed (qqplot + histogram)
    • If it's lognormally distributed tranform it with the logarithm
    • If it follow any other distribution you need some other transformation
  • Step2: Create your groups.
  • Step3: Equal variance:
    • Levene test -> but a bit rigurous
    • As a rule of thumb the variance of the group with highest variance shouldn't be greater than 4 times the variance of the group with lowest variance.
  • Step4: Do ANOVA/Kruskal-Wallis (with two groups do t-test or MWU)
  • Step5: Do Tukey test if normality distributed and equal variance. Otherwise you can compare each pair of groups with MWU test (and the level of significance should be 0.05/number of comparisons)

Case 2: Suppose that you want to find relationships between variables

  • Step0:
    • Do a correlation plot and a scatter matrix to understand how your variables correlate to each other.
  • Step1:
    • Run regression
  • Step2:
    • Check assumptions and modify your variables as needed. The two most common things that can help:
      • Transform your variables so all of them have similar distributions
      • Combine or drop some variables (see dimensionality reduction notebook)