Homework 4

Due date: Friday 27th 23:59

  • Write your own code in the blanks. It is okay to collaborate with other students, but both students must write their own code and write the name of the other student in this cell. In case you adapt code from other sources you also must give that user credit (a comment with the link to the source suffices)
  • Complete the blanks, adding comments to explain what you are doing
  • Each plot must have labels

Collaborated with:

Problem 1

Using the data for your project, take a quantitative variable and a categorical variable.

  • Check the normality of the quantitative variable (e.g. x) for the different categories
  • If lognormal, transform it by using x = np.log(x)
  • Calculate confidence intervals of the mean and std of your variable
  • Check if the variable for the different groups (categories) have the same variance
  • Use the appropriate test (ANOVA / Kruskal-Wallis) to calculate differences between groups
  • If normality and equal variance: Do a Tukey's test and plot it

If your data is not appropriate for this download a dataset and do it: https://vincentarelbundock.github.io/Rdatasets/datasets.html


In [ ]:

Problem 2

Using the data for your project, investigate the relationship between several quantitative variables (at least 3):

  • Part 1: Create a pairplot or correlation matrix (see class3b_data_visualization)

  • Part 2: Do one of the following:

    • Linear regression
    • Linear mixed effects regression
    • Logistic regression
    • Any kind of regression using machine learning (more advanced)
  • Part 3: Create a new column with the predicted values and plot predicted vs real values
  • Part 4:
    • If regression
      • Interpret briefly the summary table
      • Plot the residuals vs the predicted values
    • If machine learning (more advanced)
      • Interpret briefly the results
      • Discuss how you set the regularization parameter (usually called C)

In [ ]: