This assignment will allow you to experiment with modeling count data. In the data folder on the github repo for this course or on datahub, you'll find data on Italian football, Serie A 2013/2014. A data dictionary can be found here. Your task is to predict the total number of goals in a given match. Your model should be validated using cross-validation and an appropriate metric or metrics. You may choose the metric or metrics.
In order to avoid looking ahead and biasing the model, you may not include the following as predictors: FTHG, FTAG, HTHG, HTAG.
Submit a writeup describing your analysis. You writeup must contain enough detail to allow reproduction of your results, along with justification for why you made the decisions that you did during construction of the model. An explanation of your choice of metric or metrics is required. Include a brief discussion of your assessment of the quality of the model based on the metrics chosen.
You may (and are encouraged to) include a code file to enable quick reproduction of your work. However, a code file, no matter how thoroughly commented, does not qualify as a writeup.
If you use Python for this assignment, you may do your work and your writeup in a Jupyter notebook and submit the single notebook file. This is the preferred (but not required) method. (Note that you can install an R kernel for Jupyter and do the writeup in the same way if you prefer to use R, though this is not required).
If you use R for this assignment, consider trying glmnet. This library enables quick builds of regularized generalized linear models.
As usual, submit your completed assignment to me via email. This assignment is due Wednesday, August 3, by 9:00AM. Each day that the assignment is late results in a penalty of -1 point or -5%. Submission any time after 9:00AM on August 3 counts as 1 day late.
In [ ]: