Regression and Featurization

Question 1

Henry is attempting to predict his math midterm scores (it's more enjoyable than studying), and has decided to use a linear regression for the task. As a very thorough data scientist, he has taken a good deal of data on his study habits in the past, and decides to start by computing a model:

$$ f_\theta(Hours\_Studying) = (Hours\_Studying) * \theta_1 + \theta_0$$

After taking 5 more midterms (he's a busy guy), Henry checks how well his model has predicted his scores. He gets the following results:

Henry decides that, while this line fits the new data fairly well, it might be possible to create a better fit with a more complex regression function. He decides to use a polynomial basis, adding extra features that represent polynomial functions of the amount of time he spent studying. So, now his regression function is as follows:

$$ f_\theta(Hours) = (Hours)^5 * \theta_5 + (Hours)^4 * \theta_4 + (Hours)^3 * \theta_3 + (Hours)^2 * \theta_2 + (Hours) * \theta_1 + \theta_0$$

Do you think that Henry's prediction error on his training dataset will decrease? What about his prediction error on new data (that is, his test dataset)?

Hint: If you're unsure, scroll down to see how his prediction line has changed































SOLUTION:

Henry's training error has decreased. However, this more-complex function is unlikely to provide a better approximation for new data points - that is, it has probably overfit the training data. Recall from data 8 that a more-complex model will often perform worse on test data due to overfitting (diagrammed below). So, Henry's test error has likely increased.

Question 2

Now, instead of adding functions of existing features, Henry tries adding additional features. In addition to the number of hours he spent studying for his math midterms, he also includes the number of hours he slept the night before the exam as an additional feature. Henry is fairly sure that his midterm scores are higher when he's slept more, and the amount he sleeps is not closely correlated with the amount he studies.

Now Henry's regression function is as follows:

$$ f_\theta(Hours\_Sleep, Hours\_Math) = (Hours\_Sleep) * \theta_2 + (Hours\_Studying) * \theta_1 + \theta_0$$

Given this information, do you expect this to decrease prediction error on Henry's training dataset? What about on his test dataset?
















SOLUTION:

Given what we know, we can expect both training and test error to decrease. Given that Hours_Sleep has low correlation with Hours_Studying and high correlation with y, it can be expected that it will improve the overall quality of the regression.

Question 3

Pleased with his improved model, Henry decides to add a third feature. Since the number of hours he spent studying math was predictive of his math midterm scores, he decides to also include a feature representing the number of hours he spent studying Chinese Literature (Henry is a man for all seasons). So, his new regression function is as follows:

$$ f_\theta(Hours\_Sleep, Hours\_Chinese, Hours\_Math) = (Hours\_Sleep) * \theta_3 + (Hours\_Chinese) * \theta_2 + (Hours\_Math) * \theta_1 + \theta_0$$

How do you expect this new feature, Hours_Chinese, to affect prediction error on Henry's training dataset? What about on his test dataset?
















SOLUTION:

Adding a new feature will always decrease training error! But, this new feature is unlikely to be predictive of Henry's midterm score - he's not taking a Chinese Literature midterm! Thus, he has probably overfit the training data data again.

In general, as the number of features in a linear regression model increases, its training error will decrease. This is also true for the test error, depending on the number of their features and their quality. However, as a model becomes more complex, it may lose accuracy on test data as a result of overfitting to training data (diagrammed below).

Question 4

Henry decides to add yet another feature - the day of the week that he took the midterm on. Each of his data points thus has one additional dimension - a Day string with possible values "M", "T", "W", "R", or "F".

How could Henry encode this data such that it can be used in a regression model?

Now, assume that Henry uses the encoding you thought of. He then takes a course at UCSD, where some courses have Saturday midterms. What problem will his model have in predicting his scores in this new course? What method might you use to predict this new data point?
















SOLUTION:

One possible method would one-hot encoding. However, one-hot encoding will not work for this new data point - since the Day is neither "M", "T", "W", "R", nor "F", it can't be encoded. The simplest option for this new data point would be falling back to a different model which doesn't take Day into account, but that might be messy.

Question 5

Henry wants to be able to test his various models in order to determine which will have the best performance on his test data, but he knows that simply testing on training data will give uninformative results.

What's one method he could use to get a good estimate of how his model will perform on test data, using only his training data? What parameter(s) of that method will he need to consider?
















SOLUTION:

Henry could use cross validation to measure the quality of his models. He would have to choose $k$ (if using $k$-fold CV, hw could also use another method such as leave-one-out).

A reminder of what CV looks like: