In [1]:
%load_ext rmagic
In [2]:
%%R
head(iris, 10)
Here we have 4 independent variables, and a class label (note, not a dependent variable here!) Q: What kind of data is is the "Species" label? Quantitative or Qualitative? Q: Given these features and label, what do you think we will be trying to predict?
In [3]:
%%R
summary(iris)
Note to comment about how all of these should be predicted, particularly OOS error
Thought experiment:
Suppose instead, we train our model using the entire dataset
Q: How low can we push the training error?
A: Down to zero!
source: dtreg.com
source: dtreg.com
The most significant reason why we'd consider generalization over "overfitting" is the cost of an algorithm. Notice how much less complicated the natural fit is compared to the "overfit" in this example, with only minor errors. This is much more effective for us as data scientists. Note that at times we might deal with "underfitting" as well, where we generalize our data too much. Let's strive to find the right fit for our data.
Thought experiment:
Suppose instead, we train our model using the entire dataset.
Q: How low can we push the training error?
Thought experiment:
Different train/test splits will give us different generalization errors.
Q: What if we did a bunch of these and took the average?
A: Now you’re talking!
More accurate estimate of OOS prediction error.
More efficient use of data than single train/test split.
In [4]:
%%R
c(0.001, 23.4, 17.3, 26.8, 32.8, 31.3, 34.5, 7352.3)
Because of one outlier (7352.3), standard metrics like mean and sd become meaningless.
In [5]:
%%R
mean(c(0.001, 23.4, 17.3, 26.8, 32.8, 31.3, 34.5, 7352.3))
In [6]:
%%R
sd(c(0.001, 23.4, 17.3, 26.8, 32.8, 31.3, 34.5, 7352.3))
What if one dimension is scaled much differently than another? (Say height in feet and weight in pounds)
In [7]:
%%R
d <- data.frame(heights=c(5.42, 5.67, 5.75, 4.83), weights=c(120, 150, 110, 90))
plot(d, pch=16)
text(d$heights, y = d$weights + 2, labels = seq(4))
points(x=c(4.9), y=c(101), col=2, pch=16)
text(4.9, y = 103, labels = "?", col=2)
Which black dot is closest to the red one?
In [8]:
%%R
sqrt((4.9 - d$heights)^2 + (101 - d$weights)^2)
In [9]:
%%R
d2 <- data.frame(scale(d))
sqrt((((4.9 - mean(d$heights)) / sd(d$heights)) - d2$heights)^2 +
(((101 - mean(d$weights)) / sd(d$weights)) - d2$weights)^2)
...or use a different distance metric (depending on the data)
Define: Imputation