Chap 2: Start Learning

ToC

Prediction
- Reducible and irreducible errors
Inference
Parametric and unparametric methods
- Parametric methods of estimating f
- Non parametric methods of estimating f
General concepts
Model accuracy - Regression problems
- Bias variance trade-off
Model accuracy - Classification problems

Prediction

In a prediction / regression problem, the inputs (denoted by X) are called as predictors, independent variables, features and the predicted variable is called as response, dependent variable and is denoted by Y.

The relationship betwen input and predicted is represented as

$$ Y = f(X) + \epsilon $$

where $f$ is some fixed, unknown function that is to be determined. $\epsilon$ is random error term that is independent of X and has zero mean.

In reality, $f$ may depend on more than 1 input variable $X$, for instance 2. In this case, $f$ is a 2D surface that is fit. In general, the process of estimating $f$ is statistical learning.

Reducible and Irreducible errors

Since $f$ and $Y$ cannot be calculated, the best we can get is to estimate them. Thus, the estimates are called $\hat f$ and $\hat Y$

$$ \hat Y = \hat f(X) $$

The accuracy of $\hat Y$ depends on reducible and irreducible errors. The error in prediction of $\hat f$ is reduible and can be improved wth more data and better models. However, $\hat Y$ is also a function of $\epsilon$ which is irreducible. Thus, the best our predictions can get is

$$ \hat Y = f(X) $$

Focus of Statistical learning is to estimating $f$ as $\hat f$ with least reducible error. However, the accuracy of $\hat Y$ will always be controlled by irreducible and unknown error $\epsilon$.

In prediction problems, $\hat f$ can be treated as blackbox as we are only interested in predicting $Y$.

Inference

We are interested in understanding how each of the different $X_{1}... X_{p}$ affect the dependent variable $Y$, hence the name inference. Here, $\hat f$ cannot be treated as blackbox and we need to know its exact form. Some questions that are sought to be answered through inference:

which predictor variables are associated with the response?
what is the relationship b/w response and each predictor?
is the relationship linear or is more complicated?

Parametric and Unparametric methods for Estimating f

The observations for X and Y can be written as ${(x_{1}, y_{1}),(x_{2}, y_{2}),...,(x_{n}, y_{n})}$ where each x has many predictor variables that can be written as $x_{i} = (x_{i1},x_{i2},..,x_{ip})^{T}$. The goal is to find $\hat f$ such that $Y \approx \hat f (X)$

Parametric methods for estimating f

Parametric methods take a model based approach (deterministic). We make an assumption about the functional form of f (whether it is linear, non linear, higher order, logistic etc). For instance, if we assume that f is linear, then

$$ Y \approx f(X) = \beta_{0} + \beta_{1}X_{1} + \beta_{2}X_{2} + .. + \beta_{p}X_{p} $$

we only need to find $p+1$ coefficients. Through training or fitting (using methods like ordinary least squares), we can estimate the coefficients.

Notes

parametric methods are an approximation of the true functional form of f.
simpler (lower order, less flexible) models may lead to poorer estimates of f
more flexible (higher order, complex) models may lead to overfitting.
Since the model is trained on a subset of values, it might be very different from true nature of f. Hence the model developed is only valid for the range of data it was trained on.

Non parametric methods for estimating f

Non parametric methods avoid assuming the functional form of f. However, these methods require a very large number of observations since they do not try to reduce the phenomenon to a model.

General concepts

Model interpretability and complexity: The more complex a model is (higher order more flexible models, decision trees..), the less interpretable it is.

Supervised vs Unsupervised algorithms: Supervised methods are used when both the predictor and response variables can be measured and data is available. Unsupervised methods are used when little is known about the data and only predictor variables are available. Unsupervised are best when put to classification / clustering problems.

Regression vs Classification: When the response variable is quantitative and continuous, the problem is considered a regression. When the response is qualitative and falls within categories, then the problem is a classification problem. Howerver, this distinction is not really solid as many algorithms can be used for both.

Model accuracy - Regression problems

Measuring quality of fit

Here the deviation between predicted and actual values is measured. In regression, Mean Squared Error (MSE) is commonly used.

$$ MSE = \frac{1}{n} \sum_{i=1}^{n}(y_{i} - \hat f(x_{i}))^2 $$

The MSE obtained is called training MSE. When used against unseen test data, we get test MSE. Our objective is to choose the method with lowest test MSE. There is no guarantee that a low training MSE will yield a low test MSE.

A fundamental property in statistical learning is as model flexibility increases, training MSE might decrease, but test MSE might not. When a given learning method yields a small training MSE but a large test MSE, we are overfitting the data. This is because, our data might have noise from irreducible error and the model is trying to fit it.

Bias variance trade-off

If you plot the test MSE against model flexibility, it follows a U shaped curve. Thus it first reduces then increases. The expected test MSE for observation $x_{0}$ $E(y_{0} - \hat f(x_{0}))$ can be decomposed to 3 fundamental quantities: (a) the variance of $\hat f(x_{0})$, (b) the squared bias of $\hat f(x_{0})$ and (c) variance of irreducible error $\epsilon$. Thus:

$$ E(y_{0} - \hat f(x_{0}))^{2} = Var(\hat f(x_{0})) + [Bias(\hat f(x_{0}))]^{2} + Var(\epsilon) $$

Thus, to reduce the expected test MSE, we need to reduce both the variance of $\hat f$ and bias of $\hat f$.

Variance refers to the amount by which $\hat f$ would change if we estimated it using a different training dataset. Ideally, $\hat f$ should not change much if a slightly different data set is used. A statistical learning method with high variance would yield a very different $\hat f$ for different training data sets. Higher the model flexibility, the higher is its variance as the model closely fits the training data.

Bias refers to the error introduced by approximating a real-life problem. Generally, higher model flexibility, the lower is the bias. Thus as we use more flexible methods, the variance will increase and bias would decrease.

As model flexibility increases, the bias reduces faster than the rate at which variance increases. Thus, the test MSE drops initially before increasing (U shape). This relationship is called the bias variance trade-off and the objective is to pick the model flexibility that has the least of both.

Model accuracy - Classification problems

Measuring quality of classification

error rate is used to quantify the errors in classification. It is the ratio of sum of misclassifications to number of observations.

$$ error rate = \frac{1}{n}\sum_{i=1}^{n}I(y_{i} \ne \hat y_{i}) $$

where $\hat y_{i}$ is predicted class label for ith observation using $\hat f$. When computed against training data, this yields training error rate. When computed for test data, this yields test error rate.

Bayes Classifier

Bayes classifier is a simple but idealistic classifier. It assigns each observation to the most likely class given its predictor class. This can be written using conditional probability as below:

$$ P(Y=j \ | \ X=x_{0}) $$

Thus, the error rate with Bayes classifier becomes the average of (1 - max probability for different classes). The Bayes error rate is analogous to *irreducible error.

KNN classifier

In reality, Bayes classifier is not possible as the conditional probability is unknown. Instead, algorithms attempt to derive the conditional probability. One such is KNN.

The KNN classifier identifies K points in training data that are closest to test observation $x_{0}$, represented at $N_{0}$. It then estimates conditional proabability for class $j$ as the fraction of points in $N_{0}$ whose classes equal $j$. This can be written as:

$$ P(Y = j \ | \ X = x_{0}) = \frac{1}{K} \sum_{i \in N_{0}} I(y_{i} = j) $$

When K=1, the decision boundary is of highest flexibility and overfits with low bias and high variance. When K is very large, it underfits with low flexibility. It has high bias and low variance. As in regression, with classification, increasing flexibility reduces the training error, but does not affect test error rate.