In a prediction / regression problem, the inputs (denoted by `X`

) are called as `predictors`

, `independent variables`

, `features`

and the predicted variable is called as `response`

, `dependent variable`

and is denoted by `Y`

.

The relationship betwen input and predicted is represented as

$$ Y = f(X) + \epsilon $$where $f$ is some fixed, unknown function that is to be determined. $\epsilon$ is **random error** term that is independent of `X`

and has **zero mean**.

In reality, $f$ may depend on more than 1 input variable $X$, for instance 2. In this case, $f$ is a `2D`

surface that is fit. In general, the process of **estimating** $f$ is **statistical learning**.

Since $f$ and $Y$ cannot be **calculated**, the best we can get is to **estimate** them. Thus, the estimates are called $\hat f$ and $\hat Y$

The accuracy of $\hat Y$ depends on **reducible** and **irreducible** errors. The error in prediction of $\hat f$ is **reduible** and can be improved wth more data and better models. However, $\hat Y$ is also a function of $\epsilon$ which is **irreducible**. Thus, the best our predictions can get is

Focus of Statistical learning is to estimating $f$ as $\hat f$ with least **reducible** error. However, the accuracy of $\hat Y$ will always be controlled by **irreducible** and **unknown** error $\epsilon$.

In **prediction problems**, $\hat f$ can be treated as **blackbox** as we are only interested in predicting $Y$.

We are interested in understanding how each of the different $X_{1}... X_{p}$ affect the dependent variable $Y$, hence the name **inference**. Here, $\hat f$ **cannot be treated as blackbox** and we need to know its exact form. Some questions that are sought to be answered through inference:

- which predictor variables are associated with the response?
- what is the relationship b/w response and each predictor?
- is the relationship linear or is more complicated?

The observations for X and Y can be written as ${(x_{1}, y_{1}),(x_{2}, y_{2}),...,(x_{n}, y_{n})}$ where each *x* has many predictor variables that can be written as $x_{i} = (x_{i1},x_{i2},..,x_{ip})^{T}$. The goal is to find $\hat f$ such that $Y \approx \hat f (X)$

Parametric methods take a **model based** approach (**deterministic**).
We make an assumption about the functional form of *f* (whether it is linear, non linear, higher order, logistic etc). For instance, if we assume that *f* is linear, then

we only need to find $p+1$ coefficients. Through training or fitting (using methods like *ordinary least squares*), we can estimate the coefficients.

**Notes**

- parametric methods are an approximation of the true functional form of
*f*. - simpler (lower order, less flexible) models may lead to poorer estimates of
*f* - more flexible (higher order, complex) models may lead to
**overfitting**. - Since the model is trained on a subset of values, it might be very different from true nature of
*f*. Hence the model developed is only valid for the range of data it was trained on.

**Model interpretability and complexity**: The more complex a model is (higher order more flexible models, decision trees..), the less interpretable it is.

**Supervised vs Unsupervised algorithms**: Supervised methods are used when both the `predictor`

and `response`

variables can be measured and data is available. Unsupervised methods are used when little is known about the data and only `predictor`

variables are available. Unsupervised are best when put to classification / clustering problems.

**Regression vs Classification**: When the `response`

variable is `quantitative`

and continuous, the problem is considered a regression. When the `response`

is `qualitative`

and falls within categories, then the problem is a classification problem. Howerver, this distinction is not really solid as many algorithms can be used for both.

Here the deviation between `predicted`

and actual values is measured. In regression, `Mean Squared Error (MSE)`

is commonly used.

The MSE obtained is called **training MSE**. When used against unseen test data, we get **test MSE**. Our objective is to choose the method with lowest *test MSE*. There is no guarantee that a low training MSE will yield a low test MSE.

A fundamental property in statistical learning is **as model flexibility increases, training MSE might decrease, but test MSE might not**. When a given learning method yields a **small training MSE** but a **large test MSE**, we are **overfitting** the data. This is because, our data might have noise from **irreducible error** and the model is trying to fit it.

If you plot the **test MSE** against **model flexibility**, it follows a *U shaped curve*. Thus it first reduces then increases. The expected test MSE for observation $x_{0}$ $E(y_{0} - \hat f(x_{0}))$ can be decomposed to `3`

fundamental quantities: (a) the **variance of** $\hat f(x_{0})$, (b) **the squared bias** of $\hat f(x_{0})$ and (c) **variance of irreducible error** $\epsilon$. Thus:

Thus, to reduce the expected test MSE, we need to reduce both the **variance of $\hat f$** and **bias of $\hat f$**.

**Variance** refers to the amount by which $\hat f$ would change if we estimated it using a different training dataset. Ideally, $\hat f$ should not change much if a slightly different data set is used. A statistical learning method with high variance would yield a very different $\hat f$ for different training data sets. **Higher the model flexibility, the higher is its variance** as the model closely fits the training data.

**Bias** refers to the error introduced by approximating a real-life problem. Generally, **higher model flexibility, the lower is the bias**. Thus as we use more flexible methods, the variance will increase and bias would decrease.

As model **flexibility increases**, the **bias reduces faster** than the rate at which **variance increases**. Thus, the **test MSE** drops initially before increasing (`U shape`

). This relationship is called the **bias variance trade-off** and the objective is to pick the model flexibility that has the least of both.

`error rate`

is used to quantify the errors in classification. It is the ratio of `sum of misclassifications`

to `number of observations`

.

where $\hat y_{i}$ is predicted class label for ith observation using $\hat f$. When computed against training data, this yields **training error rate**. When computed for test data, this yields **test error rate**.

Bayes classifier is a simple but idealistic classifier. It assigns each observation to the **most likely class** given its predictor class. This can be written using **conditional probability** as below:

Thus, the error rate with **Bayes classifier** becomes the average of (1 - max probability for different classes). The Bayes error rate is analogous to **irreducible error*.

In reality, Bayes classifier is not possible as the conditional probability is unknown. Instead, algorithms attempt to derive the conditional probability. One such is KNN.

The KNN classifier identifies **K** points in training data that are closest to test observation $x_{0}$, represented at $N_{0}$. It then **estimates** conditional proabability for class $j$ as the fraction of points in $N_{0}$ whose classes equal $j$. This can be written as:

**K=1**, the decision boundary is of highest flexibility and **overfits** with low bias and high variance. When **K** is very large, it **underfits** with low flexibility. It has high bias and low variance. As in regression, with classification, increasing flexibility reduces the training error, but does not affect test error rate.