| notebook.community

Supervised Learning Problem

outcome measurement $Y$ (dependent variable, response, target)

vector of $p$ predictor measurements $X$ (inputs, regressors, covariates, features, independent variables)

regression problem: $Y$ is quantitative

classification problem: $Y$ takes values in a finite, unordered set

training data: $(x_{1}, y_{1}),...,(x_{N}, y_{N})$ (observations [examples, instances] of these measurements

objectives: accurately predict unseen test cases, understand which inputs affect the outcome and how, assess the quality of our predictions and inferences

unsupervised learning

no outcome variable, just a set of predictors (features) measured on a set of samples

objective is more fuzzy -- find groups of samples that behave similarly, find features that behave similarly, find linear combinations of features with the most variation

difficult to know how well you are doing

different from supervised learning, but can be useful as a pre-processing step for supervised learning

input vector
$X = \begin{pmatrix}X_{1}\\X_{2}\\X_{3}\end{pmatrix}$

model
$Y = f(X) + \epsilon$ (where $\epsilon$ captures errors and other discrepancies

with a good $f$, can make predicitons of $Y$ at new points $X = x$

can understand which components of $X = (X_{1},X_{2},...,X_{p})$ are important in explaining $Y$ and which are irrelevant

depending on complexity of $F$, may be able to understand how each component $X_{j}$ of $X$ affects $Y$

ideal $f(X)$?

a good value is, for example, for $4$, $f(4) = E(Y|X=4)$

$E(Y|X=4)$ means expected value (conditional average) of $Y$ given $X=4$

this ideal $f(x) = E(Y|X=x)$ is called the regression function

regression function $f(x)$ is also defined for vector $X$

$f(x) = f(x_{1},x_{2},x_{3}) = E(Y|X_{1} = x_{1}, X_{2} = x_{2}, X_{3} = x_{3})$

it's the ideal or optimal (with regards to a loss function, the mean-squared prediction error) predictor of $Y$

$f(x) = E(Y|X=x)$ is the function that minimizes $E[(Y-g(X))^{2}|X=x]$ over all functions $g$ at all points $X=x$

$\epsilon = Y - f(x)$ is the irreducible error; even if $f(x)$ is known, still errors in prediction, since at each $X=x$ there is typically a distribution of possible $Y$ values

for any estimate $\hat{f}(x)$ of $f(x)$:
$E[(Y-g(X))^{2}|X=x] = [f(x) - \hat{f}(x)]^{2} + \text{Var}(\epsilon)$
$[f(x) - \hat{f}(x)]^{2}$ is reducible $\text{Var}(\epsilon)$ is irreducible

typically few if any data points with $X=4$, so cannot compute $E(Y|X=x))

thus, relax definition and let $\hat{f}(x) = \text{Ave}(Y|X \in \mathcal{N}(x))$, where $\mathcal{N}(x)$ is some neighborhood of $x$

this is called nearest-neighbor or local averaging