outcome measurement $Y$ (dependent variable, response, target)
vector of $p$ predictor measurements $X$ (inputs, regressors, covariates, features, independent variables)
regression problem: $Y$ is quantitative
classification problem: $Y$ takes values in a finite, unordered set
training data: $(x_{1}, y_{1}),...,(x_{N}, y_{N})$ (observations [examples, instances] of these measurements
objectives: accurately predict unseen test cases, understand which inputs affect the outcome and how, assess the quality of our predictions and inferences
no outcome variable, just a set of predictors (features) measured on a set of samples
objective is more fuzzy -- find groups of samples that behave similarly, find features that behave similarly, find linear combinations of features with the most variation
difficult to know how well you are doing
different from supervised learning, but can be useful as a pre-processing step for supervised learning
input vector
$X = \begin{pmatrix}X_{1}\\X_{2}\\X_{3}\end{pmatrix}$
model
$Y = f(X) + \epsilon$ (where $\epsilon$ captures errors and other discrepancies
with a good $f$, can make predicitons of $Y$ at new points $X = x$
can understand which components of $X = (X_{1},X_{2},...,X_{p})$ are important in explaining $Y$ and which are irrelevant
depending on complexity of $F$, may be able to understand how each component $X_{j}$ of $X$ affects $Y$
ideal $f(X)$?
a good value is, for example, for $4$, $f(4) = E(Y|X=4)$
$E(Y|X=4)$ means expected value (conditional average) of $Y$ given $X=4$
this ideal $f(x) = E(Y|X=x)$ is called the regression function
regression function $f(x)$ is also defined for vector $X$
$f(x) = f(x_{1},x_{2},x_{3}) = E(Y|X_{1} = x_{1}, X_{2} = x_{2}, X_{3} = x_{3})$
it's the ideal or optimal (with regards to a loss function, the mean-squared prediction error) predictor of $Y$
$f(x) = E(Y|X=x)$ is the function that minimizes $E[(Y-g(X))^{2}|X=x]$ over all functions $g$ at all points $X=x$
$\epsilon = Y - f(x)$ is the irreducible error; even if $f(x)$ is known, still errors in prediction, since at each $X=x$ there is typically a distribution of possible $Y$ values
for any estimate $\hat{f}(x)$ of $f(x)$:
$E[(Y-g(X))^{2}|X=x] = [f(x) - \hat{f}(x)]^{2} + \text{Var}(\epsilon)$
$[f(x) - \hat{f}(x)]^{2}$ is reducible
$\text{Var}(\epsilon)$ is irreducible
typically few if any data points with $X=4$, so cannot compute $E(Y|X=x))
thus, relax definition and let $\hat{f}(x) = \text{Ave}(Y|X \in \mathcal{N}(x))$, where $\mathcal{N}(x)$ is some neighborhood of $x$
this is called nearest-neighbor or local averaging