Lecture 01

The first thing to do is not running through a fancy algorithm, make some graph and plots first.

Lecture 02

Objectives

Accurate predict unseen test cases
Understand which inputs affect outputs and how
Access the quality of our predictions and inferences

Philosophy

Know when and how to use what techniques
Understand simple methods first and then sophisticated ones
Accurately assess the performance of a method, how well and how badly

Unsupervised learning

No outcome
Fuzzy objective
can be useful as pre-processing step for supervised learning

Statistical Learning V.S. Machine Learning

ML: large scale applications and prediction accuracy
SL: Models and their interpretability, precision and uncertainty??

Done

Lecture 03

Notation

response: Y; feature, input: X

$$Y = f(X)+\epsilon$$

What is f(X) good for

predict on new points
Can tell Which components are important
Maybe can tell how each component affects Y

The regression function f(x)

$$f(x) = f(x_1, x_2, x_3) = E(Y|X=x)$$

Reducible and Irreducible: Var(e)

How to estimate f(x)

Use neighborhood of x: N(x)

Done

Lecture 04

The curse of dimension

The drawback of the above method:
We want to reduce variance, so choose certain percent of data points, say 10%.
Whe dimension p is large, the radius can be large

Parametric and structured models

Linear model, quadratic model
thin-plate splines averaging (later ch. 7), can be overfitting with bad parameter.

Trade-off

prediction accuracy v.s. interpretability
good v.s. under v.s. over fit
Parsimony (simpler model) v.s. blackbox (use all variables)

Done

Lecture 05

Accessing Model Accuracy

average squared prediction error

on training data, then biased towards overfit model
on testing data, better

Bias-Variance Trade-off

$$E(y-\hat{f}(x_0))^2 = Var(\hat{f}(x_0)) + [Bias(\hat{f}(x_0))]^2 + var(\epsilon)$$

Note that x_0 is a test data, why f hat x_0 is a random varible? Here, method is fix but training data can vary according some probablity.

So bias is defined as the following:

$$E(\hat{f}(x_0)) - f(x_0)$$

More flexible model, less bias but more variance

Lecture 06

Classification

Neighborhood approach can also work for low dimension.

SVM build structure model for C(X), the classifier
logistic regression, generalized additive model, build the structure model for the probability p_k(x)

Nearest Neighbor

K small, high variance, K large, high bias.

Lecture 3.1

Linear regression

Simple is good, useful both in conceptually and practically

Question to ask (Applicable to all algorithm I think)

Is there a relation between a predictor and the response
How strong is the relationship?
How accurately can we predict future?
Is the relationship linear?
Is there synergy among predictor?

Accessing the accuracy of the coefficient estimates

SE(\beta_1) depends on and reversely depends on the spread of X.
- When we select X, it's good to spread X to get good extimation of slope.
Confidence interval 95%

Lecture 3.2

Hypothesis testing

Null hypothesis: There is no relationship between X and Y, H_0 Alternative hypothesis: There is some relation.

Compute t-statistics
Then compute p-value

Assessing the overall accuracy of the model

Residual standard error
R-squared

Lecture 3.3

Multiple Linear Regression

A regression coefficient beta_j estimates the expected change in Y per unit change in X_j, with all other predictors held fixed. BUT predictors usually change together
The only way to find out what will happen when a complex system is disturbed is to disturb the system, not merely to observe it passively.

Lecture 3.4

Some important questions

Is at least one of the predictors useful in predicting the response?
- Look at F-statistics, the bigger the better
Do all the predictors help to explain Y, or is only a subset of the predictors useful?
- examine all subsets, not feasible for big p
- Forward selection
- Backward selection
- Other method: AIC, BIC, cross validation
How well does the model fit the data?
Given a set of predictor values, what response value should we predict, and how accurate?

Qualitative variables

$$y_i = \beta_0 + \beta_1 x_i + \epsilon_i $$

x_i can take 1 or 0.

If a qualitative variable is more than 2 level, needs to create multiple variables. One for if Asian or not, one for if Caucasian or not. One fewer variable than number of levels.

Lecture 3.5 Interaction and nonlinearity

Use product of 2 predictors

Hierarchy

If we want to include interactions, we will also need to put main effects

Uncovered parts

section 3.33 Outliers Non-constant variance of error terms High leverage points Collinearity

Lecture 8.1

Tree based methods: stratifying and segmenting predictor space into a number simple regions. Simple and easy to explain. Accuracy can be improved by bagging, random forest and boosting.

Terminology

Divided region is called terminal nodes
Tree is upside down, so leaves are at the bottom
The points splitting the input space is internal node

Tree building process

Divide the predictor space into non-overlapping regions R1 ... Rj.
Everything in the same region, same prediction.
Although the region can be any shape, but boxes are easy to interpret.
Optimal is infeasible.
Top down and greedy approach, i.e. select best variable to best split which results to best RSS reduction.
The results are piecewise constant

Pruning a tree

Large tree: overfit and poor test data.
Better strategy: build a large tree and prune it back
- Cost complexity pruning, penalize with tree size (i.e. # of terminal nodes) with alpha.
Looks like the lecture does not cover the details on how to prune the tree

Tree algorithm

Recursive binary splitting to grow large tree, stop when terminal nodes have less observations
- Note, it doesn't say how to select the optimal split point that results to RSS reduction
Prune the tree based on cost complexity pruning for a sequence of alpha.
Use the K-fold cross validation to choose alpha.
Go back to step 2 and choose the tree.

Regression tree

Hitter's data:

* Number of years + Number of hits to predict
* Internal node is a split of one predictor
* The prediction is the mean of obsvervations in the leaf node.

Classification tree

Terminal node: instead of mean, use majority vote.
Internal node: split cannot base on RSS, but classification error rate is not sensitive for tree growth
- Gini index, a measure of purity
- cross-entropy, similar to Gini index

Lecture 8.2 Bagging

Bagging

Bootstrap aggregation (bagging) is a general procedure to reduce the variance of a stat learning method.
- Simple example, if Z1, ..., Zn are iid, with sigma^2 variance, them sum(Zi)/n variance is s^2/n
Generate B different bootstrapped training data and
- Can be repeated? YES
- What is size of each bootstrap data set.
OOB error
- One observation will be 2/3 trees' training set, so about 1/3 trees won't have it perfect for test error estimation.
- Useful when decide number of trees.
Random forest
- choose m = sqrt(p) for split candidates.
- The purpose is to reduce the correlation between trees.
Gene example
- Choose 500 genes with largest variance out of 4718 genes. Why is that? Intuitively correct

Lecture 8.3 Boosting

Bagging builds trees independently
Boosting builds trees sequentially

Idea

Fit small trees to the residual, Shrink it by lambda.
Improve it slowly by adding this new decision tree into the fitted function.
Gene example, even stump outperforms the random forest
Tuning parameters
- No. of Trees: B
- Shrinkage parameters lambda, usually 0.01 or 0.001
- No. of splits: d, usually try 1, 2, 4, 8

Variable Importance measure

Regression, total amount of the RSS decreased, average over all B trees.
Total amount Gini index decreased given predictor, average over all B trees.