Lecture 01

The first thing to do is not running through a fancy algorithm, make some graph and plots first.

Lecture 02

Objectives

  • Accurate predict unseen test cases
  • Understand which inputs affect outputs and how
  • Access the quality of our predictions and inferences

Philosophy

  • Know when and how to use what techniques
  • Understand simple methods first and then sophisticated ones
  • Accurately assess the performance of a method, how well and how badly

Unsupervised learning

  • No outcome
  • Fuzzy objective
  • can be useful as pre-processing step for supervised learning

Statistical Learning V.S. Machine Learning

  • ML: large scale applications and prediction accuracy
  • SL: Models and their interpretability, precision and uncertainty??

Done

Lecture 03

Notation

  • response: Y; feature, input: X
$$Y = f(X)+\epsilon$$

What is f(X) good for

  • predict on new points
  • Can tell Which components are important
  • Maybe can tell how each component affects Y

The regression function f(x)

$$f(x) = f(x_1, x_2, x_3) = E(Y|X=x)$$
  • Reducible and Irreducible: Var(e)

How to estimate f(x)

  • Use neighborhood of x: N(x)

Done

Lecture 04

The curse of dimension

  • The drawback of the above method:
  • We want to reduce variance, so choose certain percent of data points, say 10%.
  • Whe dimension p is large, the radius can be large

Parametric and structured models

  • Linear model, quadratic model
  • thin-plate splines averaging (later ch. 7), can be overfitting with bad parameter.

Trade-off

  • prediction accuracy v.s. interpretability
  • good v.s. under v.s. over fit
  • Parsimony (simpler model) v.s. blackbox (use all variables)

Done

Lecture 05

Accessing Model Accuracy

average squared prediction error

  • on training data, then biased towards overfit model
  • on testing data, better

Bias-Variance Trade-off

$$E(y-\hat{f}(x_0))^2 = Var(\hat{f}(x_0)) + [Bias(\hat{f}(x_0))]^2 + var(\epsilon)$$

Note that x_0 is a test data, why f hat x_0 is a random varible? Here, method is fix but training data can vary according some probablity.

So bias is defined as the following:

$$E(\hat{f}(x_0)) - f(x_0)$$

More flexible model, less bias but more variance

Lecture 06

Classification

Neighborhood approach can also work for low dimension.

  • SVM build structure model for C(X), the classifier
  • logistic regression, generalized additive model, build the structure model for the probability p_k(x)

Nearest Neighbor

K small, high variance, K large, high bias.

Lecture 3.1

Linear regression

  • Simple is good, useful both in conceptually and practically

Question to ask (Applicable to all algorithm I think)

  • Is there a relation between a predictor and the response
  • How strong is the relationship?
  • How accurately can we predict future?
  • Is the relationship linear?
  • Is there synergy among predictor?

Accessing the accuracy of the coefficient estimates

  • SE(\beta_1) depends on and reversely depends on the spread of X.
    • When we select X, it's good to spread X to get good extimation of slope.
  • Confidence interval 95%

Lecture 3.2

Hypothesis testing

Null hypothesis: There is no relationship between X and Y, H_0 Alternative hypothesis: There is some relation.

  • Compute t-statistics
  • Then compute p-value

Assessing the overall accuracy of the model

  • Residual standard error
  • R-squared

Lecture 3.3

Multiple Linear Regression

  • A regression coefficient beta_j estimates the expected change in Y per unit change in X_j, with all other predictors held fixed. BUT predictors usually change together

  • The only way to find out what will happen when a complex system is disturbed is to disturb the system, not merely to observe it passively.

Lecture 3.4

Some important questions

  1. Is at least one of the predictors useful in predicting the response?
    • Look at F-statistics, the bigger the better
  2. Do all the predictors help to explain Y, or is only a subset of the predictors useful?
    • examine all subsets, not feasible for big p
    • Forward selection
    • Backward selection
    • Other method: AIC, BIC, cross validation
  3. How well does the model fit the data?
  4. Given a set of predictor values, what response value should we predict, and how accurate?

Qualitative variables

$$y_i = \beta_0 + \beta_1 x_i + \epsilon_i $$

x_i can take 1 or 0.

If a qualitative variable is more than 2 level, needs to create multiple variables. One for if Asian or not, one for if Caucasian or not. One fewer variable than number of levels.

Lecture 3.5 Interaction and nonlinearity

  • Use product of 2 predictors

Hierarchy

  • If we want to include interactions, we will also need to put main effects

Uncovered parts

section 3.33 Outliers Non-constant variance of error terms High leverage points Collinearity

Lecture 8.1

Tree based methods: stratifying and segmenting predictor space into a number simple regions. Simple and easy to explain. Accuracy can be improved by bagging, random forest and boosting.

Terminology

  • Divided region is called terminal nodes
  • Tree is upside down, so leaves are at the bottom
  • The points splitting the input space is internal node

Tree building process

  • Divide the predictor space into non-overlapping regions R1 ... Rj.
  • Everything in the same region, same prediction.
  • Although the region can be any shape, but boxes are easy to interpret.
  • Optimal is infeasible.
  • Top down and greedy approach, i.e. select best variable to best split which results to best RSS reduction.
  • The results are piecewise constant

Pruning a tree

  • Large tree: overfit and poor test data.
  • Better strategy: build a large tree and prune it back
    • Cost complexity pruning, penalize with tree size (i.e. # of terminal nodes) with alpha.
  • Looks like the lecture does not cover the details on how to prune the tree

Tree algorithm

  1. Recursive binary splitting to grow large tree, stop when terminal nodes have less observations
    • Note, it doesn't say how to select the optimal split point that results to RSS reduction
  2. Prune the tree based on cost complexity pruning for a sequence of alpha.
  3. Use the K-fold cross validation to choose alpha.
  4. Go back to step 2 and choose the tree.

Regression tree

Hitter's data:

* Number of years + Number of hits to predict
* Internal node is a split of one predictor
* The prediction is the mean of obsvervations in the leaf node.

Classification tree

  • Terminal node: instead of mean, use majority vote.
  • Internal node: split cannot base on RSS, but classification error rate is not sensitive for tree growth
    • Gini index, a measure of purity
    • cross-entropy, similar to Gini index

Lecture 8.2 Bagging

Bagging

  • Bootstrap aggregation (bagging) is a general procedure to reduce the variance of a stat learning method.
    • Simple example, if Z1, ..., Zn are iid, with sigma^2 variance, them sum(Zi)/n variance is s^2/n
  • Generate B different bootstrapped training data and
    • Can be repeated? YES
    • What is size of each bootstrap data set.
  • OOB error
    • One observation will be 2/3 trees' training set, so about 1/3 trees won't have it perfect for test error estimation.
    • Useful when decide number of trees.
  • Random forest
    • choose m = sqrt(p) for split candidates.
    • The purpose is to reduce the correlation between trees.
  • Gene example
    • Choose 500 genes with largest variance out of 4718 genes. Why is that? Intuitively correct

Lecture 8.3 Boosting

  • Bagging builds trees independently
  • Boosting builds trees sequentially

Idea

  • Fit small trees to the residual, Shrink it by lambda.
  • Improve it slowly by adding this new decision tree into the fitted function.

  • Gene example, even stump outperforms the random forest

  • Tuning parameters

    • No. of Trees: B
    • Shrinkage parameters lambda, usually 0.01 or 0.001
    • No. of splits: d, usually try 1, 2, 4, 8

Variable Importance measure

  • Regression, total amount of the RSS decreased, average over all B trees.
  • Total amount Gini index decreased given predictor, average over all B trees.