Some basic notes of the important points made during each of the lectures for the Coursera Machine Learning class taught by Andrew Ng, of Stanford.

The next step is to put these lessons into code in another notebook.

Notes for Andrew Ng, Machine Learning class

Machine Learning:

  • Arthur Samuel (1959). Machine Learning: Field of study that gives computers the ability to learn without being explicitly programmed.

  • Tom Mitchell (1998) Well-posed Learning Problem: A computer program is said to learn from experience E with respect to some task T and some performance measure P, if its performance on T, as measured by P, improves with experience E.

Machine Learning Algorithms:

  • Supervised
  • insupervised

Others:

  • Reinforcement Learning, recommender systems...

Supervised

  1. Hypothesis function $h(x)$ maps input feature values to output label/target values
  2. Cost function $J(h(x))$, needs to be minimized with some algorithm to determine the best set of fitting model parameters for the given data.
  3. Testing of the resulting best-fit hypothesis function needs to be performed to determine robustness/generalization and accuracy.

Regression: Used to predict continuous valued output

  • regression algorithms include:
    • gradient decent

Non-continuous label values -> classification problem

Unsupervised

Algorithm naturally discovers different classes.

Ex. Coctail party problem: using multiple microphones to remove background sounds (music, general chatter).

Gradient decent

Stochastic Gradient Decent sklearn

  • learning rate ($\alpha$):
    • Too big and it is likely to overshoot the local minima, and maybe even diverge
    • Too small and it will take a long time to converge.

Plotting J vs #iters will help visualize/determine which case you are in. But, once chosen, one expects the slope to get more shallow the closer to the minima each iteration, thus the resulting step sizes decrease even with a fixed $\alpha$.

Notes:

  • Must make sure to update each model parameter simultaneously for each iteration to avoid skewing the direction the cost function takes you in.

  • Caution to starting position, can lead to local rather than global minima, thus need to try starting and many different starting positions in the parameter space.

  • 'batch' gradient decent: each step uses ALL the training examples.

  • Caution in cases where there is more features than training examples.

Feature scaling

sklearn min/max scaler

Idea:

make sure features are on similar scales. Should any features be 10's - 10000's of times greater in value or range than others, it can dramitcally decrease the speed/effectiveness of gradient decent.

Options:
  • divide each feature value by its range. ie. new = orig/(max-min)
  • Mean normilization: new=(orig-mean)/(max-min)

Gradient decent vs Normal equation

Gradient decent

  • need to choose alpha
  • need many iterations
  • works well even when there are many features being fit

Normal Equation

  • no need to choose alpha
  • don't need to iterate
  • need to compute an nxn matrix -> can be slow if many features (over ~ 10000)

Python implementation on StackOverflow

Example from course site

Classification

Applying standard linear regression can lead to incorrect threshold classifiers -> use Logistic regression.

Logistic regression

Use a sigmoid/logistic function, valid between 0-1. Solves for likelihood/probability of label=1 for each input example.

Non-linear decision boundaries can be found with higher order polynomials for the hypothesis functions.

Modified cost function allowing for only 1 or 0 label/target values, enables use of gradient decent or normal equation fitting. Specific equations for the cost function and implementation into gradient decent discussed in mid-late slides of Chapter 6.

Optimization Algorithms

Advantages:

  • no need to pick alpha
  • often faster than standard Gradient decent
  • commonly included as options in high level fitting toolboxes

Multi-class classification

One-vs-all (one-vs-rest): train for each classifier on its own vs all others, and repeat sequentially.

Overfitting (high variance) vs Underfitting (high bias)

  • overfitting: Too many features causes fitting to training set data to be generalized. ie. fit has a high variance
  • underfitting: Too few features gives too simple of a fit that will work in many cases, but never gives a good fit. ie. fit is biased by the assumed low number of features.

Solutions to overfitting:

  • reduce number of features
  • regularize the data: add a strong constant weight to reduce the magnitude of parameter values. Then only important features get big enough values to impact resulting fits.

Note: caution on not making weighting/regularization parameter ($\lambda$) too small/ too big. Too small still leads to over fitting, too big leads to flat fits.

~SKIPPING NEURAL NETWORKS (Chapters 8 & 9) FOR NOW~

Testing performance and debugging ML

Split training data into 'training' and 'cross validation' sets.

Plot error $J$ vs degree of polynomial.

  • both $J_{train}$ and $J_{cv}$ are high and of similar values -> high bias and you need to add more features
  • $J_{cv}$ >> $J_{train}$ -> high variance, reduce the number of features or implement regularization

Plotting regularization parameter (lambda) vs error (J)

  • $\lambda$ small, $J_{train}$ low, $J_{cv}$ high -> high variance, increase $\lambda$
  • $\lambda$ high, $J_{train}$ high, $J_{cv}$ hight -> high bias, reduce $\lambda$

Learning curves

plot training set size (number of examples) (m) vs error (J)

  • $J_{train}$ low, $J_{cv}$ high -> increase size of training set
  • $J_{cv}$ and $J_{train}$ ~ same, and both high -> high bias, don't need more data, need to decrease $\lambda$ or increase number of features.

debugging 'what to do next' summary

  • Get more training examples -> fixes high variance
  • Try smaller sets of features -> fixes high variance
  • Try getting additional features-> fixes high bias
  • Try adding polynomial features -> fixes high bias
  • Try decreasing -> fixes high bias
  • Try increasing -> fixes high variance

In [ ]: