Some basic notes of the important points made during each of the lectures for the Coursera Machine Learning class taught by Andrew Ng, of Stanford.

The next step is to put these lessons into code in another notebook.

First: Other courses I have taken or were recommended to me

MIT open course videos on YouTube
Data Science and Machine Learning course using Python
- I have summarized the key poits from this course in Basic_DS_Udemy_notes.ipynb
Taming big data with Apache SPark - Hand on!
Python for Data Structures, Algorithms, and Interviews!.
- Great for preparing for CS based interviews and an intense review of practical parts of 1st and 2nd year CS in general.
Data Science by Bill Howe (U-Washington)
data scientist tutorial: reviews simple application of scikit-learn
A big data course I have the vids for, but can't find proper title or instructor of... odd

Notes for Andrew Ng, Machine Learning class

Machine Learning:

Arthur Samuel (1959). Machine Learning: Field of study that gives computers the ability to learn without being explicitly programmed.
Tom Mitchell (1998) Well-posed Learning Problem: A computer program is said to learn from experience E with respect to some task T and some performance measure P, if its performance on T, as measured by P, improves with experience E.

Machine Learning Algorithms:

Supervised
insupervised

Others:

Reinforcement Learning, recommender systems...

Supervised

Hypothesis function $h(x)$ maps input feature values to output label/target values
Cost function $J(h(x))$, needs to be minimized with some algorithm to determine the best set of fitting model parameters for the given data.
Testing of the resulting best-fit hypothesis function needs to be performed to determine robustness/generalization and accuracy.

Regression: Used to predict continuous valued output

regression algorithms include:
- gradient decent

Non-continuous label values -> classification problem

Unsupervised

Algorithm naturally discovers different classes.

Ex. Coctail party problem: using multiple microphones to remove background sounds (music, general chatter).

Gradient decent

Stochastic Gradient Decent sklearn

learning rate ($\alpha$):
- Too big and it is likely to overshoot the local minima, and maybe even diverge
- Too small and it will take a long time to converge.

Plotting J vs #iters will help visualize/determine which case you are in. But, once chosen, one expects the slope to get more shallow the closer to the minima each iteration, thus the resulting step sizes decrease even with a fixed $\alpha$.

Notes:

Must make sure to update each model parameter simultaneously for each iteration to avoid skewing the direction the cost function takes you in.
Caution to starting position, can lead to local rather than global minima, thus need to try starting and many different starting positions in the parameter space.
'batch' gradient decent: each step uses ALL the training examples.
Caution in cases where there is more features than training examples.

Feature scaling

sklearn min/max scaler

Idea:

make sure features are on similar scales. Should any features be 10's - 10000's of times greater in value or range than others, it can dramitcally decrease the speed/effectiveness of gradient decent.

Options:

divide each feature value by its range. ie. new = orig/(max-min)
Mean normilization: new=(orig-mean)/(max-min)

Gradient decent vs Normal equation

Gradient decent

need to choose alpha
need many iterations
works well even when there are many features being fit

Normal Equation

no need to choose alpha
don't need to iterate
need to compute an nxn matrix -> can be slow if many features (over ~ 10000)

Python implementation on StackOverflow

Example from course site

Classification

Applying standard linear regression can lead to incorrect threshold classifiers -> use Logistic regression.

Logistic regression

Use a sigmoid/logistic function, valid between 0-1. Solves for likelihood/probability of label=1 for each input example.

Non-linear decision boundaries can be found with higher order polynomials for the hypothesis functions.

Modified cost function allowing for only 1 or 0 label/target values, enables use of gradient decent or normal equation fitting. Specific equations for the cost function and implementation into gradient decent discussed in mid-late slides of Chapter 6.

Optimization Algorithms

Advantages:

no need to pick alpha
often faster than standard Gradient decent
commonly included as options in high level fitting toolboxes

Multi-class classification

One-vs-all (one-vs-rest): train for each classifier on its own vs all others, and repeat sequentially.

Overfitting (high variance) vs Underfitting (high bias)

overfitting: Too many features causes fitting to training set data to be generalized. ie. fit has a high variance
underfitting: Too few features gives too simple of a fit that will work in many cases, but never gives a good fit. ie. fit is biased by the assumed low number of features.

Solutions to overfitting:

reduce number of features
regularize the data: add a strong constant weight to reduce the magnitude of parameter values. Then only important features get big enough values to impact resulting fits.

Note: caution on not making weighting/regularization parameter ($\lambda$) too small/ too big. Too small still leads to over fitting, too big leads to flat fits.

~ partly SKIPPING NEURAL NETWORKS (Chapters 8 & 9) FOR NOW~

linear regression or multivariate regression can become nearly impossible when the number of features becomes very large and the problem is non-linear.

Moving to a Neural Network ML is a good way to solve non-linear models with many features

each layer is made of neurons that all make simple Sigmoid type hypothesis based on their inputs. Their outputs are termed "actions" that feed the next layer. This way, a single set of direct inputs from the data can be converted into increasingly more complicated non-linear inputs to deeper layers that can make complex decisions.

During 'Forward propagation' we go from inputs, through the layers, to the final output, calculating actions along the way.

During 'backward propagation', the errors caused by the input actions to lower layers are calculated and used to adjust the neurons decisions in a form of training.

Testing performance and debugging ML

Split training data into 'training' and 'cross validation' sets.

Plot error $J$ vs degree of polynomial.

both $J_{train}$ and $J_{cv}$ are high and of similar values -> high bias and you need to add more features
$J_{cv}$ >> $J_{train}$ -> high variance, reduce the number of features or implement regularization

Plotting regularization parameter (lambda) vs error (J)

$\lambda$ small, $J_{train}$ low, $J_{cv}$ high -> high variance, increase $\lambda$
$\lambda$ high, $J_{train}$ high, $J_{cv}$ hight -> high bias, reduce $\lambda$

Learning curves

plot training set size (number of examples) (m) vs error (J)

$J_{train}$ low, $J_{cv}$ high -> increase size of training set
$J_{cv}$ and $J_{train}$ ~ same, and both high -> high bias, don't need more data, need to decrease $\lambda$ or increase number of features.

debugging 'what to do next' summary

Get more training examples -> fixes high variance
Try smaller sets of features -> fixes high variance
Try getting additional features-> fixes high bias
Try adding polynomial features -> fixes high bias
Try decreasing -> fixes high bias
Try increasing -> fixes high variance



In [ ]: