Some basic notes of the important points made during each of the lectures for the Coursera Machine Learning class taught by Andrew Ng, of Stanford.
The next step is to put these lessons into code in another notebook.
Arthur Samuel (1959). Machine Learning: Field of study that gives computers the ability to learn without being explicitly programmed.
Tom Mitchell (1998) Well-posed Learning Problem: A computer program is said to learn from experience E with respect to some task T and some performance measure P, if its performance on T, as measured by P, improves with experience E.
Non-continuous label values -> classification problem
Stochastic Gradient Decent sklearn
Plotting J vs #iters will help visualize/determine which case you are in. But, once chosen, one expects the slope to get more shallow the closer to the minima each iteration, thus the resulting step sizes decrease even with a fixed $\alpha$.
Must make sure to update each model parameter simultaneously for each iteration to avoid skewing the direction the cost function takes you in.
Caution to starting position, can lead to local rather than global minima, thus need to try starting and many different starting positions in the parameter space.
'batch' gradient decent: each step uses ALL the training examples.
Caution in cases where there is more features than training examples.
make sure features are on similar scales. Should any features be 10's - 10000's of times greater in value or range than others, it can dramitcally decrease the speed/effectiveness of gradient decent.
Applying standard linear regression can lead to incorrect threshold classifiers -> use Logistic regression.
Use a sigmoid/logistic function, valid between 0-1. Solves for likelihood/probability of label=1 for each input example.
Non-linear decision boundaries can be found with higher order polynomials for the hypothesis functions.
Modified cost function allowing for only 1 or 0 label/target values, enables use of gradient decent or normal equation fitting. Specific equations for the cost function and implementation into gradient decent discussed in mid-late slides of Chapter 6.
Note: caution on not making weighting/regularization parameter ($\lambda$) too small/ too big. Too small still leads to over fitting, too big leads to flat fits.
Split training data into 'training' and 'cross validation' sets.
In [ ]: