Some basic notes of the important points made during each of the lectures for the Coursera Machine Learning class taught by Andrew Ng, of Stanford.
The next step is to put these lessons into code in another notebook.
Arthur Samuel (1959). Machine Learning: Field of study that gives computers the ability to learn without being explicitly programmed.
Tom Mitchell (1998) Well-posed Learning Problem: A computer program is said to learn from experience E with respect to some task T and some performance measure P, if its performance on T, as measured by P, improves with experience E.
Non-continuous label values -> classification problem
Stochastic Gradient Decent sklearn
Plotting J vs #iters will help visualize/determine which case you are in. But, once chosen, one expects the slope to get more shallow the closer to the minima each iteration, thus the resulting step sizes decrease even with a fixed $\alpha$.
Must make sure to update each model parameter simultaneously for each iteration to avoid skewing the direction the cost function takes you in.
Caution to starting position, can lead to local rather than global minima, thus need to try starting and many different starting positions in the parameter space.
'batch' gradient decent: each step uses ALL the training examples.
Caution in cases where there is more features than training examples.
make sure features are on similar scales. Should any features be 10's - 10000's of times greater in value or range than others, it can dramitcally decrease the speed/effectiveness of gradient decent.
Applying standard linear regression can lead to incorrect threshold classifiers -> use Logistic regression.
Use a sigmoid/logistic function, valid between 0-1. Solves for likelihood/probability of label=1 for each input example.
Non-linear decision boundaries can be found with higher order polynomials for the hypothesis functions.
Modified cost function allowing for only 1 or 0 label/target values, enables use of gradient decent or normal equation fitting. Specific equations for the cost function and implementation into gradient decent discussed in mid-late slides of Chapter 6.
Note: caution on not making weighting/regularization parameter ($\lambda$) too small/ too big. Too small still leads to over fitting, too big leads to flat fits.
linear regression or multivariate regression can become nearly impossible when the number of features becomes very large and the problem is non-linear.
Moving to a Neural Network ML is a good way to solve non-linear models with many features
each layer is made of neurons that all make simple Sigmoid type hypothesis based on their inputs. Their outputs are termed "actions" that feed the next layer. This way, a single set of direct inputs from the data can be converted into increasingly more complicated non-linear inputs to deeper layers that can make complex decisions.
During 'Forward propagation' we go from inputs, through the layers, to the final output, calculating actions along the way.
During 'backward propagation', the errors caused by the input actions to lower layers are calculated and used to adjust the neurons decisions in a form of training.
Split training data into 'training' and 'cross validation' sets.
In [ ]: