Types of ML systems

trained w/ or w/o human supervision (supervised, unsupervised, semi-supervised, reinforcement learning)
can or cannot learn learn incrementally on the fly (online v batch learning)
comparing new data points to known, or detect patterns in training data and build predictive model (instance-based v model-based learning)

Supervised/Unsupervised learning

Supervised

labels: solutions fed to algo along with training data
classification: algo is trained w/ many examples with their class
regression: predict a target numeric value given a set of features called predictors
attribute: a data type (eg, "mileage")
feature: an attribute plus its value (eg, "mileage = 15,000")
logistic regression can be used for classification, as it outputs a value that corresponds to the probability of belonging to a given class
most important supervised learning algos
- k-nearest neighbors
- linear regression
- logistic regression
- support vector machines (SVMs)
- decision trees and random forests
- neural networks

clustering
- k-means
- hierarchical cluster analysis (HCA)
- expectation maximization
visualization and dimensionality reduction
- principal component analysis (PCA)
- kernel PCE
- locally-linear embedding (LLE)
- t-distributed stochastic neighbor embedding (t-SNE)
association rule learning
- apriori
- eclat
dimensionalty reduction: simplify data w/o loosing to much info
- example: a car's mileage highly correlated w/ its age, so merge them into one feature
feature extraction: building informative and non-redundant features from an initial set of measured data
anomaly detection: identifying outliers
association rule learning: discover interesting relations between attributes

algos taking paritally labeled data, usu. lots of unlabeled data and a little labeled
- ex: step 1 unsupervised clustering of faces from an unlabeled set of photos; step 2 human assigns a name label per face
- most are combinations of supervised and unsupervised algos
deep belief networks (DBNs) are based on unsupervised components called restricted Bolzmann machines (RBMs) stack on top of one another

very different
learning system (called an agent) can observe an environment, select and perform actions; gets rewards or penalties for its choices
creates a policy to get the most reward over time; a policy defines what actions an agent takes in a given situation
ex: robots learning to walk and AlphaGo (just applying the policy it had learned)

system trained incrementally, either individually or mini-batches
good for data w/ continuous data flow and need to react rapidly (eg, stock prices); also for limited computing resources
can also be used for out-of-core learning
learning rate: how fast to adapt to changing data (and thus forget old data)
- high learning rate means adapt quickly but forget old data
- low means more inertia, less sensitive to noise in new data or non-representative data points
danger of bad data input degrading system; may want to monitor input and react to abnormal data

Distinction on how ML systems generalize (how it learns to perform on instances never seen)

a system starts by looking for exact matches, then generalizes to similar examples
measure of similarity: ex: count of words in common between two docs (spam and unknown)

generalize from a set of examples, then make predictions
How to know which model performs best
- utility function (or fitness function) that measures how good your model is
- cost function measures how bad it is
- for linear regression, usu. use a cost function to measure distance between model's preductions and training examples

On effectiveness of data:

Banko and Brill (2001): very diff algos perform similarly w/ enough data
Peter Norvig et al (2009). "The Unreasonable Effectiveness of Data". Data is more important than algo for complex problems; thus invest in corpus.

rm or fix outliers
if some instances are missing a few features, must decide whether to drop instance, fill in w/ average, ignore the attribute for all instances, or train 2 models, w/ and w/o the missing value

feature engineering: deciding upon good features
- feature selection: selecting more useful features
- feature extraction: combining features to produce a more useful one (dimensionality reduction helps)
- creating new features

overfitting: model performs well on training data but does not generalize
- happens when model is too complex relative to amount and/or noisiness of training data
solutions:
- simplify the model by selecting simpler algo, algo w/ fewer parameters, reducing number of attributes in training data, constraining model
- get more training data
- reduce noise in training data (fix errors, rm outliers)
regularization: constraining model to make is simpler, thus reducing overfitting
degrees of freedom: number of variables is an algo. Ex: z = Ax + y has two degrees of freedome; however if x is limited to some range, then it'll have between 1 and 2 degrees of freedom
hyperparameter: controlls the amount of regularization; a parameter of the algo, not the model

underfitting: when the model is too simple to learn from the training data
to fix underfitting:
- use more powerful model with more parameters
- feature engineering -- better features into model
- reduce constraints on the model (eg reducing regularization hyperparameter)

training set: what you make model from
test set: what you evaulate model from
generalization error (or out-of-sample error): estimation of accuracy of model on unseen data
validation set: a second holdout set in addition to test set
cross-validation: divide all data into complementary test/train sets, making multple passes with different divisions; average errors; train final model on all data
no free lunch theorem: Wolpert (1996): there is no model that is guaranteed to work a prior better than another, must try them all