Machine Learning

The intent of this notebook is to document different types of machine learning approaches and algorithms. At this point it's a pile of everything, I expect to break this up into more modular pieces as more and more content is added.

Neural networks are already broken out into nn/neural_networks.ipynb

ML vs. Tranditional Programming:

The rules are often expressed using weights/numbers (like e.g. Weight matrix in NNs).

Source: https://www.bouvet.no/bouvet-deler/6-tips-for-getting-started-with-machine-learning

The 2 image below gives a good idea of the different subfields of ML:

Source: https://www.mathworks.com/discovery/machine-learning.html

Note: In many similar hierarchy diagrams, reinforcement learning is another category that exists next to supervised and unsupervised learning.

Supervised learning

Wikipedia has an excellent definition:

Supervised learning is the machine learning task of inferring a function from labeled training data. The training data consist of a set of training examples. In supervised learning, each example is a pair consisting of an input object (typically a vector) and a desired output value (also called the supervisory signal). A supervised learning algorithm analyzes the training data and produces an inferred function, which can be used for mapping new examples. An optimal scenario will allow for the algorithm to correctly determine the class labels for unseen instances. This requires the learning algorithm to generalize from the training data to unseen situations in a "reasonable" way

Classification

Support Vector Machines (SVMs)

SVMs are a set of learning algorithms for to do classification. In a nutshell, SVMs try to find a hyperplane that maximes the distance between clusters of data.

Visually:

The points/vectors that define where the hyperplane fits are called the support vectors (they're "supporting" the maximum distance between the clusters and the plane).

Computing SVMs comes down to solving a differential equation: finding a hyperplane that minimizes the distance between the hyperplane and all potential support vectors in their respective clusters. There are multiple algorithms for solving this.

Note that a tricky part of defining an SVMs is often finding the right feature space. What might be hard in one dimension, might become easy in a higher dimension that represents the right features:

SVMs in Scikit-learn

In Scikit-learn, the simplest SVM that corresponds to the pictures above is LinearSVC. The algorithm takes a C value which represents the amount of error/margin the SVM will allow when making calculations. The images below visualize the difference between C values (from Hands-On Machine Learning with Scikit-Learn and TensorFlow)

The question is which C value better captures the true nature relationship of the data. A C value that is too high might cause overfitting: the model might be too much tailored to the learning data and not generalize well.

Detecting multiple classes using SVMs

SVMs are binary classifiers: they can inheritly only classify 2 classes (as the found function divides the hyperplane in 2). Still, SVMs are often used to find multiple classes by training and SVM for each of the target classes and then comparing fit or loss scores. For example, when solving the classic MNIST digit classification problem using an SVM, as done in sk-learn-intro.ipynb, we train 10 SVMs, one for each digit. The SVM for digit 5 then can recognize whether a digit is either 5 or not 5, with a certain associated accuracy. When classifing a random digit, we simply run all SVMs and compare their scores and pick the one with the highest score.

Note that scikit-learn takes care of running all 10 SVMs and selecting the class with the highest score automatically for us.

This strategy of training x SVMs for x target classes, where each SVM can only distinguish between its class and everything else is called the one-versus-all (OvA) strategy. This is the default in scikit-learn. An alternative strategy, one-versus-one (OvO) - OneVsOneClassifier, creates SVMs that can detect pairs of classes (e.g. SVM that can detect 5 or 6). A downside of this approach is that you need to create SVMs for all possible pairs. For MNIST this means a combination of 2 out of 10:

$$C(10,2) =\frac{10!}{(2!(10−2)!)} = 45$$

Decision Trees

Scikit-Learn uses the CART algorithm, which produces only binary trees: nonleaf nodes always have two children (i.e., questions only have yes/no answers). There are other algorithms that use produce 3 or more leafs per parent.

From Hands-On Machine Learning with Scikit-Learn and TensorFlow, Chapter 6: The idea behind the CART algorithm is quite simple: the algorithm first splits the training set in two subsets using a single feature k and a threshold tk (e.g., "petal length <= 2.45 cm"). How does it choose $k$ and $t_k$? It searches for the pair $(k, tk)$ that produces the purest subsets (weighted by their size). Purity is defined by the Gini measure. Once it has successfully split the training set in two, it splits the subsets using the same logic, then the sub-subsets and so on, recursively. It stops recursing once it reaches the maximum depth (defined by the max_depth hyperparameter), or if it cannot find a split that will reduce impurity.

Regression

The goal of regression is to predict numbers (=continuous), note that this is different from classification where we predict class membership.

One exception is logistic regression where we use a regression technique to do classification.

Unsupervised Learning

With unsupervised learning, there are no output values or labels provided, the algorithms find those themselves. A good example of unsupervised learning is finding clusters in a dataset: you don't need to specify how many clusters there, or what they are named, etc. An unsupervised clustering algorithm will find the clusters in the data without further input. A practical example is grouping photos of the same person on e.g. Google Photos or Facebook.

Example learning algorithms:

  • Clustering:
    • k-means
    • Hierarchical Cluster Analysis (HCA)
    • Expected Maximization
  • Association rule learning:
    • Aprioiri
    • Eclat

Clustering

k-means

Most of this is from wikipedia.

k-means clustering aims to partition n observations into k clusters.

Given a set of observations $(x_1, x_2, …, x_n)$, where each observation is a $d$-dimensional real vector, k-means clustering aims to partition the $n$ observations into $k (≤ n)$ sets $S = {S_1, S_2, …, S_k}$ so as to minimize the within-cluster sum of squares (WCSS) (i.e. variance). Formally, the objective is to find:

$${\displaystyle {\underset {\mathbf {S} }{\operatorname {arg\,min} }}\sum _{i=1}^{k}\sum _{\mathbf {x} \in S_{i}}\left\|\mathbf {x} -{\boldsymbol {\mu }}_{i}\right\|^{2}={\underset {\mathbf {S} }{\operatorname {arg\,min} }}\sum _{i=1}^{k}|S_{i}|\operatorname {Var} S_{i}}$$

where $μ_i$ is the mean of points in $S_i$.

algorithms

The problem is computationally difficult (NP-hard); however, there are efficient heuristic algorithms that are commonly employed and converge quickly to a local optimum. So these algorithms don't typically find the best (=optimal) global clustering.

Reinforcement Learning

In reinforement learning, an agent observes the enforment and is able to select and perform action and it get's rewards in return (or penalties). It must then learn by itself what the best strategy or policy is.

This approach has had recent successes with e.g. Deepmind's AlphaGo program.

Batch and Online Learning

In batch learning a system cannot learn incrementally, it needs to have all the data available when it does learning. The system is first trained and then released in production. Typically, batch systems are retrained e.g.: every hour, 24hours, week, etc. This is called offline learning.

In online learning, the learning happens simultaneously with outputting values, this is often useful when there is a cotinuous flow of data (e.g. stock prices). An important principle here is that for the learning rate: how fast does the system adapt to new data.

Ensemble Learning

Building a model on top of many other models is called Ensemble Learning, and it is often a great way to push ML algorithms even further.

An example is a RandomForestRegressor (scikit-learn). Random Forests work by training many Decision Trees on random subsets of the features, then averaging out their predictions.

Transfer Learning

Transfer learning focuses on storing knowledge gained while solving one problem and applying it to a different but related problem. For example, knowledge gained while learning to recognize cars could apply when trying to recognize trucks.

Practically, you can for example take a model from TensorFlow Hub and then apply it to your own problem.

Source: https://medium.com/owkin/transfer-learning-and-the-rise-of-collaborative-artificial-intelligence-41f9e2950657

Model Sources

Hyperparameters

A hyperparameter is a parameter whose value is set before the learning process begins. By contrast, the values of other parameters are derived via training.

For example, one optimization algorithm that is often used in ML is Stochastic Gradient Descent (SGD). That algorithm takes an input parameter (the learning_rate which can be more or less interpreted as "aggressiveness") which is set once at the start, and then kept consistent throughout the learning phase.

In tensorflow:

# loss = function to optimiize
optimizer = tf.train.GradientDescentOptimizer(learning_rate=0.05) # learning_rate is a hyperparameter
train = optimizer.minimize(loss)
sess.run(train)
# value of `train` is the weights (W) and bias (b) of the nodes in the neural network

This is a simple example with just a single hyperparamter, but real-world deep networks often times have many hyperparameters (often dozens or more). The art of finding the right values for those hyperparameters is referred to as hyperparameter search and is one of the trickier parts of ML.

This is especially true because doing an automated hyperparameter search is often intractable. That is, usually the amount of learning time required to train a model with a single set of hyperparameter values is already high. Repeating that computation for all possible hyperparameter combinations is just not feasible, especially when some of the hyperparameters can have continuous values.

However, there are some tools in this area, like Katib


In [ ]: