History of Machine Learning

The field of machine learning has its roots in Artificial intelligence (AI), which started in 1950 with the seminal paper Computing Machinery and Intelligence. In this paper, Alan Turing posed a key question: "Can machine think?" and also introduced the popular Turing test:

A computer program is said to be intelligent if it could carry on a conversation that was indistinguishable from a conversation with a human being.

In [27]:
from IPython.display import YouTubeVideo


The first generation of AI researchers were John McCarthy (who invented Lisp and established Stanford AI Laboratory), Marvin Minsky and Frank Rosenblatt (who invented perceptron algorithm) and their work consisted of building theorem provers, natural language processing, logic programming, etc.

Knowledge base systems

The next generation of AI research was on developing expert systems where the focus was building a system that emulates the decision-making ability of a human expert. The expert system is an example of a broad class of system, commonly referred to as Knowledge base system. The Knowledge base system consists of two primary components:

  1. Knowledge base Or fact base: A database that stores collection of facts or assertions (and may be even entities and relationships between entities).

  2. Inference engine: applies set of logical rules on the facts available in the knowledge base to deduce new facts.

Here are some popular knowledge base systems:

  1. In 1970, Shortliffe and others at Stanford developed an expert system for medical diagnosis called Mycin. Mycin would ask the physician a series of "yes or no" questions and would return list of (diagnosis (i.e. culprit bacteria), probability, reasoning behind the diagnosis, recommended drug treatment). To specify the "reasoning behind the diagnosis", Mycin would return list of questions/answers along with the set of rules that it used to come to the diagnosis.

  2. Cyc developed by Douglas Lenat in 1984.

  3. Sub-components of IBM Watson

In [1]:
from IPython.display import YouTubeVideo


Neural networks returns

With the invention of back-propogation algorithm, research on neural networks (which was almost stagnant due to criticism by Marvin Minsky on Rosenblatt's perceptron) revived during 1980s.

VC Theory

From 1980 to early 1990s, Vladimir Vapnik introduced Vapnik-Chervonenkis theory (VC theory), that attempts to explain the learning process from a statistical point of view. The key question of this new field machine learning that was based on Vapnik's framework was no longer Turing's grand question, but much more simple and realistic one: "(Given enough examples) Can machine learn something about them ?":

A computer program is said to learn from experience E with respect to some class of tasks T and performance measure P, if its performance at tasks in T, as measured by P, improves with experience E ... by Tom Mitchell.

Here experience E refers to data (or examples), performance measure P usually refers to accuracy (or some other metrics) and the task T refers to machine learning algorithms.

In [28]:
from IPython.display import YouTubeVideo
YouTubeVideo('jmMcJ4XlrWM', start=195, end=234)


The success of VC theory for number of problems like handwriting recognition, using support vector machines (SVM), generated a lot of interests in machine learning community.

Bayesian Theory

There is equally interesting parallel timeline in the field of (Bayesian) Statistical Inference. This field is addressed by various names: Probability theory (before 1838), Inverse Probability (from 1838 to 1950) and Bayesian Analysis (after 1950). The concepts of probability were formalized by Pascal and Fermat in 17th century and later made much more rigorous by Kolmogorov in 1930 using measure theory. In 1761, Thomas Bayes proved following theorem (now known as Bayes theorem):

$$p(H|E) = \dfrac{p(E|H)p(H)}{p(E)}$$

This simple equation has extremely important implication. It says that if we believe the hypothesis H to be true with probability P(H) (also called as prior) and if we are given additional evidence E from the experiment, we can then update our belief about H to P(H | E) using the above equation. Wow ... doesn’t this essentially summarize how a scientist should think ?

Based on what prior was used, the research in this field can be classified into four distinct eras. The first era (Objective Bayesians) started with Thomas Bayes and Laplace, who liked to use uniform priors for most of the problems1. This continued till early 19th century2 where intensive efforts were made to circumvent the priors. Important researchers during this era (also called as frequentists) were Fisher (likelihood methods, fiducial inference), Neyman (founded frequentism) and Wald (Decision theory) and the key question they were interested in was how the results would change if you ran a procedure over and over again, with data changing each time. This is considered to be golden era for statistical inference as lot of discoveries were made in extremely short time. Around 1950s, during the third era (Subjective Bayesians), the researchers like Savage and de Finetti proposed that one should sit down with domain expert to find appropriate priors for any problem. In the fourth era, researchers like Harold Jeffreys (who was famous for his critique about p-value: ‘An hypothesis that may be true, may be rejected because it has not predicted observable results that have not occurred’) revived the field of objective bayes. The key idea of objective bayesian methodology is to use frequentist analytic tools to guide their choice of priors.

Both Objective bayesian methodology and Vapnik’s framework are widely used in the field of machine learning.

Deep Learning


Here is a Venn diagram from the deep learning book, describing the relationship between the sub-fields of AI:

When to use machine learning

Machine Learning approach is suitable if the problem satisfies following three criteria:

  1. There exists sufficiently large data Dn = { (Xi, Yi) }i = 1 .. n.
  2. There exists a pattern in the data that you intend to learn. That is, there exist a target function (or distribution) f: X -> Y which maps input feature vector to output label.
  3. We cannot pin that pattern down mathematically or algorithmically using set of rules (i.e. f is unknown).

Other hints that you can use to determine whether machine learning is suitable for your problem:

  1. Don’t know how to calculate the output from the input (eg: medical diagnosis, bioinformatics, biomedical imagery/computer vision).
  2. Requirements change rapidly (eg: spam filtering, search engines).
  3. Environment in which the system operates change rapidly (eg: stock market, video games, robotics).
  4. There exists tremendous individual variability (eg: recommendations, speech/handwriting recognition).

Stages of Machine Learning

Step 1: Data collection

In this step, we capture the information of real world object and store it in the computer either as text, image, audio file, video file or in one of common datastores (i.e. database, knowledge base, etc).

Let's use a simple example of recognizing digits. First, we ask set of volunteers to write numbers from 0 to 9 on piece of black paper and then scan it into a jpeg file.

In [29]:
%matplotlib inline
import matplotlib.pyplot as plt
from sklearn import datasets
digits = datasets.load_digits()
images_and_labels = list(zip(digits.images,
for index, (image, label) in enumerate(images_and_labels[:10]):
    plt.subplot(2, 10, index + 1)

Step 2: Feature representation

Now, we convert the binary object into a format required for machine learning algorithm. This is usually (but not necessarily) a D-dimensional numeric vector (called as feature):

In [30]:
%matplotlib inline
import matplotlib.pyplot as plt
from sklearn import datasets
digits = datasets.load_digits()
images_and_labels = list(zip(digits.images,
for index, (image, label) in enumerate(images_and_labels[:1]):
    plt.subplot(2, 1, index + 1)
    plt.title('Feature: ' + str(image.flatten()))

This is an extremely crucial step and if done incorrectly can screw up the whole learning process (atleast for most if not all problems). So, most applied machine learning researchers spend lot of time with domain experts understanding, extracting and tuning the features3.

This step is usually accompanied with data cleaning (which is removing noisy data), but we will skip that for now.

Step 3: Dimensionality reduction

For computational feasibility and for machine learning algorithm to work correctly, number of dimensions D should be reasonable (i.e. D <<< infinity). So, during this process, you will also use the following dimensionality reduction techniques:

  1. Feature selection: chose relevant features for the model
  2. Feature extraction: transform high-dimensional data to lower-dimensions

Both the methods helps to alleviate the curse of dimensionality, improve performance of the model and speed up the learning process. The term "curse of dimensionality" (coined by Richard Bellman) refers to the fact that some problems become intractable as the number of variables increases. In Bayesian statistics, the problem of curse of dimensionality occurs while evaluating posterior distribution, which often has many parameters. However, this is usually overcomed by using MCMC methods.

The above two steps is often referred to as feature engineering.

Step 4: Training and Testing

After performing the above two steps for all the objects, we get the input data, which is divided into three parts:

  1. Training data (about 60 to 70% of the input data): The training data is used to learn the parameters of the model (i.e. parameter selection).
  2. Validation data (about 10 to 20% of the input data): The validation is used to learn the hyperparameters (i.e. model selection).
  3. Test data (about 20% of the input data): The test data is used to test whether the model generalizes or not.

Let's ignore validation for now and see how to split the above data into training and testing.

In [31]:
from sklearn import datasets
from sklearn.utils import shuffle
digits = datasets.load_digits()
X_digits =
y_digits =
X_digits, y_digits  = shuffle(X_digits, y_digits, random_state=0)
n_samples = len(X_digits)
X_train = X_digits[:int(.8 * n_samples)]
y_train = y_digits[:int(.8 * n_samples)]
X_test = X_digits[int(.8 * n_samples):]
y_test = y_digits[int(.8 * n_samples):]

General Machine Learning Setup

  1. Perform feature selection and feature extraction to get the input data;
  2. Partition the input data into training, validation and test data;
  3. foreach model do
    • Model Selection: Learn hyperparameters by optimizing model selection criterion on validation data
    • foreach parameter do
      • Model fitting/training: Learn parameters by optimizing training criterion on training data

After above process, the learned model is m and the learned parameters are p. Use test data to verify the accuracy (or some other performance measure) of the model/parameter (m; p).

To test that we are actually learning, we slowly increase the number of training datapoints (i.e. experience) and see if the performance (i.e. score) increases. As an example, we will use a simple machine learning model (i.e. logistic regression) and plot the performance vs number of training datapoints.


A computer program is said to learn from experience E with respect to some class of tasks T and performance measure P, if its performance at tasks in T, as measured by P, improves with experience E.

In [2]:
%matplotlib inline
import matplotlib.pyplot as plt
from sklearn import datasets
from sklearn.utils import shuffle
from systemml.mllearn import LogisticRegression
from pyspark.sql import SQLContext
sqlCtx = SQLContext(sc)
digits = datasets.load_digits()
X_digits =
y_digits =
X_digits, y_digits  = shuffle(X_digits, y_digits, random_state=0)
n_samples = len(X_digits)
X_test = X_digits[int(.8 * n_samples):]
y_test = y_digits[int(.8 * n_samples):]
training_fraction = [0.005, 0.01, 0.05, 0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8]
scores = []
for frac in training_fraction:
    X_train = X_digits[:int(frac * n_samples)]
    y_train = y_digits[:int(frac * n_samples)]
    classifier = LogisticRegression(sqlCtx)
    score =, y_train).score(X_test, y_test)
    scores = scores + [ score ]
plt.plot(training_fraction, scores)
plt.xlabel('Fraction of data used for training: E')
plt.ylabel('Prediction score (higher the better): P')


1 Ofcourse this is a slight exaggeration as Laplace was the one who invented conjugate priors and exponential families and used them consistently.

2 Again an exaggeration, as Gauss (and many others) used maximum likelihood methods as early as 18th century.

3 In the cases where feature engineering is expensive or difficult, researchers employ machine learning algorithms that transform the raw input data to feature vector. This approach is commonly referred to as feature learning or representation learning.

Previous Chapter $\qquad \qquad \qquad \qquad $ Main Page $\qquad \qquad \qquad \qquad $ Next Chapter