The Fundamentals of Machine Learning

Quick Definition

The design and study of software artifacts that use past experience to make future decisions.

Definition (Tom Mitchell)

"A program can be said to learn from experience E with respect to some class of tasks T and performance measure P, if its performance at tasks in T, as measured by P improves with experience E"

Supervised Learning Intuition

In supervised learning, the machine learning (ML) system is given examples of the right answers. The inputs are paired with the 'correct' output, which is often called a label.

For example, the ML system can be provided with data consisting of observations representing people. These observations consist of a set of features like hair length, height, and weight. These features are said to be measures of some explanatory variables.

The accompanying label can be the person's sex, male or female. Suppose that we now have access to a new set of data that are unlabelled, i.e. we don't know which observations are male or `female.

The supervised ML system that is trained on the original pair of features and labels is now tasked with generating outputs or responses for the new set of unlabelled data.

Common Supervised Tasks

  • Classification: Predicting discrete values for the response variables from one of more explanatory variables.
    • For example, given a dataset of college students, can a machine learning model correctly identify the student's major?
  • Regression: Predicting the value of a continuous response variable.
    • For example, given the same dataset of college students, can a machine learning model correctly predict the student debt accrued over 4 years?

Unsupervised Learning Intuition

In unsupervised learning, the machine learning system does not have access to the labels for a given dataset. Instead, the algorithm tries to find inherent patterns in the data by clustering the observations according to some measure of distance or similarity.

Common Unsupervised Tasks

  • Clustering
  • Dimensionality Reduction

Training and Testing

Imagine we want to create a machine learning classifier that can classify all sorts of living organisms, and suppose we want to start with just the plant kingdom. Let's call this classifier Linnaeus, after the father of modern biological nomenclature, Carl Linnaeus.

To keep it simple, we want Linnaeus to be able to classify two types of plants: trees, and ferns. We go out into the world and find a bunch of examples of trees, and a bunch of examples of ferns.

Ferns

Wikipedia

"A fern is a member of a group of approximately 12,000 species of vascular plants that reproduce via spores and have neither seeds nor flowers. They differ from mosses by being vascular (i.e. having water-conducting vessels). They have stems and leaves, like other vascular plants. Most ferns have what are called fiddleheads that expand into fronds, which are each delicately divided."

Trees

Wikipedia

"In botany, a tree is a perennial plant with an elongated stem, or trunk, supporting branches and leaves in most species. In some usages, the definition of a tree may be narrower, including only woody plants with secondary growth, plants that are usable as lumber or plants above a specified height. In looser senses, the taller palms, the tree ferns, bananas and bamboos are also trees."

Unfortunately, the features we manage to collect for each individual fern and tree is limited: number_of_leaves, number_of_stems, has_flowers.

Each data point also has a label category associated with it, with two possible values: tree or fern.

The first 6 rows of this dataset might look like this:

plant_id number_of_leaves number_of_stems has_flowers category
1 102,172 7,301 True tree
2 321,102 9,332 True tree
3 42,211 1,208 False fern
4 259,102 7,301 True tree
5 21,998 966 False fern
6 81,954 1,566 False fern

The task for Linnaeus is the following: using these features, can Linnaeus create a model that generalizes the notion of tree and fern so that it can correctly discern new instances of trees and ferns?

Training is the process by which Linnaeus creates a general model of the world, in this case of trees and ferns. The Training Set is the set of observations that Linnaeus uses to create this model.

Testing is the process by which we measure how well Linnaeus does in classifying trees and ferns with this model. The Test Set is a similar set of observations as the Training Set, and is used to evaluate how well Linnaeus does in the classification of plants into trees and ferns.

Hyperparameter Tuning is the process by which we use a Validation Set in order to control how the model is learned. The Validation Set is a third set of observations in the dataset that is distinct from the Training Set and Test Set, and is used to prevent overfitting or underfitting the model.

Overfitting and Underfitting

Suppose that Linnaeus only uses the number_of_leaves feature to carry out the machine learning task of predicing the category response variable tree or fern. Let us also suppose that it uses the first 6 rows of the data set to create the model.

Linnaeus could potentially create a model like this:

Model 1: Overfitting

if number_of_leaves equals 102,172 or 321,102 or 259,102:
    new_observation_prediction = tree
else if number_of_leaves equals 42,211 or 21,998 or 81,954:
    new_observation_prediction = fern

Model 1 is said to memorize the training set and is probably overfitting the data. The conditional statements above serve as an extreme example of what it means to overfit, because Linnaeus essentially looking at the examples of trees in the dataset and mapping the exact number_of_leaves to its notion of a tree. Similarly, it is looking at the examples of ferns in the dataset and mapping the exact number_of_leaves to its notion of a fern.

Now that we have a sense of what overfitting is, what would underfitting look like?

Model 2: Underfitting

if number_of_leaves is > 100,000:
    new_observation_prediction = tree
else
    new_observation_prediction = fern

Model 2 is substantially simpler than Model 1. Instead of memorizing the exact number_of_leaves of trees and ferns to predict a category for future observations, it sets a threshold of 100,000. Linnaeus categorizes observations with more than 100,000 leaves as tree and observations with less than 100,000 leaves as fern.

Training Set, Testing Set, Validation Set

Hyperparameter Tuning

Cross Validation

Bias/Variance Tradeoff

Bias

Variance

The Bias/Variance Tradeoff

Performance Measures

Accuracy

Precision

Recall