Lecture 20: Introduction to Machine Learning

CSCI 1360E: Foundations for Informatics and Analytics

Overview and Objectives

We've covered statistics, probability, linear algebra, data munging, and the basics of Python. You're ready. By the end of this lecture, you should be able to

  • Define machine learning, optimization, and artificial intelligence, and how these subfields interact.
  • Understand when to use supervised versus unsupervised learning.
  • Use scikit-learn to create a basic classifier.

Part 1: Machine Learning

What, exactly, is machine learning?

From Arthur Samuel in 1959: Machine Learning is a

"...Field of study that gives computers the ability to learn without being explicitly programmed."

Probably the most succinct definition.

Imagine how you first learned what a "cat" was when you were young. The process probably went something like this:

  • You pointed at a cat and asked "What is that?"
  • Somebody (probably a parent) said "That's a cat."
  • You internalized this experience. Perhaps later, you pointed at a dog and asked "Cat?"
  • Somebody (again, probably a parent) corrected you, saying "No, that's a dog."
  • With some more back and forth, you were able to reliably determine on your own whether or not something in the real world was a cat.

The first process, in which you asked for and received feedback, is the learning step.

The second process, in which you took your experience and identified cats without any feedback, is the generalization step.

These two steps define a machine learning algorithm.

Some other ways of describing machine learning include

  • Inferring knowledge from data
  • Generalizing to unseen or unobserved data
  • Emphasizing the computational challenges (e.g., "big" data)

There's a whole ecosystem of buzzwords around machine learning; we'll only explore a very small subset in this lecture.

Most machine learning algorithms involve the development of a rule to describe the data.

This rule is sometimes referred to as a hypothesis or a decision boundary.

The way in which this rule is formed, and the resulting shape of the decision boundary, is largely what distinguishes one machine learning algorithm from another.

Choosing what algorithm to use is important.

  • Different algorithms embody and quantify different assumptions about the underlying data.
  • A prevalent heuristic throughout machine learning is to use the simplest possible model or algorithm that adequately explains the data.
  • Using extremely complex algorithms will almost always give you lower error rates, but this can potentially lead to overfitting: a situation where your algorithm so closely matches the data that it can't generalize to new data.

Overfitting is a problem in machine learning where your model performs exceptionally well on your training data, but fits so tightly to it that it cannot generalize well to new data.

This can be mitigated through the use of k-fold cross-validation.

Cross-validation is a process through which the data you use to train your model is split into two sets, or folds: training and testing. The model is trained using the training set, and then is tested for generalization accuracy with the testing set.

This process is then repeated with new training and testing sets. The number of times this process is repeated is the number $k$ in k-fold.

Another important problem to keep in mind when designing your algorithms is the concept of the bias-variance tradeoff.

  • Variance is a concept we've seen before; it's related directly to the standard deviation. This quantifies how much a distribution varies from its mean, or average. Small variance means a distribution is concentrated around its mean; large variance means the distribution is very spread out.
  • Bias is a term we've all heard as a colloquialism, but which has a very specific statistical definition: this quantifies the difference between the expected mean of a random variable, and the true mean of that random variable.

In statistical terms, neither of these quantities is inherently "bad"; this is especially true with bias, as in daily language use the term tends to be a pejorative. However, it is critical to understand the effects these quantities can have on any modeling strategy you employ in machine learning.

This picture perfectly describes the effects of the two quantities:

In an ideal world, we'd all build machine learning models with low variance and low bias (upper left corner).

Unfortunately, this is rarely, if ever, possible. In fact, we can explicitly decompose the error rates of virtually any machine learning algorithm into three distinct components: the bias, the variance, and the irreducible error (noise in the problem itself).

The tradeoff comes from the fact that, as we try to decrease the error of one of the terms, we often simultaneously increase the error from the other term.

Different machine learning algorithms and models will make different assumptions and therefore give different preferences to bias versus variance. The specifics of the problem on which you are working will dictate whether your model can tolerate high-bias or high-variance.

Part 2: Supervised and Unsupervised Learning

Learning algorithms can, at a coarse level, be divided into two distinct groups: supervised and unsupervised learning.

  • Supervised learning involves training a model using labeled data, or data for which you know the answer. For example, if you wanted to design an automatic spam filter, you might start with a training dataset containing sample emails that are labeled as either spam or not spam. This is the equivalent of your parent figure providing the correct answer regarding your inquiries about what is a "cat."
  • Unsupervised learning is the process of building a model using data for which you have no ground truth and no explicit set of labels for your data. Usually this implies clustering: the process of grouping similar things together.

Supervised Learning

The basic idea behind supervised learning is that you start with a ground-truth, labeled dataset of $N$ points with the form

$\{(\vec{x}_1, y_1), (\vec{x}_2, y_2), ..., (\vec{x}_N, y_N)\}$

where $\vec{x}_i$ is the $i^{th}$ data point, and $y_i$ is the ground-truth label (e.g. "spam" or "not spam" for email) of that data point.

The goal, then, is to use this training data to learn a function $f$ that most accurately maps the training data $X$ to the correct labels $Y$. In mathematical parlance, this looks like

$f : X \rightarrow Y$

Within the branch of supervised learning, there are two distinct strategies: classification and regression.

  • Classification is the process of mapping an input data point $\vec{x}$ to a discrete label. Spam filters would fall in thsi category: we are explicitly trying to identify an incoming email as either "spam" or "not spam." In mathematical terms: we're mapping continuous input to one of a handful of discrete outputs.
  • Regression is the process of mapping an input data point $\vec{x}$ to a real-valued label. Linear regression, in which the goal is to find a best-fit line to a cloud of data points, is an example. Rather than a handful of possible discrete outputs, your output is instead a continuous floating-point value. An example of regression might be predicting the stock value of a company.

In both instances of supervised learning, we're still concerned with mapping inputs to certain correct outputs. The difference is the form of the output: discrete (one of a possible handful) for classification tasks, and continuous (real-valued; floating-point) for regression tasks.

Unsupervised Learning

Unlike supervised learning, tasks in unsupervised learning deal strictly with unlabeled data of $N$ points of the form

$\{\vec{x}_1, \vec{x}_2, ..., \vec{x}_N\}$

Since all we have are the data, the goal of unsupervised learning is to find a model which best explains the underlying process that gave rise to the data.

One example of unsupervised learning is the identification of communities in social networks.

A social network is well-represented using the graph structure of nodes connected by edges, where each user (you and me) is a node in the graph, and any users who are "friends" are connected by an edge.

Community-finding is decidedly an unsupervised process: you have no idea how many communities there are ahead of time, nor do you know who is part of what community. All you have is your database of users and their connections.

We can use these inter-person connections to identify groups that are highly interconnected with each other, therefore representing a "community." A research paper in 2010 identified such networks of bloggers on different sides of the political spectrum:

Other ways to refer to unsupervised learning include

  • Clustering, since we are effectively "clustering" similar data together
  • Anomaly detection, sort of an antonym for clustering: rather than having the goal of grouping similar things together, we use that as a stepping stone to then try and identify things that don't fit
  • Data mining, though this colloquialism has largely been co-opted by the popular media and may therefore not necessarily equate to unsupervised learning in certain contexts

In all cases, the common thread is that you don't have any kind of ground-truth labeling scheme.

Review Questions

Some questions to discuss and consider:

1: There is such a concept as semi-supervised learning. It usually involves using a small quantity of labeled data and a much larger quantity of unlabeled data. Based on the name, and what you learned from this lecture on supervised and unsupervised learning, can you speculate as to how the unlabeled and labeled data might be used in this context?

2: In many machine learning tasks, it is common to have data that is very high-dimensional; in fact, it is common for the number of dimensions of the data to exceed the number of data points you have. This is referred to as the "curse of dimensionality" and has very real implications for training models that generalize well. Imagine again that you're trying to design a spam filter for email. If words in an email consist of that email's dimensions, explain how this task suffers from the "curse of dimensionality".

3: I'm interested in training a classifier to perform some supervised task. I have built the model and now want to test it; unfortunately, I have very little labeled data available, only 100 data points. When performing $k$-fold cross-validation to test my model, what $k$ should I choose (large, medium, small) and why?

4: Does overfitting negatively impact the learning phase, or the generalization phase? Which phase is more important? Explain.

5: I've designed a model that achieves zero error due to variance, but therefore has high bias. Is my model underfitting or overfitting? What if I instead design a model that achieves zero error due to bias, but has high variance? Explain.

Course Administrivia

  • Last review session of the semester is this coming Friday, July 29 12-1:30pm. Come if you have any questions, or forever hold your peace until the final exam!
  • The final exam will be spread over the 48-hour period of Tuesday-Wednesday, August 2-3. More details after Wednesday's lecture and at Friday's review session.
  • A10 will be out tomorrow! It will be the final assignment of the semester, and be due on Friday July 29 at 11:59:59pm.

Additional Resources

  1. Richert, Willi and Pedro Coelho, Luis. Building Machine Learning Systems with Python. 2013. ISBN-13: 978-1782161400