We've covered statistics, probability, linear algebra, data munging, and the basics of Python. You're ready. By the end of this lecture, you should be able to
scikit-learn
to create a basic classifier.What, exactly, is machine learning?
From Arthur Samuel in 1959: Machine Learning is a
"...Field of study that gives computers the ability to learn without being explicitly programmed."
Probably the most succinct definition.
Imagine how you first learned what a "cat" was when you were young. The process probably went something like this:
The first process, in which you asked for and received feedback, is the learning step.
The second process, in which you took your experience and identified cats without any feedback, is the generalization step.
These two steps define a machine learning algorithm.
Some other ways of describing machine learning include
There's a whole ecosystem of buzzwords around machine learning; we'll only explore a very small subset in this lecture.
Most machine learning algorithms involve the development of a rule to describe the data.
This rule is sometimes referred to as a hypothesis or a decision boundary.
The way in which this rule is formed, and the resulting shape of the decision boundary, is largely what distinguishes one machine learning algorithm from another.
Choosing what algorithm to use is important.
Overfitting is a problem in machine learning where your model performs exceptionally well on your training data, but fits so tightly to it that it cannot generalize well to new data.
This can be mitigated through the use of k-fold cross-validation.
Cross-validation is a process through which the data you use to train your model is split into two sets, or folds: training and testing. The model is trained using the training set, and then is tested for generalization accuracy with the testing set.
This process is then repeated with new training and testing sets. The number of times this process is repeated is the number $k$ in k-fold.
Another important problem to keep in mind when designing your algorithms is the concept of the bias-variance tradeoff.
In statistical terms, neither of these quantities is inherently "bad"; this is especially true with bias, as in daily language use the term tends to be a pejorative. However, it is critical to understand the effects these quantities can have on any modeling strategy you employ in machine learning.
This picture perfectly describes the effects of the two quantities:
In an ideal world, we'd all build machine learning models with low variance and low bias (upper left corner).
Unfortunately, this is rarely, if ever, possible. In fact, we can explicitly decompose the error rates of virtually any machine learning algorithm into three distinct components: the bias, the variance, and the irreducible error (noise in the problem itself).
The tradeoff comes from the fact that, as we try to decrease the error of one of the terms, we often simultaneously increase the error from the other term.
Different machine learning algorithms and models will make different assumptions and therefore give different preferences to bias versus variance. The specifics of the problem on which you are working will dictate whether your model can tolerate high-bias or high-variance.
Learning algorithms can, at a coarse level, be divided into two distinct groups: supervised and unsupervised learning.
The basic idea behind supervised learning is that you start with a ground-truth, labeled dataset of $N$ points with the form
$\{(\vec{x}_1, y_1), (\vec{x}_2, y_2), ..., (\vec{x}_N, y_N)\}$
where $\vec{x}_i$ is the $i^{th}$ data point, and $y_i$ is the ground-truth label (e.g. "spam" or "not spam" for email) of that data point.
The goal, then, is to use this training data to learn a function $f$ that most accurately maps the training data $X$ to the correct labels $Y$. In mathematical parlance, this looks like
$f : X \rightarrow Y$
Within the branch of supervised learning, there are two distinct strategies: classification and regression.
In both instances of supervised learning, we're still concerned with mapping inputs to certain correct outputs. The difference is the form of the output: discrete (one of a possible handful) for classification tasks, and continuous (real-valued; floating-point) for regression tasks.
Unlike supervised learning, tasks in unsupervised learning deal strictly with unlabeled data of $N$ points of the form
$\{\vec{x}_1, \vec{x}_2, ..., \vec{x}_N\}$
Since all we have are the data, the goal of unsupervised learning is to find a model which best explains the underlying process that gave rise to the data.
One example of unsupervised learning is the identification of communities in social networks.
A social network is well-represented using the graph structure of nodes connected by edges, where each user (you and me) is a node in the graph, and any users who are "friends" are connected by an edge.
Community-finding is decidedly an unsupervised process: you have no idea how many communities there are ahead of time, nor do you know who is part of what community. All you have is your database of users and their connections.
We can use these inter-person connections to identify groups that are highly interconnected with each other, therefore representing a "community." A research paper in 2010 identified such networks of bloggers on different sides of the political spectrum:
Other ways to refer to unsupervised learning include
In all cases, the common thread is that you don't have any kind of ground-truth labeling scheme.
1: There is such a concept as semi-supervised learning. It usually involves using a small quantity of labeled data and a much larger quantity of unlabeled data. Based on the name, and what you learned from this lecture on supervised and unsupervised learning, can you speculate as to how the unlabeled and labeled data might be used in this context?
2: In many machine learning tasks, it is common to have data that is very high-dimensional; in fact, it is common for the number of dimensions of the data to exceed the number of data points you have. This is referred to as the "curse of dimensionality" and has very real implications for training models that generalize well. Imagine again that you're trying to design a spam filter for email. If words in an email consist of that email's dimensions, explain how this task suffers from the "curse of dimensionality".
3: I'm interested in training a classifier to perform some supervised task. I have built the model and now want to test it; unfortunately, I have very little labeled data available, only 100 data points. When performing $k$-fold cross-validation to test my model, what $k$ should I choose (large, medium, small) and why?
4: Does overfitting negatively impact the learning phase, or the generalization phase? Which phase is more important? Explain.
5: I've designed a model that achieves zero error due to variance, but therefore has high bias. Is my model underfitting or overfitting? What if I instead design a model that achieves zero error due to bias, but has high variance? Explain.