Classifiers are algorithms that given two or more groups of samples, find the features that distinguish them the most. Different algorithms define "the most" differently, and we'll look into two of them.
Before we get started with using these algorithms, we're going to talk about overfitting.
In classical machine learning, overfitting is when you've fit your model so perfectly to your data that it's not generalizable to other cases. For example, you designed a cancer vs normal classifier algorithm that works perfectly in a single breast cancer dataset but when you try it on a new breast cancer dataset, it doesn't work.
Overfitting is a common newbie mistake that leads to bad things because not all of the "significant" features you found are actually useful - they randomly were higher or lower in your training data and thus the algorithm thought they were significant.
In machine learning, you often have a "gold standard" dataset from which you're learning the data (e.g. a manually curated face recognition dataset) and then a separate dataset that you test on. In biology, we tend to not have a good manually curated dataset for RNA-seq (though that would be very nice!!!), so what we do instead is split up our dataset to testing and training sets.
Another method you can use to prevent overfitting is rerunning your data multiple times on the same classifier, using different random seeds. A "random seed" is a starting point for random number generators. The code below uses the random number generator in Python to return a random integer from 0 to 100.
random.seed()
- right now it is 9)
In [112]:
import random
# Sets the random seed
#random.seed(9)
# Return a random integer from 0 to 100
random.randint(0, 100)
Out[112]:
Why does this matter to classifiers? Many times, they will pick a random set of features as an initial guess at the best discriminating features, and sometimes it gets lucky and sometimes it doesn't. You want to make sure that your classification features are robust to multiple different initializations.
Given a high dimensional dataset, SVMs
For this topic, we'll use a tutorial on SVM from Jake Vanderplas' excellent day-long tutorial on scikit-learn (sklearn
) (which I highly recommend working through on your own), the machine learning library in Python. Follow this link, stopping when you get to the section labeled Quick Example: Moving to Regression.
Decision trees are classifiers which find the features whose cutoffs best define the differences between your groups. For example, below is an example decision tree of teaching a computer how to classify fruit from the grocery store.
Again, for this topic, we'll use Jake's tutorial