Classifiers

Classifiers are algorithms that given two or more groups of samples, find the features that distinguish them the most. Different algorithms define "the most" differently, and we'll look into two of them.

Overfitting

Before we get started with using these algorithms, we're going to talk about overfitting.

In classical machine learning, overfitting is when you've fit your model so perfectly to your data that it's not generalizable to other cases. For example, you designed a cancer vs normal classifier algorithm that works perfectly in a single breast cancer dataset but when you try it on a new breast cancer dataset, it doesn't work.

Overfitting is a common newbie mistake that leads to bad things because not all of the "significant" features you found are actually useful - they randomly were higher or lower in your training data and thus the algorithm thought they were significant.

Source: http://scott.fortmann-roe.com/docs/MeasuringError.html

How to avoid overfitting

In machine learning, you often have a "gold standard" dataset from which you're learning the data (e.g. a manually curated face recognition dataset) and then a separate dataset that you test on. In biology, we tend to not have a good manually curated dataset for RNA-seq (though that would be very nice!!!), so what we do instead is split up our dataset to testing and training sets.

Training and test datasets

Rerunning classifiers

Another method you can use to prevent overfitting is rerunning your data multiple times on the same classifier, using different random seeds. A "random seed" is a starting point for random number generators. The code below uses the random number generator in Python to return a random integer from 0 to 100.

  1. Run the code below a few times to see different random numbers from 0 to 100.
  2. Uncomment (remove the "#" hash mark at the beginning of the line)
  3. Try changing the random seed (the number between the parentheses of random.seed() - right now it is 9)

In [112]:
import random

# Sets the random seed
#random.seed(9)

# Return a random integer from 0 to 100
random.randint(0, 100)


Out[112]:
59

Why does this matter to classifiers? Many times, they will pick a random set of features as an initial guess at the best discriminating features, and sometimes it gets lucky and sometimes it doesn't. You want to make sure that your classification features are robust to multiple different initializations.