Classifiers

Classifiers are algorithms that given two or more groups of samples, find the features that distinguish them the most. Different algorithms define "the most" differently, and we'll look into two of them.

Overfitting

Before we get started with using these algorithms, we're going to talk about overfitting.

In classical machine learning, overfitting is when you've fit your model so perfectly to your data that it's not generalizable to other cases. For example, you designed a cancer vs normal classifier algorithm that works perfectly in a single breast cancer dataset but when you try it on a new breast cancer dataset, it doesn't work.

Overfitting is a common newbie mistake that leads to bad things because not all of the "significant" features you found are actually useful - they randomly were higher or lower in your training data and thus the algorithm thought they were significant.

Source: http://scott.fortmann-roe.com/docs/MeasuringError.html

How to avoid overfitting

In machine learning, you often have a "gold standard" dataset from which you're learning the data (e.g. a manually curated face recognition dataset) and then a separate dataset that you test on. In biology, we tend to not have a good manually curated dataset for RNA-seq (though that would be very nice!!!), so what we do instead is split up our dataset to testing and training sets.

Training and test datasets

Rerunning classifiers

Another method you can use to prevent overfitting is rerunning your data multiple times on the same classifier, using different random seeds. A "random seed" is a starting point for random number generators. The code below uses the random number generator in Python to return a random integer from 0 to 100.

Run the code below a few times to see different random numbers from 0 to 100.
Uncomment (remove the "#" hash mark at the beginning of the line)
Try changing the random seed (the number between the parentheses of random.seed() - right now it is 9)



In [112]:

    
import random

# Sets the random seed
#random.seed(9)

# Return a random integer from 0 to 100
random.randint(0, 100)









    Out[112]:





59

Why does this matter to classifiers? Many times, they will pick a random set of features as an initial guess at the best discriminating features, and sometimes it gets lucky and sometimes it doesn't. You want to make sure that your classification features are robust to multiple different initializations.

Quiz 3.1.1

Support vector machines (SVM)

Given a high dimensional dataset, SVMs

Pros

Simple and easy to train method
Does well agaisnt overfitting

Cons

Data must be linearly separable
- Solution: Use a de-circularizing projection of your data (using a radial basis function, RBF kernel)
- Still need to be able to draw an obvious line between your groups

For this topic, we'll use a tutorial on SVM from Jake Vanderplas' excellent day-long tutorial on scikit-learn (sklearn) (which I highly recommend working through on your own), the machine learning library in Python. Follow this link, stopping when you get to the section labeled Quick Example: Moving to Regression.

Decision trees

Decision trees are classifiers which find the features whose cutoffs best define the differences between your groups. For example, below is an example decision tree of teaching a computer how to classify fruit from the grocery store.

Source: https://www.safaribooksonline.com/library/view/programming-collective-intelligence/9780596529321/ch12s02.html

Again, for this topic, we'll use Jake's tutorial

Pros

Agnostic to the distributions of your data
Doesn't need to be linearly separable

Cons

Can't deal with intermediate populations (e.g. $x < 10$ , $10\geq x < 15$ , $15 \geq x$ wouldn't get picked up)
- Solution: train one class at a time
Overfitting
- False positives