Frankengenome challenge

Created by Jacob Pritt with slight changes by Ben Langmead and Ankit Garg

Dr. Frankenstein is trying to create a new species of bacterium by splicing together the genomes of bacteria A and B. Unfortunately, he lost all but the first two pages of his notes describing how they were spliced together. He needs your help to identify which parts of the frankengenome came from which bacterium. You are given the full frankengenome called frankengene1.fasta as well as the first two pages of notes, called trainingData1.txt and testData1.txt, which have correct labels for the first 50,000 and the next 50,000 bases of the frankengenome respectively. The label 0 indicates the base came from genome A and the label 1 indicates it came from genome B. E.g. the label 0000011111110011 indicates the first 5 bases came from genome A, the next 7 from genome B, the next 2 from genome A, and the last 2 from genome B.

Implement a classifier that assigns a label to each base of the frankengenome according to whether it came from genome A or B. You are not expected to get the labels 100% correct; you will be graded based on your classifier's accuracy relative to a simple "reference" classifier. Accuracy is measured as the fraction of genome positions given the correct label.

The frankengenome is available here:

http://www.cs.jhu.edu/~langmea/resources/frankengene1.fasta

Labels for the first 50,000 bases, the training data, are available here:

http://www.cs.jhu.edu/~langmea/resources/trainingData1.txt

Labels for the next 50,000 bases, the test data, are available here:

http://www.cs.jhu.edu/~langmea/resources/testData1.txt

You should submit:

  1. the source code for your solution
  2. instructions on how to run it, and
  3. a file containing a string of 0s and 1s, one per genome position, giving your predictions for whether each base of the Frankengenome came from genome A (0) or genome B (1).

A good way to solve this problem is using an HMM, similarly to how we solved the CpG island labeling problem in class. You may borrow freely from the posted HMM code. While you may choose to experiment with higher-order HMM, a 1st order HMM -- where states correspond to genomes (A or B) and emissions to individual bases -- will do fine here.

When you build your classifier, build it using only the data from trainingData1.txt. When you test your classifier to assess accuracy, test it using only the data from testData1.txt. This will give you a more accurate picture about how your classifier performs on the whole genome than if you build and test using the same data.

Hint: If you build your classifier using trainingData1.txt and use it to make predictions for testData1.txt, you should be able to label at least 98.5% of the positions correctly.

Hint: The read_genome_and_training and read_training_pairs functions defined below in Python can be helpful when building the transition and emission probability tables for the HMM. Feel free to use them:


In [1]:
from itertools import islice, tee

def pairwise(iterable):
    """ Create iterator over adjacent pairs of elements in given
        iterator. """
    a, b = tee(iterable)
    next(b, None)
    return zip(a, b)


def read_fasta_char_by_char(fn):
    """ Read FASTA file, yielding nucleotides one by one """
    with open(fn) as fh:
        for ln in fh:
            if ln[0] == '>':
                continue
            for c in ln.strip():
                yield c

def read_labels_char_by_char(fn):
    """ Read label file, yielding labels one by one """
    with open(fn) as fh:
        for ln in fh:
            for c in ln.strip():
                yield c

def read_genome_and_training(franken_fn, training_fn):
    """ Given filename for Frankengenome and filename for training data,
        return iterator over labeled positions of the Frankengenome,
        yielding pairs like ('A', 0) """
    return zip(read_fasta_char_by_char(franken_fn),
               read_labels_char_by_char(training_fn))

def read_training_pairs(training_fn):
    """ Return iterator over adjacent label pairs in given label file """
    return pairwise(read_labels_char_by_char(training_fn))

Hint: This Python function can be helpful for evaluating your classifier on the test data. Feel free to use it:


In [2]:
def evaluate_on_test_data(prediction, test_data_fn):
    """ Given an iterator over all the predictions (one per
        genome position), and given the name of the test data
        file, return a pair giving the number of correct and
        the total number of predictions. """
    ncorrect, ntot = 0, 0
    for pred, act in zip(islice(prediction, 50000, None),  # skip first 50K predictions
                         read_labels_char_by_char(test_data_fn)):
        if pred == act:
            ncorrect += 1
        ntot += 1
    return ncorrect, ntot