Kraken is an algorithm that inspects sequence reads from a sample presumably containing multiple species and attempts to identify the species present (Wood & Salzberg 2014). This is most useful for microbiologists trying to identify bacteria species in environments such as a body of water or an animal's gut. This field of study is known as metagenomics.

Kraken stores a database of all $k$-mers from the 2,256 complete reference genomes contained in the NCBI's RefSeq database. The reference genomes' associated species are organized in a phylogenetic tree in which each species' node is attached to that of its most immediate ancestor.

The production code uses a lot of optimization techniques. After all, the novelty of Kraken is not the problem it solved (many metagenomic classification algorithms already existed), but rather its speed and memory efficiency. However, our Python implementation will ignore most incremental optimizations for the sake of readability and concision, while using the same general data formats as the production code.

We will populate our phylogenetic tree using data files supplied for Minikraken, which uses all of the reference genomes supplied by RefSeq but only a small subset of their $k$-mers.


In [6]:
from pprint import pprint

nodes = "/Volumes/8 GB FLASH/minikraken_20140330/taxonomy/nodes.dmp"
names = "/Volumes/8 GB FLASH/minikraken_20140330/taxonomy/names.dmp"


class PhyloNode:
    
    def __init__(self, nodeID, parentID, taxonType):
        self.nodeID = nodeID
        self.parentID = parentID
        self.name = None
        self.taxonType = taxonType
        self.children = set()
        self.parent = None

    def getLCA(descendants):
        if not descendants:
            return self
        if descendants.issubset(self.children):
            return self
        possibleDescendants = descendants.intersect(self.children)
        for child in children:
            childsNondescendants = child.nonchildrenIn(descendants)
            if not childsNondescendants:
                return child.getLCA(descendants)
            else:
                possibleDescendants.intersection_update(
                
        elif not set.intersect([child.nonchildrenIn(descendants.intersect(self.children)) for child in self.children]):
            return True
        else:
            return False
        
    # determines which nodeIDs in the set are not children of this node (inclusive)
    def nonchildrenIn(nodeIDs):
        if not nodeIDs:
            return set()
        else:
            if nodeIDs.contains(self.nodeID):
                nodeIDs.remove(self.nodeID)
            return set.intersection([child.childrenIn(nodeIDs) for child in self.children])


class PhyloTree:
    
    def __init__(self, nodesFileName, namesFileName):
        self.size = 0
        # used to quickly index node by nodeID
        self.nodes = []

        namesFile = open(namesFileName, "r")
        nodesFile = open(nodesFileName, "r")

        # populate nodes
        for line in nodesFile:
            nodeID, parentID, taxonType, _ = map(lambda x: x.strip(), line.split('|', 3))
            nodeID, parentID = int(nodeID), int(parentID)
            #pad for missing nodes
            while nodeID > len(self.nodes):
                self.nodes.append(None)
            self.nodes.append(PhyloNode(nodeID, parentID, taxonType))

        # populate children lists
        for node in self.nodes:
            if node and node.parentID:
                node.parent = self.nodes[node.parentID]
                node.parent.children.add(node)

        # add names to nodes
        for line in namesFile:
            nodeID, name, _, nameType, _ = map(lambda x: x.strip(), line.split('|'))
            nodeID = int(nodeID)
            if nameType == "scientific name":
                self.nodes[nodeID].name = name
        


phyloTree = PhyloTree(nodes, names)

In [ ]: