In [1]:
%%time
# import goetia libraries
from goetia import libgoetia
from goetia.dbg import dBG, Graph
from goetia.hashing import CanLemireShifter
from goetia.storage import SparseppSetStorage
First. some explanation as to these classes. We have, from goetia.dbg:
dBG: This is the wrapped C++ dBG implementation. It is templated with a storage type and a hasher type; the storage type decides the underlying $k$-mer store, while the hashing type decides the hashing policy. Template class wrapped by cppyy are subscriptable, returning a complete class.Graph: A thin wrapper function around dBG. Called with the storage and hashing types, it returns a funtion that takes $K$, a tuple of storage arguments, and a tuple of hashing arguments. When calling the dBG constructor directly, one must pass in an already-constructed storage and hasher. Note that the latter method has uses as well, as it allows multiple dBG's to share the same underling storage.CanLemireShifter: This is a hashing method. It wraps Lemire's rolling hash implementation and returns canonical hashes. The return type is Canonical<uint64_t>, which has the attributes fw_hash and rc_hash for each strand (relative to the given sequence), along with the property value which returns the canonical hash.SparseppSetStorage: A $k$-mer store for the dBG using the lightweight sparsepp hash table. sparsepp is significantly faster than std::map or std::set. Given that it's a regular hash table, this is an exact storage backend.We construct a dBG as follows:
In [2]:
K = 31
hasher = CanLemireShifter(K)
storage = SparseppSetStorage.build()
graph = dBG[SparseppSetStorage, CanLemireShifter](storage, hasher)
Note that we used the .build() method to get the storage. This is a C++ method that returns a std::shared_ptr owning the storage. The hasher is just going to get copied by the dBG in its constructor; it's only used as a prototype. The dBG uses the $K$ of the hasher.
Because graph is a product of the cppyy wrapper, it has some special attributes. In particular, we can get its underlying C++ type:
In [3]:
type(graph)
Out[3]:
That's rather ugly, which is why working with these things from Python is nice!
So, we now have a dBG with $K=31$. What can we do with it?
In [4]:
kmer = 'A' * K
# returns True when the k-mer is new
graph.insert(kmer)
Out[4]:
In [5]:
# Canonical hashing means the reverse complement k-mer reports as seen
graph.insert('T' * K)
Out[5]:
In [6]:
# We query similarly
graph.query(kmer)
Out[6]:
There is also a method called insert_and_query, which does as it sounds: first inserts, then returns the post-insertion count. SparseppSetStorage is a presence-only backend (a set), so the reulst here will always be 1; for backends that can count, this is more useful.
In [7]:
graph.insert_and_query('A' * (K-1) + 'T')
Out[7]:
We can interrogate the dBG for global information, such as the number of unique $k$-mers:
In [8]:
graph.n_unique()
Out[8]:
dBG's with probabilistic storage backends support a false positive reporting function. Ours, however, is exact, and does not:
In [9]:
graph.estimated_fp()
dBG, eventually, derives from whichever hasher we gave it, and so supports its methods. This means we can, for example, hash an individual $k$-mer on it:
In [10]:
graph.hash(kmer)
Out[10]:
Note that only the canonical hash is stored in the underlying storage backend. We can access it through the value property:
In [11]:
h = graph.hash(kmer)
h.value
Out[11]:
The hasher (which is an instance of HashShifter<...>) has a variety of typedefs associated with it, which are also exposed on the dBG and related classes. For example:
In [12]:
# The basic hash type used
graph.hash_type
Out[12]:
In [13]:
# Combines a k-mer string and its hash
graph.kmer_type
graph.kmer_type(graph.hash(kmer), kmer)
Out[13]:
In [14]:
# or the shifter and storage types
graph.shifter_type
Out[14]:
Additionally, there's the alphabet. CanLemireShifter is just a typedef that omits a template parameter that decides the alphabet of the dBG. It's statically accessible on the dBG (and the HashShifter) through the alphabet attribute, and has several useful statically defined methods.
In [15]:
print(graph.alphabet, graph.alphabet.SYMBOLS, graph.alphabet.COMPLEMENTS, sep=', ')
In [16]:
graph.alphabet.reverse_complement(kmer)
Out[16]:
In [17]:
graph.alphabet.complement('G')
Out[17]:
Though not directly related to performing dBG operations, it's useful to know a bit about how goetia structures its hashing, and by extension, its neighbor-finding and traversal, under the hood.
The most basic hasher is the HashShifter<T>, where T is a ShiftPolicy<HashType, Alphabet>. ShiftPolicy wraps the underlying hash functions to conform to a standard interface; HashType decides what sort of hash values are returned (canonical, regular, different value_types (maybe you want uint32_t for some reason?), etc.). HashShifter<T> derives from T: it's a policy-based design model where T provides an optimized implementation, and it avoids anything beinv virtual. The most important methods exposed by the HashShifter are shift_right(char out, char in) and shift_left(char in, char out), which update the current hash value with new symbols. Clearly, this implies they are stateful.
Because dBG is a HashShifter, we can use the existing graph to demostrate its methods.
In [18]:
# kmer is 'A' * K
graph.hash_base(kmer)
Out[18]:
In [19]:
# get rid of an A on the left, add a T on the right
graph.shift_right('A', 'T')
Out[19]:
In [20]:
graph.hash('A' * (K - 1) + 'T') == graph.get_hash()
Out[20]:
Some notes: get_hash(...) gets the current hash of the shifter; hash(...) is non-volatile (does not affect the underlying state); and hash_base(...) primes the shifter with its initial value. Calling shift_{left,right} on an unititialized shifter will raise an exception.
Next up the chain is the HashExtender. The HashExtender once again uses a policy to offer optimized implementions, in this case, for getting potential neighboring hash values in a dBG based on its alphabet: ie, extending. The HashExtender keeps more state in order to enable repeated extensions after shifting. It uses a circular (or ring) buffer to store the symbols of the current $k$-mer, which we call the cursor. In the end, the HashExtender is itself a HashShifter, though it overrides a few methods to stay consistent. It provides set_cursor(str) and get_cursor(str) for setting the state, and hash_base now makes a call to set_cursor(str) under the hood to make sure the buffer gets loaded.
Most importantly, it provides extension methods. Note that dBG is also a HashExtender, so we can use the existing graph to demo:
In [21]:
kmer = 'TCACACCTCTGTGTTGTGCTACTTGCGGCGC'
graph.set_cursor(kmer)
Out[21]:
In [22]:
graph.get_cursor()
Out[22]:
Because we keep state, we can just call shift_right(symbol), without providing the symbol to remove:
In [23]:
graph.shift_right('A')
Out[23]:
In [24]:
graph.get_cursor()
Out[24]:
We can ask for the extensions of the current cursor. If we ask for the left extensions, we should see the earlier hash value amongst them.
In [25]:
graph.left_extensions()
Out[25]:
In [26]:
graph.right_extensions()
Out[26]:
In [27]:
graph.get_hash()
Out[27]:
Also take note that the underlying hash value is still the same as it was before we asked for extensions. The extension methods leave the shifter back in its originaly state.
You might notice that this returned something called a Shift. The Shift is just a hash value with an attached symbol and direction. Just like the Canonical, it has a value property that retrieves the canonical hash value.
In [28]:
shift = graph.right_extensions()[0]
shift.value, shift.symbol
Out[28]:
Just one last abstraction layer! The Final Form of the HashShifter is the dBGWalker. Once again, just as the HashExtender is a HashShifter, the dBGWalker is a HashExtender. The difference is that the dBGWalker can not be instantiated on its own: it can only be instantiated when by a dBG deriving from it! Obviously, if we wanna walk a dBG, we need a dBG to walk.
The dBGWalker provides all the traversal-related methods to the dBG. For example, we can get the neighbors of the node ($k$-mer) pointed to by the current cursor.
NOTE: Due to a temporary regression in cppyy, the return results of several methods that return std::vector cannot be transformed directly into lists, either implicitly or by calling list(ret). So, I'm going to add a little utility function here for it. This should be fixed eventually.
In [29]:
def to_list(vec): return [v for v in vec]
In [30]:
graph.insert(graph.set_cursor(kmer))
to_list(graph.in_neighbors()), to_list(graph.out_neighbors()), kmer
Out[30]:
Well that's sad. Let's give this guy some neighbors.
In [31]:
# The attribute `hash` on `Shift` provides the hash object, which we can insert directly
for node in graph.right_extensions()[0:2]:
graph.insert(node.hash)
graph.insert(graph.left_extensions()[0].hash)
Out[31]:
Now our little node shouldn't be so lonely...
In [32]:
[(node.value, node.symbol) for node in graph.in_neighbors()], [(node.value, node.symbol) for node in graph.out_neighbors()]
Out[32]:
Much better! =)
The dBGWalker uses lower-level primitives for its neighbor-finding and traversal. One such primitive is filter_nodes, which takes the result of an extension call and returns only the nodes that are actually in the dBG.
In [33]:
from cppyy.gbl import std
[(node.value, node.symbol) for node in graph.filter_nodes(std.vector[graph.shift_left_type](graph.left_extensions()))]
Out[33]:
Most users will not need to use such low level interfaces. So now, after that long digression, back to our original programming.
In [34]:
# first let's put our cursor on the far left of what we already inserted...
graph.set_cursor(kmer)
graph.shift_left('A')
Out[34]:
In [35]:
# no left neighbors, one right neighbor
[(node.value, node.symbol) for node in graph.in_neighbors()], [(node.value, node.symbol) for node in graph.out_neighbors()]
Out[35]:
In [36]:
seq = 'GGTAGGAGGA'
for i, base in enumerate(seq[::-1]):
graph.insert(graph.shift_left(base))
print(((len(seq) - i ) * ' ') + graph.get_cursor())
In [37]:
anchor = graph.get_cursor()
anchor, graph.get_hash()
Out[37]:
Okay, now that we've added some $k$-mers, let's walk around. We've set the current position, which is on the far left of the path we inserted, as an "anchor," so that we can go back to it.
Now, for some jargon. A sequence of connected nodes can be called a path, a walk, or a traversal. We're just going to call it a walk. We'll call one transition from node-to-node a step. The API conforms to this terminology:
In [38]:
end_state, options = graph.step_right()
In [39]:
end_state, [n for n in options]
Out[39]:
step_{left,right} is somewhat self-explanatory: it steps right from the current position. The return values, however, require explanation. end_state is an enum of type goetia::TraversalState::State, which is defined as follows:
namespace TraversalState {
enum State {
STOP_FWD, // 0
DECISION_FWD, // 1
DECISION_BKW, // 2
STOP_SEEN, // 3
STOP_MASKED, // 4
BAD_SEED, // 5
GRAPH_ERROR, // 6
STEP // 7
};
}
The states describe how are traversal/walk/step ends. Unfortunately, the current C++ bindings don't convert these to strings, but it's On The List. For now, we can see that our end state was 7, which corresponds to STEP. This quite simply means that we were able to take a step right, and did, because there was only one neighbor and one possible step. options is the list of where we could have gone -- but in order to STEP, this had to be exactly one! Had there been two or more, our end state would have DECISION_FWD. We'd have stayed where we started.
Here is the complete state description:
STOP_FWD: there are no neighbors in this direction.DECISION_FWD: there is more than one neighbor.STOP_SEEN: there is a single neighbor but it has been seen.DECISION_BKW: There is a single neighbor, but it is a decision in the other direction.STOP_MASKED: There is a single neighbor, but it is masked.BAD_SEED: The node you tried to start at does not exist.GRAPH_ERROR: The graph is structural unsound (basically a panic).STEP: None of the above: we can moveNote that there are other states where we can have exactly one neighbor but not move: it could have been seen already, or it could have been a backwards decision node. A backwards d-node (DECISION_BKW) means that, were we to step right, our in-degree would be two or greater (and correspondingly for step_left).
If we check now, we'll see that we've moved, and we've got one in- and one out-neighbor.
In [40]:
[(node.value, node.symbol) for node in graph.in_neighbors()], [(node.value, node.symbol) for node in graph.out_neighbors()]
Out[40]:
If we reset back to the anchor and try again, we'll get a STOP_SEEN:
In [41]:
graph.set_cursor(anchor)
end_state, options = graph.step_right()
end_state, to_list(options), graph.get_hash()
Out[41]:
The dBGWalker maintains a set of already-seen nodes to prevent getting stuck in cycles, so we remain at anchor.
In [42]:
list(graph.seen)
Out[42]:
Not to worry though, we can clear it:
In [43]:
graph.clear_seen()
end_state, options = graph.step_right()
end_state, to_list(options), graph.get_hash()
Out[43]:
Moving anywhere would be pretty hard if we had to manually think about stepping every time we did it, and the same with walking a dBG! For that, we have the walk{left,right} methods. These call step_* repeatedly until they reach a seen $k$-mer or create a partial unitig. We can either walk from our current position or given a seed to start at -- for now we'll just start at our current position.
In [44]:
graph.set_cursor(anchor)
walk = graph.walk_right()
walk* returns a special Walk object. It has the following attributes:
start: A kmer_type for the position we started at.path: A list of shifts holding the extensions for the path.end_state: The traversal state we ended on.head(): The hash value of start.tail(): The has value of the last extension (where we ended).to_string(): The string representation of the walk, resulting from combining the start $k$-mer with the extensions (specialized for direction).glue(Walk): The string resulting from glueing two walks together, specialized on directions.retreat_symbol(): The symbol necessary to shift back left (or right) after moving to one of the end-state neighbors.In this case, we stopped on a forward decision node.
In [45]:
walk.start, walk.end_state
Out[45]:
In [46]:
graph.get_hash() == walk.tail(), walk.tail()
Out[46]:
In [47]:
to_list(walk.path)
Out[47]:
In [48]:
walk.to_string()
Out[48]:
You'll notice that walk.tail() and graph.get_has() correspond to kmer from way up before. We made kmer a decision node when we gave it two friends to its right:
In [49]:
[n for n in graph.out_neighbors()]
Out[49]:
You might also recall that anchor had no in-neighbors. That means congratulations are in order -- we've just assembled a unitig!
So now that we've walked a unitig, it's time to traverse to the rest of its neighbors. The Walk object has retreat_symbol that will help usmove to each neighbor and then back to the decision node.
In [50]:
prev = walk
for n in graph.out_neighbors():
graph.shift_right(n.symbol)
walk = graph.walk_right()
print('start:', walk.start)
print('end state:', walk.end_state)
print('current hash:', graph.get_hash())
print('walk string:', walk.to_string())
# return to the decision_node
graph.set_cursor(walk.start.kmer)
graph.shift_left(prev.retreat_symbol())
So, our two start positions were the two right neighbors, and each time our end state was STOP_FWD -- ie, a dead end. We then returned to the previous root by setting the cursor to the start of the walk and shifting left with the previous walk's retreat symbol.
Both of these "walks" were single $k$-mers, but enh, they're still walks!
In [51]:
prev.to_string(), len(prev.to_string())
Out[51]:
In [52]:
seed = prev.to_string()[5:5+K]
lwalk, rwalk = graph.walk(seed)
In [53]:
print(lwalk.start, lwalk.end_state)
print(lwalk.to_string())
print(rwalk.start, rwalk.end_state)
print(rwalk.to_string())
All fits together: lwalk has an end state of STOP_FWD, as it hit a dead-end in its direction of movement, and rwalk has an end state of DECISION_FWD, as it hit the decision node in its direction of movement. Each has the same start $k$-mer, and each has a to_string() result with that overlapping $k$-mer. If we want to combine them, we can use the glue method, which should give us back our original unitig.
In [54]:
assert lwalk.glue(rwalk) == prev.to_string()
Note that, because lwalk and rwalk here are actually different types based on a template parameter for their direciton, their glue methods are specialized, and we can glue in the other order and still get the correct result:
In [55]:
assert rwalk.glue(lwalk) == prev.to_string()