The dBG: Hashing and Traversal

This notebook demonstrates how to use goetia's $k$-mer hashing and low level de Bruijn graph (dBG) traversal functionality.



In [1]:

    
%%time
# import goetia libraries
from goetia import libgoetia
from goetia.dbg import dBG, Graph
from goetia.hashing import CanLemireShifter
from goetia.storage import SparseppSetStorage









    



CPU times: user 8.8 s, sys: 455 ms, total: 9.25 s
Wall time: 8.78 s

Building a dBG

First. some explanation as to these classes. We have, from goetia.dbg:

dBG: This is the wrapped C++ dBG implementation. It is templated with a storage type and a hasher type; the storage type decides the underlying $k$-mer store, while the hashing type decides the hashing policy. Template class wrapped by cppyy are subscriptable, returning a complete class.
Graph: A thin wrapper function around dBG. Called with the storage and hashing types, it returns a funtion that takes $K$, a tuple of storage arguments, and a tuple of hashing arguments. When calling the dBG constructor directly, one must pass in an already-constructed storage and hasher. Note that the latter method has uses as well, as it allows multiple dBG's to share the same underling storage.
CanLemireShifter: This is a hashing method. It wraps Lemire's rolling hash implementation and returns canonical hashes. The return type is Canonical<uint64_t>, which has the attributes fw_hash and rc_hash for each strand (relative to the given sequence), along with the property value which returns the canonical hash.
SparseppSetStorage: A $k$-mer store for the dBG using the lightweight sparsepp hash table. sparsepp is significantly faster than std::map or std::set. Given that it's a regular hash table, this is an exact storage backend.

We construct a dBG as follows:



In [2]:

    
K = 31
hasher = CanLemireShifter(K)
storage = SparseppSetStorage.build()
graph = dBG[SparseppSetStorage, CanLemireShifter](storage, hasher)

Note that we used the .build() method to get the storage. This is a C++ method that returns a std::shared_ptr owning the storage. The hasher is just going to get copied by the dBG in its constructor; it's only used as a prototype. The dBG uses the $K$ of the hasher.

Because graph is a product of the cppyy wrapper, it has some special attributes. In particular, we can get its underlying C++ type:



In [3]:

    
type(graph)









    Out[3]:





<class goetia.goetia.dBG<goetia::storage::SparseppSetStorage,goetia::hashing::HashShifter<goetia::hashing::LemireShifterPolicy<goetia::hashing::Canonical<unsigned long>,goetia::DNA_SIMPLE> > > at 0x55d249d22110>

That's rather ugly, which is why working with these things from Python is nice!

So, we now have a dBG with $K=31$. What can we do with it?

Basics

The most fundamental operations are probably adding new $k$-mers and asking about existing ones. For that, we have the insert and query methods.



In [4]:

    
kmer = 'A' * K

# returns True when the k-mer is new
graph.insert(kmer)









    Out[4]:





True



In [5]:

    
# Canonical hashing means the reverse complement k-mer reports as seen
graph.insert('T' * K)









    Out[5]:





False



In [6]:

    
# We query similarly
graph.query(kmer)









    Out[6]:





1

There is also a method called insert_and_query, which does as it sounds: first inserts, then returns the post-insertion count. SparseppSetStorage is a presence-only backend (a set), so the reulst here will always be 1; for backends that can count, this is more useful.



In [7]:

    
graph.insert_and_query('A' * (K-1) + 'T')









    Out[7]:





1

We can interrogate the dBG for global information, such as the number of unique $k$-mers:



In [8]:

    
graph.n_unique()









    Out[8]:





2

dBG's with probabilistic storage backends support a false positive reporting function. Ours, however, is exact, and does not:



In [9]:

    
graph.estimated_fp()









    



---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
<ipython-input-9-853026859349> in <module>
----> 1 graph.estimated_fp()

TypeError: Template method resolution failed:
  Failed to instantiate "estimated_fp()"

dBG, eventually, derives from whichever hasher we gave it, and so supports its methods. This means we can, for example, hash an individual $k$-mer on it:



In [10]:

    
graph.hash(kmer)









    Out[10]:





<Canonical fw=1376280362712374913 rc=16264769942947757911 sign=1>

Note that only the canonical hash is stored in the underlying storage backend. We can access it through the value property:



In [11]:

    
h = graph.hash(kmer)
h.value









    Out[11]:





1376280362712374913

The hasher (which is an instance of HashShifter<...>) has a variety of typedefs associated with it, which are also exposed on the dBG and related classes. For example:



In [12]:

    
# The basic hash type used 
graph.hash_type









    Out[12]:





<class goetia.goetia.hashing.Canonical<unsigned long> at 0x55d24e799d30>



In [13]:

    
# Combines a k-mer string and its hash
graph.kmer_type
graph.kmer_type(graph.hash(kmer), kmer)









    Out[13]:





<Kmer hash=<Canonical fw=1376280362712374913 rc=16264769942947757911 sign=1> kmer=AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA>



In [14]:

    
# or the shifter and storage types
graph.shifter_type









    Out[14]:





<class goetia.goetia.hashing.HashShifter<goetia::hashing::LemireShifterPolicy<goetia::hashing::Canonical<unsigned long>,goetia::DNA_SIMPLE> > at 0x55d2495a0e50>

Additionally, there's the alphabet. CanLemireShifter is just a typedef that omits a template parameter that decides the alphabet of the dBG. It's statically accessible on the dBG (and the HashShifter) through the alphabet attribute, and has several useful statically defined methods.



In [15]:

    
print(graph.alphabet, graph.alphabet.SYMBOLS, graph.alphabet.COMPLEMENTS, sep=', ')









    



<class goetia.goetia.DNA_SIMPLE at 0x55d24eb20710>, ACGT, TGCA



In [16]:

    
graph.alphabet.reverse_complement(kmer)









    Out[16]:





'TTTTTTTTTTTTTTTTTTTTTTTTTTTTTTT'



In [17]:

    
graph.alphabet.complement('G')









    Out[17]:





'C'

The HashShifter

Though not directly related to performing dBG operations, it's useful to know a bit about how goetia structures its hashing, and by extension, its neighbor-finding and traversal, under the hood.

The most basic hasher is the HashShifter<T>, where T is a ShiftPolicy<HashType, Alphabet>. ShiftPolicy wraps the underlying hash functions to conform to a standard interface; HashType decides what sort of hash values are returned (canonical, regular, different value_types (maybe you want uint32_t for some reason?), etc.). HashShifter<T> derives from T: it's a policy-based design model where T provides an optimized implementation, and it avoids anything beinv virtual. The most important methods exposed by the HashShifter are shift_right(char out, char in) and shift_left(char in, char out), which update the current hash value with new symbols. Clearly, this implies they are stateful.

Because dBG is a HashShifter, we can use the existing graph to demostrate its methods.



In [18]:

    
# kmer is 'A' * K
graph.hash_base(kmer)









    Out[18]:





<Canonical fw=1376280362712374913 rc=16264769942947757911 sign=1>



In [19]:

    
# get rid of an A on the left, add a T on the right
graph.shift_right('A', 'T')









    Out[19]:





<Canonical fw=18429189216377515994 rc=11251502874266433223 sign=0>



In [20]:

    
graph.hash('A' * (K - 1) + 'T') == graph.get_hash()









    Out[20]:





True

Some notes: get_hash(...) gets the current hash of the shifter; hash(...) is non-volatile (does not affect the underlying state); and hash_base(...) primes the shifter with its initial value. Calling shift_{left,right} on an unititialized shifter will raise an exception.

The HashExtender

Next up the chain is the HashExtender. The HashExtender once again uses a policy to offer optimized implementions, in this case, for getting potential neighboring hash values in a dBG based on its alphabet: ie, extending. The HashExtender keeps more state in order to enable repeated extensions after shifting. It uses a circular (or ring) buffer to store the symbols of the current $k$-mer, which we call the cursor. In the end, the HashExtender is itself a HashShifter, though it overrides a few methods to stay consistent. It provides set_cursor(str) and get_cursor(str) for setting the state, and hash_base now makes a call to set_cursor(str) under the hood to make sure the buffer gets loaded.

Most importantly, it provides extension methods. Note that dBG is also a HashExtender, so we can use the existing graph to demo:



In [21]:

    
kmer = 'TCACACCTCTGTGTTGTGCTACTTGCGGCGC'
graph.set_cursor(kmer)









    Out[21]:





<Canonical fw=15476993202998089975 rc=8312430231874395202 sign=0>



In [22]:

    
graph.get_cursor()









    Out[22]:





'TCACACCTCTGTGTTGTGCTACTTGCGGCGC'

Because we keep state, we can just call shift_right(symbol), without providing the symbol to remove:



In [23]:

    
graph.shift_right('A')









    Out[23]:





<Canonical fw=7170494598487408972 rc=16043475115312428656 sign=1>



In [24]:

    
graph.get_cursor()









    Out[24]:





'CACACCTCTGTGTTGTGCTACTTGCGGCGCA'

We can ask for the extensions of the current cursor. If we ask for the left extensions, we should see the earlier hash value amongst them.



In [25]:

    
graph.left_extensions()









    Out[25]:





[<Shift hash=<Canonical fw=12345526596623405415 rc=11494181310317096217 sign=0> symbol=A direction=0>,
 <Shift hash=<Canonical fw=6756202371005322454 rc=1628793635298901189 sign=0> symbol=C direction=0>,
 <Shift hash=<Canonical fw=8758330460264747644 rc=5289963226481314134 sign=0> symbol=G direction=0>,
 <Shift hash=<Canonical fw=15476993202998089975 rc=8312430231874395202 sign=0> symbol=T direction=0>]



In [26]:

    
graph.right_extensions()









    Out[26]:





[<Shift hash=<Canonical fw=2234471427066309240 rc=4231802296200205098 sign=1> symbol=A direction=1>,
 <Shift hash=<Canonical fw=2679775005173563244 rc=10806639864229157281 sign=1> symbol=C direction=1>,
 <Shift hash=<Canonical fw=8846023511076178687 rc=12804271227938578187 sign=1> symbol=G direction=1>,
 <Shift hash=<Canonical fw=17571437905152908067 rc=5127116808213764794 sign=0> symbol=T direction=1>]



In [27]:

    
graph.get_hash()









    Out[27]:





<Canonical fw=7170494598487408972 rc=16043475115312428656 sign=1>

Also take note that the underlying hash value is still the same as it was before we asked for extensions. The extension methods leave the shifter back in its originaly state.

You might notice that this returned something called a Shift. The Shift is just a hash value with an attached symbol and direction. Just like the Canonical, it has a value property that retrieves the canonical hash value.



In [28]:

    
shift = graph.right_extensions()[0]
shift.value, shift.symbol









    Out[28]:





(2234471427066309240, 'A')

The dBGWalker

Just one last abstraction layer! The Final Form of the HashShifter is the dBGWalker. Once again, just as the HashExtender is a HashShifter, the dBGWalker is a HashExtender. The difference is that the dBGWalker can not be instantiated on its own: it can only be instantiated when by a dBG deriving from it! Obviously, if we wanna walk a dBG, we need a dBG to walk.

The dBGWalker provides all the traversal-related methods to the dBG. For example, we can get the neighbors of the node ($k$-mer) pointed to by the current cursor.

NOTE: Due to a temporary regression in cppyy, the return results of several methods that return std::vector cannot be transformed directly into lists, either implicitly or by calling list(ret). So, I'm going to add a little utility function here for it. This should be fixed eventually.



In [29]:

    
def to_list(vec): return [v for v in vec]



In [30]:

    
graph.insert(graph.set_cursor(kmer))
to_list(graph.in_neighbors()), to_list(graph.out_neighbors()), kmer









    Out[30]:





([], [], 'TCACACCTCTGTGTTGTGCTACTTGCGGCGC')

Well that's sad. Let's give this guy some neighbors.



In [31]:

    
# The attribute `hash` on `Shift` provides the hash object, which we can insert directly
for node in graph.right_extensions()[0:2]:
    graph.insert(node.hash)
graph.insert(graph.left_extensions()[0].hash)









    Out[31]:





True

Now our little node shouldn't be so lonely...



In [32]:

    
[(node.value, node.symbol) for node in graph.in_neighbors()], [(node.value, node.symbol) for node in graph.out_neighbors()]









    Out[32]:





([(7847775928338814768, 'A')],
 [(7170494598487408972, 'A'), (6462841522242458712, 'C')])

Much better! =)

The dBGWalker uses lower-level primitives for its neighbor-finding and traversal. One such primitive is filter_nodes, which takes the result of an extension call and returns only the nodes that are actually in the dBG.



In [33]:

    
from cppyy.gbl import std
[(node.value, node.symbol) for node in graph.filter_nodes(std.vector[graph.shift_left_type](graph.left_extensions()))]









    Out[33]:





[(7847775928338814768, 'A')]

Most users will not need to use such low level interfaces. So now, after that long digression, back to our original programming.

Walking the dBG

Obviously, the more interesting case is actually traversing the dBG. The dBGWalker provides a rich interface for taking a stroll around the graph. To demo, let's first build up our graph...



In [34]:

    
# first let's put our cursor on the far left of what we already inserted...
graph.set_cursor(kmer)
graph.shift_left('A')









    Out[34]:





<Canonical fw=7847775928338814768 rc=11167344157054741611 sign=1>



In [35]:

    
# no left neighbors, one right neighbor
[(node.value, node.symbol) for node in graph.in_neighbors()], [(node.value, node.symbol) for node in graph.out_neighbors()]









    Out[35]:





([], [(8312430231874395202, 'C')])



In [36]:

    
seq = 'GGTAGGAGGA'
for i, base in enumerate(seq[::-1]):
    graph.insert(graph.shift_left(base))
    print(((len(seq) - i ) * ' ') + graph.get_cursor())









    



          AATCACACCTCTGTGTTGTGCTACTTGCGGC
         GAATCACACCTCTGTGTTGTGCTACTTGCGG
        GGAATCACACCTCTGTGTTGTGCTACTTGCG
       AGGAATCACACCTCTGTGTTGTGCTACTTGC
      GAGGAATCACACCTCTGTGTTGTGCTACTTG
     GGAGGAATCACACCTCTGTGTTGTGCTACTT
    AGGAGGAATCACACCTCTGTGTTGTGCTACT
   TAGGAGGAATCACACCTCTGTGTTGTGCTAC
  GTAGGAGGAATCACACCTCTGTGTTGTGCTA
 GGTAGGAGGAATCACACCTCTGTGTTGTGCT



In [37]:

    
anchor = graph.get_cursor()
anchor, graph.get_hash()









    Out[37]:





('GGTAGGAGGAATCACACCTCTGTGTTGTGCT',
 <Canonical fw=10388554386607035669 rc=1182107035245116275 sign=0>)

Okay, now that we've added some $k$-mers, let's walk around. We've set the current position, which is on the far left of the path we inserted, as an "anchor," so that we can go back to it.

Steppin'

Now, for some jargon. A sequence of connected nodes can be called a path, a walk, or a traversal. We're just going to call it a walk. We'll call one transition from node-to-node a step. The API conforms to this terminology:



In [38]:

    
end_state, options = graph.step_right()



In [39]:

    
end_state, [n for n in options]









    Out[39]:





(7,
 [<Shift hash=<Canonical fw=12736750171723927455 rc=8224377023565475682 sign=0> symbol=A direction=1>])

step_{left,right} is somewhat self-explanatory: it steps right from the current position. The return values, however, require explanation. end_state is an enum of type goetia::TraversalState::State, which is defined as follows:

namespace TraversalState {
    enum State {
        STOP_FWD,        // 0
        DECISION_FWD,    // 1
        DECISION_BKW,    // 2
        STOP_SEEN,       // 3
        STOP_MASKED,     // 4
        BAD_SEED,        // 5
        GRAPH_ERROR,     // 6
        STEP             // 7
    };
}

The states describe how are traversal/walk/step ends. Unfortunately, the current C++ bindings don't convert these to strings, but it's On The List. For now, we can see that our end state was 7, which corresponds to STEP. This quite simply means that we were able to take a step right, and did, because there was only one neighbor and one possible step. options is the list of where we could have gone -- but in order to STEP, this had to be exactly one! Had there been two or more, our end state would have DECISION_FWD. We'd have stayed where we started.

Here is the complete state description:

STOP_FWD: there are no neighbors in this direction.
DECISION_FWD: there is more than one neighbor.
STOP_SEEN: there is a single neighbor but it has been seen.
DECISION_BKW: There is a single neighbor, but it is a decision in the other direction.
STOP_MASKED: There is a single neighbor, but it is masked.
BAD_SEED: The node you tried to start at does not exist.
GRAPH_ERROR: The graph is structural unsound (basically a panic).
STEP: None of the above: we can move

Note that there are other states where we can have exactly one neighbor but not move: it could have been seen already, or it could have been a backwards decision node. A backwards d-node (DECISION_BKW) means that, were we to step right, our in-degree would be two or greater (and correspondingly for step_left).

If we check now, we'll see that we've moved, and we've got one in- and one out-neighbor.



In [40]:

    
[(node.value, node.symbol) for node in graph.in_neighbors()], [(node.value, node.symbol) for node in graph.out_neighbors()]









    Out[40]:





([(1182107035245116275, 'G')], [(7801055356572192737, 'C')])

If we reset back to the anchor and try again, we'll get a STOP_SEEN:



In [41]:

    
graph.set_cursor(anchor)
end_state, options = graph.step_right()
end_state, to_list(options), graph.get_hash()









    Out[41]:





(3,
 [<Shift hash=<Canonical fw=12736750171723927455 rc=8224377023565475682 sign=0> symbol=A direction=1>],
 <Canonical fw=10388554386607035669 rc=1182107035245116275 sign=0>)

The dBGWalker maintains a set of already-seen nodes to prevent getting stuck in cycles, so we remain at anchor.



In [42]:

    
list(graph.seen)









    Out[42]:





[8224377023565475682]

Not to worry though, we can clear it:



In [43]:

    
graph.clear_seen()
end_state, options = graph.step_right()
end_state, to_list(options), graph.get_hash()









    Out[43]:





(7,
 [<Shift hash=<Canonical fw=12736750171723927455 rc=8224377023565475682 sign=0> symbol=A direction=1>],
 <Canonical fw=12736750171723927455 rc=8224377023565475682 sign=0>)

Walkin'

Moving anywhere would be pretty hard if we had to manually think about stepping every time we did it, and the same with walking a dBG! For that, we have the walk{left,right} methods. These call step_* repeatedly until they reach a seen $k$-mer or create a partial unitig. We can either walk from our current position or given a seed to start at -- for now we'll just start at our current position.



In [44]:

    
graph.set_cursor(anchor)
walk = graph.walk_right()

walk* returns a special Walk object. It has the following attributes:

start: A kmer_type for the position we started at.
path: A list of shifts holding the extensions for the path.
end_state: The traversal state we ended on.
head(): The hash value of start.
tail(): The has value of the last extension (where we ended).
to_string(): The string representation of the walk, resulting from combining the start $k$-mer with the extensions (specialized for direction).
glue(Walk): The string resulting from glueing two walks together, specialized on directions.
retreat_symbol(): The symbol necessary to shift back left (or right) after moving to one of the end-state neighbors.

In this case, we stopped on a forward decision node.



In [45]:

    
walk.start, walk.end_state









    Out[45]:





(<Kmer hash=<Canonical fw=10388554386607035669 rc=1182107035245116275 sign=0> kmer=GGTAGGAGGAATCACACCTCTGTGTTGTGCT>,
 1)



In [46]:

    
graph.get_hash() == walk.tail(), walk.tail()









    Out[46]:





(True, <Canonical fw=15476993202998089975 rc=8312430231874395202 sign=0>)



In [47]:

    
to_list(walk.path)









    Out[47]:





[<Shift hash=<Canonical fw=12736750171723927455 rc=8224377023565475682 sign=0> symbol=A direction=1>,
 <Shift hash=<Canonical fw=14637547382867799967 rc=7801055356572192737 sign=0> symbol=C direction=1>,
 <Shift hash=<Canonical fw=13010390999111524039 rc=3221253690059650609 sign=0> symbol=T direction=1>,
 <Shift hash=<Canonical fw=12749353867514261335 rc=18060366745909523060 sign=1> symbol=T direction=1>,
 <Shift hash=<Canonical fw=10703243689465863068 rc=885851925541808064 sign=0> symbol=G direction=1>,
 <Shift hash=<Canonical fw=9490228497282697113 rc=6014014112993225648 sign=0> symbol=C direction=1>,
 <Shift hash=<Canonical fw=6306066054953700407 rc=12958152393930660101 sign=1> symbol=G direction=1>,
 <Shift hash=<Canonical fw=6508651095222951261 rc=12173448819754021496 sign=1> symbol=G direction=1>,
 <Shift hash=<Canonical fw=2162203961694194202 rc=84412478834819436 sign=0> symbol=C direction=1>,
 <Shift hash=<Canonical fw=7847775928338814768 rc=11167344157054741611 sign=1> symbol=G direction=1>,
 <Shift hash=<Canonical fw=15476993202998089975 rc=8312430231874395202 sign=0> symbol=C direction=1>]



In [48]:

    
walk.to_string()









    Out[48]:





'GGTAGGAGGAATCACACCTCTGTGTTGTGCTACTTGCGGCGC'

You'll notice that walk.tail() and graph.get_has() correspond to kmer from way up before. We made kmer a decision node when we gave it two friends to its right:



In [49]:

    
[n for n in graph.out_neighbors()]









    Out[49]:





[<Shift hash=<Canonical fw=7170494598487408972 rc=16043475115312428656 sign=1> symbol=A direction=1>,
 <Shift hash=<Canonical fw=6462841522242458712 rc=8207629909326200059 sign=1> symbol=C direction=1>]

You might also recall that anchor had no in-neighbors. That means congratulations are in order -- we've just assembled a unitig!

Traversin'

So now that we've walked a unitig, it's time to traverse to the rest of its neighbors. The Walk object has retreat_symbol that will help usmove to each neighbor and then back to the decision node.



In [50]:

    
prev = walk
for n in graph.out_neighbors():
    graph.shift_right(n.symbol)
    walk = graph.walk_right()
    print('start:', walk.start)
    print('end state:', walk.end_state)
    print('current hash:', graph.get_hash())
    print('walk string:', walk.to_string())
    # return to the decision_node
    graph.set_cursor(walk.start.kmer)
    graph.shift_left(prev.retreat_symbol())









    



start: <Kmer hash=<Canonical fw=7170494598487408972 rc=16043475115312428656 sign=1> kmer=CACACCTCTGTGTTGTGCTACTTGCGGCGCA>
end state: 0
current hash: <Canonical fw=7170494598487408972 rc=16043475115312428656 sign=1>
walk string: CACACCTCTGTGTTGTGCTACTTGCGGCGCA
start: <Kmer hash=<Canonical fw=6462841522242458712 rc=8207629909326200059 sign=1> kmer=CACACCTCTGTGTTGTGCTACTTGCGGCGCC>
end state: 0
current hash: <Canonical fw=6462841522242458712 rc=8207629909326200059 sign=1>
walk string: CACACCTCTGTGTTGTGCTACTTGCGGCGCC

So, our two start positions were the two right neighbors, and each time our end state was STOP_FWD -- ie, a dead end. We then returned to the previous root by setting the cursor to the start of the walk and shifting left with the previous walk's retreat symbol.

Both of these "walks" were single $k$-mers, but enh, they're still walks!

Bidrectional Walkin'

Other than walk_left and walk_right, there's also plain old walk. This walks in both directions and returns the two walks as a tuple. We can demonstrate this by starting in the middle of unitig we assembled earlier.



In [51]:

    
prev.to_string(), len(prev.to_string())









    Out[51]:





('GGTAGGAGGAATCACACCTCTGTGTTGTGCTACTTGCGGCGC', 42)



In [52]:

    
seed = prev.to_string()[5:5+K]
lwalk, rwalk = graph.walk(seed)



In [53]:

    
print(lwalk.start, lwalk.end_state)
print(lwalk.to_string())
print(rwalk.start, rwalk.end_state)
print(rwalk.to_string())









    



<Kmer hash=<Canonical fw=10703243689465863068 rc=885851925541808064 sign=0> kmer=GAGGAATCACACCTCTGTGTTGTGCTACTTG> 0
GGTAGGAGGAATCACACCTCTGTGTTGTGCTACTTG
<Kmer hash=<Canonical fw=10703243689465863068 rc=885851925541808064 sign=0> kmer=GAGGAATCACACCTCTGTGTTGTGCTACTTG> 1
GAGGAATCACACCTCTGTGTTGTGCTACTTGCGGCGC

All fits together: lwalk has an end state of STOP_FWD, as it hit a dead-end in its direction of movement, and rwalk has an end state of DECISION_FWD, as it hit the decision node in its direction of movement. Each has the same start $k$-mer, and each has a to_string() result with that overlapping $k$-mer. If we want to combine them, we can use the glue method, which should give us back our original unitig.



In [54]:

    
assert lwalk.glue(rwalk) == prev.to_string()

Note that, because lwalk and rwalk here are actually different types based on a template parameter for their direciton, their glue methods are specialized, and we can glue in the other order and still get the correct result:



In [55]:

    
assert rwalk.glue(lwalk) == prev.to_string()