In [1]:

    
%%HTML
<style>
.container { width:100% } 
</style>

Tries

A Trie is a data structure that implements the abstract data type of a dictionary in the case that the keys are strings. The set $\mathbb{T}$ of all tries is defined inductively using the constructor $$ \texttt{Node}: \texttt{Value} \times \texttt{List}(\Sigma) \times\texttt{List}(\mathbb{T}) \rightarrow \mathbb{T}. $$ Here $\Sigma$ is the alphabet, i.e. the set of characters. The inductive definition of the set $\mathbb{T}$ has only a single clause: If

$v \in \texttt{Value} \cup \{\Omega\}$,
$C\!s = [c_1, \cdots, c_n] \in \texttt{List}(\Sigma)$ is a list of different characters of length $n$ and
$T\!s = [t_1, \cdots, t_n] \in \texttt{List}(\mathbb{T})$ is a list of tries of the same length $n$,

then we have $$ \texttt{Node}(v, C\!s, T\!s) \in \mathbb{T}. $$

An object $t$ of the class Trie represents the trie $$ \texttt{Node}(t.\texttt{mValue}, t.\texttt{mChars}, t.\texttt{mTries}). $$ The constructor below creates an empty trie.



In [2]:

    
class Trie:
    def __init__(self):
        self.mValue  = None
        self.mChars  = []
        self.mTries  = []
        self.mParent = '' # only used by graphviz

In order to specify the meaning of the class Trie we define the function find, which can be used to retrieve the values stored in a trie. This function has the signature $$ \texttt{find}: \mathbb{T} \times \Sigma^* \rightarrow \texttt{Value} \cup \{ \Omega\}. $$ For a trie $t$ and a string $s$, the expression $t.\texttt{find}(s)$ returns the value that is associated with the key $s$ in the trie $t$. The expression $\texttt{Node}(v,C\!s,T\!s).\texttt{find}(s)$ is defined by induction on the length of the string $s$:

$\texttt{Node}(v, C\!s, T\!s).\texttt{find}(\varepsilon) = v$.

The value associated with the empty string $\varepsilon$ is stored at the root of the trie.
$c = c_i \rightarrow \texttt{Node}(v, [c_1, \cdots, c_n], [t_1, \cdots, t_n]).\texttt{find}(cr) = t_i.\texttt{find}(r) $

The trie $\texttt{Node}(v, [c_1, \cdots, c_n], [t_1, \cdots, t_n])$ associates a value with the key $cr$ if the list $[c_1, \cdots, c_n]$ has a position $i$ such that $c$ equals $c_i$ and, furthermore, the trie $t_i$ associates a value with the key $r$.
$ c \not\in C\!s \rightarrow \texttt{Node}(v, C\!s, T\!s).\texttt{find}(cr) = \Omega $

If $c$ does not occur in the list $C\!s$, then the trie $\texttt{Node}(v, C\!s, T\!s)$ does not store a value for the key $cr$.



In [4]:

    
def find(self, s):
    if s == '':
        return self.mValue
    c, r = s[0], s[1:]
    for i, ci in enumerate(self.mChars):
        if c == ci:
            return self.mTries[i].find(r)
    return None # not necessary

Trie.find = find
del find



In [3]:

    
list(enumerate(['a', 'b', 'c']))









    Out[3]:





[(0, 'a'), (1, 'b'), (2, 'c')]

The signature of the method $\texttt{insert}$ is given as follows: $$ \texttt{insert}: \mathbb{T} \times \Sigma^* \times \texttt{Value} \rightarrow \mathbb{T}. $$ The result of evaluating $$ \texttt{Node}(v_1, [c_1, \cdots, c_n], [t_1, \cdots, t_n]).\texttt{insert}(w, v_2) $$ for a string $w\in \Sigma^*$ and a value $v_2 \in \texttt{Value}$ is defined by induction on the length of $w$:

$\texttt{Node}(v_1,L,T).\texttt{insert}(\varepsilon, v_2) = \texttt{Node}(v_2,L,T)$.

If a new value $v_2$ is associated with the empty string $\varepsilon$, then the old value $v_1$, which had been stored at the root before, is overwritten.
$\texttt{Node}\bigl(v_1,[c_1,\cdots,c_i,\cdots,c_n], [t_1,\cdots,t_i,\cdots,t_n]\bigr).\texttt{insert}(c_ir,v_2) = \texttt{Node}\bigl(v_1,[c_1,\cdots,c_i,\cdots,c_n], [t_1,\cdots,t_i.\texttt{insert}(r,v_2),\cdots,t_n]\bigr)$.

In order to associate a value $v_2$ with the string $c_ir$ in the trie $$ \texttt{Node}\bigl(v_1,[c_1,\cdots,c_i,\cdots,c_n], [t_1,\cdots,t_i,\cdots,t_n]\bigr) $$ we have to recursively associate the value $v_2$ with the string $r$ in the trie $t_i$.
$c \not\in\{c_1,\cdots,c_n\} \;\rightarrow\;\texttt{Node}\bigl(v_1,[c_1,\cdots,c_n], [t_1,\cdots,t_n]\bigr).\texttt{insert}(cr,v_2) = \texttt{Node}\bigl(v_1,[c_1,\cdots,c_n,c], [t_1,\cdots,t_n,\texttt{Node}(\Omega,[],[]).\texttt{insert}(r,v_2)]\bigr)$.

If we want to associate a value $v$ with the key $cr$ in the trie $\texttt{Node}\bigl(v_1,[c_1,\cdots,c_n], [t_1,\cdots,t_n]\bigr)$ then, if the character $c$ does not occur in the list $[c_1,\cdots,c_n]$, we first have to create a new empty trie. This trie has the form $$ \texttt{Node}(\Omega, [], []). $$ Next, we associate the value $v_2$ with the key $r$ in this empty trie. Finally, we append the character $c$ to the end of the list $[c_1,\cdots,c_n]$ and append the trie $$ \texttt{Node}(\Omega, [], []).\texttt{insert}(r,v_2) $$ to the end of the list $[t_1,\cdots,t_n]$.



In [5]:

    
def insert(self, s, v):
    if s == '':
        self.mValue = v
        return
    c, r = s[0], s[1:]
    for i, ci in enumerate(self.mChars):
        if c == ci:
            self.mTries[i].insert(r, v)
            return
    t = Trie()
    t.insert(r, v)
    t.mParent = c # necessary for visualization
    self.mChars.append(c)
    self.mTries.append(t)
    
Trie.insert = insert
del insert

In order to implement deletion in tries, we need the auxiliary function isEmpty. Its signature is given as $$ \texttt{isEmpty}: \mathbb{T} \rightarrow \mathbb{B}. $$ For a trie $t$, we have $t.\texttt{isEmpty}() = \texttt{True}$ if and only if the trie $t$ does not store any key. The following formula specifies the function $\texttt{isEmpty}$: $$ \texttt{Node}(v, L, T).\texttt{isEmpty}() = \mathtt{True} \Leftrightarrow v = \Omega \wedge L = []. $$



In [6]:

    
def isEmpty(self):
    return self.mValue == None and self.mChars == []

Trie.isEmpty = isEmpty
del isEmpty

For a trie $t \in \mathbb{T}$ and a string $w \in \Sigma^*$ the value $t.\texttt{delete}(w)$ is defined by induction on the length of $w$.

$\texttt{Node}(v,C,T).\texttt{delete}(\varepsilon) = \texttt{Node}(\Omega,C,T)$

The value that is associated with the empty string $\varepsilon$ is stored at the root of the trie where is can be deleted without further ado.
$\begin{array}[t]{ll} t_i.\texttt{delete}(r).\texttt{isEmpty}() & \rightarrow \\ \texttt{Node}(v, [c_1,\cdots,c_i,\cdots,c_n],[t_1,\cdots,t_i,\cdots,t_n]).\texttt{delete}(c_ir) & = \\ \qquad \texttt{Node}(v, [c_1,\cdots,c_{i-1},c_{i+1},\cdots,c_n],[t_1,\cdots,t_{i-1},t_{i+1},\cdots,t_n]). \end{array} $

If the key that is to be deleted starts with the character $c_i$ and if deletion of the key $r$ in the $i$th trie $t_i$ yields an empty trie, then both the $i$th character $c_i$ and the $i$th trie $t_i$ are deleted.
$\begin{array}[t]{ll} \neg t_i.\texttt{delete}(r).\texttt{isEmpty}() & \rightarrow \\ \texttt{Node}(v, [c_1,\cdots,c_i,\cdots,c_n],[t_1,\cdots,t_i,\cdots,t_n]).\texttt{delete}(c_ir) & = \\ \qquad \texttt{Node}(v, [c_1,\cdots,c_i,\cdots,c_n],[t_1,\cdots,t_i.\texttt{delete}(r),\cdots,t_n]). \end{array} $

If the key that is to be deleted starts with the character $c_i$ and if deletion of the key $r$ in the $i$th trie $t_i$ yields a non-empty trie, then the key $r$ has to be deleted recursively in the trie $t_i$.
$c \notin C \rightarrow \texttt{Node}(v, C, T).\texttt{delete}(cr) = \texttt{Node}(v, C, T)$.

If the key that is to be deleted starts with the character $c$ and $c$ does not occur in the list of characters $C$, then the trie does not contain the key $cr$ and therefore there is nothing to do: The trie is left unchanged.



In [7]:

    
def delete(self, s):
    if s == '':
        self.mValue = None
        return
    c, r = s[0], s[1:]
    for i, ci in enumerate(self.mChars):
        if c == ci:
            self.mTries[i].delete(r)
            if self.mTries[i].isEmpty():
                self.mChars.pop(i)
                self.mTries.pop(i)
            return
        
Trie.delete = delete
del delete

String Completion

In order to implement string completion we define the function allKeys, which has the following signature: $$\texttt{allKeys}: \mathbb{T} \times \Sigma^* \rightarrow \texttt{Set}(\Sigma^*)$$ Given a trie $T$ and a string $p$, the function $T.\texttt{allKeys}(p)$ computes the set of all strings that are stored as keys in the trie $T$. Furthermore, the string $p$ is added as a prefix to all these string. Therefore we specify the semantics of the function $\texttt{allKeys}$ as follows: $$ T.\texttt{allKeys}(p) = \{ p+w \mid w \in T \}. $$ Here, $p+w$ denotes the concatenation of the strings $p$ and $w$ and the expression $w \in T$ is true iff $T.\texttt{find}(w) \not= \Omega$.

Given a trie $T$, the value $T.\texttt{allKeys}(p)$ is computed by induction on $T$. There are two cases:

$\texttt{Node}(\Omega, [c_1, \cdots, c_n], [t_1,\cdots,t_n]).\texttt{allKeys}(p) = \bigcup\limits_{i=1}^n t_i.\texttt{allKeys}(p+c_i) $,
$v \not= \Omega \rightarrow \texttt{Node}(v, [c_1, \cdots, c_n], [t_1,\cdots,t_n]).\texttt{allKeys}(p) = \{p\} \cup \bigcup\limits_{i=1}^n t_i.\texttt{allKeys}(p+c_i) $.



In [8]:

    
def allKeys(self, prefix=''):
    Result = set()
    if self.mValue != None:
        Result.add(prefix)
    for i, ci in enumerate(self.mChars):
        Result |= self.mTries[i].allKeys(prefix + ci)
    return Result

Trie.allKeys = allKeys
del allKeys

Next, we specify the auxiliary function _findSuffix that has the following signature: $$ \texttt{_findSuffix}: \mathbf{T} \times \Sigma^* \times \Sigma^* \rightarrow \texttt{Set}(\Sigma^*) $$ Given a trie $T$, a string $s$, and a string $p$, it returns the set of all strings that are used as keys in $T$ and, furthermore, have the prefix $s$. Additionally, it replaces the prefix $s$ with the string $p$. Therefore we have $$ T.\texttt{_findSuffix}(s, p) = \bigl\{ p+r \mid s + r \in T \bigr\}. $$ The second argument $p$ is needed in order to make the inductive definition of the function _findSuffix work out. This definition is given next:

$T.\texttt{_findSuffix}(\varepsilon, p) = T.\texttt{allKeys}(p)$,
$\texttt{Node}\bigl(v, [c_1, \cdots, c_n], [t_1,\cdots,t_n]\bigr).\texttt{_findSuffix}(c_ir, p) = t_i.\texttt{_findSuffix}(r, p) $,
$c \not\in C\!s \rightarrow \texttt{Node}(v, C\!s, T\!s).\texttt{_findSuffix}(cr, p) = \{\}$.



In [9]:

    
def _findSuffix(self, s, p):
    if s == '':
        return self.allKeys(p)
    c, r = s[0], s[1:]
    for i, ci in enumerate(self.mChars):
        if c == ci:
            return self.mTries[i]._findSuffix(r, p)
    return set()

Trie._findSuffix = _findSuffix
del _findSuffix

Finally, we specify the function $$\texttt{findSuffix}:\mathbb{T} \times \Sigma^* \rightarrow \texttt{Set}(\Sigma^*)$$ so that given a trie $T$ and a prefix $p$, the expression $T.\texttt{findSuffix}(p)$ finds all strings $s \in T$ that have the prefix $p$, i.e. that can be written in the form $s=p+r$: $$ T.\texttt{findSuffix}(p) = \bigl\{ p+r \mid p+r \in T \bigr\}. $$ Having defined the auxiliary function \texttt{_findSuffix}, the function \texttt{findSuffix} can be implemented as follows: $$ T.\texttt{findSuffix}(s) = T.\texttt{_findSuffix}(s, s). $$



In [10]:

    
def findSuffix(self, s):
    return self._findSuffix(s, s)

Trie.findSuffix = findSuffix
del findSuffix

Graphical Representation via GraphViz



In [11]:

    
import graphviz as gv



In [12]:

    
def toDot(self):
    Trie.sNodeCount = 0 # this is a static variable of the class Trie
    dot = gv.Digraph(node_attr={'shape': 'record', 'style': 'rounded'})
    NodeDict = {}
    self._assignIDs(NodeDict)
    for id, t in NodeDict.items():
        if t.mValue != None:
            dot.node(id, label='{' + t.mParent + '|' + str(t.mValue) + '}')
        else:
            dot.node(id, label=t.mParent)
    for id, t in NodeDict.items():
        for x in t.mTries:
            dot.edge(id, x.mID)
    return dot

Trie.toDot = toDot
del toDot



In [13]:

    
def _assignIDs(self, NodeDict):
    Trie.sNodeCount += 1
    self.mID = str(Trie.sNodeCount)
    NodeDict[self.mID] = self
    for t in self.mTries:
        t._assignIDs(NodeDict) 
    
Trie._assignIDs = _assignIDs
del _assignIDs

Testing



In [14]:

    
t = Trie()
t.toDot()









    Out[14]:



In [15]:

    
t.insert("Hans", 1)
t.toDot()









    Out[15]:



In [16]:

    
t.insert("Hanna", 2)
t.toDot()









    Out[16]:



In [17]:

    
t.insert("Hilde", 3)
t.toDot()









    Out[17]:



In [18]:

    
t.insert("Hugo", 4)
t.toDot()









    Out[18]:



In [19]:

    
t.findSuffix('Hi')









    Out[19]:





{'Hilde'}



In [20]:

    
t.insert("Harro", 5)
t.toDot()









    Out[20]:



In [21]:

    
t.insert("Heike", 6)
t.toDot()









    Out[21]:



In [22]:

    
t.insert("Heiko", 7)
t.toDot()









    Out[22]:



In [24]:

    
t.find('Hilde')









    Out[24]:





3



In [25]:

    
t.insert("Harald", 8)
t.insert("Hasso", 9)
t.insert("Haley", 1)
t.insert("Hanny", 1)
t.insert("Happy", 1)
t.insert("Harlene", 1)
t.insert("Harley", 1)
t.insert("Harmony", 1)
t.insert("Harriet", 1)
t.insert("Harrietta", 1)
t.insert("Hazel", 1)
t.insert("Heather", 1)
t.insert("Hedwig", 1)
t.insert("Hedy", 1)
t.insert("Heidi", 1)
t.insert("Helaina", 1)
t.insert("Helaine", 1)
t.insert("Helen", 1)
t.insert("Helena", 1)
t.insert("Helene", 1)
t.insert("Helga", 1)
t.insert("Hendrika", 1)
t.insert("Henrietta", 1)
t.insert("Henriette", 1)
t.insert("Hermione", 1)
t.insert("Herta", 1)
t.insert("Hilary", 1)
t.insert("Hilda", 1)
t.toDot()









    Out[25]:



In [26]:

    
t.findSuffix('Hea')









    Out[26]:





{'Heather'}



In [27]:

    
t.findSuffix('Hel')









    Out[27]:





{'Helaina', 'Helaine', 'Helen', 'Helena', 'Helene', 'Helga'}



In [28]:

    
t.delete('Hanny')
t.toDot()









    Out[28]:



In [29]:

    
t.insert('Auto', 7)
t.toDot()









    Out[29]:



In [30]:

    
t.allKeys()









    Out[30]:





{'Auto',
 'Haley',
 'Hanna',
 'Hans',
 'Happy',
 'Harald',
 'Harlene',
 'Harley',
 'Harmony',
 'Harriet',
 'Harrietta',
 'Harro',
 'Hasso',
 'Hazel',
 'Heather',
 'Hedwig',
 'Hedy',
 'Heidi',
 'Heike',
 'Heiko',
 'Helaina',
 'Helaine',
 'Helen',
 'Helena',
 'Helene',
 'Helga',
 'Hendrika',
 'Henrietta',
 'Henriette',
 'Hermione',
 'Herta',
 'Hilary',
 'Hilda',
 'Hilde',
 'Hugo'}



In [32]:

    
t.delete("Hans")
t.delete("Hanna")
t.delete("Hilde")
t.delete("Hugo")
t.delete("Harro")
t.delete("Heike")
t.delete("Heiko")
t.delete("Harald")
t.delete("Hasso")
t.delete("Haley")
t.delete("Hanny")
t.delete("Happy")
t.delete("Harlene")
t.delete("Harley")
t.delete("Harmony")
t.delete("Harriet")
t.delete("Harrietta")
t.delete("Hazel")
t.delete("Heather")
t.delete("Hedwig")
t.delete("Hedy")
t.delete("Heidi")
t.delete("Helaina")
t.delete("Helaine")
t.delete("Helen")
t.delete("Helena")
t.delete("Helene")
t.delete("Helga")
t.delete("Hendrika")
t.delete("Henrietta")
t.delete("Henriette")
t.delete("Hermione")
t.delete("Herta")
t.delete("Hilary")
t.delete("Hilda")
t.toDot()









    Out[32]:

Let us compute the prime numbers less than $100$. Note that we have to convert numbers to strings in order to be able to use tries.



In [33]:

    
S = Trie()
for i in range(2, 101):
    S.insert(str(i), i)
for i in range(2, 51):
    for j in range(i, 100 // i + 1):
        S.delete(str(i * j))
S.toDot()









    Out[33]:

Let us compute all key stored in this trie, as these are the prime numbers.



In [34]:

    
print(sorted([int(p) for p in S.allKeys()]))









    



[2, 3, 5, 7, 11, 13, 17, 19, 23, 29, 31, 37, 41, 43, 47, 53, 59, 61, 67, 71, 73, 79, 83, 89, 97]



In [ ]: