In [1]:
%%HTML
<style>
.container { width:100% }
</style>
A Trie is a data structure that implements the abstract data type of a dictionary in the case that the keys are strings. The set $\mathbb{T}$ of all tries is defined inductively using the constructor $$ \texttt{Node}: \texttt{Value} \times \texttt{List}(\Sigma) \times\texttt{List}(\mathbb{T}) \rightarrow \mathbb{T}. $$ Here $\Sigma$ is the alphabet, i.e. the set of characters. The inductive definition of the set $\mathbb{T}$ has only a single clause: If
then we have $$ \texttt{Node}(v, C\!s, T\!s) \in \mathbb{T}. $$
An object $t$ of the class Trie
represents the trie
$$ \texttt{Node}(t.\texttt{mValue}, t.\texttt{mChars}, t.\texttt{mTries}). $$
The constructor below creates an empty trie.
In [2]:
class Trie:
def __init__(self):
self.mValue = None
self.mChars = []
self.mTries = []
self.mParent = '' # only used by graphviz
In order to specify the meaning of the class Trie
we define the function find
,
which can be used to retrieve the values stored in a trie.
This function has the signature
$$ \texttt{find}: \mathbb{T} \times \Sigma^* \rightarrow \texttt{Value} \cup \{ \Omega\}. $$
For a trie $t$ and a string $s$, the expression $t.\texttt{find}(s)$ returns the
value that is associated with the key $s$ in the trie $t$. The expression
$\texttt{Node}(v,C\!s,T\!s).\texttt{find}(s)$ is defined by induction on the length of the string $s$:
$\texttt{Node}(v, C\!s, T\!s).\texttt{find}(\varepsilon) = v$.
The value associated with the empty string $\varepsilon$ is stored at the root of the trie.
$c = c_i \rightarrow \texttt{Node}(v, [c_1, \cdots, c_n], [t_1, \cdots, t_n]).\texttt{find}(cr) = t_i.\texttt{find}(r) $
The trie $\texttt{Node}(v, [c_1, \cdots, c_n], [t_1, \cdots, t_n])$ associates a value with the key $cr$ if the list $[c_1, \cdots, c_n]$ has a position $i$ such that $c$ equals $c_i$ and, furthermore, the trie $t_i$ associates a value with the key $r$.
$ c \not\in C\!s \rightarrow \texttt{Node}(v, C\!s, T\!s).\texttt{find}(cr) = \Omega $
If $c$ does not occur in the list $C\!s$, then the trie $\texttt{Node}(v, C\!s, T\!s)$ does not store a value for the key $cr$.
In [4]:
def find(self, s):
if s == '':
return self.mValue
c, r = s[0], s[1:]
for i, ci in enumerate(self.mChars):
if c == ci:
return self.mTries[i].find(r)
return None # not necessary
Trie.find = find
del find
In [3]:
list(enumerate(['a', 'b', 'c']))
Out[3]:
The signature of the method $\texttt{insert}$ is given as follows: $$ \texttt{insert}: \mathbb{T} \times \Sigma^* \times \texttt{Value} \rightarrow \mathbb{T}. $$ The result of evaluating $$ \texttt{Node}(v_1, [c_1, \cdots, c_n], [t_1, \cdots, t_n]).\texttt{insert}(w, v_2) $$ for a string $w\in \Sigma^*$ and a value $v_2 \in \texttt{Value}$ is defined by induction on the length of $w$:
$\texttt{Node}(v_1,L,T).\texttt{insert}(\varepsilon, v_2) = \texttt{Node}(v_2,L,T)$.
If a new value $v_2$ is associated with the empty string $\varepsilon$, then the old value $v_1$, which had been stored at the root before, is overwritten.
$\texttt{Node}\bigl(v_1,[c_1,\cdots,c_i,\cdots,c_n], [t_1,\cdots,t_i,\cdots,t_n]\bigr).\texttt{insert}(c_ir,v_2) = \texttt{Node}\bigl(v_1,[c_1,\cdots,c_i,\cdots,c_n], [t_1,\cdots,t_i.\texttt{insert}(r,v_2),\cdots,t_n]\bigr)$.
In order to associate a value $v_2$ with the string $c_ir$ in the trie $$ \texttt{Node}\bigl(v_1,[c_1,\cdots,c_i,\cdots,c_n], [t_1,\cdots,t_i,\cdots,t_n]\bigr) $$ we have to recursively associate the value $v_2$ with the string $r$ in the trie $t_i$.
$c \not\in\{c_1,\cdots,c_n\} \;\rightarrow\;\texttt{Node}\bigl(v_1,[c_1,\cdots,c_n], [t_1,\cdots,t_n]\bigr).\texttt{insert}(cr,v_2) = \texttt{Node}\bigl(v_1,[c_1,\cdots,c_n,c], [t_1,\cdots,t_n,\texttt{Node}(\Omega,[],[]).\texttt{insert}(r,v_2)]\bigr)$.
If we want to associate a value $v$ with the key $cr$ in the trie $\texttt{Node}\bigl(v_1,[c_1,\cdots,c_n], [t_1,\cdots,t_n]\bigr)$ then, if the character $c$ does not occur in the list $[c_1,\cdots,c_n]$, we first have to create a new empty trie. This trie has the form $$ \texttt{Node}(\Omega, [], []). $$ Next, we associate the value $v_2$ with the key $r$ in this empty trie. Finally, we append the character $c$ to the end of the list $[c_1,\cdots,c_n]$ and append the trie $$ \texttt{Node}(\Omega, [], []).\texttt{insert}(r,v_2) $$ to the end of the list $[t_1,\cdots,t_n]$.
In [5]:
def insert(self, s, v):
if s == '':
self.mValue = v
return
c, r = s[0], s[1:]
for i, ci in enumerate(self.mChars):
if c == ci:
self.mTries[i].insert(r, v)
return
t = Trie()
t.insert(r, v)
t.mParent = c # necessary for visualization
self.mChars.append(c)
self.mTries.append(t)
Trie.insert = insert
del insert
In order to implement deletion in tries, we need the auxiliary function isEmpty
. Its signature is given as
$$ \texttt{isEmpty}: \mathbb{T} \rightarrow \mathbb{B}. $$
For a trie $t$, we have $t.\texttt{isEmpty}() = \texttt{True}$ if and only if the trie $t$ does not
store any key. The following formula specifies the function $\texttt{isEmpty}$:
$$ \texttt{Node}(v, L, T).\texttt{isEmpty}() = \mathtt{True} \Leftrightarrow v = \Omega \wedge L = []. $$
In [6]:
def isEmpty(self):
return self.mValue == None and self.mChars == []
Trie.isEmpty = isEmpty
del isEmpty
For a trie $t \in \mathbb{T}$ and a string $w \in \Sigma^*$ the value $t.\texttt{delete}(w)$ is defined by induction on the length of $w$.
$\texttt{Node}(v,C,T).\texttt{delete}(\varepsilon) = \texttt{Node}(\Omega,C,T)$
The value that is associated with the empty string $\varepsilon$ is stored at the root of the trie where is can be deleted without further ado.
$\begin{array}[t]{ll} t_i.\texttt{delete}(r).\texttt{isEmpty}() & \rightarrow \\ \texttt{Node}(v, [c_1,\cdots,c_i,\cdots,c_n],[t_1,\cdots,t_i,\cdots,t_n]).\texttt{delete}(c_ir) & = \\ \qquad \texttt{Node}(v, [c_1,\cdots,c_{i-1},c_{i+1},\cdots,c_n],[t_1,\cdots,t_{i-1},t_{i+1},\cdots,t_n]). \end{array} $
If the key that is to be deleted starts with the character $c_i$ and if deletion of the key $r$ in the $i$th trie $t_i$ yields an empty trie, then both the $i$th character $c_i$ and the $i$th trie $t_i$ are deleted.
$\begin{array}[t]{ll} \neg t_i.\texttt{delete}(r).\texttt{isEmpty}() & \rightarrow \\ \texttt{Node}(v, [c_1,\cdots,c_i,\cdots,c_n],[t_1,\cdots,t_i,\cdots,t_n]).\texttt{delete}(c_ir) & = \\ \qquad \texttt{Node}(v, [c_1,\cdots,c_i,\cdots,c_n],[t_1,\cdots,t_i.\texttt{delete}(r),\cdots,t_n]). \end{array} $
If the key that is to be deleted starts with the character $c_i$ and if deletion of the key $r$ in the $i$th trie $t_i$ yields a non-empty trie, then the key $r$ has to be deleted recursively in the trie $t_i$.
$c \notin C \rightarrow \texttt{Node}(v, C, T).\texttt{delete}(cr) = \texttt{Node}(v, C, T)$.
If the key that is to be deleted starts with the character $c$ and $c$ does not occur in the list of characters $C$, then the trie does not contain the key $cr$ and therefore there is nothing to do: The trie is left unchanged.
In [7]:
def delete(self, s):
if s == '':
self.mValue = None
return
c, r = s[0], s[1:]
for i, ci in enumerate(self.mChars):
if c == ci:
self.mTries[i].delete(r)
if self.mTries[i].isEmpty():
self.mChars.pop(i)
self.mTries.pop(i)
return
Trie.delete = delete
del delete
In order to implement string completion we define the function allKeys
, which
has the following signature:
$$\texttt{allKeys}: \mathbb{T} \times \Sigma^* \rightarrow \texttt{Set}(\Sigma^*)$$
Given a trie $T$ and a string $p$, the function $T.\texttt{allKeys}(p)$ computes the set of all strings that
are stored as keys in the trie $T$. Furthermore, the string $p$ is added as a prefix to all these string.
Therefore we specify the semantics of the function $\texttt{allKeys}$ as follows:
$$ T.\texttt{allKeys}(p) = \{ p+w \mid w \in T \}. $$
Here, $p+w$ denotes the concatenation of the strings $p$ and $w$ and the expression $w \in T$ is true iff
$T.\texttt{find}(w) \not= \Omega$.
Given a trie $T$, the value $T.\texttt{allKeys}(p)$ is computed by induction on $T$. There are two cases:
In [8]:
def allKeys(self, prefix=''):
Result = set()
if self.mValue != None:
Result.add(prefix)
for i, ci in enumerate(self.mChars):
Result |= self.mTries[i].allKeys(prefix + ci)
return Result
Trie.allKeys = allKeys
del allKeys
Next, we specify the auxiliary function _findSuffix
that has the following signature:
$$ \texttt{_findSuffix}: \mathbf{T} \times \Sigma^* \times \Sigma^* \rightarrow \texttt{Set}(\Sigma^*) $$
Given a trie $T$, a string $s$, and a string $p$, it returns the set of all strings that are used as keys in $T$
and, furthermore, have the prefix $s$. Additionally, it replaces the prefix $s$ with the string $p$. Therefore
we have
$$ T.\texttt{_findSuffix}(s, p) = \bigl\{ p+r \mid s + r \in T \bigr\}. $$
The second argument $p$ is needed in order to make the inductive definition of the function
_findSuffix
work out. This definition is given next:
In [9]:
def _findSuffix(self, s, p):
if s == '':
return self.allKeys(p)
c, r = s[0], s[1:]
for i, ci in enumerate(self.mChars):
if c == ci:
return self.mTries[i]._findSuffix(r, p)
return set()
Trie._findSuffix = _findSuffix
del _findSuffix
Finally, we specify the function $$\texttt{findSuffix}:\mathbb{T} \times \Sigma^* \rightarrow \texttt{Set}(\Sigma^*)$$ so that given a trie $T$ and a prefix $p$, the expression $T.\texttt{findSuffix}(p)$ finds all strings $s \in T$ that have the prefix $p$, i.e. that can be written in the form $s=p+r$: $$ T.\texttt{findSuffix}(p) = \bigl\{ p+r \mid p+r \in T \bigr\}. $$ Having defined the auxiliary function \texttt{_findSuffix}, the function \texttt{findSuffix} can be implemented as follows: $$ T.\texttt{findSuffix}(s) = T.\texttt{_findSuffix}(s, s). $$
In [10]:
def findSuffix(self, s):
return self._findSuffix(s, s)
Trie.findSuffix = findSuffix
del findSuffix
In [11]:
import graphviz as gv
In [12]:
def toDot(self):
Trie.sNodeCount = 0 # this is a static variable of the class Trie
dot = gv.Digraph(node_attr={'shape': 'record', 'style': 'rounded'})
NodeDict = {}
self._assignIDs(NodeDict)
for id, t in NodeDict.items():
if t.mValue != None:
dot.node(id, label='{' + t.mParent + '|' + str(t.mValue) + '}')
else:
dot.node(id, label=t.mParent)
for id, t in NodeDict.items():
for x in t.mTries:
dot.edge(id, x.mID)
return dot
Trie.toDot = toDot
del toDot
In [13]:
def _assignIDs(self, NodeDict):
Trie.sNodeCount += 1
self.mID = str(Trie.sNodeCount)
NodeDict[self.mID] = self
for t in self.mTries:
t._assignIDs(NodeDict)
Trie._assignIDs = _assignIDs
del _assignIDs
In [14]:
t = Trie()
t.toDot()
Out[14]:
In [15]:
t.insert("Hans", 1)
t.toDot()
Out[15]:
In [16]:
t.insert("Hanna", 2)
t.toDot()
Out[16]:
In [17]:
t.insert("Hilde", 3)
t.toDot()
Out[17]:
In [18]:
t.insert("Hugo", 4)
t.toDot()
Out[18]:
In [19]:
t.findSuffix('Hi')
Out[19]:
In [20]:
t.insert("Harro", 5)
t.toDot()
Out[20]:
In [21]:
t.insert("Heike", 6)
t.toDot()
Out[21]:
In [22]:
t.insert("Heiko", 7)
t.toDot()
Out[22]:
In [24]:
t.find('Hilde')
Out[24]:
In [25]:
t.insert("Harald", 8)
t.insert("Hasso", 9)
t.insert("Haley", 1)
t.insert("Hanny", 1)
t.insert("Happy", 1)
t.insert("Harlene", 1)
t.insert("Harley", 1)
t.insert("Harmony", 1)
t.insert("Harriet", 1)
t.insert("Harrietta", 1)
t.insert("Hazel", 1)
t.insert("Heather", 1)
t.insert("Hedwig", 1)
t.insert("Hedy", 1)
t.insert("Heidi", 1)
t.insert("Helaina", 1)
t.insert("Helaine", 1)
t.insert("Helen", 1)
t.insert("Helena", 1)
t.insert("Helene", 1)
t.insert("Helga", 1)
t.insert("Hendrika", 1)
t.insert("Henrietta", 1)
t.insert("Henriette", 1)
t.insert("Hermione", 1)
t.insert("Herta", 1)
t.insert("Hilary", 1)
t.insert("Hilda", 1)
t.toDot()
Out[25]:
In [26]:
t.findSuffix('Hea')
Out[26]:
In [27]:
t.findSuffix('Hel')
Out[27]:
In [28]:
t.delete('Hanny')
t.toDot()
Out[28]:
In [29]:
t.insert('Auto', 7)
t.toDot()
Out[29]:
In [30]:
t.allKeys()
Out[30]:
In [32]:
t.delete("Hans")
t.delete("Hanna")
t.delete("Hilde")
t.delete("Hugo")
t.delete("Harro")
t.delete("Heike")
t.delete("Heiko")
t.delete("Harald")
t.delete("Hasso")
t.delete("Haley")
t.delete("Hanny")
t.delete("Happy")
t.delete("Harlene")
t.delete("Harley")
t.delete("Harmony")
t.delete("Harriet")
t.delete("Harrietta")
t.delete("Hazel")
t.delete("Heather")
t.delete("Hedwig")
t.delete("Hedy")
t.delete("Heidi")
t.delete("Helaina")
t.delete("Helaine")
t.delete("Helen")
t.delete("Helena")
t.delete("Helene")
t.delete("Helga")
t.delete("Hendrika")
t.delete("Henrietta")
t.delete("Henriette")
t.delete("Hermione")
t.delete("Herta")
t.delete("Hilary")
t.delete("Hilda")
t.toDot()
Out[32]:
Let us compute the prime numbers less than $100$. Note that we have to convert numbers to strings in order to be able to use tries.
In [33]:
S = Trie()
for i in range(2, 101):
S.insert(str(i), i)
for i in range(2, 51):
for j in range(i, 100 // i + 1):
S.delete(str(i * j))
S.toDot()
Out[33]:
Let us compute all key stored in this trie, as these are the prime numbers.
In [34]:
print(sorted([int(p) for p in S.allKeys()]))
In [ ]: