Trie (Keyword Tree)

A Trie or Keyword tree is a search tree, usually used to search strings. Every child node of a given node in this tree share the same prefix up to that node. Common applications for this data structure include genome analysis, search autocomplete and data analysis. Here we implement a basic trie in C++11 capable of doing insertions and searches.

How does a Trie works?

Tries are usually used on data that can be represented in a string format, as a mean to simplificate the explanation we will only be dealing with strings here. Each node on a Trie represents a character on the string, the nodes are stored in a top to bottom approach where it's ancestors represent it's prefix. We can see in Figure 1 a Trie after adding the word "Ball".

Figure 1: Trie after inserting the word "Ball".

As shown in the tree in Figure 1, in the word "Ball", the node representing "a" in the seccond layer is predecessed by "B", similarly the first "l" has nodes "Ba" as it's ancestors. The trie allow us to interpret the string as a graph and store the data more efficiently by writting only once the common preffix of the strings stored, we can see it on the tree in figure 2 where the words "sweet", "sheep" and "ship were inserted.

Figure 2: Trie after inserting the words "sweet", "sheep" and "ship"

Including necessary libraries

We will represent every chilld node of a given node by a hash and use the node's key as the hash's key which allows for O(1) lookups. I'm personally not a fan of typing "std::" all the damn time and this is a small project so let's use namespace std. The beginning of the file should look something like this.


In [ ]:
#include <unordered_map>

using namespace std;

The Node

Each node will have a children Hash which will map a character in that position to it's following nodes. It is also useful to know if our node represent the end of a word and if it's a leaf node. We also created a default constructor so that the values are preset once the node is initialized. With this in mind our structure must look something like this.


In [ ]:
struct Node{
	unordered_map<char, Node*> children;
	bool endOfWord;
	bool isLeaf;
	Node(){
		endOfWord = false;
		isLeaf = true;
	}
};

The Trie class

We chose to represent our Trie as a class so that we can implement good programming practices such as keeping private methods private. The root Node is represented as a pointer to a node and is allocated once the object is initialized, as said before, we also implemented the insert and search method.

Since we are allocating the Node pointers as a tree we will have to write our own destructor. The destructor should be able to visit every node in the tree deallocating the memory. We can achieve that by visiting all child nodes recursively and deallocating them once all children nodes were visited. This is done by calling the remove method on the destructor which calls itself recursively.

By now, our class is looking like the code below:


In [ ]:
class Trie{
private:
    Node *root;
    void remove(Node *n);
public:
    Trie();
    ~Trie();
    
    void insert(string str);
    bool search(string str);
};

In [ ]:
Trie::Trie(){
    root = new Node();
}
The destructor

The destructor only calls the remove method passing the root node as a parameter, the remove method free the memory recursively as described earlier. Both, the destructor and remove are shown in the code below.


In [ ]:
Trie::~Trie(){
    remove(root);
}

In [ ]:
void Trie::remove(Node *n){
    // Loop through every child node visiting it recursivelly
    for(pair<char,Node*> p: n->children)
        remove(p.second);
    // Delete current node
    delete n;
}
The insert method

We start by visiting the root node and check if any of it's children match the first character on the string, this can be made in $O(1)$ on average with the unordered set. If the node has the character as it's child then you simply visit the child node, otherwise you create that child node and visit it, this process repeats with every single letter on the string until reaching the end of it, after reaching the end, the last node reached is marked as a end of word. When creating a new node the current node is marked as not being a leaf node.

The code for this is presented below:


In [ ]:
void Trie::insert(string str){
    Node* node = root; // Current node being visited
    // Loop through every letter on the String
    for(char c: str){
        // Adds a children node if it doesn't find any with that key
        if(node->children.find(c) == node->children.end()){
            node->children[c] = new Node();
            node->isLeaf = false;
        }
        // Visit next node
        node = node->children[c];
    }
    // Sets the end of the word
    node->endOfWord = true;
}
The search method

For searching we can take two approaches, one of them is to search for a complete word and the other is to search for a word prefix. Let's say I add the word "hello" on a empty trie, doing a full word search of the word "hell" in this trie should return false, while if you are matching only the prefix, it should return true. Our first code in this topic implements a search where only the prefix is matched a slight modification was made on the second one to match the entire word.

The code works by looping through every character on the string and checking if there is a corresponding child node, visiting the node if it was found and returning false otherwise. The function will also return false in case it reaches a leaf node.

Important: if you are reading this on a Jupyter notebook please run only one of the codes. If you are tring to implement both in your code, please implement them with different names.


In [ ]:
bool Trie::search(string str){
    Node *node = root; // Current node
    // Loop through every character on the string
    for(char c:str){
        // Return false if key is not found or reach a leaf node
        if(node->children.find(c) == node->children.end() || node->isLeaf)
            return false;
        // Visit next node
        node = node->children[c];
    }
    // Once all characters matched, returns true
    return true;
}

The following code returns true only when the the entire word is matched


In [ ]:
bool Trie::search(string str){
    Node *node = root; // Current node
    // Loop through every character on the string
    for(char c:str){
        // Return false if key is not found or reach a leaf node
        if(node->children.find(c) == node->children.end() || node->isLeaf)
            return false;
        // Visit next node
        node = node->children[c];
    }
    // Returns true once all character matched and you reached the end of a word
    return node->endOfWord; // <-- MODIFIED!!!
}

With that we conclude our basic Trie.