Data structures

For a refresher on object-oriented programming, see Object-oriented programming.

A simple set implementation

Sets in Python can be specified with set notation:


In [1]:
s = {1,3,2,9}

Or with by creating a set object and assigning it to a variable then manually adding elements:


In [2]:
s = set()
s.add(1)
s.add(3)

We can build our own set object implementation by creating a class definition:


In [3]:
class MySet:
    def __init__(self):
        self.elements = []
    def add(self, x):
        if x not in self.elements:
            self.elements.append(x)

In [4]:
s = MySet()
s.add(3)  # same as MySet.add(a,3)
s.add(3)
s.add(2)
s.add('cat')
s.elements


Out[4]:
[3, 2, 'cat']

In [5]:
from lolviz import *
objviz(s)


Out[5]:
G node4571349120 MySet elements     node4569897416 0 1 2 3 2 'cat' node4571349120:c->node4569897416

Question: How expensive is it to add an element to a set with this implementation?

Exercise

Add a method called hasmember() that returns true or false according to whether parameter x is a member of the set.


In [6]:
class MySet:
    def __init__(self):
        self.elements = []
    def add(self, x):
        if x not in self.elements:
            self.elements.append(x)
    def hasmember(self, x):
        return x in self.elements

In [7]:
s = MySet()
s.add(3)  # same as MySet.add(a,3)
s.add(3)
s.add(2)
s.add('cat')
s.hasmember(3), s.hasmember(99)


Out[7]:
(True, False)

Linked lists -- the gateway drug

We've studied arrays/lists that are built into Python but they are not always the best kind of list to use. Sometimes, we are inserting and deleting things from the head or middle of the list. If we do this in lists made up of contiguous cells in memory, we have to move a lot of cells around to make room for a new element or to close a hole made by a deletion. Most importantly, linked lists are the degenerate form of a general object graph. So, it makes sense to start with the simple versions and move up to general graphs.

Linked lists allow us to efficiently insert and remove things anywhere we want, at the cost of more memory.

A linked list associates a next pointer with each value. We call these things nodes and here's a simple implementation for node objects:


In [8]:
class LLNode:
    def __init__(self, value, next=None):
        self.value = value
        self.next = next

In [9]:
head = LLNode('tombu')
callsviz(varnames='head')


Out[9]:
G node4571405672 globals head     node4571551280 LLNode value 'tombu' next     node4571405672:c->node4571551280

In [10]:
head = LLNode('parrt', head)
callsviz(varnames='head')


Out[10]:
G cluster1 node4571605896 globals head     node4572652992 LLNode value 'parrt' next     node4571605896:c->node4572652992 node4571551280 LLNode value 'tombu' next     node4572652992:c->node4571551280

In [11]:
head = LLNode("xue", head)
callsviz(varnames='head')


Out[11]:
G cluster1 node4571606376 globals head     node4572652824 LLNode value 'xue' next     node4571606376:c->node4572652824 node4572652992 LLNode value 'parrt' next     node4572652824:c->node4572652992 node4571551280 LLNode value 'tombu' next     node4572652992:c->node4571551280

Walk list

To walk a list, we use the notion of a cursor, which we can think of as a finger that moves along a data structure from node to node. We initialize the cursor to point to the first node of the list, the head, and then walk the cursor through the list via the next fields:


In [12]:
p = head
while p is not None:
    print(p.value)
    p = p.next


xue
parrt
tombu

Question: How fast can we walk the linked list?

Exercise

Modify the walking code so that it lives in a method of LLNode called exists(self, x) that looks for a node with value x starting at self. If we test with head.exists('parrt') then self would be our global head variable. Have the function return true if x exists in the list, else return false. You can test it with:

head = LLNode('tombu')
head = LLNode('parrt', head)
head = LLNode("xue", head)
head.exists('parrt'), head.exists('part')

In [13]:
class LLNode:
    def __init__(self, value, next=None):
        self.value = value
        self.next = next
        
    def exists(self, x):
        p = self # start looking at this node
        while p is not None:
            if x==p.value:
                return True
            p = p.next
        return False
    
head = LLNode('tombu')
head = LLNode('parrt', head)
head = LLNode("xue", head)
head.exists('parrt'), head.exists('part')


Out[13]:
(True, False)

Insertion at head

If we want to insert an element at the front of a linked list, we create a node to hold the value and set its next pointer to point to the old head. Then we have the head variable point at the new node. Here is the sequence.

Create new node


In [14]:
x = LLNode('mary')
callviz(varnames=['head','x'])


Out[14]:
G cluster1 node4572769160 globals head     x     node4572765936 LLNode value 'xue' next     node4572769160:c->node4572765936 node4572763248 LLNode value 'mary' next     node4572769160:c->node4572763248 node4572766104 LLNode value 'parrt' next     node4572766160 LLNode value 'tombu' next     node4572766104:c->node4572766160 node4572765936:c->node4572766104

Make next field of new node point to head


In [15]:
x.next = head
callviz(varnames=['head','x'])


Out[15]:
G cluster1 node4572807240 globals head     x     node4572763248 LLNode value 'mary' next     node4572807240:c->node4572763248 node4572765936 LLNode value 'xue' next     node4572807240:c->node4572765936 node4572766104 LLNode value 'parrt' next     node4572766160 LLNode value 'tombu' next     node4572766104:c->node4572766160 node4572763248:c->node4572765936 node4572765936:c->node4572766104

Make head point at new node


In [16]:
head = x
callviz(varnames=['head','x'])


Out[16]:
G cluster1 node4572807720 globals head     x     node4572763248 LLNode value 'mary' next     node4572807720:c->node4572763248 node4572807720:c->node4572763248 node4572766104 LLNode value 'parrt' next     node4572766160 LLNode value 'tombu' next     node4572766104:c->node4572766160 node4572765936 LLNode value 'xue' next     node4572765936:c->node4572766104 node4572763248:c->node4572765936

Deletion of node


In [17]:
# to delete xue, make previous node skip over xue
xue = head.next
callviz(varnames=['head','x','xue'])


Out[17]:
G cluster1 node140252612140040 globals head     x     xue     node4572765936 LLNode value 'xue' next     node140252612140040:c->node4572765936 node4572763248 LLNode value 'mary' next     node140252612140040:c->node4572763248 node140252612140040:c->node4572763248 node4572766104 LLNode value 'parrt' next     node4572766160 LLNode value 'tombu' next     node4572766104:c->node4572766160 node4572765936:c->node4572766104 node4572763248:c->node4572765936

In [18]:
head.next = xue.next
callviz(varnames=['head','x'])


Out[18]:
G cluster1 node140252610377016 globals head     x     node4572763248 LLNode value 'mary' next     node140252610377016:c->node4572763248 node140252610377016:c->node4572763248 node4572766104 LLNode value 'parrt' next     node4572766160 LLNode value 'tombu' next     node4572766104:c->node4572766160 node4572763248:c->node4572766104

Notice that xue still points at the node but we are going to ignore that variable from now on. Moving from the head of the list, we still cannot see the node with 'xue' in it.


In [19]:
head.next = xue.next
callviz(varnames=['head','x','xue'])


Out[19]:
G cluster1 node140252572598776 globals head     x     xue     node4572765936 LLNode value 'xue' next     node140252572598776:c->node4572765936 node4572763248 LLNode value 'mary' next     node140252572598776:c->node4572763248 node140252572598776:c->node4572763248 node4572766104 LLNode value 'parrt' next     node4572766160 LLNode value 'tombu' next     node4572766104:c->node4572766160 node4572765936:c->node4572766104 node4572763248:c->node4572766104

Exercise

Get a pointer to the node with value tombu and then delete it from the list using the same technique we just saw.


In [20]:
before_tombu = head.next
callviz(varnames=['head','x','before_tombu'])


Out[20]:
G cluster1 node4571341352 globals head     x     before_tombu     node4572766104 LLNode value 'parrt' next     node4571341352:c->node4572766104 node4572763248 LLNode value 'mary' next     node4571341352:c->node4572763248 node4571341352:c->node4572763248 node4572766160 LLNode value 'tombu' next     node4572766104:c->node4572766160 node4572763248:c->node4572766104

In [21]:
before_tombu.next = None
callviz(varnames=['head','x','before_tombu'])


Out[21]:
G cluster1 node140252572604488 globals head     x     before_tombu     node4572766104 LLNode value 'parrt' next     node140252572604488:c->node4572766104 node4572763248 LLNode value 'mary' next     node140252572604488:c->node4572763248 node140252572604488:c->node4572763248 node4572763248:c->node4572766104

Binary trees

The tree data structure is one of the most important in computer science and is extremely common in data science as well. Decision trees, which form the core of gradient boosting machines and random forests (machine learning algorithms), are naturally represented as trees in memory. When we process HTML and XML files, those are generally represented by trees. For example:

</td>

<bookstore>
  <book category="cooking">
    <title lang="en">Everyday Italian</title>
    <author>Giada De Laurentiis</author>
    <year>2005</year>
    <price>30.00</price>
  </book>
  <book category="web">
    <title lang="en">Learning XML</title>
    <author>Erik T. Ray</author>
    <year>2003</year>
    <price>39.95</price>
  </book>
</bookstore>

We're going to look at a simple kind of tree that has at most two children: a binary tree. A node that has no children is called a leaf and non-leaves are called internal nodes.

In general, trees with $n$ nodes have $n-1$ edges. Each node has a single incoming edge and the root has none.

Nodes have parents and children and siblings (at the same level).

Sometimes nodes have links back to their parents for programming convenience reasons. That would make it a graph not a tree but we still consider it a tree.


In [22]:
class Tree:
    def __init__(self, value, left=None, right=None):
        self.value = value
        self.left = left
        self.right = right

In [23]:
root = Tree('parrt')
root.left = Tree('mary')
root.right = Tree('april')
treeviz(root)


Out[23]:
G node4572653048 Tree value 'parrt' left right node4572651872 Tree value 'mary' left right node4572653048:c->node4572651872 node4572652600 Tree value 'april' left right node4572653048:c->node4572652600

In [24]:
root = Tree('parrt', Tree('mary'), Tree('april'))
treeviz(root)


Out[24]:
G node4572655512 Tree value 'parrt' left right node4572654616 Tree value 'mary' left right node4572655512:c->node4572654616 node4572655568 Tree value 'april' left right node4572655512:c->node4572655568

In [25]:
root = Tree('parrt')
mary = Tree('mary')
april = Tree('april')
jim = Tree('jim')
sri = Tree('sri')
mike = Tree('mike')

root.left = mary
root.right = april
mary.left = jim
mary.right = mike
april.right = sri

treeviz(root)


Out[25]:
G node4572784344 Tree value 'parrt' left right node4572655568 Tree value 'mary' left right node4572784344:c->node4572655568 node4572655512 Tree value 'april' left right node4572784344:c->node4572655512 node4572654616 Tree value 'jim' left right node4572655568:c->node4572654616 node4572783392 Tree value 'mike' left right node4572655568:c->node4572783392 node4572784848 Tree value 'sri' left right node4572655512:c->node4572784848

Exercise

Create a class definition for NTree that allows arbitrary numbers of children. (Use a list for field children rather than left and right.) The constructor should init an empty children list. Test your code using:

from lolviz import objviz
root2 = NTree('parrt')
mary = NTree('mary')
april = NTree('april')
jim = NTree('jim')
sri = NTree('sri')
mike = NTree('mike')

root2.addchild(mary)
root2.addchild(jim)
root2.addchild(sri)
sri.addchild(mike)
sri.addchild(april)

objviz(root2)

Solution


In [27]:
class NTree:
    def __init__(self, value):
        self.value = value
        self.children = []
        
    def addchild(self, child):
        if isinstance(child, NTree):
            self.children.append(child)

In [28]:
root2 = NTree('parrt')
mary = NTree('mary')
april = NTree('april')
jim = NTree('jim')
sri = NTree('sri')
mike = NTree('mike')

root2.addchild(mary)
root2.addchild(jim)
root2.addchild(sri)
sri.addchild(mike)
sri.addchild(april)

objviz(root2)


Out[28]:
G node4572775368 NTree value 'parrt' children     node4571220040 0 1 2 node4572775368:c->node4571220040 node4572775312 NTree value 'mary' children     node4571220040:0->node4572775312:w node4572775480 NTree value 'jim' children     node4571220040:1->node4572775480:w node4572775536 NTree value 'sri' children     node4571220040:2->node4572775536:w node4572739336 empty list node4572775312:c->node4572739336 node4570479688 empty list node4572775480:c->node4570479688 node4572761160 0 1 node4572775536:c->node4572761160 node4572775592 NTree value 'mike' children     node4572761160:0->node4572775592:w node4572775424 NTree value 'april' children     node4572761160:1->node4572775424:w node4572761672 empty list node4572775592:c->node4572761672 node4572760904 empty list node4572775424:c->node4572760904

Walking trees

Walking a tree is a matter of moving a cursor like we did with the linked lists above. The goal is to visit each node in the tree. We start out by having the cursor point at the root of the tree and then walk downwards until we hit leaves, and then we come back up and try other alternatives.

A good physical analogy: imagine a person (cursor) from HR needing to speak (visit) each person in a company starting with the president/CEO. Here's a sample org chart:

The general visit algorithm starting at node p is meet with p then visit each direct report. Then visit all of their direct reports, one level of the tree at a time. The node visitation sequence would be A,B,C,F,H,J,... This is a breadth-first search of the tree and easy to describe but a bit more work to implement that a depth-first search. Depth first means visiting a person then visit their first direct report and that person's direct report etc... until you reach a leaf node. Then back up a level and move to next direct report. That visitation sequence is A,B,C,D,E,F,G,H,I,J,K,L.

If you'd like to start at node B, not A, what is the procedure? The same, of course. So visiting A means, say, printing A then visiting B. Visiting B means visiting C, and when that completes, visiting F, etc... The key is that the procedure for visiting a node is exactly the same regardless of which node you start with. This is generally true for any self-similar data structure like a tree.

Another easy way to think about binary tree visitation in particular is positioning yourself in a room with a bunch of doors as choices. Each door leads to other rooms, which might also have doors leading to other rooms. We can think of a room as a node and doors as pointers to other nodes. Each room is identical and has 0, 1, or 2 doors (for a binary tree). At the root node we might see two choices and, to explore all nodes, we can visit each door in turn. Let's go left:

After exploring all possible rooms by taking the left door, we come all the way back out to the root room and try the next alternative on the right:

Algorithmically what were doing in each room is

procedure visit room:
    if left door exists, visit rooms accessible through left door
    if right door exists, visit rooms accessible through right door

Or in code notation:

def visit(room):
    if room.left: visit(room.left)
    if room.right: visit(room.right)

This mechanism works from any room. Imagine waking up and finding yourself in a room with two doors. You have no idea whether you are at the root or somewhere in the middle of a labyrinth (maze) of rooms.

This approach is called backtracking.

Let's code this up but make a regular function not a method of the tree class to keep things simple. Let's look at that tree again:


In [29]:
treeviz(root)


Out[29]:
G node4572784344 Tree value 'parrt' left right node4572655568 Tree value 'mary' left right node4572784344:c->node4572655568 node4572655512 Tree value 'april' left right node4572784344:c->node4572655512 node4572654616 Tree value 'jim' left right node4572655568:c->node4572654616 node4572783392 Tree value 'mike' left right node4572655568:c->node4572783392 node4572784848 Tree value 'sri' left right node4572655512:c->node4572784848

In [44]:
def walk(t):
    "Depth-first walk of binary tree"
    if t is None: return
#     if t.left is None: callsviz(varnames=['t']).view()
    print(t.value) # "visit" or process this node
    walk(t.left)   # walk into the left door
    walk(t.right)  # after visiting all those, enter right door
    
walk(root)


parrt
mary
jim
mike
april
sri

That is a recursive function, meaning that walk calls itself. It's really no different than the recurrence relations we use in mathematics, such as the gradient descent recurrence:

$x_{t+1} = x_t - \eta f'(x_t)$

Variable $x$ is a function of previous incarnations of itself.


In [31]:
def fact(n):
    print(f"fact({n})")
    if n==0: return 1
    return n * fact(n-1)

fact(10)


fact(10)
fact(9)
fact(8)
fact(7)
fact(6)
fact(5)
fact(4)
fact(3)
fact(2)
fact(1)
fact(0)
Out[31]:
3628800

Don't let the recursion scare you, just pretend that you are calling a different function or that you are calling the same function except that it is known to be correct. We call that the "recursive leap of faith." (See Fundamentals of Recursion,Although that one is using C++ not Python.)

As the old joke goes: "To truly understand recursion, you must first understand recursion."

The order in which we reach (enter/exit) each node during the search is always the same for a given search strategy, such as depth first search. Here is a visualization from Wikipedia:

We always try to go as deep as possible before exploring siblings.

Now, notice the black dots on the traversal. That signifies processing or "visiting" a node and in this case is done before visiting the children. When we process a node and then it's children, we call that a preorder traversal. If we process a node after walking the children, we call it a post-order traversal:

In code, that just means switching the processing step two after the walk of the children:


In [32]:
def walk(t):
    if t is None: return
    walk(t.left)
    walk(t.right)
    print(t.value) # process after visiting children
    
walk(root)


jim
mike
mary
sri
april
parrt

In both cases we are performing a depth-first walk of the tree, which means that we are immediately seeking the leaves rather than siblings. A depth first walk scans down all of the left child fields of the nodes until it hits a leaf and then goes back up a level, looking for children at that level.

In contrast, a breadth-first walk processes all children before looking at grandchildren. This is a less common walk but, for our tree, would be the sequence parrt, mary, april, jim, mike, sri. In a sense, breadth first processes one level of the tree at a time:

Exercise

Alter the depth-first recursive tree walk above to sum the values in a binary tree. Have walk() return the sum of a node's value and all it childrens' values. Test with:

a = Tree(3)
b = Tree(5)
c = Tree(10)
d = Tree(9)
e = Tree(4)
f = Tree(1)

a.left = b
a.right = c
b.left = d
b.right = e
e.right = f
treeviz(a)

print(walk(a), walk(b), walk(c))

In [33]:
class Tree:
    def __init__(self, value, left=None, right=None):
        self.value = value
        self.left = left
        self.right = right       
        
def walk(t:Tree) -> int:
    if t is None: return 0
    return t.value + walk(t.left) + walk(t.right)

a = Tree(3)
b = Tree(5)
c = Tree(10)
d = Tree(9)
e = Tree(4)
f = Tree(1)

a.left = b
a.right = c
b.left = d
b.right = e
e.right = f
treeviz(a)

print(walk(a), walk(b), walk(c))


32 19 10

Graphs

Trees are actually a subset of the class of directed, acyclic graphs. If we remove the acyclic restriction and the restriction that nodes have a single incoming edge, we get a general, directed graph. These are also extremely common in computer science and are used to represent graphs of users in a social network, locations on a map, or a graph of webpages, which is how Google does page ranking.

graphviz

You might find it useful to display graphs visually and graphviz is an excellent way to do that. Here's an example


In [34]:
import graphviz as gv

gv.Source("""
digraph G {
    node [shape=box penwidth="0.6" margin="0.0" fontname="Helvetica" fontsize=10]
    edge [arrowsize=.4 penwidth="0.6"]
    rankdir=LR;
    ranksep=.25;
    cat->dog
    dog->cat
    dog->horse
    dog->zebra
    horse->zebra
    zebra->llama
}
""")


Out[34]:
G cat cat dog dog cat->dog dog->cat horse horse dog->horse zebra zebra dog->zebra horse->zebra llama llama zebra->llama

Once again, it's very convenient to represent a node in this graph as an object, which means we need a class definition:


In [35]:
class GNode:
    def __init__(self, value):
        self.value = value
        self.edges = [] # outgoing edges
        
    def connect(self, other):
        self.edges.append(other)

In [36]:
cat = GNode('cat')
dog = GNode('dog')
horse = GNode('horse')
zebra = GNode('zebra')
llama = GNode('llama')

cat.connect(dog)
dog.connect(cat)
dog.connect(horse)
dog.connect(zebra)
horse.connect(zebra)
zebra.connect(llama)

In [37]:
objviz(cat)


Out[37]:
G node4572778448 GNode value 'cat' edges     node4570495496 0 node4572778448:c->node4570495496 node4572778392 GNode value 'dog' edges     node4570495496:0->node4572778392:w node4572740744 0 1 2 node4572778392:c->node4572740744 node4572740744:0->node4572778448:w node4572785856 GNode value 'horse' edges     node4572740744:1->node4572785856:w node4572784232 GNode value 'zebra' edges     node4572740744:2->node4572784232:w node4571533000 0 node4572785856:c->node4571533000 node4571533000:0->node4572784232:w node4571530120 0 node4572784232:c->node4571530120 node4572785576 GNode value 'llama' edges     node4571530120:0->node4572785576:w node4571344968 empty list node4572785576:c->node4571344968

Walking graphs

Walking a graph (depth-first) is just like walking a tree in that we use backtracking to try all possible branches out of every node until we have reached all reachable nodes. When we run into a dead end, we back up to the most recently available on visited path and try that. That's how you get from the entrance to the exit of a maze.

The only difference between walking a tree and walking a graph is that we have to watch out for cycles when walking a graph, so that we don't get stuck in an infinite loop. We leave a trail of breadcrumbs or candies or string to help us keep track of where we have visited and where we have not. If we run into our trail, we have hit a cycle and must also backtrack to avoid an infinite loop. This is a depth first search.

Here's a nice visualization website for graph walking.

In code, here is how we perform a depth-frist search on a graph:


In [38]:
def walk(g, visited):
    "Depth-first walk of a graph"
    if g is None or g in visited: return
    visited.add(g) # mark as visited
    print(g.value) # process before visiting outgoing edges
    for node in g.edges:
        walk(node, visited) # walk all outgoing edge targets
    
walk(cat, set())


cat
dog
horse
zebra
llama

Where we start the walk of the graph matters:


In [39]:
walk(llama, set())


llama

In [40]:
walk(horse, set())


horse
zebra
llama

Operator overloading

(Note: We overload operators but override methods in a subclass definition)

Python allows class definitions to implement functions that are called when standard operator symbols such as + and / are applied to objects of that type. This is extremely useful for mathematical libraries such as numpy, but is often abused. Note that you could redefine subtraction to be multiplication when someone used the - sign. (Yikes!)

Here's an extension to Point that supports + for Point addition:


In [41]:
import numpy as np

class Point:
    def __init__(self, x, y):
        self.x = x
        self.y = y
        
    def distance(self, other):
        return np.sqrt( (self.x - other.x)**2 + (self.y - other.y)**2 )
    
    def __add__(self,other):
        x = self.x + other.x
        y = self.y + other.y
        return Point(x,y)
    
    def __str__(self):
        return f"({self.x},{self.y})"

In [42]:
p = Point(3,4)
q = Point(5,6)
print(p, q)
print(p + q) # calls p.__add__(q) or Point.__add__(p,q)
print(Point.__add__(p,q))


(3,4) (5,6)
(8,10)
(8,10)

Exercise

Add a method to implement the - subtraction operator for Point so that the following code works:

p = Point(5,4)
q = Point(1,5)
print(p, q)
print(p - q)

In [43]:
import numpy as np

class Point:
    def __init__(self, x, y):
        self.x = x
        self.y = y
        
    def distance(self, other):
        return np.sqrt( (self.x - other.x)**2 + (self.y - other.y)**2 )
    
    def __add__(self,other):
        x = self.x + other.x
        y = self.y + other.y
        return Point(x,y)
    
    def __sub__(self,other):
        x = self.x - other.x
        y = self.y - other.y
        return Point(x,y)
    
    def __str__(self):
        return f"({self.x},{self.y})"
    
p = Point(5,4)
q = Point(1,5)
print(p, q)
print(p - q)


(5,4) (1,5)
(4,-1)