For a refresher on object-oriented programming, see Object-oriented programming.
In [1]:
s = {1,3,2,9}
Or with by creating a set
object and assigning it to a variable then manually adding elements:
In [2]:
s = set()
s.add(1)
s.add(3)
We can build our own set object implementation by creating a class definition:
In [3]:
class MySet:
def __init__(self):
self.elements = []
def add(self, x):
if x not in self.elements:
self.elements.append(x)
In [4]:
s = MySet()
s.add(3) # same as MySet.add(a,3)
s.add(3)
s.add(2)
s.add('cat')
s.elements
Out[4]:
In [5]:
from lolviz import *
objviz(s)
Out[5]:
Question: How expensive is it to add an element to a set with this implementation?
In [6]:
class MySet:
def __init__(self):
self.elements = []
def add(self, x):
if x not in self.elements:
self.elements.append(x)
def hasmember(self, x):
return x in self.elements
In [7]:
s = MySet()
s.add(3) # same as MySet.add(a,3)
s.add(3)
s.add(2)
s.add('cat')
s.hasmember(3), s.hasmember(99)
Out[7]:
We've studied arrays/lists that are built into Python but they are not always the best kind of list to use. Sometimes, we are inserting and deleting things from the head or middle of the list. If we do this in lists made up of contiguous cells in memory, we have to move a lot of cells around to make room for a new element or to close a hole made by a deletion. Most importantly, linked lists are the degenerate form of a general object graph. So, it makes sense to start with the simple versions and move up to general graphs.
Linked lists allow us to efficiently insert and remove things anywhere we want, at the cost of more memory.
A linked list associates a next
pointer with each value
. We call these things nodes and here's a simple implementation for node objects:
In [8]:
class LLNode:
def __init__(self, value, next=None):
self.value = value
self.next = next
In [9]:
head = LLNode('tombu')
callsviz(varnames='head')
Out[9]:
In [10]:
head = LLNode('parrt', head)
callsviz(varnames='head')
Out[10]:
In [11]:
head = LLNode("xue", head)
callsviz(varnames='head')
Out[11]:
In [12]:
p = head
while p is not None:
print(p.value)
p = p.next
Question: How fast can we walk the linked list?
Modify the walking code so that it lives in a method of LLNode
called exists(self, x)
that looks for a node with value x
starting at self
. If we test with head.exists('parrt')
then self
would be our global head
variable. Have the function return true if x
exists in the list, else return false. You can test it with:
head = LLNode('tombu')
head = LLNode('parrt', head)
head = LLNode("xue", head)
head.exists('parrt'), head.exists('part')
In [13]:
class LLNode:
def __init__(self, value, next=None):
self.value = value
self.next = next
def exists(self, x):
p = self # start looking at this node
while p is not None:
if x==p.value:
return True
p = p.next
return False
head = LLNode('tombu')
head = LLNode('parrt', head)
head = LLNode("xue", head)
head.exists('parrt'), head.exists('part')
Out[13]:
Create new node
In [14]:
x = LLNode('mary')
callviz(varnames=['head','x'])
Out[14]:
Make next field of new node point to head
In [15]:
x.next = head
callviz(varnames=['head','x'])
Out[15]:
Make head point at new node
In [16]:
head = x
callviz(varnames=['head','x'])
Out[16]:
In [17]:
# to delete xue, make previous node skip over xue
xue = head.next
callviz(varnames=['head','x','xue'])
Out[17]:
In [18]:
head.next = xue.next
callviz(varnames=['head','x'])
Out[18]:
Notice that xue
still points at the node but we are going to ignore that variable from now on. Moving from the head of the list, we still cannot see the node with 'xue'
in it.
In [19]:
head.next = xue.next
callviz(varnames=['head','x','xue'])
Out[19]:
In [20]:
before_tombu = head.next
callviz(varnames=['head','x','before_tombu'])
Out[20]:
In [21]:
before_tombu.next = None
callviz(varnames=['head','x','before_tombu'])
Out[21]:
The tree data structure is one of the most important in computer science and is extremely common in data science as well. Decision trees, which form the core of gradient boosting machines and random forests (machine learning algorithms), are naturally represented as trees in memory. When we process HTML and XML files, those are generally represented by trees. For example:
<bookstore>
<book category="cooking">
<title lang="en">Everyday Italian</title>
<author>Giada De Laurentiis</author>
<year>2005</year>
<price>30.00</price>
</book>
<book category="web">
<title lang="en">Learning XML</title>
<author>Erik T. Ray</author>
<year>2003</year>
<price>39.95</price>
</book>
</bookstore>
We're going to look at a simple kind of tree that has at most two children: a binary tree. A node that has no children is called a leaf and non-leaves are called internal nodes.
In general, trees with $n$ nodes have $n-1$ edges. Each node has a single incoming edge and the root has none.
Nodes have parents and children and siblings (at the same level).
Sometimes nodes have links back to their parents for programming convenience reasons. That would make it a graph not a tree but we still consider it a tree.
In [22]:
class Tree:
def __init__(self, value, left=None, right=None):
self.value = value
self.left = left
self.right = right
In [23]:
root = Tree('parrt')
root.left = Tree('mary')
root.right = Tree('april')
treeviz(root)
Out[23]:
In [24]:
root = Tree('parrt', Tree('mary'), Tree('april'))
treeviz(root)
Out[24]:
In [25]:
root = Tree('parrt')
mary = Tree('mary')
april = Tree('april')
jim = Tree('jim')
sri = Tree('sri')
mike = Tree('mike')
root.left = mary
root.right = april
mary.left = jim
mary.right = mike
april.right = sri
treeviz(root)
Out[25]:
Create a class definition for NTree
that allows arbitrary numbers of children. (Use a list for field children
rather than left
and right
.) The constructor should init an empty children list. Test your code using:
from lolviz import objviz
root2 = NTree('parrt')
mary = NTree('mary')
april = NTree('april')
jim = NTree('jim')
sri = NTree('sri')
mike = NTree('mike')
root2.addchild(mary)
root2.addchild(jim)
root2.addchild(sri)
sri.addchild(mike)
sri.addchild(april)
objviz(root2)
In [27]:
class NTree:
def __init__(self, value):
self.value = value
self.children = []
def addchild(self, child):
if isinstance(child, NTree):
self.children.append(child)
In [28]:
root2 = NTree('parrt')
mary = NTree('mary')
april = NTree('april')
jim = NTree('jim')
sri = NTree('sri')
mike = NTree('mike')
root2.addchild(mary)
root2.addchild(jim)
root2.addchild(sri)
sri.addchild(mike)
sri.addchild(april)
objviz(root2)
Out[28]:
Walking a tree is a matter of moving a cursor like we did with the linked lists above. The goal is to visit each node in the tree. We start out by having the cursor point at the root of the tree and then walk downwards until we hit leaves, and then we come back up and try other alternatives.
A good physical analogy: imagine a person (cursor) from HR needing to speak (visit) each person in a company starting with the president/CEO. Here's a sample org chart:
The general visit algorithm starting at node p
is meet with p
then visit each direct report. Then visit all of their direct reports, one level of the tree at a time. The node visitation sequence would be A,B,C,F,H,J,... This is a breadth-first search of the tree and easy to describe but a bit more work to implement that a depth-first search. Depth first means visiting a person then visit their first direct report and that person's direct report etc... until you reach a leaf node. Then back up a level and move to next direct report. That visitation sequence is A,B,C,D,E,F,G,H,I,J,K,L.
If you'd like to start at node B, not A, what is the procedure? The same, of course. So visiting A means, say, printing A
then visiting B. Visiting B means visiting C, and when that completes, visiting F, etc... The key is that the procedure for visiting a node is exactly the same regardless of which node you start with. This is generally true for any self-similar data structure like a tree.
Another easy way to think about binary tree visitation in particular is positioning yourself in a room with a bunch of doors as choices. Each door leads to other rooms, which might also have doors leading to other rooms. We can think of a room as a node and doors as pointers to other nodes. Each room is identical and has 0, 1, or 2 doors (for a binary tree). At the root node we might see two choices and, to explore all nodes, we can visit each door in turn. Let's go left:
After exploring all possible rooms by taking the left door, we come all the way back out to the root room and try the next alternative on the right:
Algorithmically what were doing in each room is
procedure visit room:
if left door exists, visit rooms accessible through left door
if right door exists, visit rooms accessible through right door
Or in code notation:
def visit(room):
if room.left: visit(room.left)
if room.right: visit(room.right)
This mechanism works from any room. Imagine waking up and finding yourself in a room with two doors. You have no idea whether you are at the root or somewhere in the middle of a labyrinth (maze) of rooms.
This approach is called backtracking.
Let's code this up but make a regular function not a method of the tree class to keep things simple. Let's look at that tree again:
In [29]:
treeviz(root)
Out[29]:
In [44]:
def walk(t):
"Depth-first walk of binary tree"
if t is None: return
# if t.left is None: callsviz(varnames=['t']).view()
print(t.value) # "visit" or process this node
walk(t.left) # walk into the left door
walk(t.right) # after visiting all those, enter right door
walk(root)
That is a recursive function, meaning that walk
calls itself. It's really no different than the recurrence relations we use in mathematics, such as the gradient descent recurrence:
$x_{t+1} = x_t - \eta f'(x_t)$
Variable $x$ is a function of previous incarnations of itself.
In [31]:
def fact(n):
print(f"fact({n})")
if n==0: return 1
return n * fact(n-1)
fact(10)
Out[31]:
Don't let the recursion scare you, just pretend that you are calling a different function or that you are calling the same function except that it is known to be correct. We call that the "recursive leap of faith." (See Fundamentals of Recursion,Although that one is using C++ not Python.)
As the old joke goes: "To truly understand recursion, you must first understand recursion."
The order in which we reach (enter/exit) each node during the search is always the same for a given search strategy, such as depth first search. Here is a visualization from Wikipedia:
We always try to go as deep as possible before exploring siblings.
Now, notice the black dots on the traversal. That signifies processing or "visiting" a node and in this case is done before visiting the children. When we process a node and then it's children, we call that a preorder traversal. If we process a node after walking the children, we call it a post-order traversal:
In code, that just means switching the processing step two after the walk of the children:
In [32]:
def walk(t):
if t is None: return
walk(t.left)
walk(t.right)
print(t.value) # process after visiting children
walk(root)
In both cases we are performing a depth-first walk of the tree, which means that we are immediately seeking the leaves rather than siblings. A depth first walk scans down all of the left child fields of the nodes until it hits a leaf and then goes back up a level, looking for children at that level.
In contrast, a breadth-first walk processes all children before looking at grandchildren. This is a less common walk but, for our tree, would be the sequence parrt, mary, april, jim, mike, sri. In a sense, breadth first processes one level of the tree at a time:
Alter the depth-first recursive tree walk above to sum the values in a binary tree. Have walk()
return the sum of a node's value and all it childrens' values. Test with:
a = Tree(3)
b = Tree(5)
c = Tree(10)
d = Tree(9)
e = Tree(4)
f = Tree(1)
a.left = b
a.right = c
b.left = d
b.right = e
e.right = f
treeviz(a)
print(walk(a), walk(b), walk(c))
In [33]:
class Tree:
def __init__(self, value, left=None, right=None):
self.value = value
self.left = left
self.right = right
def walk(t:Tree) -> int:
if t is None: return 0
return t.value + walk(t.left) + walk(t.right)
a = Tree(3)
b = Tree(5)
c = Tree(10)
d = Tree(9)
e = Tree(4)
f = Tree(1)
a.left = b
a.right = c
b.left = d
b.right = e
e.right = f
treeviz(a)
print(walk(a), walk(b), walk(c))
Trees are actually a subset of the class of directed, acyclic graphs. If we remove the acyclic restriction and the restriction that nodes have a single incoming edge, we get a general, directed graph. These are also extremely common in computer science and are used to represent graphs of users in a social network, locations on a map, or a graph of webpages, which is how Google does page ranking.
You might find it useful to display graphs visually and graphviz is an excellent way to do that. Here's an example
In [34]:
import graphviz as gv
gv.Source("""
digraph G {
node [shape=box penwidth="0.6" margin="0.0" fontname="Helvetica" fontsize=10]
edge [arrowsize=.4 penwidth="0.6"]
rankdir=LR;
ranksep=.25;
cat->dog
dog->cat
dog->horse
dog->zebra
horse->zebra
zebra->llama
}
""")
Out[34]:
Once again, it's very convenient to represent a node in this graph as an object, which means we need a class definition:
In [35]:
class GNode:
def __init__(self, value):
self.value = value
self.edges = [] # outgoing edges
def connect(self, other):
self.edges.append(other)
In [36]:
cat = GNode('cat')
dog = GNode('dog')
horse = GNode('horse')
zebra = GNode('zebra')
llama = GNode('llama')
cat.connect(dog)
dog.connect(cat)
dog.connect(horse)
dog.connect(zebra)
horse.connect(zebra)
zebra.connect(llama)
In [37]:
objviz(cat)
Out[37]:
Walking a graph (depth-first) is just like walking a tree in that we use backtracking to try all possible branches out of every node until we have reached all reachable nodes. When we run into a dead end, we back up to the most recently available on visited path and try that. That's how you get from the entrance to the exit of a maze.
The only difference between walking a tree and walking a graph is that we have to watch out for cycles when walking a graph, so that we don't get stuck in an infinite loop. We leave a trail of breadcrumbs or candies or string to help us keep track of where we have visited and where we have not. If we run into our trail, we have hit a cycle and must also backtrack to avoid an infinite loop. This is a depth first search.
Here's a nice visualization website for graph walking.
In code, here is how we perform a depth-frist search on a graph:
In [38]:
def walk(g, visited):
"Depth-first walk of a graph"
if g is None or g in visited: return
visited.add(g) # mark as visited
print(g.value) # process before visiting outgoing edges
for node in g.edges:
walk(node, visited) # walk all outgoing edge targets
walk(cat, set())
Where we start the walk of the graph matters:
In [39]:
walk(llama, set())
In [40]:
walk(horse, set())
(Note: We overload operators but override methods in a subclass definition)
Python allows class definitions to implement functions that are called when standard operator symbols such as +
and /
are applied to objects of that type. This is extremely useful for mathematical libraries such as numpy, but is often abused. Note that you could redefine subtraction to be multiplication when someone used the -
sign. (Yikes!)
Here's an extension to Point
that supports +
for Point
addition:
In [41]:
import numpy as np
class Point:
def __init__(self, x, y):
self.x = x
self.y = y
def distance(self, other):
return np.sqrt( (self.x - other.x)**2 + (self.y - other.y)**2 )
def __add__(self,other):
x = self.x + other.x
y = self.y + other.y
return Point(x,y)
def __str__(self):
return f"({self.x},{self.y})"
In [42]:
p = Point(3,4)
q = Point(5,6)
print(p, q)
print(p + q) # calls p.__add__(q) or Point.__add__(p,q)
print(Point.__add__(p,q))
In [43]:
import numpy as np
class Point:
def __init__(self, x, y):
self.x = x
self.y = y
def distance(self, other):
return np.sqrt( (self.x - other.x)**2 + (self.y - other.y)**2 )
def __add__(self,other):
x = self.x + other.x
y = self.y + other.y
return Point(x,y)
def __sub__(self,other):
x = self.x - other.x
y = self.y - other.y
return Point(x,y)
def __str__(self):
return f"({self.x},{self.y})"
p = Point(5,4)
q = Point(1,5)
print(p, q)
print(p - q)