List intersection algorithms

In this notebook, you can experiment with two algorithms (the naive one and the one using galloping search) to perform list intersection, which is an essential operator in supporting conjunction keyword queries.


In [67]:
## define the function p() that outputs debugging messages if DEBUG flag is True. 

#DEBUG = True
DEBUG = False

def p(msg):
    if DEBUG:
        print(".. {}".format(msg))

The InvertedList class

We implement this class which encapsulates all the important operations one can perform on an inverted list.


In [1]:
from pprint import pprint
from bisect import *

## You may find many similarity between this class and the file operation (in C/Java). 
class InvertedList:
    def __init__(self, l):
        self.data = l[:] # make a copy
        self.cur = 0     # the cursor 

    def get_list(self):
        return self.data
 
    def eol(self):
        # we use cur == len(list) to indicate EOL
        return False if self.cur < len(self.data) else True
    
    def next(self, val = 1):
        # does not allow cur to be out-of-range, but use len(list) to indicate EOL
        self.cur = min(self.cur + val, len(self.data)) 
            
    def elem(self):
        if self.eol():
            return None
        else: 
            return self.data[self.cur]

    def peek(self, pos):
        # look at the element under the current cursor, but does not advance the cursor. 
        if pos < len(self.data):
            return self.data[pos]
        else:
            return None
        
    def reset(self):
        self.cur = 0
    
    def dump(self):
        print("* {} : {} (cur = {})".format(self.data[:self.cur], self.data[self.cur:], self.cur))
        
    def skip_to(self, val):
        ##!! This is a naive and inefficient implementation. 
        ##!! Once you have implemented gallop_to(), you can just call it here instead.
        # move the cursor to the first element no smaller than `val`

        while not self.eol() and self.elem() < val:
            self.next()               
            
    def gallop_to(self, val):
        ##!! Not implemented yet. 
        pass

Test the class


In [69]:
a = InvertedList([2, 4, 6, 8, 10, 12, 14, 16, 18])
b = InvertedList([1, 2, 4, 8, 16, 32])

while not a.eol():
    e = a.elem()
    a.next()
    print(e)

a.dump()


2
4
6
8
10
12
14
16
18
* [2, 4, 6, 8, 10, 12, 14, 16, 18] : [] (cur = 9)

If you have implemented the gallop_to() correctly, you should pass the following test. You may change the call to skip_to() for now to see how it should work.


In [70]:
def test_gallop(l, val):
    print("=> gallop_to({})".format(val))
    l.reset()
    l.gallop_to(val)
    l.dump()

In [71]:
test_a = [val - 1 for val in a.get_list()]
test_a.append(a.get_list()[-1] + 100)

DEBUG = False # to show the overview
a.dump()
for t_a in test_a:
    test_gallop(a, t_a)


* [2, 4, 6, 8, 10, 12, 14, 16, 18] : [] (cur = 9)
=> gallop_to(1)
* [] : [2, 4, 6, 8, 10, 12, 14, 16, 18] (cur = 0)
=> gallop_to(3)
* [2] : [4, 6, 8, 10, 12, 14, 16, 18] (cur = 1)
=> gallop_to(5)
* [2, 4] : [6, 8, 10, 12, 14, 16, 18] (cur = 2)
=> gallop_to(7)
* [2, 4, 6] : [8, 10, 12, 14, 16, 18] (cur = 3)
=> gallop_to(9)
* [2, 4, 6, 8] : [10, 12, 14, 16, 18] (cur = 4)
=> gallop_to(11)
* [2, 4, 6, 8, 10] : [12, 14, 16, 18] (cur = 5)
=> gallop_to(13)
* [2, 4, 6, 8, 10, 12] : [14, 16, 18] (cur = 6)
=> gallop_to(15)
* [2, 4, 6, 8, 10, 12, 14] : [16, 18] (cur = 7)
=> gallop_to(17)
* [2, 4, 6, 8, 10, 12, 14, 16] : [18] (cur = 8)
=> gallop_to(118)
* [2, 4, 6, 8, 10, 12, 14, 16, 18] : [] (cur = 9)

In [72]:
DEBUG = True # to show the details. Your detailed debugging info may vary. 
test_gallop(a, test_a[-2])


=> gallop_to(17)
.. moved to pos = 2, elem = 6
.. moved to pos = 6, elem = 14
.. moved to pos = 14, elem = None
.. Stage 1 complete: interval = [6, 8], elems = [14, 18]
.. Stage 2 complete: cur = 8, sub_pos = 1, sub_list = [16, 18]
* [2, 4, 6, 8, 10, 12, 14, 16] : [18] (cur = 8)

As you can see above, the galloping search algorithm quickly skip over the list to reposition the cursor. In the additional note, it is proved that its complexity is $O(\log x)$, where $x$ is the difference between the correct cursor position and the initial cursor position.

The Naive Intersection Algorithm


In [73]:
def intersection_naive(a, b):
    # just in case these lists have been traversed.
    a.reset()
    b.reset()
    
    ret = []
    while not a.eol() and not b.eol():
        if a.elem() < b.elem():
            p("move a")
            a.next()
        elif a.elem() > b.elem():
            p("move b")
            b.next()
        else:
            p("found one")
            ret.append(a.elem())
            a.next()
            b.next()
    return ret

Test


In [74]:
intersection_naive(a, b)


.. move b
.. found one
.. found one
.. move a
.. found one
.. move a
.. move a
.. move a
.. found one
.. move a
Out[74]:
[2, 4, 8, 16]

Intersection Algorithm using gallop_to()

Here, we call gallop_to explicitly to show some detailed info if you have it implemented correctly. You can change the calls to skip_to in a more general setting.


In [75]:
def intersection_galloping(a, b):
    # just in case these lists have been traversed.
    a.reset()
    b.reset()
    
    ret = []
    while not a.eol() and not b.eol():
        if a.elem() == b.elem():
            p("found one")
            ret.append(a.elem())
            a.next()
        else:
            if a.elem() < b.elem():
                a.gallop_to(b.elem())
            else:
                b.gallop_to(a.elem())
    # end_while
    return ret

In [76]:
intersection_galloping(a, b)


.. moved to pos = 2, elem = 4
.. Stage 1 complete: interval = [0, 2], elems = [1, 4]
.. Stage 2 complete: cur = 1, sub_pos = 0, sub_list = [2, 4]
.. found one
.. moved to pos = 3, elem = 8
.. Stage 1 complete: interval = [1, 3], elems = [2, 8]
.. Stage 2 complete: cur = 2, sub_pos = 0, sub_list = [4, 8]
.. found one
.. moved to pos = 4, elem = 16
.. Stage 1 complete: interval = [2, 4], elems = [4, 16]
.. Stage 2 complete: cur = 3, sub_pos = 0, sub_list = [8, 16]
.. moved to pos = 4, elem = 10
.. Stage 1 complete: interval = [2, 4], elems = [6, 10]
.. Stage 2 complete: cur = 3, sub_pos = 0, sub_list = [8, 10]
.. found one
.. moved to pos = 5, elem = 32
.. Stage 1 complete: interval = [3, 5], elems = [8, 32]
.. Stage 2 complete: cur = 4, sub_pos = 0, sub_list = [16, 32]
.. moved to pos = 6, elem = 14
.. moved to pos = 10, elem = None
.. Stage 1 complete: interval = [6, 8], elems = [14, 18]
.. Stage 2 complete: cur = 7, sub_pos = 0, sub_list = [16, 18]
.. found one
.. moved to pos = 6, elem = None
.. Stage 1 complete: interval = [4, 6], elems = [16, None]
.. Stage 2 complete: cur = 5, sub_pos = 0, sub_list = [32]
.. moved to pos = 10, elem = None
.. Stage 1 complete: interval = [8, 10], elems = [18, None]
Out[76]:
[2, 4, 8, 16]

Exercise

You need to correctly implement gallop_to using the galloping search algorithm, and then play with the intersection algorithm to appreciate how efficient it works in action. Note that the implementation is not hard, but you do need to take care of a few corner cases. That's why the test_gallop() evaluates almost all possible situation.


In [ ]:


In [ ]:


In [ ]:


In [ ]: