Introduction

In this notebook, we experiment with the optimal histogram algorithm. We will implement a simple version based on recursion and you will do the hard job of implementing a dynamic programming-based version.

References:

H. V. Jagadish, Nick Koudas, S. Muthukrishnan, Viswanath Poosala, Kenneth C. Sevcik, Torsten Suel: Optimal Histograms with Quality Guarantees. VLDB 1998: 275-286. (url: http://engineering.nyu.edu/~suel/papers/vopt.pdf)
Dynamic Programming (wikipedia): https://en.wikipedia.org/wiki/Dynamic_programming



In [41]:

    
LARGE_NUM = 1000000000.0
EMPTY = -1

DEBUG = 2
#DEBUG = 1

import numpy as np

def sse(arr):
    if len(arr) == 0: # deal with arr == []
        return 0.0

    avg = np.average(arr)
    val = sum( [(x-avg)*(x-avg) for x in arr] )
    return val

def calc_depth(b):
    return 5 - b

def v_opt_rec(xx, b):
    mincost = LARGE_NUM
    n = len(xx)

    # check boundary condition:
    if n < b:
        return LARGE_NUM + 1
    elif b == 1:
        return sse(xx)
    else:  # the general case
        if DEBUG > 1:
            #print('.. BEGIN: input = {!s:<30}, b = {}'.format(xx, b))
            print('..{}BEGIN: input = {!s:<30}, b = {}'.format('  '*calc_depth(b), xx, b))

            
        for t in range(n):
            prefix = xx[0 : t+1]
            suffix = xx[t+1 : ]
            cost = sse(prefix) + v_opt_rec(suffix, b - 1)
            mincost = min(mincost, cost)

        if DEBUG > 0:
            #print('.. END: input = {!s:<32}, b = {}, mincost = {}'.format(xx, b, mincost))
            print('..{}END: input = {!s:<32}, b = {}, mincost = {}'.format('  '*calc_depth(b), xx, b, mincost))

        return mincost

Now, try to understand how the algorithm works -- feel free to modify the code to output more if you need. Specifically,

Observe and understand how the recursion works (set DEBUG = 2)
Observe and understand how many sub-problems are being solved again and again (set DEBUG = 1), especially when the input array is longer.



In [42]:

    
x = [7, 9, 13, 5]
b = 3

c = v_opt_rec(x, b)
print('optimal cost = {}'.format(c))









    



..    BEGIN: input = [7, 9, 13, 5]                 , b = 3
..      BEGIN: input = [9, 13, 5]                    , b = 2
..      END: input = [9, 13, 5]                      , b = 2, mincost = 8.0
..      BEGIN: input = [13, 5]                       , b = 2
..      END: input = [13, 5]                         , b = 2, mincost = 0.0
..    END: input = [7, 9, 13, 5]                   , b = 3, mincost = 2.0
optimal cost = 2.0



In [43]:

    
x = [1, 3, 9, 13, 17]
b = 4

c = v_opt_rec(x, b)
print('c = {}'.format(c))









    



..  BEGIN: input = [1, 3, 9, 13, 17]             , b = 4
..    BEGIN: input = [3, 9, 13, 17]                , b = 3
..      BEGIN: input = [9, 13, 17]                   , b = 2
..      END: input = [9, 13, 17]                     , b = 2, mincost = 8.0
..      BEGIN: input = [13, 17]                      , b = 2
..      END: input = [13, 17]                        , b = 2, mincost = 0.0
..    END: input = [3, 9, 13, 17]                  , b = 3, mincost = 8.0
..    BEGIN: input = [9, 13, 17]                   , b = 3
..      BEGIN: input = [13, 17]                      , b = 2
..      END: input = [13, 17]                        , b = 2, mincost = 0.0
..    END: input = [9, 13, 17]                     , b = 3, mincost = 0.0
..  END: input = [1, 3, 9, 13, 17]               , b = 4, mincost = 2.0
c = 2.0



In [44]:

    
x = [3, 1, 18, 9, 13, 17]
b = 4

c = v_opt_rec(x, b)
print('c = {}'.format(c))









    



..  BEGIN: input = [3, 1, 18, 9, 13, 17]         , b = 4
..    BEGIN: input = [1, 18, 9, 13, 17]            , b = 3
..      BEGIN: input = [18, 9, 13, 17]               , b = 2
..      END: input = [18, 9, 13, 17]                 , b = 2, mincost = 32.0
..      BEGIN: input = [9, 13, 17]                   , b = 2
..      END: input = [9, 13, 17]                     , b = 2, mincost = 8.0
..      BEGIN: input = [13, 17]                      , b = 2
..      END: input = [13, 17]                        , b = 2, mincost = 0.0
..    END: input = [1, 18, 9, 13, 17]              , b = 3, mincost = 32.0
..    BEGIN: input = [18, 9, 13, 17]               , b = 3
..      BEGIN: input = [9, 13, 17]                   , b = 2
..      END: input = [9, 13, 17]                     , b = 2, mincost = 8.0
..      BEGIN: input = [13, 17]                      , b = 2
..      END: input = [13, 17]                        , b = 2, mincost = 0.0
..    END: input = [18, 9, 13, 17]                 , b = 3, mincost = 8.0
..    BEGIN: input = [9, 13, 17]                   , b = 3
..      BEGIN: input = [13, 17]                      , b = 2
..      END: input = [13, 17]                        , b = 2, mincost = 0.0
..    END: input = [9, 13, 17]                     , b = 3, mincost = 0.0
..  END: input = [3, 1, 18, 9, 13, 17]           , b = 4, mincost = 10.0
c = 10.0



In [45]:

    
x = [1, 2, 3, 4, 5, 6]
b = 4

c = v_opt_rec(x, b)
print('c = {}'.format(c))









    



..  BEGIN: input = [1, 2, 3, 4, 5, 6]            , b = 4
..    BEGIN: input = [2, 3, 4, 5, 6]               , b = 3
..      BEGIN: input = [3, 4, 5, 6]                  , b = 2
..      END: input = [3, 4, 5, 6]                    , b = 2, mincost = 1.0
..      BEGIN: input = [4, 5, 6]                     , b = 2
..      END: input = [4, 5, 6]                       , b = 2, mincost = 0.5
..      BEGIN: input = [5, 6]                        , b = 2
..      END: input = [5, 6]                          , b = 2, mincost = 0.0
..    END: input = [2, 3, 4, 5, 6]                 , b = 3, mincost = 1.0
..    BEGIN: input = [3, 4, 5, 6]                  , b = 3
..      BEGIN: input = [4, 5, 6]                     , b = 2
..      END: input = [4, 5, 6]                       , b = 2, mincost = 0.5
..      BEGIN: input = [5, 6]                        , b = 2
..      END: input = [5, 6]                          , b = 2, mincost = 0.0
..    END: input = [3, 4, 5, 6]                    , b = 3, mincost = 0.5
..    BEGIN: input = [4, 5, 6]                     , b = 3
..      BEGIN: input = [5, 6]                        , b = 2
..      END: input = [5, 6]                          , b = 2, mincost = 0.0
..    END: input = [4, 5, 6]                       , b = 3, mincost = 0.0
..  END: input = [1, 2, 3, 4, 5, 6]              , b = 4, mincost = 1.0
c = 1.0

Exercise

Now you need to implement the same algorithm using dynamic programming. The idea is to fill in a table, named opt, of $b \times n$, where opt[i][j] records the optimal cost for building a histogram of $[x_j, x_{j+1}, \ldots, x_n]$ using $i$ bins.

The first step is to work out the general recursive formula, in the form of: $$ opt[i][j] = \min_{t} f(t) $$ You need to work out what is the domain of $t$ and what exactly is $f()$, which should depend on $opt[u][v]$ that has already been computed. If you cannot work it out directly, you can observe the sub-problems being solved in the recursive algorithm and see if you can schedule the computation of table cells such that every cell required on the right hand side (i.e., $f(t)$) is always scheduled before the current cell $opt[i][j]$.



In [ ]:



In [ ]:



In [ ]:



In [ ]: