In this notebook, we implement a simpler version of the BUC (Bottom up cubing) algorithm.
The original BUC algorithm is a highly optimized and practically efficient algorithm to materialize the entire data cube given a fact table. We instead experiment with a simpler version, called BUC_rec
(BUC recursive), based on our teaching slides. The implementation here emphasize more on the conceptual simplicity rather than efficiency.
Once you understand BUC_rec
, you may try to understand BUC
if you want to learn more by yourself.
The recursive formula used in BUC_rec
is:
where $R_i$ is $\pi_{B,C,\ldots,M}(\sigma_{A = a_i}(R))$ and $R_{\text{ALL}}$ is $\pi_{B,C,\ldots,M}(R)$.
In [1]:
import sys
import pandas as pd
import numpy as np
ALL = -1
# DEBUG = True
DEBUG = False
##============================================================
# Data file format:
# * tab-delimited input file
# * 1st line: dimension names and the last dimension is assumed to be the measure
# * rest of the lines: data values.
def read_data(filename):
df = pd.read_csv(filename, sep='\t')
dims = df.shape[1] - 1 # the last dim is the measure
return (df, dims)
def dump_input2(input):
if DEBUG:
print("\n.. BUC_rec invoked on:")
print(input)
print("......................\n")
# helper functions
def project_data(input, d):
# Return only the d-th column of INPUT
return input.iloc[:, d]
def select_data(input, d, val):
# SELECT * FROM INPUT WHERE INPUT.d = VAL
col_name = input.columns[d]
return input[input[col_name] == val]
def remove_first_dim(input):
# Remove the first dim of the input
return input.iloc[:, 1:]
def slice_data_dim0(input, v):
# syntactic sugar to get R_{ALL} in a less verbose way
df_temp = select_data(input, 0, v)
return remove_first_dim(df_temp)
def output(val):
print('=>\t{}'.format(val))
In [2]:
data, d = read_data('./asset/a_.txt')
print(d)
data
Out[2]:
In [3]:
project_data(data, 0)
Out[3]:
In [4]:
select_data(data, 1, 2)
Out[4]:
In [5]:
slice_data_dim0(data, 2)
Out[5]:
Now we can implement the buc_rec()
algorithm and test it.
In [6]:
# Note that input is a DataFrame
def buc_rec(input):
# Note that input is a DataFrame
dump_input2(input)
dims = input.shape[1]
if dims == 1:
# only the measure dim
input_sum = sum( project_data(input, 0) )
output(input_sum)
else:
# the general case
dim0_vals = set(project_data(input, 0).values)
for dim0_v in dim0_vals:
sub_data = slice_data_dim0(input, dim0_v)
buc_rec(sub_data)
## for R_{ALL}
sub_data = remove_first_dim(input)
buc_rec(sub_data)
In [7]:
buc_rec(data)
With the following pivot table, we can easily see the output is correct (i.e., all the (non-empty) aggregates are computed).
But did you notice anything else interesting?
In [8]:
data.pivot_table(index = ['A'], columns = ['B'], aggfunc = np.sum, margins = True)
Out[8]:
Your task is to enhance the implementation of buc_rec
so that you can generate the above more readable output.
In [ ]:
In [ ]:
In [ ]:
In [ ]: