MODL

A Bayes-optimal discretization algorithm for supervised discretization of continuous attributes based on their discrete class values. The resulting discretization maximizes the likehood:

$$ P(Model\,|\, Data) \propto P(Data\,|\, Model) P(Model) $$

In this readme file, we first demonstrate how to compute the likehood of each discretization by example, then show the formal mathematical derivation and three algorithms behind MODL that obtain the optimal solution.

Citation

The MODL implementations are based on the original paper by Marc Boulle, "MODL: a Bayes Optimal Discretization Method for Continuous Attributes", published in Machine Learning in 2006.

Example and Demonstration

Consider a toy problem in which a car's acceleration depends on its speed. We have data as follows:

Speed Action
km/hr (1=Decelerate, 2=Coast, 3=Accelerate)
200.0 1
100.5 2
30.8 3
90.5 2
150.2 1
80.7 1
50.9 3
... ...

and the underlying distribution of speed instances is:

We would like to leverage the class probabilities to obtain bins and conditional probability tables such as:

The resulting bins and class ratios are used to compute $$P(Data\,|\, Model)$$. Furthermore, $$P(Model)$$ is computed from the prior and the discretization structure. We obtain likehood of the discretization model by multiplying these two terms.

Mathematical Expression of The Objective Function

Before writing down the equations, we define variables: $n$ as number of instances, $J$ as number of class values, $I$ as number of intervals of discretization of the continuous attribute, $n_i$ as number of instances in the interval $i$, $n_{ij}$ as the number of instances of class $j$ in the interval $i$. Then the objective function of each discretization could be represented by these variables.

For convenice, in the algorithm we compute the objective function as $log(1/P(Model)) + log(1/P(Data\,|\, Model))$, which is the log of inverse of the likehood. The first term, as mentioned above, is based on the three staged prior distribution:

  • the number of intervals $I$ is uniformly distributed between $1$ and $n$.
  • for a given number of intervals $I$, every division of the string to discretize into $I$ intervals is equiprobable.
  • for a given interval, every distribution of class values in the interval is equiprobable
  • the distributions of the class values in each interval are independent from each other

Thus the first term of objective function could be written as:

$$ log(n) + log{{n+I-1}\choose{I-1}} + \sum_{i=1}^{I}{log{{n_i+J-1}\choose{J-1}}} $$

The second term of objective function is simply

$$ \sum_{i=1}^{I} log{ { {n_i !}\choose{{n_{i,1}!}{n_{i,2}!}{n_{i,3}!}...}}} $$

In summary, the objective function of each discretization is

$$ log(n) + log{{n+I-1}\choose{I-1}} + \sum_{i=1}^{I}{log{{n_i+J-1}\choose{J-1}}} + \sum_{i=1}^{I} log{ { {n_i !}\choose{{n_{i,1}!}{n_{i,2}!}{n_{i,3}!}...}}} $$

Three MODL Algorithms in Discretizers.jl

We have three algorithms that minimize the objective function. The first one, implemented by dynamical programming, always returns the optimal discretization. However, it takes much more runtime ($O(n^3)$) than the other two. The second one is based on greedy heuristic. It only takes $O(nlog(n)$, but might fall into local minimum. The third one is post-greedy algorithm. As its name indicates, it improves the result of second algortihm by further steps. Ihe runtime is still asymptotically $O(nlog(n)$, although taking more time than the second one.

Algorithm1: Dynamical Programming

We follow the concept that, if a partition of $S$, the sorted continuous attributes, into $I$ intervals $S_1$, $S_2$, $S_3$, ... $S_I$ is a optimal discretization, then a partition of $S-S_1$ into $I-1$ intervals $S_2$,...,$S_I$ is a optimal discretization of $S-S_1$.

Thus, we have the algorithm:

        Let $S^{i,j}$ be the substring of S consisting of the instances from i to j

        Let Disc($S^{i,j}$,$k$) be the optimal discretization of $S^{i,j}$ into exactly $k$ intervals

        For each $k$, $1 \leq k \leq n$

                For each $j$, $1 \leq j \leq n$

                If $k=1$, Disc($S^{1,j}$,1) = {$S^{1,j}$}

                If $k>1$, Disc($S^{i,j}$,$k$) is obtained by minimizing the objective function of all discretization Disc($S^{1,i}$,$k-1$)U{$S^{i+1,j}$} for $1 \leq i \leq j-1$.

Here is how Algorithm 1 works in the package:


In [ ]:
edges = binedges(DiscretizeMODL_Optimal(),continuous_data,class_data)

'continuous_data' is an array that contains the continuous feature we want to discretize. 'class_data' is an array that contains the discrete class value corresponding to each instance in 'continuous.' 'edges' is an array of bin boundaries.

Algorithm2: Greedy Heuristic

In the greedy bottom-up algorithm, we first treat each single instance as one bin, then search for the best merge between adjacent intervals. The merge is performed if the objective function decreases after the merge. This process is reiterated until no further merge can decrease the objective function. We call the difference of objective function after a merge as $\Delta$. All $\Delta$ values are stored in a priority queue. In each iteration, we dequeue the $\Delta$ and the corresponding merge policy. After the merge is completed, we update the $\Delta$ values of new interval and its adjacents to prepare the next dequeue step.

        Initialization

                Create a bin for each instance of continuous attribute

                Compute the value of this initial discretization

                Compute the $\Delta$ values related to all possile merges

                Sort the all $\Delta$ values into a priority queue

        Optimization of the discretization: Repeat the following steps

                Dequeue a $\Delta$ value and the corresponding merge policy

                If this $\Delta$ < 0

                        Update the current objective function by adding this $\Delta$.

                        Update the priority queue by computing the $\Delta$ of new intervals

Here is how Algorithm 2 works in the package:


In [ ]:
edges = binedges(DiscretizeMODL_Greedy(),continuous_data,class_data)

The definition of 'continuous_data', 'class_data', and 'edges' are same as the example in Algorithm 1.

Algorithm3: Post-greedy Algorithm

There are two stages in post-greedy algorithm. The first stage is called exhaustive merge. We run the greedy algorithm unconditionally, i.e.,without the condition $\Delta < 0$, until the discretization consists of only one interval. Then the best encountered discretization is memorized. This discretization will fall into local minimum and we try to escape from it in the second stage.

In the second stage, we compute not only the policy Merge, but also three other policies for each interval. These three policies are Split, MergeSplit, and MergeMergeSplit. They are shown as follows: ( figure is quoted from Boulle, Marc(2006) )

That is to say, we not only sort the $\Delta$ of Merge for each interval into the priority queue, but also the $\Delta$ of Split, MergeSplit and MergeMergeSplit. Then, again, we iterate the second stage by dequeue and update until all the $\Delta$ is non-negative. The first stage has runtime $(nlog(n))$. The second stage, although each iteration takes $O(n)$, doesn't increase the runtime asymptoically because it converges in few iterations.

Here is how Algorithm 3 works in the package:


In [ ]:
edges = binedges(DiscretizeMODL_PostGreedy(),continuous_data,class_data)

The definition of 'continuous_data', 'class_data', and 'edges' are same as the example in Algorithm 1.

Algorithm4: Post-greedy Algorithm with constraint on bin numbers

In previous algorithms, we have no restriction on bin numbers. However, in real world, we might wish that number of bins is not too large, since it exponentially affects the runtime of learning and inference in Bayes network. Here we provide Algorithm4 which is adapted from Algorithm3 and limits the number of bins. There are two stage of Algorithm4. Assume now the limit of bins is $m$. In the first stage, we just set the discretization of continuous attributes by uniform count algorithm with $m$ bins. In second stage, we improve the discrtization by apply 4 methods described in Algorithm3. Note that the method "Split" will increase bin number. Thus, once current bin number is equal to the limit, the method "Split" is banned for next dequeue process.

        Initialization:

                Set discretization by uniform count algorithm with limit $m$.

                Compute the value of this initial discretization

                Compute the $\Delta$ values related to all possile policies

                Sort the all $\Delta$ values into a priority queue

        Optimization: Repeat the following process

                Dequeue a $\Delta$ value and the corresponding policy

                If this $\Delta$ < 0

                        Update the current objective function by adding this $\Delta$.

                        Update the priority queue by computing the $\Delta$ of new intervals

                        If current bin number is equal to $m$

                                All 'Split' methods are set to have infinite $\Delta$ value


In [ ]: