A Bayes-optimal discretization algorithm for supervised discretization of continuous attributes based on their discrete class values. The resulting discretization maximizes the likehood:
$$ P(Model\,|\, Data) \propto P(Data\,|\, Model) P(Model) $$In this readme file, we first demonstrate how to compute the likehood of each discretization by example, then show the formal mathematical derivation and three algorithms behind MODL that obtain the optimal solution.
Citation
The MODL implementations are based on the original paper by Marc Boulle, "MODL: a Bayes Optimal Discretization Method for Continuous Attributes", published in Machine Learning in 2006.
Example and Demonstration
Consider a toy problem in which a car's acceleration depends on its speed. We have data as follows:
Speed | Action |
---|---|
km/hr | (1=Decelerate, 2=Coast, 3=Accelerate) |
200.0 | 1 |
100.5 | 2 |
30.8 | 3 |
90.5 | 2 |
150.2 | 1 |
80.7 | 1 |
50.9 | 3 |
... | ... |
and the underlying distribution of speed instances is:
We would like to leverage the class probabilities to obtain bins and conditional probability tables such as:
The resulting bins and class ratios are used to compute $$P(Data\,|\, Model)$$. Furthermore, $$P(Model)$$ is computed from the prior and the discretization structure. We obtain likehood of the discretization model by multiplying these two terms.
Mathematical Expression of The Objective Function
Before writing down the equations, we define variables: $n$ as number of instances, $J$ as number of class values, $I$ as number of intervals of discretization of the continuous attribute, $n_i$ as number of instances in the interval $i$, $n_{ij}$ as the number of instances of class $j$ in the interval $i$. Then the objective function of each discretization could be represented by these variables.
For convenice, in the algorithm we compute the objective function as $log(1/P(Model)) + log(1/P(Data\,|\, Model))$, which is the log of inverse of the likehood. The first term, as mentioned above, is based on the three staged prior distribution:
Thus the first term of objective function could be written as:
$$ log(n) + log{{n+I-1}\choose{I-1}} + \sum_{i=1}^{I}{log{{n_i+J-1}\choose{J-1}}} $$The second term of objective function is simply
$$ \sum_{i=1}^{I} log{ { {n_i !}\choose{{n_{i,1}!}{n_{i,2}!}{n_{i,3}!}...}}} $$In summary, the objective function of each discretization is
$$ log(n) + log{{n+I-1}\choose{I-1}} + \sum_{i=1}^{I}{log{{n_i+J-1}\choose{J-1}}} + \sum_{i=1}^{I} log{ { {n_i !}\choose{{n_{i,1}!}{n_{i,2}!}{n_{i,3}!}...}}} $$Three MODL Algorithms in Discretizers.jl
We have three algorithms that minimize the objective function. The first one, implemented by dynamical programming, always returns the optimal discretization. However, it takes much more runtime ($O(n^3)$) than the other two. The second one is based on greedy heuristic. It only takes $O(nlog(n)$, but might fall into local minimum. The third one is post-greedy algorithm. As its name indicates, it improves the result of second algortihm by further steps. Ihe runtime is still asymptotically $O(nlog(n)$, although taking more time than the second one.
Algorithm1: Dynamical Programming
We follow the concept that, if a partition of $S$, the sorted continuous attributes, into $I$ intervals $S_1$, $S_2$, $S_3$, ... $S_I$ is a optimal discretization, then a partition of $S-S_1$ into $I-1$ intervals $S_2$,...,$S_I$ is a optimal discretization of $S-S_1$.
Thus, we have the algorithm:
Let $S^{i,j}$ be the substring of S consisting of the instances from i to j
Let Disc($S^{i,j}$,$k$) be the optimal discretization of $S^{i,j}$ into exactly $k$ intervals
For each $k$, $1 \leq k \leq n$
For each $j$, $1 \leq j \leq n$
If $k=1$, Disc($S^{1,j}$,1) = {$S^{1,j}$}
If $k>1$, Disc($S^{i,j}$,$k$) is obtained by minimizing the objective function of all discretization Disc($S^{1,i}$,$k-1$)U{$S^{i+1,j}$} for $1 \leq i \leq j-1$.
Here is how Algorithm 1 works in the package:
In [ ]:
edges = binedges(DiscretizeMODL_Optimal(),continuous_data,class_data)
'continuous_data' is an array that contains the continuous feature we want to discretize. 'class_data' is an array that contains the discrete class value corresponding to each instance in 'continuous.' 'edges' is an array of bin boundaries.
Algorithm2: Greedy Heuristic
In the greedy bottom-up algorithm, we first treat each single instance as one bin, then search for the best merge between adjacent intervals. The merge is performed if the objective function decreases after the merge. This process is reiterated until no further merge can decrease the objective function. We call the difference of objective function after a merge as $\Delta$. All $\Delta$ values are stored in a priority queue. In each iteration, we dequeue the $\Delta$ and the corresponding merge policy. After the merge is completed, we update the $\Delta$ values of new interval and its adjacents to prepare the next dequeue step.
Initialization
Create a bin for each instance of continuous attribute
Compute the value of this initial discretization
Compute the $\Delta$ values related to all possile merges
Sort the all $\Delta$ values into a priority queue
Optimization of the discretization: Repeat the following steps
Dequeue a $\Delta$ value and the corresponding merge policy
If this $\Delta$ < 0
Update the current objective function by adding this $\Delta$.
Update the priority queue by computing the $\Delta$ of new intervals
Here is how Algorithm 2 works in the package:
In [ ]:
edges = binedges(DiscretizeMODL_Greedy(),continuous_data,class_data)
The definition of 'continuous_data', 'class_data', and 'edges' are same as the example in Algorithm 1.
Algorithm3: Post-greedy Algorithm
There are two stages in post-greedy algorithm. The first stage is called exhaustive merge. We run the greedy algorithm unconditionally, i.e.,without the condition $\Delta < 0$, until the discretization consists of only one interval. Then the best encountered discretization is memorized. This discretization will fall into local minimum and we try to escape from it in the second stage.
In the second stage, we compute not only the policy Merge, but also three other policies for each interval. These three policies are Split, MergeSplit, and MergeMergeSplit. They are shown as follows: ( figure is quoted from Boulle, Marc(2006) )
That is to say, we not only sort the $\Delta$ of Merge for each interval into the priority queue, but also the $\Delta$ of Split, MergeSplit and MergeMergeSplit. Then, again, we iterate the second stage by dequeue and update until all the $\Delta$ is non-negative. The first stage has runtime $(nlog(n))$. The second stage, although each iteration takes $O(n)$, doesn't increase the runtime asymptoically because it converges in few iterations.
Here is how Algorithm 3 works in the package:
In [ ]:
edges = binedges(DiscretizeMODL_PostGreedy(),continuous_data,class_data)
The definition of 'continuous_data', 'class_data', and 'edges' are same as the example in Algorithm 1.
Algorithm4: Post-greedy Algorithm with constraint on bin numbers
In previous algorithms, we have no restriction on bin numbers. However, in real world, we might wish that number of bins is not too large, since it exponentially affects the runtime of learning and inference in Bayes network. Here we provide Algorithm4 which is adapted from Algorithm3 and limits the number of bins. There are two stage of Algorithm4. Assume now the limit of bins is $m$. In the first stage, we just set the discretization of continuous attributes by uniform count algorithm with $m$ bins. In second stage, we improve the discrtization by apply 4 methods described in Algorithm3. Note that the method "Split" will increase bin number. Thus, once current bin number is equal to the limit, the method "Split" is banned for next dequeue process.
Initialization:
Set discretization by uniform count algorithm with limit $m$.
Compute the value of this initial discretization
Compute the $\Delta$ values related to all possile policies
Sort the all $\Delta$ values into a priority queue
Optimization: Repeat the following process
Dequeue a $\Delta$ value and the corresponding policy
If this $\Delta$ < 0
Update the current objective function by adding this $\Delta$.
Update the priority queue by computing the $\Delta$ of new intervals
If current bin number is equal to $m$
All 'Split' methods are set to have infinite $\Delta$ value
In [ ]: