Lets say we have five classes: mammalas, reptiles, birds, fish, and amphibians, with the following class frequencies
| class | freq |
|---|---|
| mammals | 50 |
| reptiles | 100 |
| birds | 20 |
| fish | 30 |
| amphibians | 10 |
So, we should start with amphibians since ther are the least frequent class, (the hardest class to detetct). We keep the default class to be the most frequent class, i.e. reptiles.
amphibians --> birds --> fish --> mammals (default: reptiles)
At each step, we deal with a binary classification, positive means ??? and negative means evrything else.
Extended rule: $R_1: AB \rightarrow + $
$Gain(R_0, R_1) = p_1 \left[\log \frac{p_1}{p_1+n_1} - \log\frac{p_0}{p_0+n_0}\right]$
$p_0$: number of positive instances covered by rule $R_0$
$p_1$: number of positive instances covered by rule $R_1$
$n_0$: number of negative instances covered by rule $R_0$
$n_1$: number of negative instances covered by rule $R_1$
Rationale Gain is higher when the extended rule has high coverage and accuracy.
Why should we remove the instances covered by $R_1$?
Should we also remove the neative classes that are missclasisfied by $R_1$?
Yes
$R_1$: $p_1=12, n_1=3 $
$R_2$: without removing any instance: $p_2 = 8, n_2 = 4 $
$R_2'$: removing only the positive instances $p2' = 6 , n_2' = 4 $
$R_2''$: remove both positive and negative instances $p_2'' = 6 n_2''=2 $