In [ ]:
Data matrix: rows are instances/items/patterns and columns are features/attributes.
Common rule for removing outliers: $ \text{mean} + 3\times \text{std} $
Get rid of them
Replacement with mean values
Basic assumptions:
Statistical Indepence: knowledge about the value of one attribute, desn't tell us anything about the value of other attributes.
Naive Bayes assumption: evidence terms are conditionnaly independet from each other. As a result, evidence can be split into independent parts
$$\text{Pr}(\text{Evidence} \ | \ \text{Hypothesis}) = \text{Pr}(\text{E}_1 \ | \ \text{Hypothesis}) \times \text{Pr}(\text{E}_2 \ | \ \text{Hypothesis}) \times .. \times \text{Pr}(\text{E}_d \ | \ \text{Hypothesis})$$Applicable to non-linear relationships
#### Basic Frameork
Training a decision tree will include the minimum set of attributes in order to fit the data
#### Construction of a decision tree
#### Algorithmic Procedure
### Criterion for selecting the best attribute
Define entropy: $$\text{entropy}(p_1, p_2, ..p_n) = -p_1\log p_1 -p_2 \log p_2 ... -p_n log p_n$$
Information Gain: the entropy before spliting at a node minus total entropy after splitting $$\text{Gain} = \text{entropy}_\text{before splitting} - \text{entropy}_\text{after splitting}$$
#### Gini Index
### Overfitting in Decision Trees
#### Gain Ratio
Information Gain divided by interinsic information
Inrinsic information: entropy of an attribute (not the class)
The intrinsic informaiton of attribute $v_a$ is higher than $v_b$, if $v_a$ has higher number of possible values
### Pruning Decision Trees
#### Pre-pruning
#### Post-pruning
| Index | Age | Marital Status | Highest Degree | Risk Factor |
|---|---|---|---|---|
| 1 | 22 | Single | BS | High |
| 2 | 27 | Married | MS | Low |
| 3 | 34 | Single | MS | High |
| 4 | 45 | Married | PhD | Low |
| 5 | 28 | Married | PhD | Low |
| 6 | 32 | Single | BS | High |
| 7 | 31 | Married | PhD | Low |
| 8 | 38 | Married | BS | High |
| 9 | 36 | Single | MS | Low |
| 10 | 41 | Married | MS | Low |
Initial Entropy:
$\text{entropy} = -\frac{4}{10}\log \frac{4}{10} - \frac{6}{10} \log \frac{6}{10} = 0.971$
Split on Marital Status:
Child 1: Single (3H, 1L)
$\text{entropy}=-\frac{3}{4}\log \frac{3}{4} - \frac{1}{4}\log \frac{1}{4} = 0.658$
Child 2: Married (1H, 5L) $\text{entropy}=-\frac{1}{6}\log \frac{1}{6} - \frac{5}{6}\log \frac{5}{6} = 0.651$
Total entropy: $\text{entropy}_{tot}=\frac{4\times 0.658 + 6\times 0.651}{10} = 0.654$
Information Gain for this node: $\text{Gain} = 0.971 - 0.654 = 0.317$
Split on Degree:
Child 1: BS (3H, 0L) $\text{entropy}=-\frac{3}{3}\log \frac{3}{3} = 0$
Child 2: MS (1H, 3L) $\text{entropy}=-\frac{1}{4}\log \frac{1}{4} - \frac{3}{4}\log \frac{3}{4} = 0.811$
Child 3: PhD (0H, 3L) $\text{entropy}=-\frac{3}{3}\log \frac{3}{3} = 0$
Therefore, splitting on Highest Degree resules in greater information gain.
In the second step, we add branches to the possible outcome
In [ ]: