Evaluation

Holdout method (2/3 for training, 1/3 for testing)
Repeated holdout
Cross-validation
- k-fold cross-validation
- startified k-fold cross-validation (equal representation of classes)
- leave-one-out
Bootstrap (sample with replacement)

Most commonly: 10 fold cross-validation



In [ ]:

Preprocessing Steps

Data matrix: rows are instances/items/patterns and columns are features/attributes.

Statistical Noise

Common rule for removing outliers: $ \text{mean} + 3\times \text{std} $

Dealing with missing values

Get rid of them
- Deleting the instances (records)
- Deleting features with high missing cases
Replacement with mean values
Inference / imputations *

Feature Transformation

Combining variables
Scaling data
- Mean centering $x_{new} = x - \text{mean}(x) $
- Z-score: mean centering and standardization $x_{new} = \frac{x - \text{mean}(x)}{\text{std}(x)}$
- scale to range $[0 .. 1]$ by $x_{new} = \frac{x - \min(x)}{\max(x) - \min(x)}$
- Log-scaling $x_{new} = \log(x)$
Discretization

Feature selection

Reducing the features with low correletaions ot outcome
Start from empty set of features and add 1 feature at a time while testing the performance on a sample set

Statistical Modeling

Basic assumptions:

Variables are equally important
Variables are statistically independent

Statistical Indepence: knowledge about the value of one attribute, desn't tell us anything about the value of other attributes.

Naive Bayes Rule

$\text{Pr}(\text{Hypothesis}\ |\ \text{Evidence})=\displaystyle \frac{\text{Pr}(\text{Evidence} \ | \ \text{Hypothesis})\times \text{Pr}(\text{Hypothesis})}{\text{Pr}(\text{Evidence})}$

Naive Bayes assumption: evidence terms are conditionnaly independet from each other. As a result, evidence can be split into independent parts

$$\text{Pr}(\text{Evidence} \ | \ \text{Hypothesis}) = \text{Pr}(\text{E}_1 \ | \ \text{Hypothesis}) \times \text{Pr}(\text{E}_2 \ | \ \text{Hypothesis}) \times .. \times \text{Pr}(\text{E}_d \ | \ \text{Hypothesis})$$

Decision Tree Induction

A method for estimating descrete valued functions
Robust to noise and missing data
Applicable to non-linear relationships

#### Basic Frameork
- Each node tests an attribute
- Each branch correpond to some possible attribute values
- Each leaf node assigns a class
Training a decision tree will include the minimum set of attributes in order to fit the data

#### Construction of a decision tree
- First attribute is selected as a root node based on given criterion and branches are created
- The instances are split into subsets according to the condition on the node (Divide & Conqure)
- The procedure is repeated recursivly for each branch, by creating new nodes and dividing the data into subsets
- Stop when all the instances at the node have the same class (pure node = leaf node), or we run out of attribues on the path from to the node.
#### Algorithmic Procedure
- Loop while not done
  - At each step, pick the best attribute to split the data
  - For each value of the selected attribute, create branches (child nodes)
### Criterion for selecting the best attribute

Information gain

Define entropy: $$\text{entropy}(p_1, p_2, ..p_n) = -p_1\log p_1 -p_2 \log p_2 ... -p_n log p_n$$

Total entropy: the weigthed average entropy at each node
Information Gain: the entropy before spliting at a node minus total entropy after splitting $$\text{Gain} = \text{entropy}_\text{before splitting} - \text{entropy}_\text{after splitting}$$

#### Gini Index

### Overfitting in Decision Trees

#### Gain Ratio

Information Gain divided by interinsic information
Inrinsic information: entropy of an attribute (not the class)
The intrinsic informaiton of attribute $v_a$ is higher than $v_b$, if $v_a$ has higher number of possible values

### Pruning Decision Trees

#### Pre-pruning

#### Post-pruning

Example

Index	Age	Marital Status	Highest Degree	Risk Factor
1	22	Single	BS	High
2	27	Married	MS	Low
3	34	Single	MS	High
4	45	Married	PhD	Low
5	28	Married	PhD	Low
6	32	Single	BS	High
7	31	Married	PhD	Low
8	38	Married	BS	High
9	36	Single	MS	Low
10	41	Married	MS	Low

Initial Entropy:

$\text{entropy} = -\frac{4}{10}\log \frac{4}{10} - \frac{6}{10} \log \frac{6}{10} = 0.971$
Split on Marital Status:

Child 1: Single (3H, 1L) $\text{entropy}=-\frac{3}{4}\log \frac{3}{4} - \frac{1}{4}\log \frac{1}{4} = 0.658$
Child 2: Married (1H, 5L) $\text{entropy}=-\frac{1}{6}\log \frac{1}{6} - \frac{5}{6}\log \frac{5}{6} = 0.651$
- Total entropy: $\text{entropy}_{tot}=\frac{4\times 0.658 + 6\times 0.651}{10} = 0.654$
- Information Gain for this node: $\text{Gain} = 0.971 - 0.654 = 0.317$

Split on Degree:

Child 1: BS (3H, 0L) $\text{entropy}=-\frac{3}{3}\log \frac{3}{3} = 0$
Child 2: MS (1H, 3L) $\text{entropy}=-\frac{1}{4}\log \frac{1}{4} - \frac{3}{4}\log \frac{3}{4} = 0.811$
Child 3: PhD (0H, 3L) $\text{entropy}=-\frac{3}{3}\log \frac{3}{3} = 0$
- Total entropy: $\text{entropy}_{tot} = \frac{0+4*0.811+0}{10} = 0.324$
- Information Gain for this node: $\text{Gain} = 0.971 - 0.324 = 0.647$
Therefore, splitting on Highest Degree resules in greater information gain.

In the second step, we add branches to the possible outcome



In [ ]: