Evaluation

  • Holdout method (2/3 for training, 1/3 for testing)
  • Repeated holdout
  • Cross-validation
    • k-fold cross-validation
    • startified k-fold cross-validation (equal representation of classes)
    • leave-one-out
  • Bootstrap (sample with replacement)

Most commonly: 10 fold cross-validation


In [ ]:

Preprocessing Steps

Data matrix: rows are instances/items/patterns and columns are features/attributes.

Statistical Noise

Common rule for removing outliers: $ \text{mean} + 3\times \text{std} $

Dealing with missing values

  • Get rid of them

    • Deleting the instances (records)
    • Deleting features with high missing cases
  • Replacement with mean values

  • Inference / imputations *

Feature Transformation

  • Combining variables
  • Scaling data
    • Mean centering $x_{new} = x - \text{mean}(x) $
    • Z-score: mean centering and standardization $x_{new} = \frac{x - \text{mean}(x)}{\text{std}(x)}$
    • scale to range $[0 .. 1]$ by $x_{new} = \frac{x - \min(x)}{\max(x) - \min(x)}$
    • Log-scaling $x_{new} = \log(x)$
  • Discretization

Feature selection

  • Reducing the features with low correletaions ot outcome
  • Start from empty set of features and add 1 feature at a time while testing the performance on a sample set

Statistical Modeling

Basic assumptions:

  • Variables are equally important
  • Variables are statistically independent

Statistical Indepence: knowledge about the value of one attribute, desn't tell us anything about the value of other attributes.

Naive Bayes Rule

  • $\text{Pr}(\text{Hypothesis}\ |\ \text{Evidence})=\displaystyle \frac{\text{Pr}(\text{Evidence} \ | \ \text{Hypothesis})\times \text{Pr}(\text{Hypothesis})}{\text{Pr}(\text{Evidence})}$

Naive Bayes assumption: evidence terms are conditionnaly independet from each other. As a result, evidence can be split into independent parts

$$\text{Pr}(\text{Evidence} \ | \ \text{Hypothesis}) = \text{Pr}(\text{E}_1 \ | \ \text{Hypothesis}) \times \text{Pr}(\text{E}_2 \ | \ \text{Hypothesis}) \times .. \times \text{Pr}(\text{E}_d \ | \ \text{Hypothesis})$$

Decision Tree Induction

  • A method for estimating descrete valued functions
  • Robust to noise and missing data
  • Applicable to non-linear relationships

    #### Basic Frameork

    • Each node tests an attribute
    • Each branch correpond to some possible attribute values
    • Each leaf node assigns a class
  • Training a decision tree will include the minimum set of attributes in order to fit the data

    #### Construction of a decision tree

    • First attribute is selected as a root node based on given criterion and branches are created
    • The instances are split into subsets according to the condition on the node (Divide & Conqure)
    • The procedure is repeated recursivly for each branch, by creating new nodes and dividing the data into subsets
    • Stop when all the instances at the node have the same class (pure node = leaf node), or we run out of attribues on the path from to the node.

    #### Algorithmic Procedure

    • Loop while not done
      • At each step, pick the best attribute to split the data
      • For each value of the selected attribute, create branches (child nodes)

    ### Criterion for selecting the best attribute

Information gain

Define entropy: $$\text{entropy}(p_1, p_2, ..p_n) = -p_1\log p_1 -p_2 \log p_2 ... -p_n log p_n$$

  • Total entropy: the weigthed average entropy at each node
  • Information Gain: the entropy before spliting at a node minus total entropy after splitting $$\text{Gain} = \text{entropy}_\text{before splitting} - \text{entropy}_\text{after splitting}$$

    #### Gini Index

    ### Overfitting in Decision Trees

    #### Gain Ratio

    Information Gain divided by interinsic information

  • Inrinsic information: entropy of an attribute (not the class)

  • The intrinsic informaiton of attribute $v_a$ is higher than $v_b$, if $v_a$ has higher number of possible values

    ### Pruning Decision Trees

    #### Pre-pruning

    #### Post-pruning


  • Example
Index Age Marital Status Highest Degree Risk Factor
1 22 Single BS High
2 27 Married MS Low
3 34 Single MS High
4 45 Married PhD Low
5 28 Married PhD Low
6 32 Single BS High
7 31 Married PhD Low
8 38 Married BS High
9 36 Single MS Low
10 41 Married MS Low
  • Initial Entropy:

    $\text{entropy} = -\frac{4}{10}\log \frac{4}{10} - \frac{6}{10} \log \frac{6}{10} = 0.971$

  • Split on Marital Status:

    Child 1: Single (3H, 1L) $\text{entropy}=-\frac{3}{4}\log \frac{3}{4} - \frac{1}{4}\log \frac{1}{4} = 0.658$
    Child 2: Married (1H, 5L) $\text{entropy}=-\frac{1}{6}\log \frac{1}{6} - \frac{5}{6}\log \frac{5}{6} = 0.651$

    • Total entropy: $\text{entropy}_{tot}=\frac{4\times 0.658 + 6\times 0.651}{10} = 0.654$

    • Information Gain for this node: $\text{Gain} = 0.971 - 0.654 = 0.317$

  • Split on Degree:

    Child 1: BS (3H, 0L) $\text{entropy}=-\frac{3}{3}\log \frac{3}{3} = 0$
    Child 2: MS (1H, 3L) $\text{entropy}=-\frac{1}{4}\log \frac{1}{4} - \frac{3}{4}\log \frac{3}{4} = 0.811$
    Child 3: PhD (0H, 3L) $\text{entropy}=-\frac{3}{3}\log \frac{3}{3} = 0$

    • Total entropy: $\text{entropy}_{tot} = \frac{0+4*0.811+0}{10} = 0.324$
    • Information Gain for this node: $\text{Gain} = 0.971 - 0.324 = 0.647$
  • Therefore, splitting on Highest Degree resules in greater information gain.

    In the second step, we add branches to the possible outcome


In [ ]: