Lesson 11 - Feature Selection

  • algorithm will be only as good as the features that are put into it so this is something where as a machine learner we need to spend some time
  • need a way to
    • select the best features
    • add new features

A new feature

  • use my human intuition
  • code up the new feature
  • visualize
  • repeat

A New Enron Feature

  • we create a new feature about the number of email from a poi to a person and vice versa. We make a visualization for the same to see whether we have any relationship between them

We made POI red so we may say trends. We see that we don't see any trend here. Katie's intuition says that maybe it is not the exact number of email that matters. Maybe the fraction of emails that are received are important.

Here we can see that although we don't have a very good cluster of POI they are not that spread out. There are some places where there are no POIs and can indicate that none of the people there are POIs

Why might we want to ignore a feature

  • it's noisy
  • it's causing overfitting
  • it is strongly correlated to another feature that is already present
  • additional features slow down training/testing process

Features and information are not the same thing. We want information. Features are an attempt to get information

Univariate Feature Selection

There are several go-to methods of automatically selecting your features in sklearn. Many of them fall under the umbrella of univariate feature selection, which treats each feature independently and asks how much power it gives you in classifying or regressing.

There are two big univariate feature selection tools in sklearn: SelectPercentile and SelectKBest. The difference is pretty apparent by the names: SelectPercentile selects the X% of features that are most powerful (where X is a parameter) and SelectKBest selects the K features that are most powerful (where K is a parameter).

A clear candidate for feature reduction is text learning, since the data has such high dimension. We actually did feature selection in the Sara/Chris email classification problem during the first few mini-projects; you can see it in the code in tools/email_preprocess.py.

Bias Variance Dilemma and number of features

High Bias

  • pays little attention to data
  • high error on training data
  • low (r ^ 2), large sum of squared errors

High Variance

  • pays too much attention to data (does not generalize well)
  • much higher error on test set than training set

Maybe there are many features that you need to consider to fully describe your features but if you use too few features then you are using only a few of them. This is a typical high bias situation.

If we carefully tune the algorithm very much to minimize and take out all the information that can be found from your training set then it becomes a high variance situation.

There is the trade off between the goodness of the fit and the simplicity of your fit

We want to use as few features as possible which gives us large (r ^ 2) / low sum of squared errors

Some algorithms can automatically find the sweet spot between the number of features and quality of model. This process is called regularisation.

Regularisation in regression

  • method for automatically penalising extra features

Lasso regression tries to minimize the SSE (Sum of Squared Errors) and the number of features

The decrease of SSE should not be offset by the increase in coefficients. For those that do not satisfy this it can set the coefficients to zero for those features.