Title: Variance Thresholding Binary Features
Slug: variance_thresholding_binary_features
Summary: How to select the best binary features for machine learning using variance thresholding in Python.
Date: 2017-09-14 12:00
Category: Machine Learning
Tags: Feature Selection Authors: Chris Albon

Preliminaries


In [1]:
from sklearn.feature_selection import VarianceThreshold

Load Data


In [2]:
# Create feature matrix with: 
# Feature 0: 80% class 0
# Feature 1: 80% class 1
# Feature 2: 60% class 0, 40% class 1
X = [[0, 1, 0],
     [0, 1, 1],
     [0, 1, 0],
     [0, 1, 1],
     [1, 0, 0]]

Conduct Variance Thresholding

In binary features (i.e. Bernoulli random variables), variance is calculated as:

$$\operatorname {Var} (x)= p(1-p)$$

where $p$ is the proportion of observations of class 1. Therefore, by setting $p$, we can remove features where the vast majority of observations are one class.


In [3]:
# Run threshold by variance
thresholder = VarianceThreshold(threshold=(.75 * (1 - .75)))
thresholder.fit_transform(X)


Out[3]:
array([[0],
       [1],
       [0],
       [1],
       [0]])