Feature Selection with R

Jaganadh Gopinadhan http://jaganadhg.in

Feature selection is one of the important tasks in Machine Learning and Data Science. This notebook is a continuation of my notes on Feature Selection with sklearn. In this notebook we will discuss various feature selection utilities available in R and how to use them with examples.

Boruta Algorithm and Package

One of the most widely used package for feature selection task in R in Boruta. This package wraps the randomForest package in R. A detailed note on the package and algorithm is available in the paper "Feature Selection with the Boruta Package" [1]. I am not going to discuss the same here, but we will discuss the usage here.

We will use the Boston house price data here. Before starting the exercises make sure that the required libraries are installed. To access the data we need the 'MASS' package. Install the package by 'install.packages('MASS')'. The next package which we require is 'Boruta', install it by 'install.packages('Boruta', dependencies=c('Depends','Suggests')).

First let's load the data and examine the data.


In [26]:
library(MASS)
data(Boston)
head(Boston)


Out[26]:
crimzninduschasnoxrmagedisradtaxptratioblacklstatmedv
10.00632182.3100.5386.57565.24.09129615.3396.94.9824
20.0273107.0700.4696.42178.94.9671224217.8396.99.1421.6
30.0272907.0700.4697.18561.14.9671224217.8392.834.0334.7
40.0323702.1800.4586.99845.86.0622322218.7394.632.9433.4
50.0690502.1800.4587.14754.26.0622322218.7396.95.3336.2
60.0298502.1800.4586.4358.76.0622322218.7394.125.2128.7

The data contains 13 attributes and 'medv' is the target variable. Now let's try to find the feature importance.


In [2]:
library(MASS)
library(Boruta)

data(Boston)


boruta_feat_imp <- function(data,formula){
    #Compute feature importance with Boruta algorithm
    #:param data: dataframe containing the data
    #:param formula: formula for randomForest 
    #:returns imp_feats: returns report on feature importance
    imp_feats <- Boruta(formula,data=data, doTrace = 2, ntree = 500)
    return(imp_feats)
} 

feats <- boruta_feat_imp(Boston,medv ~ .)
feats


 1. run of importance source...
 2. run of importance source...
 3. run of importance source...
 4. run of importance source...
 5. run of importance source...
 6. run of importance source...
 7. run of importance source...
 8. run of importance source...
 9. run of importance source...
 10. run of importance source...
 11. run of importance source...
Confirmed 13 attributes: age, black, chas, crim, dis and 8 more.
Out[2]:
Boruta performed 11 iterations in 24.11364 secs.
 13 attributes confirmed important: age, black, chas, crim, dis and 8
more.
 No attributes deemed unimportant.

What Just Happened ?

We have loaded the Boston data set first. Then we defined a generic function which will accept a data-set and a formula as arguments. The function will pass the data and formula to Boruta algo, which eventually invokes the randomForest package. Once the computing is over it will return the feature importance report. In the Boston case the algorithm found all the features are important :-) . Now it is time for checking the same with some other data.Try the 'HouseVotes84' data from 'mlbench' package.

Feature Selection with randomForest

Now let's see how we can use the randomForest package to compute the feature importance.


In [9]:
library(MASS)
library(randomForest)

data(Boston)


rf_feat_imp <- function(data,formula){
    #Compute feature importance with randomForest algorithm
    #:param data: dataframe containing the data
    #:param formula: formula for randomForest 
    #:returns imp_feats: returns report on feature importance
    imp_feats <- randomForest(formula,data=data, mtry=2, ntree = 500,importance=TRUE)
    imp_feats_res <- importance(imp_feats,type=1)
    return(imp_feats_res)
} 

feats <- rf_feat_imp(Boston,medv ~ .)
feats


Out[9]:
%IncMSE
crim18.09281
zn6.570366
indus13.27625
chas5.204331
nox17.64775
rm32.02581
age13.74005
dis17.3245
rad9.837306
tax13.75862
ptratio16.58041
black12.58019
lstat24.55349

What Just Happened ?

Similar to the previous example we have created a generic function to compute the feature importance. The results will be a data frame with feature name and percentage of MSE (in regression example). If we change type=2 importance function it will give the node impurity value.

It is party time : Feature importance with 'party' package

The next package we are exploring is 'party'


In [ ]:
library(party)
library(MASS)

data(Boston)


party_feat_imp <- function(data,formula){
    #Compute feature importance with party package
    #:param data: dataframe containing the data
    #:param formula: formula for randomForest 
    #:returns imp_feats: returns report on feature importance
    imp_feats <- cforest(formula,data=data, control=cforest_unbiased(mtry=2,ntree=50))
    imp_feats_res <- varimp(imp_feats,conditional=TRUE)
    return(imp_feats_res)
} 

feats <- party_feat_imp(Boston,medv ~ .)
feats


Loading required package: grid
Loading required package: mvtnorm
Loading required package: modeltools
Loading required package: stats4
Loading required package: strucchange
Loading required package: zoo

Attaching package: ‘zoo’

The following objects are masked from ‘package:base’:

    as.Date, as.Date.numeric

Loading required package: sandwich
Note : This code will take memory and time to finish the computing

In [ ]:

References

[1] Witold R. Rudnicki and Miron B. Kursa, Feature Selection with the Boruta Package, Journal of Statistical Software, September 2010, Volume 36, Issue 11. http://www.jstatsoft.org/v36/i11/paper