Feature Selection with R

Jaganadh Gopinadhan http://jaganadhg.in

Feature selection is one of the important tasks in Machine Learning and Data Science. This notebook is a continuation of my notes on Feature Selection with sklearn. In this notebook we will discuss various feature selection utilities available in R and how to use them with examples.

Boruta Algorithm and Package

One of the most widely used package for feature selection task in R in Boruta. This package wraps the randomForest package in R. A detailed note on the package and algorithm is available in the paper "Feature Selection with the Boruta Package" [1]. I am not going to discuss the same here, but we will discuss the usage here.

We will use the Boston house price data here. Before starting the exercises make sure that the required libraries are installed. To access the data we need the 'MASS' package. Install the package by 'install.packages('MASS')'. The next package which we require is 'Boruta', install it by 'install.packages('Boruta', dependencies=c('Depends','Suggests')).

First let's load the data and examine the data.



In [26]:

    
library(MASS)
data(Boston)
head(Boston)









    Out[26]:





crim zn indus chas nox rm age dis rad tax ptratio black lstat medv

	1 0.00632 18 2.31 0 0.538 6.575 65.2 4.09 1 296 15.3 396.9 4.98 24
	2 0.02731 0 7.07 0 0.469 6.421 78.9 4.9671 2 242 17.8 396.9 9.14 21.6
	3 0.02729 0 7.07 0 0.469 7.185 61.1 4.9671 2 242 17.8 392.83 4.03 34.7
	4 0.03237 0 2.18 0 0.458 6.998 45.8 6.0622 3 222 18.7 394.63 2.94 33.4
	5 0.06905 0 2.18 0 0.458 7.147 54.2 6.0622 3 222 18.7 396.9 5.33 36.2
	6 0.02985 0 2.18 0 0.458 6.43 58.7 6.0622 3 222 18.7 394.12 5.21 28.7

The data contains 13 attributes and 'medv' is the target variable. Now let's try to find the feature importance.



In [2]:

    
library(MASS)
library(Boruta)

data(Boston)


boruta_feat_imp <- function(data,formula){
    #Compute feature importance with Boruta algorithm
    #:param data: dataframe containing the data
    #:param formula: formula for randomForest 
    #:returns imp_feats: returns report on feature importance
    imp_feats <- Boruta(formula,data=data, doTrace = 2, ntree = 500)
    return(imp_feats)
} 

feats <- boruta_feat_imp(Boston,medv ~ .)
feats









    



 1. run of importance source...
 2. run of importance source...
 3. run of importance source...
 4. run of importance source...
 5. run of importance source...
 6. run of importance source...
 7. run of importance source...
 8. run of importance source...
 9. run of importance source...
 10. run of importance source...
 11. run of importance source...
Confirmed 13 attributes: age, black, chas, crim, dis and 8 more.






    Out[2]:





Boruta performed 11 iterations in 24.11364 secs.
 13 attributes confirmed important: age, black, chas, crim, dis and 8
more.
 No attributes deemed unimportant.

What Just Happened ?

We have loaded the Boston data set first. Then we defined a generic function which will accept a data-set and a formula as arguments. The function will pass the data and formula to Boruta algo, which eventually invokes the randomForest package. Once the computing is over it will return the feature importance report. In the Boston case the algorithm found all the features are important :-) . Now it is time for checking the same with some other data.Try the 'HouseVotes84' data from 'mlbench' package.

Feature Selection with randomForest

Now let's see how we can use the randomForest package to compute the feature importance.



In [9]:

    
library(MASS)
library(randomForest)

data(Boston)


rf_feat_imp <- function(data,formula){
    #Compute feature importance with randomForest algorithm
    #:param data: dataframe containing the data
    #:param formula: formula for randomForest 
    #:returns imp_feats: returns report on feature importance
    imp_feats <- randomForest(formula,data=data, mtry=2, ntree = 500,importance=TRUE)
    imp_feats_res <- importance(imp_feats,type=1)
    return(imp_feats_res)
} 

feats <- rf_feat_imp(Boston,medv ~ .)
feats









    Out[9]:





%IncMSE

	crim 18.09281
	zn 6.570366
	indus 13.27625
	chas 5.204331
	nox 17.64775
	rm 32.02581
	age 13.74005
	dis 17.3245
	rad 9.837306
	tax 13.75862
	ptratio 16.58041
	black 12.58019
	lstat 24.55349

What Just Happened ?

Similar to the previous example we have created a generic function to compute the feature importance. The results will be a data frame with feature name and percentage of MSE (in regression example). If we change type=2 importance function it will give the node impurity value.

It is party time : Feature importance with 'party' package

The next package we are exploring is 'party'



In [ ]:

    
library(party)
library(MASS)

data(Boston)


party_feat_imp <- function(data,formula){
    #Compute feature importance with party package
    #:param data: dataframe containing the data
    #:param formula: formula for randomForest 
    #:returns imp_feats: returns report on feature importance
    imp_feats <- cforest(formula,data=data, control=cforest_unbiased(mtry=2,ntree=50))
    imp_feats_res <- varimp(imp_feats,conditional=TRUE)
    return(imp_feats_res)
} 

feats <- party_feat_imp(Boston,medv ~ .)
feats









    



Loading required package: grid
Loading required package: mvtnorm
Loading required package: modeltools
Loading required package: stats4
Loading required package: strucchange
Loading required package: zoo

Attaching package: ‘zoo’

The following objects are masked from ‘package:base’:

    as.Date, as.Date.numeric

Loading required package: sandwich

Note : This code will take memory and time to finish the computing



In [ ]:

References

[1] Witold R. Rudnicki and Miron B. Kursa, Feature Selection with the Boruta Package, Journal of Statistical Software, September 2010, Volume 36, Issue 11. http://www.jstatsoft.org/v36/i11/paper

	crim	zn	indus	nox	rm	age	dis	rad	tax	ptratio	black	lstat	medv
1	0.00632	18	2.31	0.538	6.575	65.2	4.09	1	296	15.3	396.9	4.98	24
2	0.02731	0	7.07	0.469	6.421	78.9	4.9671	2	242	17.8	396.9	9.14	21.6
3	0.02729	0	7.07	0.469	7.185	61.1	4.9671	2	242	17.8	392.83	4.03	34.7
4	0.03237	0	2.18	0.458	6.998	45.8	6.0622	3	222	18.7	394.63	2.94	33.4
5	0.06905	0	2.18	0.458	7.147	54.2	6.0622	3	222	18.7	396.9	5.33	36.2
6	0.02985	0	2.18	0.458	6.43	58.7	6.0622	3	222	18.7	394.12	5.21	28.7

	%IncMSE
crim	18.09281
zn	6.570366
indus	13.27625
chas	5.204331
nox	17.64775
rm	32.02581
age	13.74005
dis	17.3245
rad	9.837306
tax	13.75862
ptratio	16.58041
black	12.58019
lstat	24.55349