Jaganadh Gopinadhan http://jaganadhg.in
Feature selection is one of the important tasks in Machine Learning and Data Science. This notebook is a continuation of my notes on Feature Selection with sklearn. In this notebook we will discuss various feature selection utilities available in R and how to use them with examples.
One of the most widely used package for feature selection task in R in Boruta. This package wraps the randomForest package in R. A detailed note on the package and algorithm is available in the paper "Feature Selection with the Boruta Package" [1]. I am not going to discuss the same here, but we will discuss the usage here.
We will use the Boston house price data here. Before starting the exercises make sure that the required libraries are installed. To access the data we need the 'MASS' package. Install the package by 'install.packages('MASS')'. The next package which we require is 'Boruta', install it by 'install.packages('Boruta', dependencies=c('Depends','Suggests')).
First let's load the data and examine the data.
In [26]:
library(MASS)
data(Boston)
head(Boston)
Out[26]:
The data contains 13 attributes and 'medv' is the target variable. Now let's try to find the feature importance.
In [2]:
library(MASS)
library(Boruta)
data(Boston)
boruta_feat_imp <- function(data,formula){
#Compute feature importance with Boruta algorithm
#:param data: dataframe containing the data
#:param formula: formula for randomForest
#:returns imp_feats: returns report on feature importance
imp_feats <- Boruta(formula,data=data, doTrace = 2, ntree = 500)
return(imp_feats)
}
feats <- boruta_feat_imp(Boston,medv ~ .)
feats
Out[2]:
We have loaded the Boston data set first. Then we defined a generic function which will accept a data-set and a formula as arguments. The function will pass the data and formula to Boruta algo, which eventually invokes the randomForest package. Once the computing is over it will return the feature importance report. In the Boston case the algorithm found all the features are important :-) . Now it is time for checking the same with some other data.Try the 'HouseVotes84' data from 'mlbench' package.
Now let's see how we can use the randomForest package to compute the feature importance.
In [9]:
library(MASS)
library(randomForest)
data(Boston)
rf_feat_imp <- function(data,formula){
#Compute feature importance with randomForest algorithm
#:param data: dataframe containing the data
#:param formula: formula for randomForest
#:returns imp_feats: returns report on feature importance
imp_feats <- randomForest(formula,data=data, mtry=2, ntree = 500,importance=TRUE)
imp_feats_res <- importance(imp_feats,type=1)
return(imp_feats_res)
}
feats <- rf_feat_imp(Boston,medv ~ .)
feats
Out[9]:
Similar to the previous example we have created a generic function to compute the feature importance. The results will be a data frame with feature name and percentage of MSE (in regression example). If we change type=2 importance function it will give the node impurity value.
The next package we are exploring is 'party'
In [ ]:
library(party)
library(MASS)
data(Boston)
party_feat_imp <- function(data,formula){
#Compute feature importance with party package
#:param data: dataframe containing the data
#:param formula: formula for randomForest
#:returns imp_feats: returns report on feature importance
imp_feats <- cforest(formula,data=data, control=cforest_unbiased(mtry=2,ntree=50))
imp_feats_res <- varimp(imp_feats,conditional=TRUE)
return(imp_feats_res)
}
feats <- party_feat_imp(Boston,medv ~ .)
feats
In [ ]:
[1] Witold R. Rudnicki and Miron B. Kursa, Feature Selection with the Boruta Package, Journal of Statistical Software, September 2010, Volume 36, Issue 11. http://www.jstatsoft.org/v36/i11/paper