Using Boruta on the Madalon Data Set

Author: Mike Bernico

This example demonstrates using Boruta to find all relevant features in the Madalon dataset, which is an artificial dataset used in NIPS2003 and cited in the Boruta paper

This dataset has 2000 observations and 500 features. We will use Boruta to identify the features that are relevant to the classification task.



In [1]:

    
# Installation
#!pip install boruta



In [2]:

    
import pandas as pd
from sklearn.datasets import load_iris
from sklearn.ensemble import RandomForestClassifier
from boruta import BorutaPy



In [3]:

    
def load_data():
    # URLS for dataset via UCI
    train_data_url='https://archive.ics.uci.edu/ml/machine-learning-databases/madelon/MADELON/madelon_train.data'
    train_label_url='https://archive.ics.uci.edu/ml/machine-learning-databases/madelon/MADELON/madelon_train.labels'

    X_data = pd.read_csv(train_data_url, sep=" ", header=None)
    y_data = pd.read_csv(train_label_url, sep=" ", header=None)
    data = X_data.loc[:, :499]
    data['target'] = y_data[0]
    return data



In [4]:

    
data = load_data()



In [5]:

    
data.head()









    Out[5]:






  
    
      
      0
      1
      2
      3
      4
      5
      6
      7
      8
      9
      ...
      491
      492
      493
      494
      495
      496
      497
      498
      499
      target
    
  
  
    
      0
      485
      477
      537
      479
      452
      471
      491
      476
      475
      473
      ...
      481
      477
      485
      511
      485
      481
      479
      475
      496
      -1
    
    
      1
      483
      458
      460
      487
      587
      475
      526
      479
      485
      469
      ...
      478
      487
      338
      513
      486
      483
      492
      510
      517
      -1
    
    
      2
      487
      542
      499
      468
      448
      471
      442
      478
      480
      477
      ...
      481
      492
      650
      506
      501
      480
      489
      499
      498
      -1
    
    
      3
      480
      491
      510
      485
      495
      472
      417
      474
      502
      476
      ...
      480
      474
      572
      454
      469
      475
      482
      494
      461
      1
    
    
      4
      484
      502
      528
      489
      466
      481
      402
      478
      487
      468
      ...
      479
      452
      435
      486
      508
      481
      504
      495
      511
      1
    
  

5 rows × 501 columns



In [6]:

    
y = data.pop('target')
X = data.copy().values

Boruta conforms to the sklearn api and can be used in a Pipeline as well as on it's own. Here we will demonstrate stand alone operation.

First we will instantiate an estimator that Boruta will use. Then we will instantiate a Boruta Object.



In [7]:

    
rf = RandomForestClassifier(n_jobs=-1, class_weight=None, max_depth=7, random_state=0)
# Define Boruta feature selection method
feat_selector = BorutaPy(rf, n_estimators='auto', verbose=2, random_state=0)

Once built, we can use this object to identify the relevant features in our dataset.



In [ ]:

    
feat_selector.fit(X, y)

Boruta has confirmed only a few features as useful. When our run ended, Boruta was undecided on 2 features. '

We can interrogate .support to understand which features were selected. .support returns an array of booleans that we can use to slice our feature matrix to include only relevant columns. Of course, .transform can also be used, as expected in the scikit API.



In [ ]:

    
# Check selected features
print(feat_selector.support_)
# Select the chosen features from our dataframe.
selected = X[:, feat_selector.support_]
print ("")
print ("Selected Feature Matrix Shape")
print (selected.shape)

We can also interrogate the ranking of the unselected features with .ranking_



In [ ]:

    
feat_selector.ranking_



In [ ]:

	0	1	2	3	4	5	6	7	8	9	...	491	492	493	494	495	496	497	498	499	target
0	485	477	537	479	452	471	491	476	475	473	...	481	477	485	511	485	481	479	475	496	-1
1	483	458	460	487	587	475	526	479	485	469	...	478	487	338	513	486	483	492	510	517	-1
2	487	542	499	468	448	471	442	478	480	477	...	481	492	650	506	501	480	489	499	498	-1
3	480	491	510	485	495	472	417	474	502	476	...	480	474	572	454	469	475	482	494	461	1
4	484	502	528	489	466	481	402	478	487	468	...	479	452	435	486	508	481	504	495	511	1

	0	1	2	3	4	5	6	7	8	9	...	491	492	493	494	495	496	497	498	499	target
0	485	477	537	479	452	471	491	476	475	473	...	481	477	485	511	485	481	479	475	496	-1
1	483	458	460	487	587	475	526	479	485	469	...	478	487	338	513	486	483	492	510	517	-1
2	487	542	499	468	448	471	442	478	480	477	...	481	492	650	506	501	480	489	499	498	-1
3	480	491	510	485	495	472	417	474	502	476	...	480	474	572	454	469	475	482	494	461	1
4	484	502	528	489	466	481	402	478	487	468	...	479	452	435	486	508	481	504	495	511	1

	0	1	2	3	4	5	6	7	8	9	...	491	492	493	494	495	496	497	498	499	target
0	485	477	537	479	452	471	491	476	475	473	...	481	477	485	511	485	481	479	475	496	-1
1	483	458	460	487	587	475	526	479	485	469	...	478	487	338	513	486	483	492	510	517	-1
2	487	542	499	468	448	471	442	478	480	477	...	481	492	650	506	501	480	489	499	498	-1
3	480	491	510	485	495	472	417	474	502	476	...	480	474	572	454	469	475	482	494	461	1
4	484	502	528	489	466	481	402	478	487	468	...	479	452	435	486	508	481	504	495	511	1