Using Boruta on the Madalon Data Set

Author: Mike Bernico

This example demonstrates using Boruta to find all relevant features in the Madalon dataset, which is an artificial dataset used in NIPS2003 and cited in the Boruta paper

This dataset has 2000 observations and 500 features. We will use Boruta to identify the features that are relevant to the classification task.


In [1]:
# Installation
#!pip install boruta

In [2]:
import pandas as pd
from sklearn.datasets import load_iris
from sklearn.ensemble import RandomForestClassifier
from boruta import BorutaPy

In [3]:
def load_data():
    # URLS for dataset via UCI
    train_data_url='https://archive.ics.uci.edu/ml/machine-learning-databases/madelon/MADELON/madelon_train.data'
    train_label_url='https://archive.ics.uci.edu/ml/machine-learning-databases/madelon/MADELON/madelon_train.labels'

    X_data = pd.read_csv(train_data_url, sep=" ", header=None)
    y_data = pd.read_csv(train_label_url, sep=" ", header=None)
    data = X_data.loc[:, :499]
    data['target'] = y_data[0]
    return data

In [4]:
data = load_data()

In [5]:
data.head()


Out[5]:
0 1 2 3 4 5 6 7 8 9 ... 491 492 493 494 495 496 497 498 499 target
0 485 477 537 479 452 471 491 476 475 473 ... 481 477 485 511 485 481 479 475 496 -1
1 483 458 460 487 587 475 526 479 485 469 ... 478 487 338 513 486 483 492 510 517 -1
2 487 542 499 468 448 471 442 478 480 477 ... 481 492 650 506 501 480 489 499 498 -1
3 480 491 510 485 495 472 417 474 502 476 ... 480 474 572 454 469 475 482 494 461 1
4 484 502 528 489 466 481 402 478 487 468 ... 479 452 435 486 508 481 504 495 511 1

5 rows × 501 columns


In [6]:
y = data.pop('target')
X = data.copy().values

Boruta conforms to the sklearn api and can be used in a Pipeline as well as on it's own. Here we will demonstrate stand alone operation.

First we will instantiate an estimator that Boruta will use. Then we will instantiate a Boruta Object.


In [7]:
rf = RandomForestClassifier(n_jobs=-1, class_weight=None, max_depth=7, random_state=0)
# Define Boruta feature selection method
feat_selector = BorutaPy(rf, n_estimators='auto', verbose=2, random_state=0)

Once built, we can use this object to identify the relevant features in our dataset.


In [ ]:
feat_selector.fit(X, y)

Boruta has confirmed only a few features as useful. When our run ended, Boruta was undecided on 2 features. '

We can interrogate .support to understand which features were selected. .support returns an array of booleans that we can use to slice our feature matrix to include only relevant columns. Of course, .transform can also be used, as expected in the scikit API.


In [ ]:
# Check selected features
print(feat_selector.support_)
# Select the chosen features from our dataframe.
selected = X[:, feat_selector.support_]
print ("")
print ("Selected Feature Matrix Shape")
print (selected.shape)

We can also interrogate the ranking of the unselected features with .ranking_


In [ ]:
feat_selector.ranking_

In [ ]: