This week on my Advance Business Intelligence class we took a look at Boosting and Bagging, two concepts well known by everybody in the machine learning world. The example on the class was take a simple Decision Tree and compare it to some bagged Decision Tree; FYI the example on class failed: SAS told that a simple DT was "better" than the bagged one, some error on SAS Enterprise Miner that we could not found. Time to see if python is capable of doing it.

This first part is just a recap of Post #1 I am using the same donors.csv that I am using for my class. We import the data and set some roles for the variables


In [1]:
import copper
copper.project.path = '../'
ds = copper.Dataset()
ds.load('data.csv')
ds.role['TARGET_D'] = ds.REJECTED
ds.role['TARGET_B'] = ds.TARGET
ds.type['ID'] = ds.CATEGORY

Since scikit-learn cant handle NANs we need to fill the values so I create simple method con the Dataset class to fill values of numerical columns with the mean.


In [2]:
ds.fillna('DemAge', 'mean')
ds.fillna('GiftAvgCard36', 'mean')

Let's see if we are good


In [3]:
ds.inputs


Out[3]:
<class 'pandas.core.frame.DataFrame'>
Int64Index: 9686 entries, 0 to 9685
Data columns:
GiftCnt36            9686  non-null values
GiftCntAll           9686  non-null values
GiftCntCard36        9686  non-null values
GiftCntCardAll       9686  non-null values
GiftAvgLast          9686  non-null values
GiftAvg36            9686  non-null values
GiftAvgAll           9686  non-null values
GiftAvgCard36        9686  non-null values
GiftTimeLast         9686  non-null values
GiftTimeFirst        9686  non-null values
PromCnt12            9686  non-null values
PromCnt36            9686  non-null values
PromCntAll           9686  non-null values
PromCntCard12        9686  non-null values
PromCntCard36        9686  non-null values
PromCntCardAll       9686  non-null values
StatusCat96NK [A]    9686  non-null values
StatusCat96NK [E]    9686  non-null values
StatusCat96NK [F]    9686  non-null values
StatusCat96NK [L]    9686  non-null values
StatusCat96NK [N]    9686  non-null values
StatusCat96NK [S]    9686  non-null values
StatusCatStarAll     9686  non-null values
DemCluster           9686  non-null values
DemAge               9686  non-null values
DemGender [F]        9686  non-null values
DemGender [M]        9686  non-null values
DemGender [U]        9686  non-null values
DemHomeOwner [H]     9686  non-null values
DemHomeOwner [U]     9686  non-null values
DemMedHomeValue      9686  non-null values
DemPctVeterans       9686  non-null values
DemMedIncome         9686  non-null values
dtypes: float64(7), int64(26)

OK, no missing values so we are good to go.

Machine Learning

Time to see if boosting and bagging are that good as they promise to be. We create a new machine learning instance and set the dataset and we tell them to sample half for training and half for testing.


In [4]:
ml = copper.MachineLearning()
ml.dataset = ds
ml.sample(trainSize=0.5)

Create a new Decision Tree, add it to the models to compare and fit the models


In [5]:
from sklearn import tree
tree_clf = tree.DecisionTreeClassifier(max_depth=10)
ml.add_clf(tree_clf, 'Decision Tree')
ml.fit()

Since I start coding this library I wanted to maintain all compatibility with pandas and scikit-learn and its respective API. For that reason is possible to just add already fitted classifiers to the ML class and then just compare them with the utilities; this is useful in the case of using bootstraping.

In the next lines I create 20 different Decision Tree classifiers using 20 different samples (using Bootstrap), fit each classifier and add them to the ML class.

Have to carefull here, on the Bootstraping need to use only the training part, so I only use the training part of ds.inputs (above I did ml.sample(0.5) to divide half of the inputs in training and half on testing), then I use only the training part (ml.train) to re-sample 20 times and fit 20 classifiers.

The first time I use all the inputs and the results were amazing, but obviously I was using some records on training and testing, which is wrong.

So in this case each new Decision Tree is only going to be using a quarter of the inputs to be trained.


In [6]:
from sklearn import cross_validation
bs = cross_validation.Bootstrap(len(ml.X_train), n_iter=20)
i = 0
for train_index, test_index in bs:
    X_train = ml.X_train[train_index]
    y_train = ml.y_train[train_index]
    clf = tree.DecisionTreeClassifier(max_depth=10)
    clf.fit(X_train, y_train)
    ml.add_clf(clf, "DT" + str(i + 1))
    i += 1

Lets see some results


In [7]:
ml.accuracy().head()


Out[7]:
Decision Tree    0.550279
DT17             0.539955
DT14             0.536031
DT11             0.535205
DT19             0.534999
Name: Accuracy

In [9]:
ml.roc(legend=False, retList=True)


Out[9]:
Decision Tree    0.541008
DT15             0.521935
DT19             0.520583
DT11             0.516007
DT16             0.515312
DT14             0.514619
DT8              0.513731
DT17             0.509060
DT9              0.506149
DT3              0.504594
DT13             0.504411
DT5              0.501457
DT2              0.499923
DT6              0.497339
DT10             0.494504
DT12             0.494483
DT7              0.493177
DT18             0.492243
DT4              0.491786
DT1              0.488879
DT20             0.483900

As expected that didnt do much, most of the other models are even worst than the original but now we have a few models ready to be bagged.

Bagging

Since scikit-learn does not have a implementation of bagging. I made a really simple (and un-efficient) implementation of bagging by using the mode to predict the class and using the mean to predict the probabilities. Finally I created some methods to make that possible and easy on the ML class.

To create a bag of all models is as simple as call the method and pass a name as parameter. On future releases will be possible to pass a list of target models to use on the bag and that way will be easy to create more bags.


In [10]:
ml.bagging("Bag 1")

In [11]:
ml.clfs # Checking the classifiers


Out[11]:
DT14             DecisionTreeClassifier(compute_importances=Fal...
Decision Tree    DecisionTreeClassifier(compute_importances=Fal...
DT13             DecisionTreeClassifier(compute_importances=Fal...
DT15             DecisionTreeClassifier(compute_importances=Fal...
DT18             DecisionTreeClassifier(compute_importances=Fal...
DT12             DecisionTreeClassifier(compute_importances=Fal...
DT17             DecisionTreeClassifier(compute_importances=Fal...
DT9              DecisionTreeClassifier(compute_importances=Fal...
DT8              DecisionTreeClassifier(compute_importances=Fal...
DT20             DecisionTreeClassifier(compute_importances=Fal...
DT11             DecisionTreeClassifier(compute_importances=Fal...
DT19             DecisionTreeClassifier(compute_importances=Fal...
DT16             DecisionTreeClassifier(compute_importances=Fal...
DT3              DecisionTreeClassifier(compute_importances=Fal...
DT2              DecisionTreeClassifier(compute_importances=Fal...
DT1              DecisionTreeClassifier(compute_importances=Fal...
DT10             DecisionTreeClassifier(compute_importances=Fal...
DT7              DecisionTreeClassifier(compute_importances=Fal...
DT6              DecisionTreeClassifier(compute_importances=Fal...
DT5              DecisionTreeClassifier(compute_importances=Fal...
DT4              DecisionTreeClassifier(compute_importances=Fal...
Bag 1            <copper.core.ensemble.Bagging object at 0x7937...

Let's see the results


In [12]:
ml.accuracy().head()


Out[12]:
Bag 1            0.558951
Decision Tree    0.550279
DT17             0.539955
DT14             0.536031
DT11             0.535205
Name: Accuracy

In [13]:
ml.roc(legend=False, retList=True)


Out[13]:
Bag 1            0.578210
Decision Tree    0.541008
DT15             0.521935
DT19             0.520583
DT11             0.516007
DT16             0.515312
DT14             0.514619
DT8              0.513731
DT17             0.509060
DT9              0.506149
DT3              0.504594
DT13             0.504411
DT5              0.501457
DT2              0.499923
DT6              0.497339
DT10             0.494504
DT12             0.494483
DT7              0.493177
DT18             0.492243
DT4              0.491786
DT1              0.488879
DT20             0.483900

Well it an improvement, not a huge one but to be only taking the mean of each model it is not bad at all. At least the bag score a better accuracy than the other 20 clfs and also a better Area Under the Curve

Conclusion

Bagging is good. Just a very simple implementation gave better results.

I did not select any parameter for the Decision Trees, only the max-depth=10, playing with more models and more parameters I am sure bagging is going to be even better. To solve this problem next I want to take a look at Grid Search and see how it can help me to improve this results.

I really believe that the potential of bagging is to do some conditional scoring, for example only ask models that are good with high Income when the entry income is higher than $10000 for example. Also probably is a good idea to use different classifiers, instead o 20 decision trees 5 DT, 5 SVM, and so on. But is just something that cross my mind.

As usual the code is on github: copper