This week on my Advance Business Intelligence class we took a look at Boosting and Bagging, two concepts well known by everybody in the machine learning world. The example on the class was take a simple Decision Tree and compare it to some bagged Decision Tree; FYI the example on class failed: SAS told that a simple DT was "better" than the bagged one, some error on SAS Enterprise Miner that we could not found. Time to see if python is capable of doing it.
This first part is just a recap of Post #1 I am using the same donors.csv that I am using for my class. We import the data and set some roles for the variables
In [1]:
import copper
copper.project.path = '../'
ds = copper.Dataset()
ds.load('data.csv')
ds.role['TARGET_D'] = ds.REJECTED
ds.role['TARGET_B'] = ds.TARGET
ds.type['ID'] = ds.CATEGORY
Since scikit-learn cant handle NANs we need to fill the values so I create simple method con the Dataset class to fill values of numerical columns with the mean.
In [2]:
ds.fillna('DemAge', 'mean')
ds.fillna('GiftAvgCard36', 'mean')
Let's see if we are good
In [3]:
ds.inputs
Out[3]:
OK, no missing values so we are good to go.
Time to see if boosting and bagging are that good as they promise to be. We create a new machine learning instance and set the dataset and we tell them to sample half for training and half for testing.
In [4]:
ml = copper.MachineLearning()
ml.dataset = ds
ml.sample(trainSize=0.5)
Create a new Decision Tree, add it to the models to compare and fit the models
In [5]:
from sklearn import tree
tree_clf = tree.DecisionTreeClassifier(max_depth=10)
ml.add_clf(tree_clf, 'Decision Tree')
ml.fit()
Since I start coding this library I wanted to maintain all compatibility with pandas and scikit-learn and its respective API. For that reason is possible to just add already fitted classifiers to the ML class and then just compare them with the utilities; this is useful in the case of using bootstraping.
In the next lines I create 20 different Decision Tree classifiers using 20 different samples (using Bootstrap), fit each classifier and add them to the ML class.
Have to carefull here, on the Bootstraping need to use only the training part, so I only use the training part of ds.inputs
(above I did ml.sample(0.5)
to divide half of the inputs in training and half on testing), then I use only the training part (ml.train)
to re-sample 20 times and fit 20 classifiers.
The first time I use all the inputs and the results were amazing, but obviously I was using some records on training and testing, which is wrong.
So in this case each new Decision Tree is only going to be using a quarter of the inputs to be trained.
In [6]:
from sklearn import cross_validation
bs = cross_validation.Bootstrap(len(ml.X_train), n_iter=20)
i = 0
for train_index, test_index in bs:
X_train = ml.X_train[train_index]
y_train = ml.y_train[train_index]
clf = tree.DecisionTreeClassifier(max_depth=10)
clf.fit(X_train, y_train)
ml.add_clf(clf, "DT" + str(i + 1))
i += 1
Lets see some results
In [7]:
ml.accuracy().head()
Out[7]:
In [9]:
ml.roc(legend=False, retList=True)
Out[9]:
As expected that didnt do much, most of the other models are even worst than the original but now we have a few models ready to be bagged.
Since scikit-learn does not have a implementation of bagging. I made a really simple (and un-efficient) implementation of bagging by using the mode to predict the class and using the mean to predict the probabilities. Finally I created some methods to make that possible and easy on the ML class.
To create a bag of all models is as simple as call the method and pass a name as parameter. On future releases will be possible to pass a list of target models to use on the bag and that way will be easy to create more bags.
In [10]:
ml.bagging("Bag 1")
In [11]:
ml.clfs # Checking the classifiers
Out[11]:
Let's see the results
In [12]:
ml.accuracy().head()
Out[12]:
In [13]:
ml.roc(legend=False, retList=True)
Out[13]:
Well it an improvement, not a huge one but to be only taking the mean of each model it is not bad at all. At least the bag score a better accuracy than the other 20 clfs and also a better Area Under the Curve
Bagging is good. Just a very simple implementation gave better results.
I did not select any parameter for the Decision Trees, only the max-depth=10, playing with more models and more parameters I am sure bagging is going to be even better. To solve this problem next I want to take a look at Grid Search and see how it can help me to improve this results.
I really believe that the potential of bagging is to do some conditional scoring, for example only ask models that are good with high Income when the entry income is higher than $10000 for example. Also probably is a good idea to use different classifiers, instead o 20 decision trees 5 DT, 5 SVM, and so on. But is just something that cross my mind.
As usual the code is on github: copper