I wanted to see really how fast and good is the API for the package for that reason I decide to repeat the project #3 of my basic "Business Intelligence" class. That project took me at least 16 hours to complete, of course that includes the creation of the report but I remember to spend from 5 to 6 hours on SAS doing the data mining.
The data is about the customers of a catalog, the objective is to predict which customers are more likely to respond to the catalog campaign and send a catalog to those customers. There is a dataset for training (2000 rows) and another one for testing (2000 rows). Each file has this columns:
We start by importing the training data and taking a look at the first few items and the default metadata.
In [1]:
import copper
In [2]:
copper.project.path = '../'
In [3]:
ds_train = copper.Dataset()
ds_train.load('training.csv')
In [4]:
ds_train.frame.head()
Out[4]:
In [5]:
ds_train.metadata
Out[5]:
Some things that need to be fixed are:
In [6]:
ds_train.role['CustomerID'] = ds_train.ID
ds_train.type['RFA1'] = ds_train.NUMBER
ds_train.type['RFA2'] = ds_train.NUMBER
ds_train.role['Order'] = ds_train.TARGET
ds_train.type['Order'] = ds_train.NUMBER
There is another problem with the LASD column. It is a number but it actually represents a date with format YYYYMM. For that reason we need to transform it to number of months since last transaction using this formula: 12*(2007 - YEAR) - MONTH + 2
we can do this with the same API as pandas apply
In [7]:
ds_train.frame['LASD'].head()
Out[7]:
In [8]:
fnc = lambda x: 12*(2007 - int(str(x)[0:4])) - int(str(x)[4:6]) + 2
ds_train['LASD'] = ds_train['LASD'].apply(fnc)
In [9]:
ds_train.frame['LASD'].head()
Out[9]:
Perfect, we have to do the same for the testing dataset:
In [10]:
ds_test = copper.Dataset()
ds_test.load('testing.csv')
ds_test.role['CustomerID'] = ds_test.ID
ds_test.type['RFA1'] = ds_test.NUMBER
ds_test.type['RFA2'] = ds_test.NUMBER
ds_test.role['Order'] = ds_test.TARGET
ds_test.type['Order'] = ds_test.NUMBER
ds_test['LASD'] = ds_test['LASD'].apply(fnc)
A very simple exploration we can do is to plot the histogram of the new LASD column
In [12]:
ds_train.histogram('LASD', legend=False)
We can see that the data is based on people who purchased item a lot time ago, more that 10 months ago. That can be good or bad, but is what we have so lets use it.
OK, we finally can do some machine learning. In the previous post I show how the data is transformed to be inputs for scikit-learn on this example that is not necesary because all the data are numbers, but I will show how to do that on a later post.
Anyways, to do some ML we need to:
In [13]:
ml = copper.MachineLearning()
ml.train = ds_train
ml.test = ds_test
Create a few models to fit and compare: SVM, Decision Tree, GaussianNB and GradientBoosting.
In [14]:
from sklearn import svm
svm_clf = svm.SVC(probability=True)
In [15]:
from sklearn import tree
tree_clf = tree.DecisionTreeClassifier(max_depth=6)
In [16]:
from sklearn.naive_bayes import GaussianNB
gnb_clf = GaussianNB()
In [17]:
from sklearn.ensemble import GradientBoostingClassifier
gr_bst_clf = GradientBoostingClassifier()
Add the models to the ML instance and fit all the models
In [18]:
ml.add_clf(svm_clf, 'SVM')
ml.add_clf(tree_clf, 'Decision Tree')
ml.add_clf(gnb_clf, 'GaussianNB')
ml.add_clf(gr_bst_clf, 'Grad Boosting')
In [19]:
ml.fit()
We can see the accuracy of the models and plot the ROC curve.
In [20]:
ml.accuracy()
Out[20]:
In [21]:
ml.roc()
We can see that most of the models are very similar. That is expected because I did not set any parameters on any model. Obviously need to spend some time playing with different models, but I am not going to do that right now.
Since I start learning data mining I have loved the confusion matrix and the possibilities it offers. First, is very simple to understand, and tells more information than simple accuracy that can lead to not correct interpretations. Second and most important combined with costs can tell what companies love to hear, how much money are we saving/losing.
We can see the all confusion matrixes with a simple command.
In [22]:
ml.cm()
Out[22]:
That does not tell use very clear information but we can look at each single value like this:
In [23]:
ml.cm_table(0)
Out[23]:
In [24]:
ml.cm_table(1)
Out[24]:
That is better, we can even see a pretty picture of the matrix of each model (e.g. SVM), useful to include in a report ;)
In [25]:
ml.plot_cm('SVM')
Finally we can define the costs of each prediction:
In [26]:
ml.costs = [[0, 4], [12, 16]]
With the costs defined we can take a look at the revenues and oportunity cost
In [27]:
ml.oportunity_cost()
Out[27]:
In [28]:
ml.income()
Out[28]:
OK, that is good, we now that GNB generates more revenue, but to what we are comparing? Is always useful to compare the fact of using a model and not using any model, or what I call being an idiot.
In [29]:
ml.income_no_ml()
Out[29]:
With that we can see that by sending catalogs to the every person on the testing catalog we would generated a loss of -8096; by using any model we can improve that, hell even by doing nothing and playing FIFA all day we can improve that :P
Well it took me around 4 to 5 hours to code the MachineLearning class and like other hour or so to write this. I can say with confidence (and by being a user of both) that using this package is more easier and faster than to use Enterprise Miner.
At least I can say that all the work I did on 6 hours on Enterprise Miner would have been like one hour using this; obviously had to code it first but now is done.
I know that SAS has a lot of more features but I am just getting started :P. I also learn today a that scikit-learn has tons of features that I had never heard about such as cross-validation, I will definitely take a closer look at those features.
For now this will only work for prototyping some Machine Learning but is what I have learned on my Business Intelligence classes so far, I will update the package as I learn more.
The code is on github: copper