I wanted to see really how fast and good is the API for the package for that reason I decide to repeat the project #3 of my basic "Business Intelligence" class. That project took me at least 16 hours to complete, of course that includes the creation of the report but I remember to spend from 5 to 6 hours on SAS doing the data mining.

The data is about the customers of a catalog, the objective is to predict which customers are more likely to respond to the catalog campaign and send a catalog to those customers. There is a dataset for training (2000 rows) and another one for testing (2000 rows). Each file has this columns:

  1. NGIF: number of orders in the 24 months
  2. RAMN: Total order amounts in dollars in the last 24 months
  3. LASG: Amount of last order
  4. LASD: Date of last order (you may need to convert it to number of months elapsed to 01/01/2007)
  5. RFA1: Frequency of order
  6. RFA2: Order amount category (as defined by the firm) of the last order
  7. Order: Actual response (1: response, 0: no response)

We start by importing the training data and taking a look at the first few items and the default metadata.

In [1]:
import copper

In [2]:
copper.project.path = '../'

In [3]:
ds_train = copper.Dataset()

In [4]:

0 1 2 30 20 200503 1 6 1
1 2 25 207 20 200503 1 6 0
2 3 5 52 15 200503 1 6 0
3 4 11 105 15 200503 1 6 0
4 5 2 32 17 200503 1 6 0

In [5]:

Role Type
CustomerID Input Number
NGIF Input Number
RAMN Input Number
LASG Input Number
LASD Input Number
RFA1 Input Number
RFA2 Input Number
Order Input Number

Some things that need to be fixed are:

  1. RFA1 type is a number not category
  2. RFA2 type is a number not category
  3. CustomerID role is ID not input
  4. Order role is target not input

In [6]:
ds_train.role['CustomerID'] = ds_train.ID
ds_train.type['RFA1'] = ds_train.NUMBER
ds_train.type['RFA2'] = ds_train.NUMBER
ds_train.role['Order'] = ds_train.TARGET
ds_train.type['Order'] = ds_train.NUMBER

There is another problem with the LASD column. It is a number but it actually represents a date with format YYYYMM. For that reason we need to transform it to number of months since last transaction using this formula: 12*(2007 - YEAR) - MONTH + 2 we can do this with the same API as pandas apply

In [7]:

0    200503
1    200503
2    200503
3    200503
4    200503
Name: LASD

In [8]:
fnc = lambda x: 12*(2007 - int(str(x)[0:4])) - int(str(x)[4:6]) + 2
ds_train['LASD'] = ds_train['LASD'].apply(fnc)

In [9]:

0    23
1    23
2    23
3    23
4    23
Name: LASD

Perfect, we have to do the same for the testing dataset:

In [10]:
ds_test = copper.Dataset()
ds_test.role['CustomerID'] = ds_test.ID
ds_test.type['RFA1'] = ds_test.NUMBER
ds_test.type['RFA2'] = ds_test.NUMBER
ds_test.role['Order'] = ds_test.TARGET
ds_test.type['Order'] = ds_test.NUMBER
ds_test['LASD'] = ds_test['LASD'].apply(fnc)

Simple exploration

A very simple exploration we can do is to plot the histogram of the new LASD column

In [12]:
ds_train.histogram('LASD', legend=False)

We can see that the data is based on people who purchased item a lot time ago, more that 10 months ago. That can be good or bad, but is what we have so lets use it.

Machine Learning!

OK, we finally can do some machine learning. In the previous post I show how the data is transformed to be inputs for scikit-learn on this example that is not necesary because all the data are numbers, but I will show how to do that on a later post.

Anyways, to do some ML we need to:

  1. create a MachineLearning instance
  2. set the train data for the models
  3. set the testing data for the models

In [13]:
ml = copper.MachineLearning()
ml.train = ds_train
ml.test = ds_test

Create a few models to fit and compare: SVM, Decision Tree, GaussianNB and GradientBoosting.

In [14]:
from sklearn import svm
svm_clf = svm.SVC(probability=True)

In [15]:
from sklearn import tree
tree_clf = tree.DecisionTreeClassifier(max_depth=6)

In [16]:
from sklearn.naive_bayes import GaussianNB
gnb_clf = GaussianNB()

In [17]:
from sklearn.ensemble import GradientBoostingClassifier
gr_bst_clf = GradientBoostingClassifier()

Add the models to the ML instance and fit all the models

In [18]:
ml.add_clf(svm_clf, 'SVM')
ml.add_clf(tree_clf, 'Decision Tree')
ml.add_clf(gnb_clf, 'GaussianNB')
ml.add_clf(gr_bst_clf, 'Grad Boosting')

In [19]:

Comparing models

We can see the accuracy of the models and plot the ROC curve.

In [20]:

Grad Boosting    0.7200
Decision Tree    0.7135
SVM              0.6995
GaussianNB       0.6790
Name: Accuracy

In [21]:

We can see that most of the models are very similar. That is expected because I did not set any parameters on any model. Obviously need to spend some time playing with different models, but I am not going to do that right now.

Confusion matrix

Since I start learning data mining I have loved the confusion matrix and the possibilities it offers. First, is very simple to understand, and tells more information than simple accuracy that can lead to not correct interpretations. Second and most important combined with costs can tell what companies love to hear, how much money are we saving/losing.

We can see the all confusion matrixes with a simple command.

In [22]:

{'Decision Tree': array([[1365,   67],
       [ 506,   62]]),
 'GaussianNB': array([[1196,  236],
       [ 406,  162]]),
 'Grad Boosting': array([[1387,   45],
       [ 515,   53]]),
 'SVM': array([[1362,   70],
       [ 531,   37]])}

That does not tell use very clear information but we can look at each single value like this:

In [23]:

Predicted 0's Correct 0's Rate 0's
GaussianNB 1602 1196 0.7465668
Decision Tree 1871 1365 0.7295564
Grad Boosting 1902 1387 0.7292324
SVM 1893 1362 0.7194929

In [24]:

Predicted 1's Correct 1's Rate 1's
Grad Boosting 98 53 0.5408163
Decision Tree 129 62 0.4806202
GaussianNB 398 162 0.4070352
SVM 107 37 0.3457944

That is better, we can even see a pretty picture of the matrix of each model (e.g. SVM), useful to include in a report ;)

In [25]:

Let's talk about money

Finally we can define the costs of each prediction:

  1. True Negatives: 0 - did not send a catalog but wasn't a customer so we are good
  2. False Positives: 4 - catalog send but customer didn't respond
  3. False Negatives: 12 - catalog not sent and was a customer (opportunity cost)
  4. True Positives: 16 - catalog send and customer respond

In [26]:
ml.costs = [[0, 4], [12, 16]]

With the costs defined we can take a look at the revenues and oportunity cost

In [27]:

SVM              6372
Grad Boosting    6180
Decision Tree    6072
GaussianNB       4872
Name: Oportuniy cost

In [28]:

Loss from False Positive Revenue Income
GaussianNB 944 2592 1648
Decision Tree 268 992 724
Grad Boosting 180 848 668
SVM 280 592 312

OK, that is good, we now that GNB generates more revenue, but to what we are comparing? Is always useful to compare the fact of using a model and not using any model, or what I call being an idiot.

In [29]:

Expense        17184
Revenue         9088
Net revenue    -8096
Name: Revenue of not using ML

With that we can see that by sending catalogs to the every person on the testing catalog we would generated a loss of -8096; by using any model we can improve that, hell even by doing nothing and playing FIFA all day we can improve that :P


Well it took me around 4 to 5 hours to code the MachineLearning class and like other hour or so to write this. I can say with confidence (and by being a user of both) that using this package is more easier and faster than to use Enterprise Miner.

At least I can say that all the work I did on 6 hours on Enterprise Miner would have been like one hour using this; obviously had to code it first but now is done.

I know that SAS has a lot of more features but I am just getting started :P. I also learn today a that scikit-learn has tons of features that I had never heard about such as cross-validation, I will definitely take a closer look at those features.

For now this will only work for prototyping some Machine Learning but is what I have learned on my Business Intelligence classes so far, I will update the package as I learn more.

The code is on github: copper