I wanted to see really how fast and good is the API for the package for that reason I decide to repeat the project #3 of my basic "Business Intelligence" class. That project took me at least 16 hours to complete, of course that includes the creation of the report but I remember to spend from 5 to 6 hours on SAS doing the data mining.

The data is about the customers of a catalog, the objective is to predict which customers are more likely to respond to the catalog campaign and send a catalog to those customers. There is a dataset for training (2000 rows) and another one for testing (2000 rows). Each file has this columns:

NGIF: number of orders in the 24 months
RAMN: Total order amounts in dollars in the last 24 months
LASG: Amount of last order
LASD: Date of last order (you may need to convert it to number of months elapsed to 01/01/2007)
RFA1: Frequency of order
RFA2: Order amount category (as defined by the firm) of the last order
Order: Actual response (1: response, 0: no response)

We start by importing the training data and taking a look at the first few items and the default metadata.



In [1]:

    
import copper



In [2]:

    
copper.project.path = '../'



In [3]:

    
ds_train = copper.Dataset()
ds_train.load('training.csv')



In [4]:

    
ds_train.frame.head()



In [5]:

    
ds_train.metadata









    Out[5]:






  
    
      
      Role
      Type
    
  
  
    
      CustomerID
       Input
       Number
    
    
      NGIF
       Input
       Number
    
    
      RAMN
       Input
       Number
    
    
      LASG
       Input
       Number
    
    
      LASD
       Input
       Number
    
    
      RFA1
       Input
       Number
    
    
      RFA2
       Input
       Number
    
    
      Order
       Input
       Number

Some things that need to be fixed are:

RFA1 type is a number not category
RFA2 type is a number not category
CustomerID role is ID not input
Order role is target not input



In [6]:

    
ds_train.role['CustomerID'] = ds_train.ID
ds_train.type['RFA1'] = ds_train.NUMBER
ds_train.type['RFA2'] = ds_train.NUMBER
ds_train.role['Order'] = ds_train.TARGET
ds_train.type['Order'] = ds_train.NUMBER

There is another problem with the LASD column. It is a number but it actually represents a date with format YYYYMM. For that reason we need to transform it to number of months since last transaction using this formula: 12*(2007 - YEAR) - MONTH + 2 we can do this with the same API as pandas apply



In [7]:

    
ds_train.frame['LASD'].head()









    Out[7]:





0    200503
1    200503
2    200503
3    200503
4    200503
Name: LASD



In [8]:

    
fnc = lambda x: 12*(2007 - int(str(x)[0:4])) - int(str(x)[4:6]) + 2
ds_train['LASD'] = ds_train['LASD'].apply(fnc)



In [9]:

    
ds_train.frame['LASD'].head()









    Out[9]:





0    23
1    23
2    23
3    23
4    23
Name: LASD

Perfect, we have to do the same for the testing dataset:



In [10]:

    
ds_test = copper.Dataset()
ds_test.load('testing.csv')
ds_test.role['CustomerID'] = ds_test.ID
ds_test.type['RFA1'] = ds_test.NUMBER
ds_test.type['RFA2'] = ds_test.NUMBER
ds_test.role['Order'] = ds_test.TARGET
ds_test.type['Order'] = ds_test.NUMBER
ds_test['LASD'] = ds_test['LASD'].apply(fnc)

Simple exploration

A very simple exploration we can do is to plot the histogram of the new LASD column



In [12]:

    
ds_train.histogram('LASD', legend=False)

We can see that the data is based on people who purchased item a lot time ago, more that 10 months ago. That can be good or bad, but is what we have so lets use it.

Machine Learning!

OK, we finally can do some machine learning. In the previous post I show how the data is transformed to be inputs for scikit-learn on this example that is not necesary because all the data are numbers, but I will show how to do that on a later post.

Anyways, to do some ML we need to:

create a MachineLearning instance
set the train data for the models
set the testing data for the models



In [13]:

    
ml = copper.MachineLearning()
ml.train = ds_train
ml.test = ds_test

Create a few models to fit and compare: SVM, Decision Tree, GaussianNB and GradientBoosting.



In [14]:

    
from sklearn import svm
svm_clf = svm.SVC(probability=True)



In [15]:

    
from sklearn import tree
tree_clf = tree.DecisionTreeClassifier(max_depth=6)



In [16]:

    
from sklearn.naive_bayes import GaussianNB
gnb_clf = GaussianNB()



In [17]:

    
from sklearn.ensemble import GradientBoostingClassifier
gr_bst_clf = GradientBoostingClassifier()

Add the models to the ML instance and fit all the models



In [18]:

    
ml.add_clf(svm_clf, 'SVM')
ml.add_clf(tree_clf, 'Decision Tree')
ml.add_clf(gnb_clf, 'GaussianNB')
ml.add_clf(gr_bst_clf, 'Grad Boosting')



In [19]:

    
ml.fit()

Comparing models

We can see the accuracy of the models and plot the ROC curve.



In [20]:

    
ml.accuracy()









    Out[20]:





Grad Boosting    0.7200
Decision Tree    0.7135
SVM              0.6995
GaussianNB       0.6790
Name: Accuracy



In [21]:

    
ml.roc()

We can see that most of the models are very similar. That is expected because I did not set any parameters on any model. Obviously need to spend some time playing with different models, but I am not going to do that right now.

Confusion matrix

Since I start learning data mining I have loved the confusion matrix and the possibilities it offers. First, is very simple to understand, and tells more information than simple accuracy that can lead to not correct interpretations. Second and most important combined with costs can tell what companies love to hear, how much money are we saving/losing.

We can see the all confusion matrixes with a simple command.



In [22]:

    
ml.cm()









    Out[22]:





{'Decision Tree': array([[1365,   67],
       [ 506,   62]]),
 'GaussianNB': array([[1196,  236],
       [ 406,  162]]),
 'Grad Boosting': array([[1387,   45],
       [ 515,   53]]),
 'SVM': array([[1362,   70],
       [ 531,   37]])}

That does not tell use very clear information but we can look at each single value like this:



In [23]:

    
ml.cm_table(0)









    Out[23]:






  
    
      
      Predicted 0's
      Correct 0's
      Rate 0's
    
  
  
    
      GaussianNB
       1602
       1196
       0.7465668
    
    
      Decision Tree
       1871
       1365
       0.7295564
    
    
      Grad Boosting
       1902
       1387
       0.7292324
    
    
      SVM
       1893
       1362
       0.7194929



In [24]:

    
ml.cm_table(1)









    Out[24]:






  
    
      
      Predicted 1's
      Correct 1's
      Rate 1's
    
  
  
    
      Grad Boosting
        98
        53
       0.5408163
    
    
      Decision Tree
       129
        62
       0.4806202
    
    
      GaussianNB
       398
       162
       0.4070352
    
    
      SVM
       107
        37
       0.3457944

That is better, we can even see a pretty picture of the matrix of each model (e.g. SVM), useful to include in a report ;)



In [25]:

    
ml.plot_cm('SVM')

Let's talk about money

Finally we can define the costs of each prediction:

True Negatives: 0 - did not send a catalog but wasn't a customer so we are good
False Positives: 4 - catalog send but customer didn't respond
False Negatives: 12 - catalog not sent and was a customer (opportunity cost)
True Positives: 16 - catalog send and customer respond



In [26]:

    
ml.costs = [[0, 4], [12, 16]]

With the costs defined we can take a look at the revenues and oportunity cost



In [27]:

    
ml.oportunity_cost()









    Out[27]:





SVM              6372
Grad Boosting    6180
Decision Tree    6072
GaussianNB       4872
Name: Oportuniy cost



In [28]:

    
ml.income()









    Out[28]:






  
    
      
      Loss from False Positive
      Revenue
      Income
    
  
  
    
      GaussianNB
       944
       2592
       1648
    
    
      Decision Tree
       268
        992
        724
    
    
      Grad Boosting
       180
        848
        668
    
    
      SVM
       280
        592
        312

OK, that is good, we now that GNB generates more revenue, but to what we are comparing? Is always useful to compare the fact of using a model and not using any model, or what I call being an idiot.



In [29]:

    
ml.income_no_ml()









    Out[29]:





Expense        17184
Revenue         9088
Net revenue    -8096
Name: Revenue of not using ML

With that we can see that by sending catalogs to the every person on the testing catalog we would generated a loss of -8096; by using any model we can improve that, hell even by doing nothing and playing FIFA all day we can improve that :P

Conclusion

Well it took me around 4 to 5 hours to code the MachineLearning class and like other hour or so to write this. I can say with confidence (and by being a user of both) that using this package is more easier and faster than to use Enterprise Miner.

At least I can say that all the work I did on 6 hours on Enterprise Miner would have been like one hour using this; obviously had to code it first but now is done.

I know that SAS has a lot of more features but I am just getting started :P. I also learn today a that scikit-learn has tons of features that I had never heard about such as cross-validation, I will definitely take a closer look at those features.

For now this will only work for prototyping some Machine Learning but is what I have learned on my Business Intelligence classes so far, I will update the package as I learn more.

The code is on github: copper

	CustomerID	NGIF	RAMN	LASG	LASD	RFA1	RFA2	Order
0	1	2	30	20	200503	1	6	1
1	2	25	207	20	200503	1	6	0
2	3	5	52	15	200503	1	6	0
3	4	11	105	15	200503	1	6	0
4	5	2	32	17	200503	1	6	0

	Role	Type
CustomerID	Input	Number
NGIF	Input	Number
RAMN	Input	Number
LASG	Input	Number
LASD	Input	Number
RFA1	Input	Number
RFA2	Input	Number
Order	Input	Number

	Predicted 0's	Correct 0's	Rate 0's
GaussianNB	1602	1196	0.7465668
Decision Tree	1871	1365	0.7295564
Grad Boosting	1902	1387	0.7292324
SVM	1893	1362	0.7194929

	Predicted 1's	Correct 1's	Rate 1's
Grad Boosting	98	53	0.5408163
Decision Tree	129	62	0.4806202
GaussianNB	398	162	0.4070352
SVM	107	37	0.3457944

	Loss from False Positive	Revenue	Income
GaussianNB	944	2592	1648
Decision Tree	268	992	724
Grad Boosting	180	848	668
SVM	280	592	312