As part of my Master in IT & Management this semester I am taking a class called "Advanced Business Intelligence" which is the same as machine learning or data analysis to the Business people; in general they teach how to use SAS (Enteprise Miner) to do data mining.

SAS and Enterprise Miner is a good software - with some problems:

Costs a lot of money
Is slow as hell, seriously
Good for business people, not so much for engineers
Costs a lot of money

I was able to get it for $100 dollars because the university has an arrangement with SAS but without that is impossible to buy for personal use. For that reason I start learning R (from coursera) and Python Data analysis packages (pandas and scikit-learn), while I was taking my "(basic) Business Intelligence".

I learned that is possible to replace SAS with R or Python but some easy tasks can take a long time. I want to make a contribution by making some tasks easier and keep learning python. For that reason I am going to try to do everything we do on my BI class with Python and try to make it into a package for making some tasks more easy.

The first week was review of how to import Data into SAS Enterprise Miner and explore a little bit the data.

Copper

After thinking a lot how to call the package I decide to call it copper (inspired by the dog of The Fox and the Hound a.k.a. the saddest movie ever).

Importing

Note: I am going to use the same data that as on my class, a dataset from donations, available here: donors.csv

One thing that SAS does really good and pandas does not have is meta-data:

Understands the type of each column: for example, columns with money symbols are converted to numbers that later can be used for doing machine learning.
But also can change the type (or level how SAS call it) to be categorical or number variables, also can define roles: for example Rejected columns are not used for Machine Learning and that way is very easy to try different combinations.

So I create a class called DataSet which is a wrapper around a few pandas DataFrames to introduce meta-data.

To load data have to import copper and then configure the directory path for the project. Inside the project directory needs to be another folder called 'data' with the data (csv files for example)



In [1]:

    
import copper
copper.project.path = '../'

Then create a new Dataset and load the data.csv file from the data folder.



In [2]:

    
ds = copper.Dataset()



In [3]:

    
ds.load('data.csv')

Metadata

By default copper tries to find the best match for each column, similar at what SAS does.

Depending on the name of the column decides what the target and ID columns are; the rest are inputs.
Also tries to figure out the type of each column depending on the dtype (from pandas/numpy) and the content: for example if the dtype is object but most of the values starts with a $ symbols defines the column as a money column.



In [4]:

    
ds.metadata









    Out[4]:






  
    
      
      Role
      Type
    
  
  
    
      TARGET_B
       Input
         Number
    
    
      ID
          ID
         Number
    
    
      TARGET_D
       Input
          Money
    
    
      GiftCnt36
       Input
         Number
    
    
      GiftCntAll
       Input
         Number
    
    
      GiftCntCard36
       Input
         Number
    
    
      GiftCntCardAll
       Input
         Number
    
    
      GiftAvgLast
       Input
          Money
    
    
      GiftAvg36
       Input
          Money
    
    
      GiftAvgAll
       Input
          Money
    
    
      GiftAvgCard36
       Input
          Money
    
    
      GiftTimeLast
       Input
         Number
    
    
      GiftTimeFirst
       Input
         Number
    
    
      PromCnt12
       Input
         Number
    
    
      PromCnt36
       Input
         Number
    
    
      PromCntAll
       Input
         Number
    
    
      PromCntCard12
       Input
         Number
    
    
      PromCntCard36
       Input
         Number
    
    
      PromCntCardAll
       Input
         Number
    
    
      StatusCat96NK
       Input
       Category
    
    
      StatusCatStarAll
       Input
         Number
    
    
      DemCluster
       Input
         Number
    
    
      DemAge
       Input
         Number
    
    
      DemGender
       Input
       Category
    
    
      DemHomeOwner
       Input
       Category
    
    
      DemMedHomeValue
       Input
          Money
    
    
      DemPctVeterans
       Input
         Number
    
    
      DemMedIncome
       Input
          Money

Of course is possible to change the defaults role and type of each column, lets fix some of the metadata



In [5]:

    
ds.role['TARGET_D'] = ds.REJECTED
ds.role['TARGET_B'] = ds.TARGET
ds.type['ID'] = ds.CATEGORY



In [6]:

    
ds.metadata.head(3)









    Out[6]:






  
    
      
      Role
      Type
    
  
  
    
      TARGET_B
         Target
         Number
    
    
      ID
             ID
       Category
    
    
      TARGET_D
       Rejected
          Money

Depending of the metadata copper transforms the data. Mainly it transforms non-numbers into numbers to make machine learning possible; in scikit-learn is necessary to enter only numbers. But more on that on a later post.

Histograms

Before going into Machine Learning is a good idea to explore the data, the usual way is with a histogram. Is easy to explore money (numerical) columns. I remove the legend, because is to big but the method also returns a list with the information of each bin.



In [8]:

    
ds.histogram('DemMedIncome', legend=False, retList=True)









    Out[8]:





0          0.0 - 10000.05: 2358
1         10000.0 - 20000.10: 9
2       20000.1 - 30000.15: 304
3      30000.1 - 40000.20: 1397
4      40000.2 - 50000.25: 2187
5      50000.2 - 60000.30: 1303
6       60000.3 - 70000.35: 921
7       70000.3 - 80000.40: 550
8       80000.4 - 90000.45: 290
9      90000.4 - 100000.50: 130
10    100000.5 - 110000.55: 110
11     110000.5 - 120000.60: 39
12     120000.6 - 130000.65: 34
13     130000.6 - 140000.70: 18
14      140000.7 - 150000.75: 7
15     150000.8 - 160000.80: 13
16      160000.8 - 170000.85: 7
17      170000.8 - 180000.90: 7
18      180000.9 - 190000.95: 0
19      190000.9 - 200001.00: 2

Is also possible to explore categorical variables.



In [9]:

    
ds.histogram('DemGender')

Inputs

We can take a look at how the data is transformed.



In [10]:

    
ds.inputs









    Out[10]:





<class 'pandas.core.frame.DataFrame'>
Int64Index: 9686 entries, 0 to 9685
Data columns:
GiftCnt36            9686  non-null values
GiftCntAll           9686  non-null values
GiftCntCard36        9686  non-null values
GiftCntCardAll       9686  non-null values
GiftAvgLast          9686  non-null values
GiftAvg36            9686  non-null values
GiftAvgAll           9686  non-null values
GiftAvgCard36        7906  non-null values
GiftTimeLast         9686  non-null values
GiftTimeFirst        9686  non-null values
PromCnt12            9686  non-null values
PromCnt36            9686  non-null values
PromCntAll           9686  non-null values
PromCntCard12        9686  non-null values
PromCntCard36        9686  non-null values
PromCntCardAll       9686  non-null values
StatusCat96NK [A]    9686  non-null values
StatusCat96NK [E]    9686  non-null values
StatusCat96NK [F]    9686  non-null values
StatusCat96NK [L]    9686  non-null values
StatusCat96NK [N]    9686  non-null values
StatusCat96NK [S]    9686  non-null values
StatusCatStarAll     9686  non-null values
DemCluster           9686  non-null values
DemAge               7279  non-null values
DemGender [F]        9686  non-null values
DemGender [M]        9686  non-null values
DemGender [U]        9686  non-null values
DemHomeOwner [H]     9686  non-null values
DemHomeOwner [U]     9686  non-null values
DemMedHomeValue      9686  non-null values
DemPctVeterans       9686  non-null values
DemMedIncome         9686  non-null values
dtypes: float64(7), int64(26)

inputs is a pandas DataFrame. We can see that each categorical variables are divided into more columns that are filled with one's and zero's for doing machine learning possible also money columns are converted to be numbers only.

See that the dtypes are float and int so is possible to enter that on scikit-learn by calling inputs.values to get a numpy array.

Conclusion

Thats it for now, the next week I hope to get the integration with scikit-learn to make comparison of models as easy (and why not easier) than with SAS.

The code is on github: copper

	Role	Type
TARGET_B	Input	Number
ID	ID	Number
TARGET_D	Input	Money
GiftCnt36	Input	Number
GiftCntAll	Input	Number
GiftCntCard36	Input	Number
GiftCntCardAll	Input	Number
GiftAvgLast	Input	Money
GiftAvg36	Input	Money
GiftAvgAll	Input	Money
GiftAvgCard36	Input	Money
GiftTimeLast	Input	Number
GiftTimeFirst	Input	Number
PromCnt12	Input	Number
PromCnt36	Input	Number
PromCntAll	Input	Number
PromCntCard12	Input	Number
PromCntCard36	Input	Number
PromCntCardAll	Input	Number
StatusCat96NK	Input	Category
StatusCatStarAll	Input	Number
DemCluster	Input	Number
DemAge	Input	Number
DemGender	Input	Category
DemHomeOwner	Input	Category
DemMedHomeValue	Input	Money
DemPctVeterans	Input	Number
DemMedIncome	Input	Money

	Role	Type
TARGET_B	Target	Number
ID	ID	Category
TARGET_D	Rejected	Money