As part of my Master in IT & Management this semester I am taking a class called "Advanced Business Intelligence" which is the same as machine learning or data analysis to the Business people; in general they teach how to use SAS (Enteprise Miner) to do data mining.
SAS and Enterprise Miner is a good software - with some problems:
I was able to get it for $100 dollars because the university has an arrangement with SAS but without that is impossible to buy for personal use. For that reason I start learning R (from coursera) and Python Data analysis packages (pandas and scikit-learn), while I was taking my "(basic) Business Intelligence".
I learned that is possible to replace SAS with R or Python but some easy tasks can take a long time. I want to make a contribution by making some tasks easier and keep learning python. For that reason I am going to try to do everything we do on my BI class with Python and try to make it into a package for making some tasks more easy.
The first week was review of how to import Data into SAS Enterprise Miner and explore a little bit the data.
After thinking a lot how to call the package I decide to call it copper (inspired by the dog of The Fox and the Hound a.k.a. the saddest movie ever).
Note: I am going to use the same data that as on my class, a dataset from donations, available here: donors.csv
One thing that SAS does really good and pandas does not have is meta-data:
So I create a class called DataSet which is a wrapper around a few pandas DataFrames to introduce meta-data.
To load data have to import copper
and then configure the directory path for the project. Inside the project directory needs to be another folder called 'data' with the data (csv
files for example)
In [1]:
import copper
copper.project.path = '../'
Then create a new Dataset and load the data.csv
file from the data
folder.
In [2]:
ds = copper.Dataset()
In [3]:
ds.load('data.csv')
By default copper tries to find the best match for each column, similar at what SAS does.
In [4]:
ds.metadata
Out[4]:
Of course is possible to change the defaults role and type of each column, lets fix some of the metadata
In [5]:
ds.role['TARGET_D'] = ds.REJECTED
ds.role['TARGET_B'] = ds.TARGET
ds.type['ID'] = ds.CATEGORY
In [6]:
ds.metadata.head(3)
Out[6]:
Depending of the metadata copper transforms the data. Mainly it transforms non-numbers into numbers to make machine learning possible; in scikit-learn is necessary to enter only numbers. But more on that on a later post.
Before going into Machine Learning is a good idea to explore the data, the usual way is with a histogram. Is easy to explore money (numerical) columns. I remove the legend, because is to big but the method also returns a list with the information of each bin.
In [8]:
ds.histogram('DemMedIncome', legend=False, retList=True)
Out[8]:
Is also possible to explore categorical variables.
In [9]:
ds.histogram('DemGender')
We can take a look at how the data is transformed.
In [10]:
ds.inputs
Out[10]:
inputs
is a pandas DataFrame. We can see that each categorical variables are divided into more columns that are filled with one's and zero's for doing machine learning possible also money columns are converted to be numbers only.
See that the dtypes are float and int so is possible to enter that on scikit-learn by calling inputs.values
to get a numpy array.
Thats it for now, the next week I hope to get the integration with scikit-learn to make comparison of models as easy (and why not easier) than with SAS.
The code is on github: copper