The problem I want to solve is:
I have a huge
pandas.DataFrame
and I want to rebin some columns and get the oupt of custom function on those bins
The Multibinner module accept in input:
pandas.DataFrame
with columns
The output is a shorted pandas.DataFrame
called multibinner.MBDataFrame
with multindex, addressing which row goes in which bin.
There are also two useful functions:
In [129]:
import matplotlib
from matplotlib import pyplot as plt
%matplotlib inline
import numpy as np
import pandas as pd
import collections
import multibinner as mb
The irisdf
shape is (150, 4), the bins below are defined on the columns:
sepal_length
: 10 bins sepal_width
: 8 bins That defines a 10x8 array, each bins containig a variable number of data point from the original DataFrame.
Passing a dictionary of function_names -> function reference we get as ouput the functions calculated on each bin, on the the output columns requested.
For example, here we have:
functions
: {'elements' : len ,'average' : np.average}
output columns
= petal_length
,petal_width
The output columns will be :
petal_length
: first columns in output + all the functions petal_length_average
petal_length_elements
petal_width
: second columns in output + all the functions petal_width_average
petal_width_elements
sepal_length
sepal_width
In [130]:
from sklearn import datasets
iris = datasets.load_iris()
irisdf = pd.DataFrame(iris.data,columns=iris.feature_names)
irisdf.columns = ['_'.join(i.split()[0:2]) for i in irisdf.columns]
In [131]:
irisdf.describe()
Out[131]:
In [132]:
# Let's multibinning!
# functions we want to apply on the data in a single multidimensional bin:
aggregated_functions = {'elements' : len ,'average' : np.average}
# the columns we want to have in output:
out_columns = ['petal_length','petal_width']
# define the bins for sepal_length
sepal_length_bins = { 'start' : 4 , 'stop' : 8, 'n_bins' : 10}
# conformt function generating the bins for you
sepal_length_bins = mb.bingenerator(sepal_length_bins)
# again
sepal_width_bins = { 'start' : 1 , 'stop' : 5, 'n_bins' : 8 }
sepal_width_bins = mb.bingenerator(sepal_width_bins)
# use the dataframe column name as key this links the definition to the columns
group_variables = collections.OrderedDict([
('sepal_length',sepal_length_bins),
('sepal_width',sepal_width_bins)
])
# I use OrderedDict to have fixed order, a normal dict is fine too.
print 20*'='
for key,val in group_variables.iteritems():
print key+' :'
for sub_key,sub_val in val.iteritems():
print '{:>15s}: {}'.format(sub_key,sub_val)
print 20*'='
# apply aggregated_functions on the dataframe and get the out_columns
# that is the object colelcting all the data that define the multi binning
mbdf = mb.MultiBinnedDataFrame(binstocolumns = False,
dataframe = irisdf,
group_variables = group_variables,
aggregated_functions = aggregated_functions,
out_columns = out_columns)
In [133]:
mbdf.MBDataFrame.head()
Out[133]:
The standard behaviour is to calculate all the passed function on all the columns requested for output. If we want different fucntions on different columns it is easy, just define a dictionary containig
DataFrame columns
-> functions
the latter could a dictionary function_names
-> function reference
too, like in the definition before.
Behind the scenes, the method multibinner.multibin
checks if all keys in the passed dictionary in are in the columns of the input DataFrame:
set(aggregated_functions.keys()) <= set(DataFrame.columns)
One could easily define his own functions and the name to identify it , see below.
In [134]:
# Define a custom functions to get the first element per group,
# we are working on DataFrames after all
first_func = lambda x: x.iloc[0]
aggregated_functions = {
'petal_length' : {'average' : np.average, 'first' : first_func},
'petal_width' : {'elements' : len ,'average' : np.average},
}
mbdf = mb.MultiBinnedDataFrame(binstocolumns = True,
dataframe = irisdf,
group_variables = group_variables,
aggregated_functions = aggregated_functions,
out_columns = out_columns)
In [135]:
mbdf.MBDataFrame.head()
Out[135]:
In [136]:
# reconstruct the multidimensional array defined by group_variables
outstring = []
for key,val in mbdf.group_variables.iteritems():
outstring.append('{} bins ({})'.format(val['n_bins'],key))
key = 'petal_width_elements'
print '{} array = {}'.format(key,' x '.join(outstring))
print
print mbdf.col_df_to_array(key)
In [137]:
fig, ((ax1, ax2), (ax3, ax4)) = plt.subplots(figsize=[16,10], ncols=2, nrows=2)
cm = plt.get_cmap('jet')
key = 'petal_width_elements'
imgplot = ax1.imshow(mbdf.col_df_to_array(key), cmap = cm,
interpolation='none',origin='lower')
plt.colorbar(imgplot, orientation='vertical', ax = ax1)
ax1.set_title(key)
ax1.grid(False)
key = 'petal_width_average'
imgplot = ax2.imshow(mbdf.col_df_to_array(key), cmap = cm,
interpolation='none',origin='lower')
plt.colorbar(imgplot, orientation='vertical', ax = ax2)
ax2.set_title(key)
ax2.grid(False)
key = 'petal_length_average'
imgplot = ax3.imshow(mbdf.col_df_to_array(key), cmap = cm,
interpolation='none',origin='lower')
plt.colorbar(imgplot, orientation='vertical', ax = ax3)
ax3.set_title(key)
ax3.grid(False)
# leverage pandas.DataFrame plot methods
key = 'petal_length_average'
scatterplot = mbdf.MBDataFrame.plot(kind='scatter', x='sepal_length', y='sepal_width',
c = mbdf.MBDataFrame[key],s = 450, ax=ax4, cmap = cm)
ax4.set_title('color:{}'.format(key));
In [138]:
# define the bins for sepal_length
sepal_length_bins = { 'start' : 4 , 'stop' : 8, 'n_bins' : 30}
sepal_length_bins = mb.bingenerator(sepal_length_bins)
sepal_width_bins = { 'start' : 1 , 'stop' : 5, 'n_bins' : 25}
sepal_width_bins = mb.bingenerator(sepal_width_bins)
# use the dataframe column name as key this links the definition to the columns
group_variables = collections.OrderedDict([
('sepal_length',sepal_length_bins),
('sepal_width',sepal_width_bins)
])
# that is the object colelcting all the data that define the multi binning
mbdf = mb.MultiBinnedDataFrame(binstocolumns = True,
dataframe = irisdf,
group_variables = group_variables,
aggregated_functions = aggregated_functions,
out_columns = out_columns)
print 'Output dataframe number of bins: {} '.format(mbdf.col_df_to_array(mbdf.MBDataFrame.columns[0]).size)
print 'Output dataframe number of bins with actual data: {} '.format(mbdf.MBDataFrame.shape[0])
print 'It is completaly redundant, but it works.'
In [139]:
fig, ((ax1, ax2), (ax3, ax4)) = plt.subplots(figsize=[16,10], ncols=2, nrows=2)
cm = plt.get_cmap('jet')
key = 'petal_width_elements'
imgplot = ax1.imshow(mbdf.col_df_to_array(key), cmap = cm,
interpolation='none',origin='lower')
plt.colorbar(imgplot, orientation='vertical', ax = ax1)
ax1.set_title(key)
ax1.grid(False)
key = 'petal_width_average'
imgplot = ax2.imshow(mbdf.col_df_to_array(key), cmap = cm,
interpolation='none',origin='lower')
plt.colorbar(imgplot, orientation='vertical', ax = ax2)
ax2.set_title(key)
ax2.grid(False)
key = 'petal_length_average'
imgplot = ax3.imshow(mbdf.col_df_to_array(key), cmap = cm,
interpolation='none',origin='lower')
plt.colorbar(imgplot, orientation='vertical', ax = ax3)
ax3.set_title(key)
ax3.grid(False)
# leverage pandas.DataFrame plot methods
key = 'petal_length_average'
scatterplot = mbdf.MBDataFrame.plot(kind='scatter', x='sepal_length', y='sepal_width',
c = mbdf.MBDataFrame[key],s = 100, ax=ax4, cmap = cm)
ax4.set_title('color:{}'.format(key));