Multibinner

The problem I want to solve is:

I have a huge pandas.DataFrame and I want to rebin some columns and get the oupt of custom function on those bins

The Multibinner module accept in input:

  • a pandas.DataFrame with columns
  • bins definition on some columns
  • the functions I want to calculate on the bins and on which columns

The output is a shorted pandas.DataFrame called multibinner.MBDataFrame with multindex, addressing which row goes in which bin.

There are also two useful functions:

  • 'multibinner.col_df_to_array' to get an actual numpy array of a single columns
  • 'multibinner.all_df_to_array' a dictionary of all the columns -> numpy arrays .

In [129]:
import matplotlib
from matplotlib import pyplot as plt
%matplotlib inline

import numpy as np
import pandas as pd
import collections
import multibinner as mb

Dataset definition

The irisdf shape is (150, 4), the bins below are defined on the columns:

  • sepal_length : 10 bins
  • sepal_width : 8 bins

That defines a 10x8 array, each bins containig a variable number of data point from the original DataFrame.

Passing a dictionary of function_names -> function reference we get as ouput the functions calculated on each bin, on the the output columns requested.

For example, here we have:

  • functions : {'elements' : len ,'average' : np.average}
  • output columns = petal_length,petal_width

The output columns will be :

  • petal_length : first columns in output + all the functions
    • petal_length_average
    • petal_length_elements
  • petal_width : second columns in output + all the functions
    • petal_width_average
    • petal_width_elements
  • central bin of each output row (could be switched off):
    • sepal_length
    • sepal_width

In [130]:
from sklearn import datasets
iris = datasets.load_iris()
irisdf = pd.DataFrame(iris.data,columns=iris.feature_names)
irisdf.columns = ['_'.join(i.split()[0:2]) for i in irisdf.columns]

In [131]:
irisdf.describe()


Out[131]:
sepal_length sepal_width petal_length petal_width
count 150.000000 150.000000 150.000000 150.000000
mean 5.843333 3.054000 3.758667 1.198667
std 0.828066 0.433594 1.764420 0.763161
min 4.300000 2.000000 1.000000 0.100000
25% 5.100000 2.800000 1.600000 0.300000
50% 5.800000 3.000000 4.350000 1.300000
75% 6.400000 3.300000 5.100000 1.800000
max 7.900000 4.400000 6.900000 2.500000

In [132]:
# Let's multibinning!

# functions we want to apply on the data in a single multidimensional bin:
aggregated_functions = {'elements' : len ,'average' : np.average}

# the columns we want to have in output:
out_columns = ['petal_length','petal_width']

# define the bins for sepal_length
sepal_length_bins = { 'start'  : 4  , 'stop'   : 8, 'n_bins' : 10}
# conformt function generating the bins for you
sepal_length_bins = mb.bingenerator(sepal_length_bins)

# again
sepal_width_bins = { 'start'  : 1  , 'stop'   : 5, 'n_bins' : 8 }
sepal_width_bins = mb.bingenerator(sepal_width_bins)

# use the dataframe column name as key this links the definition to the columns
group_variables = collections.OrderedDict([
                    ('sepal_length',sepal_length_bins),
                    ('sepal_width',sepal_width_bins)
                    ])
# I use OrderedDict to have fixed order, a normal dict is fine too.

print 20*'='
for key,val in group_variables.iteritems():
    print key+' :'
    for sub_key,sub_val in val.iteritems():
        print '{:>15s}: {}'.format(sub_key,sub_val)
    print 20*'='

# apply aggregated_functions on the dataframe and get the out_columns

# that is the object colelcting all the data that define the multi binning
mbdf =  mb.MultiBinnedDataFrame(binstocolumns = False,
                                dataframe = irisdf,
                                group_variables = group_variables,
                                aggregated_functions = aggregated_functions,
                                out_columns = out_columns)


====================
sepal_length :
         n_bins: 10
    bins_center: [ 4.2  4.6  5.   5.4  5.8  6.2  6.6  7.   7.4  7.8]
           stop: 8
          start: 4
       size_bin: 0.4
        n_edges: 11
           bins: [ 4.   4.4  4.8  5.2  5.6  6.   6.4  6.8  7.2  7.6  8. ]
====================
sepal_width :
         n_bins: 8
    bins_center: [ 1.25  1.75  2.25  2.75  3.25  3.75  4.25  4.75]
           stop: 5
          start: 1
       size_bin: 0.5
        n_edges: 9
           bins: [ 1.   1.5  2.   2.5  3.   3.5  4.   4.5  5. ]
====================

In [133]:
mbdf.MBDataFrame.head()


Out[133]:
petal_length_average petal_length_elements petal_width_average petal_width_elements
sepal_length_binned sepal_width_binned
0 4 1.1 1 0.100000 1
1 2 1.3 1 0.300000 1
3 1.4 1 0.200000 1
4 1.4 7 0.214286 7
5 1.0 1 0.200000 1

Define fucntion on specific columns

The standard behaviour is to calculate all the passed function on all the columns requested for output. If we want different fucntions on different columns it is easy, just define a dictionary containig

DataFrame columns -> functions

the latter could a dictionary function_names -> function reference too, like in the definition before.

Behind the scenes, the method multibinner.multibin checks if all keys in the passed dictionary in are in the columns of the input DataFrame:

set(aggregated_functions.keys()) <= set(DataFrame.columns)

One could easily define his own functions and the name to identify it , see below.


In [134]:
# Define a custom functions to get the first element per group,
# we are working on DataFrames after all
first_func = lambda x: x.iloc[0]

aggregated_functions = {
    'petal_length' : {'average' : np.average, 'first' : first_func},
    'petal_width'  : {'elements' : len ,'average' : np.average},    
    }

mbdf =  mb.MultiBinnedDataFrame(binstocolumns = True,
                                dataframe = irisdf,
                                group_variables = group_variables,
                                aggregated_functions = aggregated_functions,
                                out_columns = out_columns)

In [135]:
mbdf.MBDataFrame.head()


Out[135]:
petal_width_average petal_width_elements petal_length_average petal_length_first sepal_length sepal_width
sepal_length_binned sepal_width_binned
0 4 0.100000 1 1.1 1.1 4.2 3.25
1 2 0.300000 1 1.3 1.3 4.6 2.25
3 0.200000 1 1.4 1.4 4.6 2.75
4 0.214286 7 1.4 1.3 4.6 3.25
5 0.200000 1 1.0 1.0 4.6 3.75

In [136]:
# reconstruct the multidimensional array defined by group_variables
outstring = []

for key,val in mbdf.group_variables.iteritems():
    outstring.append('{} bins ({})'.format(val['n_bins'],key))

key = 'petal_width_elements'

print '{} array = {}'.format(key,' x '.join(outstring))
print 
print mbdf.col_df_to_array(key)


petal_width_elements array = 10 bins (sepal_length) x 8 bins (sepal_width)

[[ nan  nan  nan  nan   1.  nan  nan  nan]
 [ nan  nan   1.   1.   7.   1.  nan  nan]
 [ nan  nan   3.   2.  16.   9.  nan  nan]
 [ nan  nan   3.   3.   4.   6.   2.  nan]
 [ nan  nan  nan  15.   6.   1.   2.  nan]
 [ nan  nan   4.  13.   8.  nan  nan  nan]
 [ nan  nan  nan   8.  17.  nan  nan  nan]
 [ nan  nan  nan  nan   6.  nan  nan  nan]
 [ nan  nan  nan   2.   2.   1.  nan  nan]
 [ nan  nan  nan   2.   2.   2.  nan  nan]]

Visualization and wrap up

Now the visual part.

The question is:

which is :

  • the number of element per bin
  • the petal_width_average per bin

if we rebin the irisdf on

  • sepal_length_bins = { 'start' : 4 , 'stop' : 8, 'n_bins' : 10}
  • sepal_width_bins = { 'start' : 1 , 'stop' : 5, 'n_bins' : 8 } ?

In [137]:
fig, ((ax1, ax2), (ax3, ax4)) = plt.subplots(figsize=[16,10], ncols=2, nrows=2)

cm = plt.get_cmap('jet')

key = 'petal_width_elements'
imgplot = ax1.imshow(mbdf.col_df_to_array(key), cmap = cm, 
                     interpolation='none',origin='lower')
plt.colorbar(imgplot, orientation='vertical', ax = ax1)
ax1.set_title(key)
ax1.grid(False) 

key = 'petal_width_average'
imgplot = ax2.imshow(mbdf.col_df_to_array(key), cmap = cm,
                     interpolation='none',origin='lower')
plt.colorbar(imgplot, orientation='vertical', ax = ax2)
ax2.set_title(key)
ax2.grid(False) 

key = 'petal_length_average'
imgplot = ax3.imshow(mbdf.col_df_to_array(key), cmap = cm, 
                     interpolation='none',origin='lower')
plt.colorbar(imgplot, orientation='vertical', ax = ax3)
ax3.set_title(key)
ax3.grid(False) 

# leverage pandas.DataFrame plot methods
key = 'petal_length_average'
scatterplot = mbdf.MBDataFrame.plot(kind='scatter', x='sepal_length', y='sepal_width',
                      c = mbdf.MBDataFrame[key],s = 450, ax=ax4, cmap = cm)
ax4.set_title('color:{}'.format(key));


We want more bins!

Here you are!


In [138]:
# define the bins for sepal_length
sepal_length_bins = { 'start'  : 4  , 'stop'   : 8, 'n_bins' : 30}

sepal_length_bins = mb.bingenerator(sepal_length_bins)

sepal_width_bins = { 'start'  : 1  , 'stop'   : 5, 'n_bins' : 25}

sepal_width_bins = mb.bingenerator(sepal_width_bins)

# use the dataframe column name as key this links the definition to the columns
group_variables = collections.OrderedDict([
                    ('sepal_length',sepal_length_bins),
                    ('sepal_width',sepal_width_bins)
                    ])

# that is the object colelcting all the data that define the multi binning
mbdf =  mb.MultiBinnedDataFrame(binstocolumns = True,
                                dataframe = irisdf,
                                group_variables = group_variables,
                                aggregated_functions = aggregated_functions,
                                out_columns = out_columns)

print 'Output dataframe number of bins: {} '.format(mbdf.col_df_to_array(mbdf.MBDataFrame.columns[0]).size)
print 'Output dataframe number of bins with actual data: {} '.format(mbdf.MBDataFrame.shape[0])
print 'It is completaly redundant, but it works.'


Output dataframe number of bins: 750 
Output dataframe number of bins with actual data: 89 
It is completaly redundant, but it works.

In [139]:
fig, ((ax1, ax2), (ax3, ax4)) = plt.subplots(figsize=[16,10], ncols=2, nrows=2)

cm = plt.get_cmap('jet')

key = 'petal_width_elements'
imgplot = ax1.imshow(mbdf.col_df_to_array(key), cmap = cm, 
                     interpolation='none',origin='lower')
plt.colorbar(imgplot, orientation='vertical', ax = ax1)
ax1.set_title(key)
ax1.grid(False) 

key = 'petal_width_average'
imgplot = ax2.imshow(mbdf.col_df_to_array(key), cmap = cm,
                     interpolation='none',origin='lower')
plt.colorbar(imgplot, orientation='vertical', ax = ax2)
ax2.set_title(key)
ax2.grid(False) 

key = 'petal_length_average'
imgplot = ax3.imshow(mbdf.col_df_to_array(key), cmap = cm, 
                     interpolation='none',origin='lower')
plt.colorbar(imgplot, orientation='vertical', ax = ax3)
ax3.set_title(key)
ax3.grid(False) 

# leverage pandas.DataFrame plot methods
key = 'petal_length_average'
scatterplot = mbdf.MBDataFrame.plot(kind='scatter', x='sepal_length', y='sepal_width',
                      c = mbdf.MBDataFrame[key],s = 100, ax=ax4, cmap = cm)
ax4.set_title('color:{}'.format(key));