In [1]:

    
import matplotlib
from matplotlib import pyplot as plt
%matplotlib inline

import numpy as np
import pandas as pd
import collections
import multibinner as mb



In [2]:

    
from skimage import io
image = np.flipud(io.imread('https://media4.giphy.com/media/S3mBspMr0r5HW/200_s.gif'))

Dataset

Initial data are read from an image, then n_data samples will be extracted from the data.

The image contains 200x200 = 40k pixels

We will extract 400k random points from the image and build a pandas.DataFrame

This mimics the sampling process of a spacecraft for example : looking at a target (Earth or another body) and getting way more data points you need to reconstruct a coherent representation.

Moreover, visualize 400k x 3 columns of point is difficult, thus we will multibin the DataFrame to 200 bins on the x and 200 on the y direction, calculate the average for each bin and return 200x200 array of data in output.

The multibin.MultiBinnedDataFrame could generate as many dimension as one like, the 2D example here is for the sake of representation.



In [3]:

    
image_df = pd.DataFrame(image.reshape(-1,image.shape[-1]),columns=['red','green','blue'])
image_df.describe()









    Out[3]:






  
    
      
      red
      green
      blue
    
  
  
    
      count
      40000.000000
      40000.000000
      40000.000000
    
    
      mean
      47.090475
      25.723300
      34.231950
    
    
      std
      82.890291
      54.544805
      72.300951
    
    
      min
      0.000000
      0.000000
      0.000000
    
    
      25%
      0.000000
      0.000000
      0.000000
    
    
      50%
      0.000000
      0.000000
      0.000000
    
    
      75%
      47.000000
      20.000000
      20.000000
    
    
      max
      254.000000
      254.000000
      254.000000



In [4]:

    
n_data = image.reshape(-1,image.shape[-1]).shape[0]*10 # 10 times the original number of pixels : overkill!
x = np.random.random_sample(n_data)*image.shape[1]
y = np.random.random_sample(n_data)*image.shape[0]



In [5]:

    
data =  pd.DataFrame({'x' : x, 'y' : y })

# extract the random point from the original image and add some noise
for index,name in zip(*(range(image.shape[-1]),['red','green','blue'])):
    data[name] = image[data.y.astype(int),data.x.astype(int),index]+np.random.rand(n_data)*.1



In [6]:

    
data.describe().T









    Out[6]:






  
    
      
      count
      mean
      std
      min
      25%
      50%
      75%
      max
    
  
  
    
      x
      400000
      100.263406
      57.754107
      7.381100e-06
      50.280294
      100.353366
      150.265268
      199.999664
    
    
      y
      400000
      99.962667
      57.699804
      5.099276e-04
      50.069307
      100.014117
      149.758329
      199.999850
    
    
      red
      400000
      47.192283
      82.955866
      2.707846e-09
      0.040020
      0.079993
      47.014229
      254.099999
    
    
      green
      400000
      25.748771
      54.477645
      2.671977e-08
      0.039998
      0.080019
      20.095695
      254.099992
    
    
      blue
      400000
      34.213376
      72.194960
      4.496242e-08
      0.040098
      0.079982
      20.056783
      254.099996

Data Visualization

[It is a downsampled version of the dataset, the full version would take around 1 minute per plot to visualize...]

Does this dataset make sense for you? can you guess the original imgage?



In [7]:

    
pd.tools.plotting.scatter_matrix(data.sample(n=1000), alpha=0.5 , lw=0, figsize=(12, 12), diagonal='hist');



In [8]:

    
# Let's multibinning!

# functions we want to apply on the data in a single multidimensional bin:
aggregated_functions = {
    'red'   : {'elements' : len ,'average' : np.average},
    'green' : {'average' : np.average},
    'blue'  : {'average' : np.average}
    }

# the columns we want to have in output:
out_columns = ['red','green','blue']

# define the bins for sepal_length
group_variables = collections.OrderedDict([
                    ('y',mb.bingenerator({ 'start' : 0 ,'stop' : image.shape[0], 'n_bins' : image.shape[0]})),
                    ('x',mb.bingenerator({ 'start' : 0 ,'stop' : image.shape[1], 'n_bins' : image.shape[1]}))
                    ])
# I use OrderedDict to have fixed order, a normal dict is fine too.

# that is the object collecting all the data that define the multi binning
mbdf =  mb.MultiBinnedDataFrame(binstocolumns = True,
                                dataframe = data,
                                group_variables = group_variables,
                                aggregated_functions = aggregated_functions,
                                out_columns = out_columns)



In [9]:

    
mbdf.MBDataFrame.describe().T









    Out[9]:






  
    
      
      count
      mean
      std
      min
      25%
      50%
      75%
      max
    
  
  
    
      blue_average
      40000
      34.281937
      72.300991
      0.000087
      0.047613
      0.057960
      20.051544
      254.095256
    
    
      green_average
      40000
      25.773277
      54.544800
      0.003442
      0.047643
      0.057995
      20.065312
      254.079269
    
    
      red_average
      40000
      47.140474
      82.890276
      0.005333
      0.047774
      0.058014
      47.039703
      254.082816
    
    
      red_elements
      40000
      10.000000
      3.158148
      1.000000
      8.000000
      10.000000
      12.000000
      25.000000
    
    
      y
      40000
      100.000000
      57.735027
      0.500000
      50.250000
      100.000000
      149.750000
      199.500000
    
    
      x
      40000
      100.000000
      57.735027
      0.500000
      50.250000
      100.000000
      149.750000
      199.500000



In [10]:

    
# reconstruct the multidimensional array defined by group_variables
outstring = []

for key,val in mbdf.group_variables.iteritems():
    outstring.append('{} bins ({})'.format(val['n_bins'],key))

key = 'red_average'

print '{} array = {}'.format(key,' x '.join(outstring))
print 
print mbdf.col_df_to_array(key)









    



red_average array = 200 bins (y) x 200 bins (x)

[[ 0.05127712  0.0652712   0.04811496 ...,  0.04172704  0.05627255
   0.0473071 ]
 [ 0.04585184  0.04674475  0.04773251 ...,  0.04038853  0.04045124
   0.05153442]
 [ 0.06196471  0.05289756  0.0470766  ...,  0.05191445  0.04677261
   0.04892047]
 ..., 
 [ 0.05312529  0.04845916  0.0540913  ...,  0.05445298  0.04990296
   0.05690767]
 [ 0.04087269  0.03389485  0.05597557 ...,  0.05773319  0.05013263
   0.04475328]
 [ 0.03370772  0.03759188  0.06188008 ...,  0.03841522  0.05908078
   0.05958728]]



In [11]:

    
fig, ((ax1, ax2), (ax3, ax4)) = plt.subplots(figsize=[16,10], ncols=2, nrows=2)

cm = plt.get_cmap('jet')

key = 'red_elements'
imgplot = ax1.imshow(mbdf.col_df_to_array(key), cmap = cm, 
                     interpolation='none',origin='lower')
plt.colorbar(imgplot, orientation='vertical', ax = ax1)
ax1.set_title('elements per bin')
ax1.grid(False) 

key = 'red_average'
imgplot = ax2.imshow(mbdf.col_df_to_array(key), cmap = cm,
                     interpolation='none',origin='lower')
plt.colorbar(imgplot, orientation='vertical', ax = ax2)
ax2.set_title(key)
ax2.grid(False) 

key = 'green_average'
imgplot = ax3.imshow(mbdf.col_df_to_array(key), cmap = cm, 
                     interpolation='none',origin='lower')
plt.colorbar(imgplot, orientation='vertical', ax = ax3)
ax3.set_title(key)
ax3.grid(False) 

key = 'blue_average'
imgplot = ax4.imshow(mbdf.col_df_to_array(key), cmap = cm, 
                     interpolation='none',origin='lower')
plt.colorbar(imgplot, orientation='vertical', ax = ax4)
ax4.set_title(key)
ax4.grid(False)



In [12]:

    
rgb_image_dict = mbdf.all_df_to_array()

rgb_image = rgb_image_dict['red_average']

for name in ['green_average','blue_average']:
    rgb_image = np.dstack((rgb_image,rgb_image_dict[name]))



In [13]:

    
fig, (ax1,ax2) = plt.subplots(figsize=[16,10], ncols=2)
ax1.imshow(255-rgb_image,interpolation='bicubic',origin='lower')
ax1.set_title('MultiBinnedDataFrame')

ax2.imshow(image    ,interpolation='bicubic',origin='lower')
ax2.set_title('Original Image')









    Out[13]:





<matplotlib.text.Text at 0x10c565ed0>

In the images above, on the right the original one and on the left the result of picking 400k random point on the image, rebinning to 200x200 on the (x,y) columns and calculating the average on each of the resulting 40kbins.

The bins contain from 1 to 29 point (10 on average).

Thanks from me and Mario!

	red	green	blue
count	40000.000000	40000.000000	40000.000000
mean	47.090475	25.723300	34.231950
std	82.890291	54.544805	72.300951
min	0.000000	0.000000	0.000000
25%	0.000000	0.000000	0.000000
50%	0.000000	0.000000	0.000000
75%	47.000000	20.000000	20.000000
max	254.000000	254.000000	254.000000

	count	mean	std	min	25%	50%	75%	max
x	400000	100.263406	57.754107	7.381100e-06	50.280294	100.353366	150.265268	199.999664
y	400000	99.962667	57.699804	5.099276e-04	50.069307	100.014117	149.758329	199.999850
red	400000	47.192283	82.955866	2.707846e-09	0.040020	0.079993	47.014229	254.099999
green	400000	25.748771	54.477645	2.671977e-08	0.039998	0.080019	20.095695	254.099992
blue	400000	34.213376	72.194960	4.496242e-08	0.040098	0.079982	20.056783	254.099996

	count	mean	std	min	25%	50%	75%	max
blue_average	40000	34.281937	72.300991	0.000087	0.047613	0.057960	20.051544	254.095256
green_average	40000	25.773277	54.544800	0.003442	0.047643	0.057995	20.065312	254.079269
red_average	40000	47.140474	82.890276	0.005333	0.047774	0.058014	47.039703	254.082816
red_elements	40000	10.000000	3.158148	1.000000	8.000000	10.000000	12.000000	25.000000
y	40000	100.000000	57.735027	0.500000	50.250000	100.000000	149.750000	199.500000
x	40000	100.000000	57.735027	0.500000	50.250000	100.000000	149.750000	199.500000