The Setup

This may change but it will get you going with package buckysoap.

From the command line:

pip install numpy
pip install lazy
pip install filelock
git clone https://github.com/leonhardbrenner/buckysoap.git

At the top of your script include the following:


In [1]:
import sys
sys.path += ['/home/lbrenner/buckysoap/src']
import buckysoap as bs
from buckysoap import Atom, Element, Ring, Field

#Monkey patch Element to display rows
element_display = Element.display
def display(element, *a, **kw):
    element_display(element, *a, **kw)
    print "(%s rows)" % len(element)
    return element
Element.display = display

Meet the Atom.

First let's create a few and prove that they are Atoms.


In [2]:
print bs.zeros(10, int)
print bs.ones(10, int)
print bs.arange(10)
print type(bs.zeros(10, int))


[0, 0, 0, 0, 0, 0, 0, 0, 0, 0]
[1, 1, 1, 1, 1, 1, 1, 1, 1, 1]
[0, 1, 2, 3, 4, 5, 6, 7, 8, 9]
<class 'buckysoap.atom.Atom'>

Here are some typical operations:


In [3]:
print bs.ones(10, int) + bs.arange(10)
print bs.ones(10) / (bs.arange(10) + 1)


[1, 2, 3, 4, 5, 6, 7, 8, 9, 10]
[1.0, 0.5, 0.33333333333333331, 0.25, 0.20000000000000001, 0.16666666666666666, 0.14285714285714285, 0.125, 0.1111111111111111, 0.10000000000000001]

The previous Atoms are alot like np.ndarray because Atom just extends np.ndarray. Let's look at what functionality Atom adds to np.ndarray. We are going to pass a list of lists where some of the lists contain None, some are EMPTY and the root list may contain None.


In [4]:
x = bs.Atom.fromlist([[0, 1, 2, None, 4], None, [], [5, 6]])
print 'x = ', x
print 'x.asarray() = ', x.asarray() #This np.ndarray portion of the data
print 'x.mask() = ',    x.mask      #This exists on all Atoms to signify None
print 'x.bincounts = ', x.bincounts #This represents the bincounts on each axis of our Atom


x =  [[0, 1, 2, None, 4], None, [], [5, 6]]
x.asarray() =  [0 1 2 0 4 5 6]
x.mask() =  [ True  True  True False  True  True  True]
x.bincounts =  [Atom([5, None, 0, 2])]

In this last example we can see that an Atom is made up of:

x.asarray() - the data in the form of a ndarray
x.mask() - the mask to be applied to the data
x.bincounts - bins is a common term used for aggregation. Notice that it is a list. You will see why in the next example.

In this example we will look at how a list of lists of lists is represented by an Atom.


In [5]:
y = bs.Atom.fromlist([[[0, 1, 2, None, 4], None, [5, 6]], [[7, 8, 9, 4], [None, 10, 11]]])
print 'y = ', y
print 'y.asarray() = ', y.asarray()
print 'y.mask() = ',    y.mask
print 'y.bincounts = ', y.bincounts
print 'x.cardinality = ', x.cardinality
print 'y.cardinality = ', y.cardinality
print 'x.counts = ', x.counts
print 'y.counts = ', y.counts


y =  [[[0, 1, 2, None, 4], None, [5, 6]], [[7, 8, 9, 4], [None, 10, 11]]]
y.asarray() =  [ 0  1  2  0  4  5  6  7  8  9  4  0 10 11]
y.mask() =  [ True  True  True False  True  True  True  True  True  True  True False
  True  True]
y.bincounts =  [Atom([5, None, 2, 4, 3]), Atom([3, 2])]
x.cardinality =  2
y.cardinality =  3
x.counts =  [5, None, 0, 2]
y.counts =  [[5, None, 2], [4, 3]]

Let's operate on these Atoms.


In [6]:
print "x * 2 = ", x * 2
print "x + 2 = ", x + 2
print "y - 2 = ", y - 2
print "(y + 1) / 2 = ", (y + 1) / 2
print "x = ", x
print "x.sum() = ", x.sum()
print "x.sum().sum() = ", x.sum().sum()
print "y = ", y
print "y.sum() = ", y.sum()
print "y.sum().sum() = ", y.sum().sum()
print "y.sum().sum().sum() = ", y.sum().sum().sum()
print "y.average() = ", y.average()


x * 2 =  [[0, 2, 4, None, 8], None, [], [10, 12]]
x + 2 =  [[2, 3, 4, None, 6], None, [], [7, 8]]
y - 2 =  [[[-2, -1, 0, None, 2], None, [3, 4]], [[5, 6, 7, 2], [None, 8, 9]]]
(y + 1) / 2 =  [[[0.5, 1.0, 1.5, None, 2.5], None, [3.0, 3.5]], [[4.0, 4.5, 5.0, 2.5], [None, 5.5, 6.0]]]
x =  [[0, 1, 2, None, 4], None, [], [5, 6]]
x.sum() =  [7, None, 0, 11]
x.sum().sum() =  18
y =  [[[0, 1, 2, None, 4], None, [5, 6]], [[7, 8, 9, 4], [None, 10, 11]]]
y.sum() =  [[7, None, 11], [28, 21]]
y.sum().sum() =  [18, 49]
y.sum().sum().sum() =  67
y.average() =  [[1.75, None, 5.5], [7.0, 10.5]]
/home/lbrenner/buckysoap/src/buckysoap/atom.py:485: RuntimeWarning: invalid value encountered in divide
  return self.__class__(getattr(self.asarray(), name)(other.asarray()),

We can stack Atoms together. Note: Atoms must have the same cardinality.


In [7]:
print "y.vstack(y) = ", y.vstack(y)


y.vstack(y) =  [[[0, 1, 2, None, 4], None, [5, 6]], [[7, 8, 9, 4], [None, 10, 11]], [[0, 1, 2, None, 4], None, [5, 6]], [[7, 8, 9, 4], [None, 10, 11]]]

Indexing is the big trick in Atom. Using the bincounts I am able index these multidemensional arrays at each level of the tree. The indexing is mostly handled using run_length.index which uses run_length.range. Here is a link to the sourcecode: https://github.com/leonhardbrenner/buckysoap/blob/master/src/buckysoap/run_length.py

which lead to developement of Atom then Element run_length index is used in getitem of Atom. Take a look: https://github.com/leonhardbrenner/buckysoap/blob/master/src/buckysoap/atom.py

Now let's do some indexing.


In [8]:
z = bs.arange(12) + 100
print 'z =', z
print 'y =', y
print 'z[y] =', z[y]


z = [100, 101, 102, 103, 104, 105, 106, 107, 108, 109, 110, 111]
y = [[[0, 1, 2, None, 4], None, [5, 6]], [[7, 8, 9, 4], [None, 10, 11]]]
z[y] = [[[100, 101, 102, None, 104], None, [105, 106]], [[107, 108, 109, 104], [None, 110, 111]]]

In this last example the cardinality of the index is passed on to the Atom being indexed. In this next example the Atom being indexed will have a cardinality to 2 and the index will have a cardinality of 3. The resulting Atom will have cardinality of 4. All of this is happening with out looping through list.


In [9]:
z2 = bs.arange(78)
z2.bincounts.append((bs.arange(12) + 1))
print 'z2 =', z2
print 'z2[y] =', z2[y]
print 'z2.cardinality = ', z2.cardinality
print 'y.cardinality = ', y.cardinality
print 'z2[y].cardinality = ', z2[y].cardinality


z2 = [[0], [1, 2], [3, 4, 5], [6, 7, 8, 9], [10, 11, 12, 13, 14], [15, 16, 17, 18, 19, 20], [21, 22, 23, 24, 25, 26, 27], [28, 29, 30, 31, 32, 33, 34, 35], [36, 37, 38, 39, 40, 41, 42, 43, 44], [45, 46, 47, 48, 49, 50, 51, 52, 53, 54], [55, 56, 57, 58, 59, 60, 61, 62, 63, 64, 65], [66, 67, 68, 69, 70, 71, 72, 73, 74, 75, 76, 77]]
z2[y] = [[[[0], [1, 2], [3, 4, 5], None, [10, 11, 12, 13, 14]], None, [[15, 16, 17, 18, 19, 20], [21, 22, 23, 24, 25, 26, 27]]], [[[28, 29, 30, 31, 32, 33, 34, 35], [36, 37, 38, 39, 40, 41, 42, 43, 44], [45, 46, 47, 48, 49, 50, 51, 52, 53, 54], [10, 11, 12, 13, 14]], [None, [55, 56, 57, 58, 59, 60, 61, 62, 63, 64, 65], [66, 67, 68, 69, 70, 71, 72, 73, 74, 75, 76, 77]]]]
z2.cardinality =  2
y.cardinality =  3
z2[y].cardinality =  4

Combining Atoms to form an Element

An Element can combine multiple Atoms and can be combined with other Atoms or Elements. In this first example we will construct an Atom then assign columns: range, zero and one.


In [10]:
e1 = Element(name='Sample')(
    range = bs.arange(10),
    zero = bs.zeros(10, int),
    one = bs.ones(10, int))
e1.display()


zero range one
   0     0   1
   0     1   1
   0     2   1
   0     3   1
   0     4   1
   0     5   1
   0     6   1
   0     7   1
   0     8   1
   0     9   1
(10 rows)
Out[10]:
<buckysoap.element.Element at 0x7fc90b441ed0>

Now we will use currying to create Atom(group_id).


In [11]:
e2 = e1(
    group_id1 = lambda x: (x.range % 2) * 2,
    group_id2 = lambda x: (x.range % 3) * 2
)
e2.display()


zero range one group_id1 group_id2
   0     0   1         0         0
   0     1   1         2         2
   0     2   1         0         4
   0     3   1         2         0
   0     4   1         0         2
   0     5   1         2         4
   0     6   1         0         0
   0     7   1         2         2
   0     8   1         0         4
   0     9   1         2         0
(10 rows)
Out[11]:
<buckysoap.element.Element at 0x7fc90b452110>

Now for group operations. In the Atom indexing example we show how an index imposes it's cardinality on the Atom being indexed. This makes it very easy to build a group operation. The implementation is only a few lines so take a look but the group index is the 2 column sort_index and the bincounts which are calculated by comparing x[:-1] and x[1:]. The index is applied in getattr when Element needs to deligate to source. This is more then you need to know.


In [12]:
e3 = e2.group('group_id1,group_id2')
e3.display()
print "e3.__index__ = ", e3.__index__
print "e3.__source__.display():"
e3.__source__.display()


Sample[0]
    zero=[0, 0]
    range=[0, 6]
    one=[1, 1]
    group_id1=0
    group_id2=0
Sample[1]
    zero=[0]
    range=[4]
    one=[1]
    group_id1=0
    group_id2=2
Sample[2]
    zero=[0, 0]
    range=[2, 8]
    one=[1, 1]
    group_id1=0
    group_id2=4
Sample[3]
    zero=[0, 0]
    range=[3, 9]
    one=[1, 1]
    group_id1=2
    group_id2=0
Sample[4]
    zero=[0, 0]
    range=[1, 7]
    one=[1, 1]
    group_id1=2
    group_id2=2
Sample[5]
    zero=[0]
    range=[5]
    one=[1]
    group_id1=2
    group_id2=4
(6 rows)
e3.__index__ =  [[0, 6], [4], [2, 8], [3, 9], [1, 7], [5]]
e3.__source__.display():
zero range one group_id1 group_id2
   0     0   1         0         0
   0     1   1         2         2
   0     2   1         0         4
   0     3   1         2         0
   0     4   1         0         2
   0     5   1         2         4
   0     6   1         0         0
   0     7   1         2         2
   0     8   1         0         4
   0     9   1         2         0
(10 rows)
Out[12]:
<buckysoap.element.Element at 0x7fc90b452110>

Now let's try some now let's try sum.


In [13]:
e4 = (
    e3(range_sum = lambda x: x.range.sum(),
       zero_sum = lambda x: x.zero.sum(),
       one_sum = lambda x: x.one.sum())
    ('group_id1,group_id2,range_sum,zero_sum,one_sum,range,zero,one'))
e4.display()


Sample[0]
    group_id1=0
    group_id2=0
    range_sum=6
    zero_sum=0
    one_sum=2
    range=[0, 6]
    zero=[0, 0]
    one=[1, 1]
Sample[1]
    group_id1=0
    group_id2=2
    range_sum=4
    zero_sum=0
    one_sum=1
    range=[4]
    zero=[0]
    one=[1]
Sample[2]
    group_id1=0
    group_id2=4
    range_sum=10
    zero_sum=0
    one_sum=2
    range=[2, 8]
    zero=[0, 0]
    one=[1, 1]
Sample[3]
    group_id1=2
    group_id2=0
    range_sum=12
    zero_sum=0
    one_sum=2
    range=[3, 9]
    zero=[0, 0]
    one=[1, 1]
Sample[4]
    group_id1=2
    group_id2=2
    range_sum=8
    zero_sum=0
    one_sum=2
    range=[1, 7]
    zero=[0, 0]
    one=[1, 1]
Sample[5]
    group_id1=2
    group_id2=4
    range_sum=5
    zero_sum=0
    one_sum=1
    range=[5]
    zero=[0]
    one=[1]
(6 rows)
Out[13]:
<buckysoap.element.Element at 0x7fc90b45bad0>

In [14]:
e4.display()
e3.display()
e2.display()
e1.display()


Sample[0]
    group_id1=0
    group_id2=0
    range_sum=6
    zero_sum=0
    one_sum=2
    range=[0, 6]
    zero=[0, 0]
    one=[1, 1]
Sample[1]
    group_id1=0
    group_id2=2
    range_sum=4
    zero_sum=0
    one_sum=1
    range=[4]
    zero=[0]
    one=[1]
Sample[2]
    group_id1=0
    group_id2=4
    range_sum=10
    zero_sum=0
    one_sum=2
    range=[2, 8]
    zero=[0, 0]
    one=[1, 1]
Sample[3]
    group_id1=2
    group_id2=0
    range_sum=12
    zero_sum=0
    one_sum=2
    range=[3, 9]
    zero=[0, 0]
    one=[1, 1]
Sample[4]
    group_id1=2
    group_id2=2
    range_sum=8
    zero_sum=0
    one_sum=2
    range=[1, 7]
    zero=[0, 0]
    one=[1, 1]
Sample[5]
    group_id1=2
    group_id2=4
    range_sum=5
    zero_sum=0
    one_sum=1
    range=[5]
    zero=[0]
    one=[1]
(6 rows)
Sample[0]
    zero=[0, 0]
    range=[0, 6]
    one=[1, 1]
    group_id1=0
    group_id2=0
Sample[1]
    zero=[0]
    range=[4]
    one=[1]
    group_id1=0
    group_id2=2
Sample[2]
    zero=[0, 0]
    range=[2, 8]
    one=[1, 1]
    group_id1=0
    group_id2=4
Sample[3]
    zero=[0, 0]
    range=[3, 9]
    one=[1, 1]
    group_id1=2
    group_id2=0
Sample[4]
    zero=[0, 0]
    range=[1, 7]
    one=[1, 1]
    group_id1=2
    group_id2=2
Sample[5]
    zero=[0]
    range=[5]
    one=[1]
    group_id1=2
    group_id2=4
(6 rows)
zero range one group_id1 group_id2
   0     0   1         0         0
   0     1   1         2         2
   0     2   1         0         4
   0     3   1         2         0
   0     4   1         0         2
   0     5   1         2         4
   0     6   1         0         0
   0     7   1         2         2
   0     8   1         0         4
   0     9   1         2         0
(10 rows)
zero range one
   0     0   1
   0     1   1
   0     2   1
   0     3   1
   0     4   1
   0     5   1
   0     6   1
   0     7   1
   0     8   1
   0     9   1
(10 rows)
Out[14]:
<buckysoap.element.Element at 0x7fc90b441ed0>

Now let's look at the expand method. The examples should be clear.


In [15]:
e4.expand('range').display()


Sample[0]
    group_id1=0
    group_id2=0
    range_sum=6
    zero_sum=0
    one_sum=2
    range=0
    zero=[0, 0]
    one=[1, 1]
Sample[1]
    group_id1=0
    group_id2=0
    range_sum=6
    zero_sum=0
    one_sum=2
    range=6
    zero=[0, 0]
    one=[1, 1]
Sample[2]
    group_id1=0
    group_id2=2
    range_sum=4
    zero_sum=0
    one_sum=1
    range=4
    zero=[0]
    one=[1]
Sample[3]
    group_id1=0
    group_id2=4
    range_sum=10
    zero_sum=0
    one_sum=2
    range=2
    zero=[0, 0]
    one=[1, 1]
Sample[4]
    group_id1=0
    group_id2=4
    range_sum=10
    zero_sum=0
    one_sum=2
    range=8
    zero=[0, 0]
    one=[1, 1]
Sample[5]
    group_id1=2
    group_id2=0
    range_sum=12
    zero_sum=0
    one_sum=2
    range=3
    zero=[0, 0]
    one=[1, 1]
Sample[6]
    group_id1=2
    group_id2=0
    range_sum=12
    zero_sum=0
    one_sum=2
    range=9
    zero=[0, 0]
    one=[1, 1]
Sample[7]
    group_id1=2
    group_id2=2
    range_sum=8
    zero_sum=0
    one_sum=2
    range=1
    zero=[0, 0]
    one=[1, 1]
Sample[8]
    group_id1=2
    group_id2=2
    range_sum=8
    zero_sum=0
    one_sum=2
    range=7
    zero=[0, 0]
    one=[1, 1]
Sample[9]
    group_id1=2
    group_id2=4
    range_sum=5
    zero_sum=0
    one_sum=1
    range=5
    zero=[0]
    one=[1]
(10 rows)
Out[15]:
<buckysoap.element.Element at 0x7fc90b4521d0>

In [16]:
e4.expand('range,zero,one').display()


group_id1 group_id2 range_sum zero_sum one_sum range zero one
        0         0         6        0       2     0    0   1
        0         0         6        0       2     6    0   1
        0         2         4        0       1     4    0   1
        0         4        10        0       2     2    0   1
        0         4        10        0       2     8    0   1
        2         0        12        0       2     3    0   1
        2         0        12        0       2     9    0   1
        2         2         8        0       2     1    0   1
        2         2         8        0       2     7    0   1
        2         4         5        0       1     5    0   1
(10 rows)
Out[16]:
<buckysoap.element.Element at 0x7fc90b4525d0>

Now let's introduce the join which is also implemented using nothing but Numpy. Take a look at the code:

https://github.com/leonhardbrenner/buckysoap/blob/master/src/buckysoap/join.py

The Element uses this packaage method join. Let's start with the setup. I am creating Element(groups) which I will join to Element(e4).


In [17]:
groups = (
    bs.Element(name='Groups')
    (group_id = bs.arange(6))
    (name = lambda x: ['group_%d' % y for y in x.group_id[::-1]]))
groups.display()
e4.display()


group_id    name
       0 group_5
       1 group_4
       2 group_3
       3 group_2
       4 group_1
       5 group_0
(6 rows)
Sample[0]
    group_id1=0
    group_id2=0
    range_sum=6
    zero_sum=0
    one_sum=2
    range=[0, 6]
    zero=[0, 0]
    one=[1, 1]
Sample[1]
    group_id1=0
    group_id2=2
    range_sum=4
    zero_sum=0
    one_sum=1
    range=[4]
    zero=[0]
    one=[1]
Sample[2]
    group_id1=0
    group_id2=4
    range_sum=10
    zero_sum=0
    one_sum=2
    range=[2, 8]
    zero=[0, 0]
    one=[1, 1]
Sample[3]
    group_id1=2
    group_id2=0
    range_sum=12
    zero_sum=0
    one_sum=2
    range=[3, 9]
    zero=[0, 0]
    one=[1, 1]
Sample[4]
    group_id1=2
    group_id2=2
    range_sum=8
    zero_sum=0
    one_sum=2
    range=[1, 7]
    zero=[0, 0]
    one=[1, 1]
Sample[5]
    group_id1=2
    group_id2=4
    range_sum=5
    zero_sum=0
    one_sum=1
    range=[5]
    zero=[0]
    one=[1]
(6 rows)
Out[17]:
<buckysoap.element.Element at 0x7fc90b45bad0>

Now for the join:


In [18]:
join = e4.inner(groups, group_id='group_id2')
join.display()


Sample[0]
    group_id1=0
    group_id2=0
    range_sum=6
    zero_sum=0
    one_sum=2
    range=[0, 6]
    zero=[0, 0]
    one=[1, 1]
    Groups
        group_id=0
        name=group_5
Sample[1]
    group_id1=2
    group_id2=0
    range_sum=12
    zero_sum=0
    one_sum=2
    range=[3, 9]
    zero=[0, 0]
    one=[1, 1]
    Groups
        group_id=0
        name=group_5
Sample[2]
    group_id1=0
    group_id2=2
    range_sum=4
    zero_sum=0
    one_sum=1
    range=[4]
    zero=[0]
    one=[1]
    Groups
        group_id=2
        name=group_3
Sample[3]
    group_id1=2
    group_id2=2
    range_sum=8
    zero_sum=0
    one_sum=2
    range=[1, 7]
    zero=[0, 0]
    one=[1, 1]
    Groups
        group_id=2
        name=group_3
Sample[4]
    group_id1=0
    group_id2=4
    range_sum=10
    zero_sum=0
    one_sum=2
    range=[2, 8]
    zero=[0, 0]
    one=[1, 1]
    Groups
        group_id=4
        name=group_1
Sample[5]
    group_id1=2
    group_id2=4
    range_sum=5
    zero_sum=0
    one_sum=1
    range=[5]
    zero=[0]
    one=[1]
    Groups
        group_id=4
        name=group_1
(6 rows)
Out[18]:
<buckysoap.element.Element at 0x7fc90b45bd90>

We can sort these new elements and we can stack them together:


In [19]:
join = join.sort_by('Groups.name').display()


Sample[0]
    group_id1=0
    group_id2=4
    range_sum=10
    zero_sum=0
    one_sum=2
    range=[2, 8]
    zero=[0, 0]
    one=[1, 1]
    Groups
        group_id=4
        name=group_1
Sample[1]
    group_id1=2
    group_id2=4
    range_sum=5
    zero_sum=0
    one_sum=1
    range=[5]
    zero=[0]
    one=[1]
    Groups
        group_id=4
        name=group_1
Sample[2]
    group_id1=0
    group_id2=2
    range_sum=4
    zero_sum=0
    one_sum=1
    range=[4]
    zero=[0]
    one=[1]
    Groups
        group_id=2
        name=group_3
Sample[3]
    group_id1=2
    group_id2=2
    range_sum=8
    zero_sum=0
    one_sum=2
    range=[1, 7]
    zero=[0, 0]
    one=[1, 1]
    Groups
        group_id=2
        name=group_3
Sample[4]
    group_id1=0
    group_id2=0
    range_sum=6
    zero_sum=0
    one_sum=2
    range=[0, 6]
    zero=[0, 0]
    one=[1, 1]
    Groups
        group_id=0
        name=group_5
Sample[5]
    group_id1=2
    group_id2=0
    range_sum=12
    zero_sum=0
    one_sum=2
    range=[3, 9]
    zero=[0, 0]
    one=[1, 1]
    Groups
        group_id=0
        name=group_5
(6 rows)

In [20]:
join.vstack(join).display()


Sample[0]
    group_id1=0
    group_id2=4
    range_sum=10
    zero_sum=0
    one_sum=2
    range=[2, 8]
    zero=[0, 0]
    one=[1, 1]
    Groups
        group_id=4
        name=group_1
Sample[1]
    group_id1=2
    group_id2=4
    range_sum=5
    zero_sum=0
    one_sum=1
    range=[5]
    zero=[0]
    one=[1]
    Groups
        group_id=4
        name=group_1
Sample[2]
    group_id1=0
    group_id2=2
    range_sum=4
    zero_sum=0
    one_sum=1
    range=[4]
    zero=[0]
    one=[1]
    Groups
        group_id=2
        name=group_3
Sample[3]
    group_id1=2
    group_id2=2
    range_sum=8
    zero_sum=0
    one_sum=2
    range=[1, 7]
    zero=[0, 0]
    one=[1, 1]
    Groups
        group_id=2
        name=group_3
Sample[4]
    group_id1=0
    group_id2=0
    range_sum=6
    zero_sum=0
    one_sum=2
    range=[0, 6]
    zero=[0, 0]
    one=[1, 1]
    Groups
        group_id=0
        name=group_5
Sample[5]
    group_id1=2
    group_id2=0
    range_sum=12
    zero_sum=0
    one_sum=2
    range=[3, 9]
    zero=[0, 0]
    one=[1, 1]
    Groups
        group_id=0
        name=group_5
Sample[6]
    group_id1=0
    group_id2=4
    range_sum=10
    zero_sum=0
    one_sum=2
    range=[2, 8]
    zero=[0, 0]
    one=[1, 1]
    Groups
        group_id=4
        name=group_1
Sample[7]
    group_id1=2
    group_id2=4
    range_sum=5
    zero_sum=0
    one_sum=1
    range=[5]
    zero=[0]
    one=[1]
    Groups
        group_id=4
        name=group_1
Sample[8]
    group_id1=0
    group_id2=2
    range_sum=4
    zero_sum=0
    one_sum=1
    range=[4]
    zero=[0]
    one=[1]
    Groups
        group_id=2
        name=group_3
Sample[9]
    group_id1=2
    group_id2=2
    range_sum=8
    zero_sum=0
    one_sum=2
    range=[1, 7]
    zero=[0, 0]
    one=[1, 1]
    Groups
        group_id=2
        name=group_3
Sample[10]
    group_id1=0
    group_id2=0
    range_sum=6
    zero_sum=0
    one_sum=2
    range=[0, 6]
    zero=[0, 0]
    one=[1, 1]
    Groups
        group_id=0
        name=group_5
Sample[11]
    group_id1=2
    group_id2=0
    range_sum=12
    zero_sum=0
    one_sum=2
    range=[3, 9]
    zero=[0, 0]
    one=[1, 1]
    Groups
        group_id=0
        name=group_5
(12 rows)
Out[20]:
<buckysoap.element.Element at 0x7fc90b452190>

Notice the hierarchy looks kind of like XML or JSON. Check this out:


In [21]:
print join.toxml()


<?xml version="1.0" ?>
<root>
    <Sample group_id1="0" group_id2="4" one="1 1" one_sum="2" range="2 8" range_sum="10" zero="0 0" zero_sum="0">
        <Groups group_id="4" name="group_1"/>
    </Sample>
    <Sample group_id1="2" group_id2="4" one="1" one_sum="1" range="5" range_sum="5" zero="0" zero_sum="0">
        <Groups group_id="4" name="group_1"/>
    </Sample>
    <Sample group_id1="0" group_id2="2" one="1" one_sum="1" range="4" range_sum="4" zero="0" zero_sum="0">
        <Groups group_id="2" name="group_3"/>
    </Sample>
    <Sample group_id1="2" group_id2="2" one="1 1" one_sum="2" range="1 7" range_sum="8" zero="0 0" zero_sum="0">
        <Groups group_id="2" name="group_3"/>
    </Sample>
    <Sample group_id1="0" group_id2="0" one="1 1" one_sum="2" range="0 6" range_sum="6" zero="0 0" zero_sum="0">
        <Groups group_id="0" name="group_5"/>
    </Sample>
    <Sample group_id1="2" group_id2="0" one="1 1" one_sum="2" range="3 9" range_sum="12" zero="0 0" zero_sum="0">
        <Groups group_id="0" name="group_5"/>
    </Sample>
</root>


In [22]:
print join.toxml(use_attributes=False)


<?xml version="1.0" ?>
<root>
    <Sample>
        <group_id1>0</group_id1>
        <group_id2>4</group_id2>
        <range_sum>10</range_sum>
        <zero_sum>0</zero_sum>
        <one_sum>2</one_sum>
        <range>2 8</range>
        <zero>0 0</zero>
        <one>1 1</one>
        <Groups>
            <group_id>4</group_id>
            <name>group_1</name>
        </Groups>
    </Sample>
    <Sample>
        <group_id1>2</group_id1>
        <group_id2>4</group_id2>
        <range_sum>5</range_sum>
        <zero_sum>0</zero_sum>
        <one_sum>1</one_sum>
        <range>5</range>
        <zero>0</zero>
        <one>1</one>
        <Groups>
            <group_id>4</group_id>
            <name>group_1</name>
        </Groups>
    </Sample>
    <Sample>
        <group_id1>0</group_id1>
        <group_id2>2</group_id2>
        <range_sum>4</range_sum>
        <zero_sum>0</zero_sum>
        <one_sum>1</one_sum>
        <range>4</range>
        <zero>0</zero>
        <one>1</one>
        <Groups>
            <group_id>2</group_id>
            <name>group_3</name>
        </Groups>
    </Sample>
    <Sample>
        <group_id1>2</group_id1>
        <group_id2>2</group_id2>
        <range_sum>8</range_sum>
        <zero_sum>0</zero_sum>
        <one_sum>2</one_sum>
        <range>1 7</range>
        <zero>0 0</zero>
        <one>1 1</one>
        <Groups>
            <group_id>2</group_id>
            <name>group_3</name>
        </Groups>
    </Sample>
    <Sample>
        <group_id1>0</group_id1>
        <group_id2>0</group_id2>
        <range_sum>6</range_sum>
        <zero_sum>0</zero_sum>
        <one_sum>2</one_sum>
        <range>0 6</range>
        <zero>0 0</zero>
        <one>1 1</one>
        <Groups>
            <group_id>0</group_id>
            <name>group_5</name>
        </Groups>
    </Sample>
    <Sample>
        <group_id1>2</group_id1>
        <group_id2>0</group_id2>
        <range_sum>12</range_sum>
        <zero_sum>0</zero_sum>
        <one_sum>2</one_sum>
        <range>3 9</range>
        <zero>0 0</zero>
        <one>1 1</one>
        <Groups>
            <group_id>0</group_id>
            <name>group_5</name>
        </Groups>
    </Sample>
</root>


In [23]:
print join.tojson()


{
    "Sample": [
        {
            "group_id1": "0",
            "group_id2": "4",
            "range_sum": "10",
            "zero_sum": "0",
            "one_sum": "2",
            "range": [
                2,
                8
            ],
            "zero": [
                0,
                0
            ],
            "one": [
                1,
                1
            ],
            "Groups": {
                "group_id": "4",
                "name": "group_1"
            }
        },
        {
            "group_id1": "2",
            "group_id2": "4",
            "range_sum": "5",
            "zero_sum": "0",
            "one_sum": "1",
            "range": [
                5
            ],
            "zero": [
                0
            ],
            "one": [
                1
            ],
            "Groups": {
                "group_id": "4",
                "name": "group_1"
            }
        },
        {
            "group_id1": "0",
            "group_id2": "2",
            "range_sum": "4",
            "zero_sum": "0",
            "one_sum": "1",
            "range": [
                4
            ],
            "zero": [
                0
            ],
            "one": [
                1
            ],
            "Groups": {
                "group_id": "2",
                "name": "group_3"
            }
        },
        {
            "group_id1": "2",
            "group_id2": "2",
            "range_sum": "8",
            "zero_sum": "0",
            "one_sum": "2",
            "range": [
                1,
                7
            ],
            "zero": [
                0,
                0
            ],
            "one": [
                1,
                1
            ],
            "Groups": {
                "group_id": "2",
                "name": "group_3"
            }
        },
        {
            "group_id1": "0",
            "group_id2": "0",
            "range_sum": "6",
            "zero_sum": "0",
            "one_sum": "2",
            "range": [
                0,
                6
            ],
            "zero": [
                0,
                0
            ],
            "one": [
                1,
                1
            ],
            "Groups": {
                "group_id": "0",
                "name": "group_5"
            }
        },
        {
            "group_id1": "2",
            "group_id2": "0",
            "range_sum": "12",
            "zero_sum": "0",
            "one_sum": "2",
            "range": [
                3,
                9
            ],
            "zero": [
                0,
                0
            ],
            "one": [
                1,
                1
            ],
            "Groups": {
                "group_id": "0",
                "name": "group_5"
            }
        }
    ]
}

Now let's do some caching. For this I am relying on numpy.savez and numpy.savez_compressed. I have yet to find a faster way of storing a billion values. Here is the documentation:

http://docs.scipy.org/doc/numpy/reference/generated/numpy.savez.html

I use it in Atom.load and Atom.persist:

I splay columns into individual npz files:

lbrenner@josie:~/buckysoap$ ls cache/sample
one.npz  range.npz  zero.npz

Each Atom's npz file contains data, mask and neccessary bincounts.

 Filemode      Length  Date         Time      File
- ----------  --------  -----------  --------  --------
  -rw-------       160   4-Jun-2015  12:23:08  data.npy
  -rw-------        90   4-Jun-2015  12:23:08  mask.npy
- ----------  --------  -----------  --------  --------
               250                         2 files

In [24]:
def cache_sample():
    def factory():
        print "Running Factory"
        return bs.Element(name='Sample')(
            range = bs.arange(10),
            zero = bs.zeros(10, int),
            one = bs.ones(10, int))
    return bs.Element(
        name='sample',
        cnames='range,zero,one',
        cachedir='cache/sample',
        factory=factory)

#We are just deleting directory for the sake of demonstration
import os
if os.path.exists('cache/sample'):
    print "Deleting cache/sample"
    import shutil
    shutil.rmtree('cache/sample')
cache_sample().display()
cache_sample().display()


Deleting cache/sample
Running Factory
cache/sample = 10
range zero one
    0    0   1
    1    0   1
    2    0   1
    3    0   1
    4    0   1
    5    0   1
    6    0   1
    7    0   1
    8    0   1
    9    0   1
(10 rows)
range zero one
    0    0   1
    1    0   1
    2    0   1
    3    0   1
    4    0   1
    5    0   1
    6    0   1
    7    0   1
    8    0   1
    9    0   1
(10 rows)
Running Factory
cache/sample = 10
Out[24]:
<buckysoap.element.Element at 0x7fc8e9fc0dd0>

If you want to map to and from pandas use: element.to_pandas() Element.from_pandas(). I will provide examples but my current instance does not have it running. I would recommend using Enthought or Continuum.