This may change but it will get you going with package buckysoap.
From the command line:
pip install numpy
pip install lazy
pip install filelock
git clone https://github.com/leonhardbrenner/buckysoap.git
In [1]:
import sys
sys.path += ['/home/lbrenner/buckysoap/src']
import buckysoap as bs
from buckysoap import Atom, Element, Ring, Field
#Monkey patch Element to display rows
element_display = Element.display
def display(element, *a, **kw):
element_display(element, *a, **kw)
print "(%s rows)" % len(element)
return element
Element.display = display
First let's create a few and prove that they are Atoms.
In [2]:
print bs.zeros(10, int)
print bs.ones(10, int)
print bs.arange(10)
print type(bs.zeros(10, int))
Here are some typical operations:
In [3]:
print bs.ones(10, int) + bs.arange(10)
print bs.ones(10) / (bs.arange(10) + 1)
The previous Atoms are alot like np.ndarray because Atom just extends np.ndarray. Let's look at what functionality Atom adds to np.ndarray. We are going to pass a list of lists where some of the lists contain None, some are EMPTY and the root list may contain None.
In [4]:
x = bs.Atom.fromlist([[0, 1, 2, None, 4], None, [], [5, 6]])
print 'x = ', x
print 'x.asarray() = ', x.asarray() #This np.ndarray portion of the data
print 'x.mask() = ', x.mask #This exists on all Atoms to signify None
print 'x.bincounts = ', x.bincounts #This represents the bincounts on each axis of our Atom
In this last example we can see that an Atom is made up of:
x.asarray() - the data in the form of a ndarray
x.mask() - the mask to be applied to the data
x.bincounts - bins is a common term used for aggregation. Notice that it is a list. You will see why in the next example.
In this example we will look at how a list of lists of lists is represented by an Atom.
In [5]:
y = bs.Atom.fromlist([[[0, 1, 2, None, 4], None, [5, 6]], [[7, 8, 9, 4], [None, 10, 11]]])
print 'y = ', y
print 'y.asarray() = ', y.asarray()
print 'y.mask() = ', y.mask
print 'y.bincounts = ', y.bincounts
print 'x.cardinality = ', x.cardinality
print 'y.cardinality = ', y.cardinality
print 'x.counts = ', x.counts
print 'y.counts = ', y.counts
Let's operate on these Atoms.
In [6]:
print "x * 2 = ", x * 2
print "x + 2 = ", x + 2
print "y - 2 = ", y - 2
print "(y + 1) / 2 = ", (y + 1) / 2
print "x = ", x
print "x.sum() = ", x.sum()
print "x.sum().sum() = ", x.sum().sum()
print "y = ", y
print "y.sum() = ", y.sum()
print "y.sum().sum() = ", y.sum().sum()
print "y.sum().sum().sum() = ", y.sum().sum().sum()
print "y.average() = ", y.average()
We can stack Atoms together. Note: Atoms must have the same cardinality.
In [7]:
print "y.vstack(y) = ", y.vstack(y)
Indexing is the big trick in Atom. Using the bincounts I am able index these multidemensional arrays at each level of the tree. The indexing is mostly handled using run_length.index which uses run_length.range. Here is a link to the sourcecode: https://github.com/leonhardbrenner/buckysoap/blob/master/src/buckysoap/run_length.py
which lead to developement of Atom then Element run_length index is used in getitem of Atom. Take a look: https://github.com/leonhardbrenner/buckysoap/blob/master/src/buckysoap/atom.py
Now let's do some indexing.
In [8]:
z = bs.arange(12) + 100
print 'z =', z
print 'y =', y
print 'z[y] =', z[y]
In this last example the cardinality of the index is passed on to the Atom being indexed. In this next example the Atom being indexed will have a cardinality to 2 and the index will have a cardinality of 3. The resulting Atom will have cardinality of 4. All of this is happening with out looping through list.
In [9]:
z2 = bs.arange(78)
z2.bincounts.append((bs.arange(12) + 1))
print 'z2 =', z2
print 'z2[y] =', z2[y]
print 'z2.cardinality = ', z2.cardinality
print 'y.cardinality = ', y.cardinality
print 'z2[y].cardinality = ', z2[y].cardinality
An Element can combine multiple Atoms and can be combined with other Atoms or Elements. In this first example we will construct an Atom then assign columns: range, zero and one.
In [10]:
e1 = Element(name='Sample')(
range = bs.arange(10),
zero = bs.zeros(10, int),
one = bs.ones(10, int))
e1.display()
Out[10]:
Now we will use currying to create Atom(group_id).
In [11]:
e2 = e1(
group_id1 = lambda x: (x.range % 2) * 2,
group_id2 = lambda x: (x.range % 3) * 2
)
e2.display()
Out[11]:
Now for group operations. In the Atom indexing example we show how an index imposes it's cardinality on the Atom being indexed. This makes it very easy to build a group operation. The implementation is only a few lines so take a look but the group index is the 2 column sort_index and the bincounts which are calculated by comparing x[:-1] and x[1:]. The index is applied in getattr when Element needs to deligate to source. This is more then you need to know.
In [12]:
e3 = e2.group('group_id1,group_id2')
e3.display()
print "e3.__index__ = ", e3.__index__
print "e3.__source__.display():"
e3.__source__.display()
Out[12]:
Now let's try some now let's try sum.
In [13]:
e4 = (
e3(range_sum = lambda x: x.range.sum(),
zero_sum = lambda x: x.zero.sum(),
one_sum = lambda x: x.one.sum())
('group_id1,group_id2,range_sum,zero_sum,one_sum,range,zero,one'))
e4.display()
Out[13]:
In [14]:
e4.display()
e3.display()
e2.display()
e1.display()
Out[14]:
Now let's look at the expand method. The examples should be clear.
In [15]:
e4.expand('range').display()
Out[15]:
In [16]:
e4.expand('range,zero,one').display()
Out[16]:
Now let's introduce the join which is also implemented using nothing but Numpy. Take a look at the code:
https://github.com/leonhardbrenner/buckysoap/blob/master/src/buckysoap/join.py
The Element uses this packaage method join. Let's start with the setup. I am creating Element(groups) which I will join to Element(e4).
In [17]:
groups = (
bs.Element(name='Groups')
(group_id = bs.arange(6))
(name = lambda x: ['group_%d' % y for y in x.group_id[::-1]]))
groups.display()
e4.display()
Out[17]:
Now for the join:
In [18]:
join = e4.inner(groups, group_id='group_id2')
join.display()
Out[18]:
We can sort these new elements and we can stack them together:
In [19]:
join = join.sort_by('Groups.name').display()
In [20]:
join.vstack(join).display()
Out[20]:
Notice the hierarchy looks kind of like XML or JSON. Check this out:
In [21]:
print join.toxml()
In [22]:
print join.toxml(use_attributes=False)
In [23]:
print join.tojson()
Now let's do some caching. For this I am relying on numpy.savez and numpy.savez_compressed. I have yet to find a faster way of storing a billion values. Here is the documentation:
http://docs.scipy.org/doc/numpy/reference/generated/numpy.savez.html
I use it in Atom.load and Atom.persist:
I splay columns into individual npz files:
lbrenner@josie:~/buckysoap$ ls cache/sample
one.npz range.npz zero.npz
Each Atom's npz file contains data, mask and neccessary bincounts.
Filemode Length Date Time File
- ---------- -------- ----------- -------- --------
-rw------- 160 4-Jun-2015 12:23:08 data.npy
-rw------- 90 4-Jun-2015 12:23:08 mask.npy
- ---------- -------- ----------- -------- --------
250 2 files
In [24]:
def cache_sample():
def factory():
print "Running Factory"
return bs.Element(name='Sample')(
range = bs.arange(10),
zero = bs.zeros(10, int),
one = bs.ones(10, int))
return bs.Element(
name='sample',
cnames='range,zero,one',
cachedir='cache/sample',
factory=factory)
#We are just deleting directory for the sake of demonstration
import os
if os.path.exists('cache/sample'):
print "Deleting cache/sample"
import shutil
shutil.rmtree('cache/sample')
cache_sample().display()
cache_sample().display()
Out[24]:
If you want to map to and from pandas use: element.to_pandas() Element.from_pandas(). I will provide examples but my current instance does not have it running. I would recommend using Enthought or Continuum.