Tutorial



In [1]:

    
import os
import sys
import logging
module_path = os.path.abspath(os.path.join('..'))
if module_path not in sys.path:
    sys.path.append(module_path)



In [2]:

    
import hurraypy as hurray
import numpy as np



In [3]:

    
hurray.__version__









    Out[3]:





'0.0.3'

First, make sure all logging messages are sent to stdout:



In [4]:

    
logger = logging.getLogger('hurraypy')

# console = logging.StreamHandler()
# console.setLevel(logging.DEBUG)
# console.setFormatter(logging.Formatter('%(levelname)s --- %(message)s'))
# logger.addHandler(console)
# logger.setLevel(logging.DEBUG)



In [5]:

    
logger.handlers









    Out[5]:





[<logging.NullHandler at 0x7ff9a452a748>]



In [6]:

    
hurray.log.log.debug("bla")
hurray.log.log.info("bla")

Connecting to a hurray server

Make sure you have a Hurray server, e.g.:

$ hurray --logging=debug --debug=1 --socket=~/hurray.sock
[I 170620 09:46:50 __main__:180] Listening on localhost:2222
[I 170620 09:46:50 __main__:184] Listening on /home/rg/hurray.sock
[I 170619 11:16:50 process:132] Starting 8 processes



In [7]:

    
# conn = hurray.connect('localhost:2222')
conn = hurray.connect('~/hurray.sock')
conn









    Out[7]:





<Connection (udsocket=/home/rg/hurray.sock)>

Working with files

Let's create a file test.h5 (overwrite=True replaces the file if it already exists):



In [8]:

    
f = conn.create_file("test.h5", overwrite=True)

Note that Hurray objects (files, datasets, groups) display nicely in Jupyter notebooks.



In [9]:

    
f









    Out[9]:






File test.h5 (800b)

Working with existing files works like this:



In [10]:

    
f = conn.File("test.h5")
print(f)

with conn.File("test.h5") as f:
    print(f)









    



<File (db=test.h5, path=/)>
<File (db=test.h5, path=/)>

Deleting and renaming files is also possible:



In [11]:

    
f.delete()

Note that the object referenced by f becomes unusable after deleting the file.

Let's create another file and renamed it to test.h5:



In [12]:

    
f2 = conn.create_file("test2.h5", overwrite=True)



In [13]:

    
f2









    Out[13]:






File test2.h5 (800b)



In [14]:

    
f = f2.rename("test.h5")



In [15]:

    
f









    Out[15]:






File /home/rg/hurray_data/test.h5 (800b)

Note that rename() is not "in place". We must (re-)assign its return value.



In [16]:

    
f3 = conn.create_file("test3.h5", overwrite=True)



In [17]:

    
try:
    f3.rename("test.h5")
except hurray.exceptions.DatabaseError as e:
    print(e)









    



(300, 'file already exists', '')

Files can be in subdirectories:



In [18]:

    
f4 = conn.create_file("project1/data.h5", overwrite=True)
f4









    Out[18]:






File project1/data.h5 (800b)



In [19]:

    
conn.list_files("project1/")









    Out[19]:





{'data.h5': {'filesize': 800}}



In [21]:

    
conn.list_files("")









    Out[21]:





{'test.h5': {'filesize': 800}, 'test3.h5': {'filesize': 800}}

Working with datasets

A file can contain two kinds of objects: groups and datasets. Essentially, groups work like Python dictionaries and datasets work like NumPy arrays.

Every group and dataset has a name. First, let's try to create a dataset. We must specify the dataset either by passing a NumPy array or by passing a shape and a datatype:



In [30]:

    
dst = f.create_dataset("mydata", shape=(400, 300), dtype=np.float64)



In [31]:

    
dst









    Out[31]:




Dataset (400, 300) float64  (file=/home/rg/hurray_data/test.h5, path=/mydata)

A dataset has a shape and a dtype, just like NumPy arrays:



In [32]:

    
dst.shape, dst.dtype









    Out[32]:





((400, 300), 'float64')

It also has a path, which is the name of the dataset, prefixed by the names of containing groups. Our dataset is not contained in a group. It therefore appears under the root node / (actually, it is in a group: the file itself is the root group).



In [33]:

    
dst.path









    Out[33]:





'/mydata'

Let's check what data our dataset contains. Numpy-style indexing allows to read/write from/to a dataset. A [:]-index reads the whole dataset into memory. Apparently, our dataset has been initialized with zeros:



In [34]:

    
dst[:]









    Out[34]:





array([[ 0.,  0.,  0., ...,  0.,  0.,  0.],
       [ 0.,  0.,  0., ...,  0.,  0.,  0.],
       [ 0.,  0.,  0., ...,  0.,  0.,  0.],
       ..., 
       [ 0.,  0.,  0., ...,  0.,  0.,  0.],
       [ 0.,  0.,  0., ...,  0.,  0.,  0.],
       [ 0.,  0.,  0., ...,  0.,  0.,  0.]])

Let's overwrite this dataset with increasing floating point numbers:



In [35]:

    
arr = np.linspace(0, 1, num=dst.shape[0] * dst.shape[1]).reshape(dst.shape)
arr.shape == dst.shape









    Out[35]:





True



In [36]:

    
dst[:] = arr



In [37]:

    
dst[:]









    Out[37]:





array([[  0.00000000e+00,   8.33340278e-06,   1.66668056e-05, ...,
          2.47502063e-03,   2.48335403e-03,   2.49168743e-03],
       [  2.50002083e-03,   2.50835424e-03,   2.51668764e-03, ...,
          4.97504146e-03,   4.98337486e-03,   4.99170826e-03],
       [  5.00004167e-03,   5.00837507e-03,   5.01670847e-03, ...,
          7.47506229e-03,   7.48339569e-03,   7.49172910e-03],
       ..., 
       [  9.92508271e-01,   9.92516604e-01,   9.92524938e-01, ...,
          9.94983292e-01,   9.94991625e-01,   9.94999958e-01],
       [  9.95008292e-01,   9.95016625e-01,   9.95024959e-01, ...,
          9.97483312e-01,   9.97491646e-01,   9.97499979e-01],
       [  9.97508313e-01,   9.97516646e-01,   9.97524979e-01, ...,
          9.99983333e-01,   9.99991667e-01,   1.00000000e+00]])

Creating a dataset has increased file size:



In [38]:

    
f









    Out[38]:






File /home/rg/hurray_data/test.h5 (977K)

Fancy indexing allows allows to read/write only portions of a dataset. In the following example, only columns 50 to 55 of rows 10 and 11 are sent over the wire:



In [39]:

    
dst[10:12, 50:55]









    Out[39]:





array([[ 0.02541688,  0.02542521,  0.02543355,  0.02544188,  0.02545021],
       [ 0.0279169 ,  0.02792523,  0.02793357,  0.0279419 ,  0.02795023]])

We can also overwrite the above cells using the same notation:



In [40]:

    
dst[10:12, 50:55] = 999
dst[9:13, 50:55]









    Out[40]:





array([[  2.29168576e-02,   2.29251910e-02,   2.29335244e-02,
          2.29418578e-02,   2.29501913e-02],
       [  9.99000000e+02,   9.99000000e+02,   9.99000000e+02,
          9.99000000e+02,   9.99000000e+02],
       [  9.99000000e+02,   9.99000000e+02,   9.99000000e+02,
          9.99000000e+02,   9.99000000e+02],
       [  3.04169201e-02,   3.04252535e-02,   3.04335869e-02,
          3.04419203e-02,   3.04502538e-02]])

Require ... TODO



In [41]:

    
dst = f.require_dataset("mydata", shape=(400, 300), dtype=np.float64, exact=True)



In [42]:

    
dst[9:13, 50:55]









    Out[42]:





array([[  2.29168576e-02,   2.29251910e-02,   2.29335244e-02,
          2.29418578e-02,   2.29501913e-02],
       [  9.99000000e+02,   9.99000000e+02,   9.99000000e+02,
          9.99000000e+02,   9.99000000e+02],
       [  9.99000000e+02,   9.99000000e+02,   9.99000000e+02,
          9.99000000e+02,   9.99000000e+02],
       [  3.04169201e-02,   3.04252535e-02,   3.04335869e-02,
          3.04419203e-02,   3.04502538e-02]])

This shoud result in an error because dtypes do not match:



In [43]:

    
f.require_dataset("mydata", shape=(400, 300), dtype=np.int16, exact=True)









    



---------------------------------------------------------------------------
MessageError                              Traceback (most recent call last)
<ipython-input-43-bed5346ceb06> in <module>()
----> 1 f.require_dataset("mydata", shape=(400, 300), dtype=np.int16, exact=True)

~/workspace/hurray-py/hurraypy/nodes.py in require_dataset(self, name, shape, dtype, data, chunks, compression, compression_opts, fillvalue, exact)
    292             args[CMD_KW_FILLVALUE] = fillvalue
    293         result = self.conn.send_rcv(CMD_REQUIRE_DATASET, h5file=self.h5file,
--> 294                                     args=args, data=data)
    295 
    296         dst = result["data"]  # Dataset

~/workspace/hurray-py/hurraypy/client.py in send_rcv(self, cmd, h5file, args, data)
    221             error_msg = result.get(CMD_KW_DATA, "")
    222             if 200 <= status < 300:
--> 223                 raise MessageError(status, error_msg)
    224             elif 300 <= status < 400:
    225                 raise DatabaseError(status, error_msg)

MessageError: (204, 'incompatible dtype and/or shape ', '')

Working with groups

Datasets can be organised in groups (and subgroups). A group is like a folder and acts like a Python dictionary. Let's create a group named "data":



In [44]:

    
f.create_group("mygroup")









    Out[44]:






Group /mygroup (file=/home/rg/hurray_data/test.h5)

Recall that every file object is also a group and therefore acts like a dictionary. Its keys() now lists are newly created group:



In [45]:

    
f.keys()









    Out[45]:





('mydata', 'mygroup')

Let's create a subgroup (note that groups follow POSIX filesystem conventions):



In [46]:

    
f.create_group("mygroup/subgroup")









    Out[46]:






Group /mygroup/subgroup (file=/home/rg/hurray_data/test.h5)



In [47]:

    
subgrp = f["mygroup/subgroup"]
subgrp









    Out[47]:






Group /mygroup/subgroup (file=/home/rg/hurray_data/test.h5)

Now let's put a dataset in our subgroup:



In [48]:

    
data = np.random.random((600, 400))



In [49]:

    
dst = subgrp.create_dataset("randomdata", data=data)



In [50]:

    
dst









    Out[50]:




Dataset (600, 400) float64  (file=/home/rg/hurray_data/test.h5, path=/mygroup/subgroup/randomdata)

Every group has a tree() method that displays sub groups and datasets as a tree.



In [51]:

    
f.tree()









    Out[51]:







/Dataset (400, 300) float64  (file=/home/rg/hurray_data/test.h5, path=/mydata)


mygroup

subgroupDataset (600, 400) float64  (file=/home/rg/hurray_data/test.h5, path=/mygroup/subgroup/randomdata)

If you're not in a notebook or ipython console, tree() will give you a text based representation:



In [52]:

    
print(f.tree())









    



── /
    ├─ <Dataset (400, 300) float64 (db=/home/rg/hurray_data/test.h5, path=/mydata)>
    └─ mygroup
        └─ subgroup
            └─ <Dataset (600, 400) float64 (db=/home/rg/hurray_data/test.h5, path=/mygroup/subgroup/randomdata)>

Attributes

Every group and dataset can be assigned a number of key/value pairs, so-called attributes:



In [53]:

    
dst = f["mygroup/subgroup/randomdata"]
dst.attrs["unit"] = "celsius"
dst.attrs["max_value"] = 50

Objects that have attributes get a red "A":



In [54]:

    
dst









    Out[54]:




Dataset (600, 400) float64  (file=/home/rg/hurray_data/test.h5, path=/mygroup/subgroup/randomdata)



In [55]:

    
dst.attrs.keys()









    Out[55]:





('unit', 'max_value')



In [56]:

    
dst.attrs["unit"], dst.attrs["max_value"]









    Out[56]:





('celsius', 50)



In [57]:

    
f.tree()









    Out[57]:







/Dataset (400, 300) float64  (file=/home/rg/hurray_data/test.h5, path=/mydata)


mygroup

subgroupDataset (600, 400) float64  (file=/home/rg/hurray_data/test.h5, path=/mygroup/subgroup/randomdata)



In [ ]: