Tutorial


In [1]:
import os
import sys
import logging
module_path = os.path.abspath(os.path.join('..'))
if module_path not in sys.path:
    sys.path.append(module_path)

In [2]:
import hurraypy as hurray
import numpy as np

In [3]:
hurray.__version__


Out[3]:
'0.0.3'

First, make sure all logging messages are sent to stdout:


In [4]:
logger = logging.getLogger('hurraypy')

# console = logging.StreamHandler()
# console.setLevel(logging.DEBUG)
# console.setFormatter(logging.Formatter('%(levelname)s --- %(message)s'))
# logger.addHandler(console)
# logger.setLevel(logging.DEBUG)

In [5]:
logger.handlers


Out[5]:
[<logging.NullHandler at 0x7ff9a452a748>]

In [6]:
hurray.log.log.debug("bla")
hurray.log.log.info("bla")

Connecting to a hurray server

Make sure you have a Hurray server, e.g.:

$ hurray --logging=debug --debug=1 --socket=~/hurray.sock
[I 170620 09:46:50 __main__:180] Listening on localhost:2222
[I 170620 09:46:50 __main__:184] Listening on /home/rg/hurray.sock
[I 170619 11:16:50 process:132] Starting 8 processes

In [7]:
# conn = hurray.connect('localhost:2222')
conn = hurray.connect('~/hurray.sock')
conn


Out[7]:
<Connection (udsocket=/home/rg/hurray.sock)>

Working with files

Let's create a file test.h5 (overwrite=True replaces the file if it already exists):


In [8]:
f = conn.create_file("test.h5", overwrite=True)

Note that Hurray objects (files, datasets, groups) display nicely in Jupyter notebooks.


In [9]:
f


Out[9]:
File test.h5 (800b)

Working with existing files works like this:


In [10]:
f = conn.File("test.h5")
print(f)

with conn.File("test.h5") as f:
    print(f)


<File (db=test.h5, path=/)>
<File (db=test.h5, path=/)>

Deleting and renaming files is also possible:


In [11]:
f.delete()

Note that the object referenced by f becomes unusable after deleting the file.

Let's create another file and renamed it to test.h5:


In [12]:
f2 = conn.create_file("test2.h5", overwrite=True)

In [13]:
f2


Out[13]:
File test2.h5 (800b)

In [14]:
f = f2.rename("test.h5")

In [15]:
f


Out[15]:
File /home/rg/hurray_data/test.h5 (800b)

Note that rename() is not "in place". We must (re-)assign its return value.


In [16]:
f3 = conn.create_file("test3.h5", overwrite=True)

In [17]:
try:
    f3.rename("test.h5")
except hurray.exceptions.DatabaseError as e:
    print(e)


(300, 'file already exists', '')

Files can be in subdirectories:


In [18]:
f4 = conn.create_file("project1/data.h5", overwrite=True)
f4


Out[18]:
File project1/data.h5 (800b)

In [19]:
conn.list_files("project1/")


Out[19]:
{'data.h5': {'filesize': 800}}

In [21]:
conn.list_files("")


Out[21]:
{'test.h5': {'filesize': 800}, 'test3.h5': {'filesize': 800}}

Working with datasets

A file can contain two kinds of objects: groups and datasets. Essentially, groups work like Python dictionaries and datasets work like NumPy arrays.

Every group and dataset has a name. First, let's try to create a dataset. We must specify the dataset either by passing a NumPy array or by passing a shape and a datatype:


In [30]:
dst = f.create_dataset("mydata", shape=(400, 300), dtype=np.float64)

In [31]:
dst


Out[31]:
Dataset (400, 300) float64 (file=/home/rg/hurray_data/test.h5, path=/mydata)

A dataset has a shape and a dtype, just like NumPy arrays:


In [32]:
dst.shape, dst.dtype


Out[32]:
((400, 300), 'float64')

It also has a path, which is the name of the dataset, prefixed by the names of containing groups. Our dataset is not contained in a group. It therefore appears under the root node / (actually, it is in a group: the file itself is the root group).


In [33]:
dst.path


Out[33]:
'/mydata'

Let's check what data our dataset contains. Numpy-style indexing allows to read/write from/to a dataset. A [:]-index reads the whole dataset into memory. Apparently, our dataset has been initialized with zeros:


In [34]:
dst[:]


Out[34]:
array([[ 0.,  0.,  0., ...,  0.,  0.,  0.],
       [ 0.,  0.,  0., ...,  0.,  0.,  0.],
       [ 0.,  0.,  0., ...,  0.,  0.,  0.],
       ..., 
       [ 0.,  0.,  0., ...,  0.,  0.,  0.],
       [ 0.,  0.,  0., ...,  0.,  0.,  0.],
       [ 0.,  0.,  0., ...,  0.,  0.,  0.]])

Let's overwrite this dataset with increasing floating point numbers:


In [35]:
arr = np.linspace(0, 1, num=dst.shape[0] * dst.shape[1]).reshape(dst.shape)
arr.shape == dst.shape


Out[35]:
True

In [36]:
dst[:] = arr

In [37]:
dst[:]


Out[37]:
array([[  0.00000000e+00,   8.33340278e-06,   1.66668056e-05, ...,
          2.47502063e-03,   2.48335403e-03,   2.49168743e-03],
       [  2.50002083e-03,   2.50835424e-03,   2.51668764e-03, ...,
          4.97504146e-03,   4.98337486e-03,   4.99170826e-03],
       [  5.00004167e-03,   5.00837507e-03,   5.01670847e-03, ...,
          7.47506229e-03,   7.48339569e-03,   7.49172910e-03],
       ..., 
       [  9.92508271e-01,   9.92516604e-01,   9.92524938e-01, ...,
          9.94983292e-01,   9.94991625e-01,   9.94999958e-01],
       [  9.95008292e-01,   9.95016625e-01,   9.95024959e-01, ...,
          9.97483312e-01,   9.97491646e-01,   9.97499979e-01],
       [  9.97508313e-01,   9.97516646e-01,   9.97524979e-01, ...,
          9.99983333e-01,   9.99991667e-01,   1.00000000e+00]])

Creating a dataset has increased file size:


In [38]:
f


Out[38]:
File /home/rg/hurray_data/test.h5 (977K)

Fancy indexing allows allows to read/write only portions of a dataset. In the following example, only columns 50 to 55 of rows 10 and 11 are sent over the wire:


In [39]:
dst[10:12, 50:55]


Out[39]:
array([[ 0.02541688,  0.02542521,  0.02543355,  0.02544188,  0.02545021],
       [ 0.0279169 ,  0.02792523,  0.02793357,  0.0279419 ,  0.02795023]])

We can also overwrite the above cells using the same notation:


In [40]:
dst[10:12, 50:55] = 999
dst[9:13, 50:55]


Out[40]:
array([[  2.29168576e-02,   2.29251910e-02,   2.29335244e-02,
          2.29418578e-02,   2.29501913e-02],
       [  9.99000000e+02,   9.99000000e+02,   9.99000000e+02,
          9.99000000e+02,   9.99000000e+02],
       [  9.99000000e+02,   9.99000000e+02,   9.99000000e+02,
          9.99000000e+02,   9.99000000e+02],
       [  3.04169201e-02,   3.04252535e-02,   3.04335869e-02,
          3.04419203e-02,   3.04502538e-02]])

Require ... TODO


In [41]:
dst = f.require_dataset("mydata", shape=(400, 300), dtype=np.float64, exact=True)

In [42]:
dst[9:13, 50:55]


Out[42]:
array([[  2.29168576e-02,   2.29251910e-02,   2.29335244e-02,
          2.29418578e-02,   2.29501913e-02],
       [  9.99000000e+02,   9.99000000e+02,   9.99000000e+02,
          9.99000000e+02,   9.99000000e+02],
       [  9.99000000e+02,   9.99000000e+02,   9.99000000e+02,
          9.99000000e+02,   9.99000000e+02],
       [  3.04169201e-02,   3.04252535e-02,   3.04335869e-02,
          3.04419203e-02,   3.04502538e-02]])

This shoud result in an error because dtypes do not match:


In [43]:
f.require_dataset("mydata", shape=(400, 300), dtype=np.int16, exact=True)


---------------------------------------------------------------------------
MessageError                              Traceback (most recent call last)
<ipython-input-43-bed5346ceb06> in <module>()
----> 1 f.require_dataset("mydata", shape=(400, 300), dtype=np.int16, exact=True)

~/workspace/hurray-py/hurraypy/nodes.py in require_dataset(self, name, shape, dtype, data, chunks, compression, compression_opts, fillvalue, exact)
    292             args[CMD_KW_FILLVALUE] = fillvalue
    293         result = self.conn.send_rcv(CMD_REQUIRE_DATASET, h5file=self.h5file,
--> 294                                     args=args, data=data)
    295 
    296         dst = result["data"]  # Dataset

~/workspace/hurray-py/hurraypy/client.py in send_rcv(self, cmd, h5file, args, data)
    221             error_msg = result.get(CMD_KW_DATA, "")
    222             if 200 <= status < 300:
--> 223                 raise MessageError(status, error_msg)
    224             elif 300 <= status < 400:
    225                 raise DatabaseError(status, error_msg)

MessageError: (204, 'incompatible dtype and/or shape ', '')

Working with groups

Datasets can be organised in groups (and subgroups). A group is like a folder and acts like a Python dictionary. Let's create a group named "data":


In [44]:
f.create_group("mygroup")


Out[44]:
Group /mygroup (file=/home/rg/hurray_data/test.h5)

Recall that every file object is also a group and therefore acts like a dictionary. Its keys() now lists are newly created group:


In [45]:
f.keys()


Out[45]:
('mydata', 'mygroup')

Let's create a subgroup (note that groups follow POSIX filesystem conventions):


In [46]:
f.create_group("mygroup/subgroup")


Out[46]:
Group /mygroup/subgroup (file=/home/rg/hurray_data/test.h5)

In [47]:
subgrp = f["mygroup/subgroup"]
subgrp


Out[47]:
Group /mygroup/subgroup (file=/home/rg/hurray_data/test.h5)

Now let's put a dataset in our subgroup:


In [48]:
data = np.random.random((600, 400))

In [49]:
dst = subgrp.create_dataset("randomdata", data=data)

In [50]:
dst


Out[50]:
Dataset (600, 400) float64 (file=/home/rg/hurray_data/test.h5, path=/mygroup/subgroup/randomdata)

Every group has a tree() method that displays sub groups and datasets as a tree.


In [51]:
f.tree()


Out[51]:
  • /
    • Dataset (400, 300) float64 (file=/home/rg/hurray_data/test.h5, path=/mydata)
    • mygroup
      • subgroup
        • Dataset (600, 400) float64 (file=/home/rg/hurray_data/test.h5, path=/mygroup/subgroup/randomdata)

If you're not in a notebook or ipython console, tree() will give you a text based representation:


In [52]:
print(f.tree())


── /
    ├─ <Dataset (400, 300) float64 (db=/home/rg/hurray_data/test.h5, path=/mydata)>
    └─ mygroup
        └─ subgroup
            └─ <Dataset (600, 400) float64 (db=/home/rg/hurray_data/test.h5, path=/mygroup/subgroup/randomdata)>

Attributes

Every group and dataset can be assigned a number of key/value pairs, so-called attributes:


In [53]:
dst = f["mygroup/subgroup/randomdata"]
dst.attrs["unit"] = "celsius"
dst.attrs["max_value"] = 50

Objects that have attributes get a red "A":


In [54]:
dst


Out[54]:
Dataset (600, 400) float64 (file=/home/rg/hurray_data/test.h5, path=/mygroup/subgroup/randomdata)

In [55]:
dst.attrs.keys()


Out[55]:
('unit', 'max_value')

In [56]:
dst.attrs["unit"], dst.attrs["max_value"]


Out[56]:
('celsius', 50)

In [57]:
f.tree()


Out[57]:
  • /
    • Dataset (400, 300) float64 (file=/home/rg/hurray_data/test.h5, path=/mydata)
    • mygroup
      • subgroup
        • Dataset (600, 400) float64 (file=/home/rg/hurray_data/test.h5, path=/mygroup/subgroup/randomdata)

In [ ]: