Low Level DyND

DyND arrays are built out of three components, a type, arrmeta (array metadata) associated with each array, and a data pointer/reference. This notebook will pull apart these components in several example objects, showing how the different aspects of the library work together.

Note that the dynd._lowlevel submodule which is used in this notebook exposes low level details of dynd, and it is easy to violate invariants of the dynd object system like immutability of dynd types or arrays flagged as such.


In [1]:
from __future__ import print_function
import sys, ctypes
from pprint import pprint
import dynd
from dynd import nd, ndt, _lowlevel
import numpy as np
print('Python:', sys.version)
print('DyND:', dynd.__version__)
print('LibDyND:', dynd.__libdynd_version__)


Python: 3.3.3 |Anaconda 1.8.0 (64-bit)| (default, Dec  3 2013, 11:56:40) [MSC v.1600 64 bit (AMD64)]
DyND: 0.6.2.post65.g1b0112f.dirty
LibDyND: 0.6.2.post133.g6c9266c.dirty

A NumPy Array As DyND

The NumPy Array

Let's begin with a NumPy array, reviewing what its structure is and then looking at how it gets represented when we convert it to dynd. We'll do a simple 2-dimensional array, enough to show the dimension structure a bit.


In [2]:
a = np.arange(1, 7, dtype=np.int32).reshape(2,3)
a


Out[2]:
array([[1, 2, 3],
       [4, 5, 6]])

The way this array's memory is arranged is specified by two attributes of the object, the shape and strides. If given an integer tuple of indices, where each index satisfies 0 <= index_tuple[i] < shape[i], the memory offset from element zero to the element at that index is the dot product with the strides.


In [3]:
print('shape:  ', a.shape)
print('strides:', a.strides)


shape:   (2, 3)
strides: (12, 4)

To illustrate this, we'll use a low level attribute of NumPy arrays, a.ctypes.data, and ctypes type.from_address() method to view values at pointer addresses. If we look at the data address of a, we see the element for index (0,0).


In [4]:
addr = a.ctypes.data
print('address:        ', hex(addr))
print('memory contents:', ctypes.c_int32.from_address(addr).value)
print('a[0,0] value:   ', a[0,0])


address:         0x3eae860
memory contents: 1
a[0,0] value:    1

Now to get the address of another element, say for index (1,2), let's take its dot product with the strides to get an offset.


In [5]:
offset = int(np.dot([1,2], a.strides)) # sum(x*y for x,y in zip((1,2), a.strides))
addr = a.ctypes.data + offset
print('offset:         ', hex(offset), '(%d)' % offset)
print('address:        ', hex(addr))
print('memory contents:', ctypes.c_int32.from_address(addr).value)
print('a[1,2] value:   ', a[1,2])


offset:          0x14 (20)
address:         0x3eae874
memory contents: 6
a[1,2] value:    6

As a DyND Array

Let's now convert this array into dynd, and look at the same addresses using the _lowlevel submodule.


In [6]:
b = nd.view(a)
print('b address:     ', hex(_lowlevel.data_address_of(b)))
print('b[1,2] address:', hex(_lowlevel.data_address_of(b[1,2])))


b address:      0x3eae860
b[1,2] address: 0x3eae874

These addresses should be the same as the ones we just got from NumPy. DyND is providing a view of the same memory data NumPy is, let's take a look at the type and arrmeta dynd has created.


In [7]:
nd.debug_repr(b)


Out[7]:
------ array
 address: 000000000046BFE0
 refcount: 1
 type:
  pointer: 0000000000489470
  type: strided * strided * int32
 arrmeta:
  flags: 3 (read_access write_access )
  type-specific arrmeta:
   strided_dim arrmeta
    stride: 12
    size: 2
    strided_dim arrmeta
     stride: 4
     size: 3
 data:
   pointer: 0000000003EAE860
   reference: 0000000000475ED0
    ------ memory_block at 0000000000475ED0
     reference count: 1
     type: external
     object void pointer: 00000000043ECF80
     free function: 000007FEE4CE2743
    ------
------

The _lowlevel submodule allows as to peek directly at the dynd array structure, so we can access everything here directly via ctypes as well. Reading values this way will work fine, but writing to the reference count will not, for example, as it requires atomic operations to support multi-threaded access. Let's take a look at the values this way. To show the ctypes structure a bit, we're also printing the fields of the ctypes.Structure type used.


In [8]:
ndp = _lowlevel.array_preamble_of(b)
pprint(ndp._fields_)


[('memblockdata', <class 'dynd._lowlevel.ctypes_types.MemoryBlockData'>),
 ('dtype', <class 'ctypes.c_void_p'>),
 ('data_pointer', <class 'ctypes.c_void_p'>),
 ('flags', <class 'ctypes.c_ulonglong'>),
 ('data_reference', <class 'ctypes.c_void_p'>)]

In [9]:
# The dynd type
print('dtype ptr:', hex(ndp.dtype))


dtype ptr: 0x489470

In [10]:
# Part of the metadata
print('flags:    ', ndp.flags)


flags:     3

In [11]:
# The data
print('data ptr: ', hex(ndp.data_pointer))
print('data ref: ', hex(ndp.data_reference))


data ptr:  0x3eae860
data ref:  0x475ed0

The rest of the arrmeta has a structure specified by the dynd type. In the case of a NumPy-like array, as we have here, each dimension of the array gets a size and a stride. Let's create a ctypes Structure and take a look.


In [12]:
class StridedMetadata(ctypes.Structure):
    _fields_ = [('size', ctypes.c_ssize_t),
                ('stride', ctypes.c_ssize_t)]
meta = (StridedMetadata * 2).from_address(_lowlevel.arrmeta_address_of(b))
print('meta[0].size:  ', meta[0].size)
print('meta[0].stride:', meta[0].stride)
print('meta[1].size:  ', meta[1].size)
print('meta[1].stride:', meta[1].stride)


meta[0].size:   2
meta[0].stride: 12
meta[1].size:   3
meta[1].stride: 4

In [13]:
# Rearranged to match NumPy
print('shape:  ', (meta[0].size, meta[1].size))
print('strides:', (meta[0].stride, meta[1].stride))


shape:   (2, 3)
strides: (12, 4)

DType Comparison

To understand why dynd is structuring this arrmeta into a size/stride for each dimension, instead of as separate shape and strides arrays like numpy does it, let's first compare the dtypes between the systems.


In [14]:
a.dtype


Out[14]:
dtype('int32')

In [15]:
nd.type_of(b)


Out[15]:
ndt.type('strided * strided * int32')

Observe that the numpy dtype only represents the data type, it contains no reference to the dimensions of the array. In dynd, information about the dimensions has moved into its array type, and the two dimensions have the name strided_dim. The way it works is that a strided_dim always gets a corresponding size/stride arrmeta, while the int32 requires no arrmeta, thus the arrmeta is an array of two size/stride structures as we saw above.

DyND Dimension Type/Metadata Correspondence

Lets take another look at the arrmeta, but now using a ctypes structure that has been generated from the dynd type, and then see how its hierarchy matches the one in the type.


In [16]:
meta = _lowlevel.arrmeta_struct_of(b)
pprint(meta._fields_)


[('size', <class 'ctypes.c_longlong'>),
 ('stride', <class 'ctypes.c_longlong'>),
 ('element',
  <class 'dynd._lowlevel.arrmeta_struct.build_arrmeta_struct.<locals>.StridedMetadata'>)]

For the first dimension, we have a size and a stride as before. The type id of the dtype is the corresponding strided_dim.


In [17]:
print('type id:', nd.type_of(b).type_id)
print('size:   ', meta.size)
print('stride: ', meta.stride)


type id: strided_dim
size:    2
stride:  12

To get to the second dimension, we look at the element field of the arrmeta, or the element_dtype attribute of the type.


In [18]:
print(nd.type_of(b).element_type)
pprint(meta.element._fields_)


strided * int32
[('size', <class 'ctypes.c_longlong'>),
 ('stride', <class 'ctypes.c_longlong'>),
 ('element', <class 'dynd._lowlevel.arrmeta_struct.EmptyMetadata'>)]

Now one dimension has been stripped off, and at this level we once again have a size and a stride.


In [19]:
print('type id:', nd.type_of(b).element_type.type_id)
print('size:   ', meta.element.size)
print('stride: ', meta.element.stride)


type id: strided_dim
size:    3
stride:  4

If we strip away the second dimension, we end at the scalar type. The arrmeta structure at this level is using a placeholder empty structure.


In [20]:
print('type id:', nd.type_of(b).element_type.element_type.type_id)


type id: int32

A Ragged Array Example

To show how this way of matching up the type with a arrmeta struct can work, let's do an example of a ragged array, where the second dimension is a variable-sized dimension. If we construct a dynd object from a ragged list, this is what we will get.


In [21]:
a = nd.array([[1], [2,3,4], [5,6]])

In [22]:
nd.type_of(a)


Out[22]:
ndt.type('strided * var * int32')

The second dimension is now a var_dim instead of a strided_dim. This dimension gets a different corresponding arrmeta associated with it. Let's first look at the debug_repr as before.


In [23]:
nd.debug_repr(a)


Out[23]:
------ array
 address: 000000000046D7B0
 refcount: 1
 type:
  pointer: 000000000048D640
  type: strided * var * int32
 arrmeta:
  flags: 5 (read_access immutable )
  type-specific arrmeta:
   strided_dim arrmeta
    stride: 16
    size: 3
    var_dim arrmeta
     stride: 4
     offset: 0
     ------ memory_block at 000000000048D8C0
      reference count: 1
      type: pod
      finalized: 24
     ------
 data:
   pointer: 000000000046D808
   reference: 0000000000000000 (embedded in array memory)
------

We can see that the strided_dim has the same arrmeta as before, but the var_dim has different entries. It's got a stride, an offset, and another memory block. What's going on here is that the variable-sized data goes in another reference-counted buffer, and the array data for the first dimension gets pointers into this second buffer.


In [24]:
meta = _lowlevel.arrmeta_struct_of(a)
pprint(meta._fields_)


[('size', <class 'ctypes.c_longlong'>),
 ('stride', <class 'ctypes.c_longlong'>),
 ('element',
  <class 'dynd._lowlevel.arrmeta_struct.build_arrmeta_struct.<locals>.VarMetadata'>)]

The first dimension type/arrmeta is as before, for strided_dim.


In [25]:
print('type id:', nd.type_of(a).type_id)
print('size:   ', meta.size)
print('stride: ', meta.stride)


type id: strided_dim
size:    3
stride:  16

In [26]:
pprint(meta.element._fields_)


[('blockref', <class 'dynd._lowlevel.arrmeta_struct.LP_MemoryBlockData'>),
 ('stride', <class 'ctypes.c_longlong'>),
 ('offset', <class 'ctypes.c_longlong'>),
 ('element', <class 'dynd._lowlevel.arrmeta_struct.EmptyMetadata'>)]

The second dimension type/arrmeta is now for var_dim.


In [27]:
print('type id:', nd.type_of(a).element_type.type_id)
print('stride: ', meta.element.stride)
print('offset: ', meta.element.offset)
print('blockref: ', hex(ctypes.cast(meta.element.blockref, ctypes.c_void_p).value))


type id: var_dim
stride:  4
offset:  0
blockref:  0x48d8c0

The data elements of the first dimension are different from the strided case. You may recall that the data address of the strided array is the same as the data address at index zero. For the var_dim, this is not the case, something we can illustrate by showing the pointers.


In [28]:
print('a data address:     ', hex(_lowlevel.data_address_of(a)))
print('a[0] data address:  ', hex(_lowlevel.data_address_of(a[0])))
print('a[0,0] data address:', hex(_lowlevel.data_address_of(a[0,0])))


a data address:      0x46d808
a[0] data address:   0x46d808
a[0,0] data address: 0x499490

The reason for this is that the elements visible at the outer level are pointer/size pairs which point into the memory block held by the var_dim arrmeta. We can examine this directly by creating a ctypes structure corresponding to the first dimension's data.


In [29]:
class VarData(ctypes.Structure):
    _fields_ = [('data', ctypes.POINTER(ctypes.c_int32)),
                ('size', ctypes.c_ssize_t)]
data = (VarData * 3).from_address(_lowlevel.data_address_of(a))

To refresh your memory about the data we populated a with, lets print it out again.


In [30]:
a


Out[30]:
nd.array([[1], [2, 3, 4], [5, 6]], type="strided * var * int32")

Let's compare this with the sizes specified in the data.


In [31]:
print(data[0].size, data[1].size, data[2].size)


1 3 2

We can access the first element of each array by dereferencing the pointer directly. Recall that there was an additional arrmeta property called offset, which was zero. If this offset was not zero, we would have to add it to the pointer before dereferencing.


In [32]:
print(data[0].data.contents.value, data[1].data.contents.value, data[2].data.contents.value)


1 2 5

A String Example

The scalar type we've been using so far, int32, has no arrmeta. This is not always the case, and the string type is a good example of this. Let's create a simple one-dimensional array of strings.


In [33]:
a = nd.array([u'this is the first string', u'second', 'third'])

In [34]:
nd.type_of(a)


Out[34]:
ndt.type('strided * string')

The string type has one important property, the string encoding. For default strings, the encoding is UTF-8.


In [35]:
nd.type_of(a).element_type.encoding


Out[35]:
'utf8'

Once again it's useful to look at the debug_repr first.


In [36]:
nd.debug_repr(a)


Out[36]:
------ array
 address: 0000000000496640
 refcount: 1
 type:
  pointer: 000000000048D960
  type: strided * string
 arrmeta:
  flags: 5 (read_access immutable )
  type-specific arrmeta:
   strided_dim arrmeta
    stride: 16
    size: 3
    string arrmeta
     ------ memory_block at 000000000048DBE0
      reference count: 1
      type: pod
      finalized: 35
     ------
 data:
   pointer: 0000000000496688
   reference: 0000000000000000 (embedded in array memory)
------

Similar to the var_dim arrmeta, the string type has a memory block. In fact, that's all it contains.

The data element structure is slightly different from var_dim, however. Instead of a pointer and a length, the data includes begin and end pointers which define a half-open interval of bytes for the string. Let's use ctypes to look into these bytes, and construct a string from them directly.


In [37]:
class StringData(ctypes.Structure):
    _fields_ = [('begin', ctypes.c_void_p),
                ('end', ctypes.c_void_p)]
data = (StringData * 3).from_address(_lowlevel.data_address_of(a))

The extents of the first string are


In [38]:
print('begin:', hex(data[0].begin))
print('end:  ', hex(data[0].end))


begin: 0x499e00
end:   0x499e18

To build our own string out of it, we can create a ctypes char array, convert it to bytes, than decode it using UTF-8, as that's the encoding from the dynd arrmeta.


In [39]:
length = data[0].end - data[0].begin
print('length:', length)


length: 24

In [40]:
bytearray((ctypes.c_char * length).from_address(data[0].begin)).decode('utf-8')


Out[40]:
'this is the first string'

Conclusion

There are more things which need to be illustrated to more fully gain an understanding of how dynd's type/arrmeta system describes multi-dimensional data in a way which is quite general, yet useful for computation. So far we have just seen how the data is laid out, and have directly peeked at the internal structures of dynd using ctypes.

Some next steps will be to demonstrate how indexing in the dynd __getitem__ implementation works, how struct types work, and how these ideas can apply in a JIT setting with LLVM. There were some unanswered questions, like why there is an offset field in the var_dim arrmeta, which will get answered by exploring these topics.