DyND arrays are built out of three components, a type, arrmeta (array metadata) associated with each array, and a data pointer/reference. This notebook will pull apart these components in several example objects, showing how the different aspects of the library work together.
Note that the dynd._lowlevel submodule which is used in this notebook exposes low level details of dynd, and it is easy to violate invariants of the dynd object system like immutability of dynd types or arrays flagged as such.
In [1]:
from __future__ import print_function
import sys, ctypes
from pprint import pprint
import dynd
from dynd import nd, ndt, _lowlevel
import numpy as np
print('Python:', sys.version)
print('DyND:', dynd.__version__)
print('LibDyND:', dynd.__libdynd_version__)
Let's begin with a NumPy array, reviewing what its structure is and then looking at how it gets represented when we convert it to dynd. We'll do a simple 2-dimensional array, enough to show the dimension structure a bit.
In [2]:
a = np.arange(1, 7, dtype=np.int32).reshape(2,3)
a
Out[2]:
The way this array's memory is arranged is specified by two attributes of the object, the shape
and strides
. If given an integer tuple of indices, where each index satisfies 0 <= index_tuple[i] < shape[i]
, the memory offset from element zero to the element at that index is the dot product with the strides.
In [3]:
print('shape: ', a.shape)
print('strides:', a.strides)
To illustrate this, we'll use a low level attribute of NumPy arrays, a.ctypes.data
, and ctypes type.from_address()
method to view values at pointer addresses. If we look at the data address of a
, we see the element for index (0,0).
In [4]:
addr = a.ctypes.data
print('address: ', hex(addr))
print('memory contents:', ctypes.c_int32.from_address(addr).value)
print('a[0,0] value: ', a[0,0])
Now to get the address of another element, say for index (1,2), let's take its dot product with the strides to get an offset.
In [5]:
offset = int(np.dot([1,2], a.strides)) # sum(x*y for x,y in zip((1,2), a.strides))
addr = a.ctypes.data + offset
print('offset: ', hex(offset), '(%d)' % offset)
print('address: ', hex(addr))
print('memory contents:', ctypes.c_int32.from_address(addr).value)
print('a[1,2] value: ', a[1,2])
Let's now convert this array into dynd, and look at the same addresses using the _lowlevel
submodule.
In [6]:
b = nd.view(a)
print('b address: ', hex(_lowlevel.data_address_of(b)))
print('b[1,2] address:', hex(_lowlevel.data_address_of(b[1,2])))
These addresses should be the same as the ones we just got from NumPy. DyND is providing a view of the same memory data NumPy is, let's take a look at the type and arrmeta dynd has created.
In [7]:
nd.debug_repr(b)
Out[7]:
The _lowlevel
submodule allows as to peek directly at the dynd array structure, so we can access everything here directly via ctypes as well. Reading values this way will work fine, but writing to the reference count will not, for example, as it requires atomic operations to support multi-threaded access. Let's take a look at the values this way. To show the ctypes structure a bit, we're also printing the fields of the ctypes.Structure
type used.
In [8]:
ndp = _lowlevel.array_preamble_of(b)
pprint(ndp._fields_)
In [9]:
# The dynd type
print('dtype ptr:', hex(ndp.dtype))
In [10]:
# Part of the metadata
print('flags: ', ndp.flags)
In [11]:
# The data
print('data ptr: ', hex(ndp.data_pointer))
print('data ref: ', hex(ndp.data_reference))
The rest of the arrmeta has a structure specified by the dynd type. In the case of a NumPy-like array, as we have here, each dimension of the array gets a size
and a stride
. Let's create a ctypes Structure and take a look.
In [12]:
class StridedMetadata(ctypes.Structure):
_fields_ = [('size', ctypes.c_ssize_t),
('stride', ctypes.c_ssize_t)]
meta = (StridedMetadata * 2).from_address(_lowlevel.arrmeta_address_of(b))
print('meta[0].size: ', meta[0].size)
print('meta[0].stride:', meta[0].stride)
print('meta[1].size: ', meta[1].size)
print('meta[1].stride:', meta[1].stride)
In [13]:
# Rearranged to match NumPy
print('shape: ', (meta[0].size, meta[1].size))
print('strides:', (meta[0].stride, meta[1].stride))
To understand why dynd is structuring this arrmeta into a size/stride
for each dimension, instead of as separate shape
and strides
arrays like numpy does it, let's first compare the dtypes between the systems.
In [14]:
a.dtype
Out[14]:
In [15]:
nd.type_of(b)
Out[15]:
Observe that the numpy dtype only represents the data type, it contains no reference to the dimensions of the array. In dynd, information about the dimensions has moved into its array type, and the two dimensions have the name strided_dim
. The way it works is that a strided_dim
always gets a corresponding size/stride
arrmeta, while the int32
requires no arrmeta, thus the arrmeta is an array of two size/stride
structures as we saw above.
Lets take another look at the arrmeta, but now using a ctypes structure that has been generated from the dynd type, and then see how its hierarchy matches the one in the type.
In [16]:
meta = _lowlevel.arrmeta_struct_of(b)
pprint(meta._fields_)
For the first dimension, we have a size
and a stride
as before. The type id of the dtype is the corresponding strided_dim
.
In [17]:
print('type id:', nd.type_of(b).type_id)
print('size: ', meta.size)
print('stride: ', meta.stride)
To get to the second dimension, we look at the element
field of the arrmeta, or the element_dtype
attribute of the type.
In [18]:
print(nd.type_of(b).element_type)
pprint(meta.element._fields_)
Now one dimension has been stripped off, and at this level we once again have a size
and a stride
.
In [19]:
print('type id:', nd.type_of(b).element_type.type_id)
print('size: ', meta.element.size)
print('stride: ', meta.element.stride)
If we strip away the second dimension, we end at the scalar type. The arrmeta structure at this level is using a placeholder empty structure.
In [20]:
print('type id:', nd.type_of(b).element_type.element_type.type_id)
To show how this way of matching up the type with a arrmeta struct can work, let's do an example of a ragged array, where the second dimension is a variable-sized dimension. If we construct a dynd object from a ragged list, this is what we will get.
In [21]:
a = nd.array([[1], [2,3,4], [5,6]])
In [22]:
nd.type_of(a)
Out[22]:
The second dimension is now a var_dim
instead of a strided_dim
. This dimension gets a different corresponding arrmeta associated with it. Let's first look at the debug_repr as before.
In [23]:
nd.debug_repr(a)
Out[23]:
We can see that the strided_dim
has the same arrmeta as before, but the var_dim
has different entries. It's got a stride
, an offset
, and another memory block. What's going on here is that the variable-sized data goes in another reference-counted buffer, and the array data for the first dimension gets pointers into this second buffer.
In [24]:
meta = _lowlevel.arrmeta_struct_of(a)
pprint(meta._fields_)
The first dimension type/arrmeta is as before, for strided_dim
.
In [25]:
print('type id:', nd.type_of(a).type_id)
print('size: ', meta.size)
print('stride: ', meta.stride)
In [26]:
pprint(meta.element._fields_)
The second dimension type/arrmeta is now for var_dim
.
In [27]:
print('type id:', nd.type_of(a).element_type.type_id)
print('stride: ', meta.element.stride)
print('offset: ', meta.element.offset)
print('blockref: ', hex(ctypes.cast(meta.element.blockref, ctypes.c_void_p).value))
The data elements of the first dimension are different from the strided case. You may recall that the data address of the strided array is the same as the data address at index zero. For the var_dim
, this is not the case, something we can illustrate by showing the pointers.
In [28]:
print('a data address: ', hex(_lowlevel.data_address_of(a)))
print('a[0] data address: ', hex(_lowlevel.data_address_of(a[0])))
print('a[0,0] data address:', hex(_lowlevel.data_address_of(a[0,0])))
The reason for this is that the elements visible at the outer level are pointer/size
pairs which point into the memory block held by the var_dim
arrmeta. We can examine this directly by creating a ctypes structure corresponding to the first dimension's data.
In [29]:
class VarData(ctypes.Structure):
_fields_ = [('data', ctypes.POINTER(ctypes.c_int32)),
('size', ctypes.c_ssize_t)]
data = (VarData * 3).from_address(_lowlevel.data_address_of(a))
To refresh your memory about the data we populated a
with, lets print it out again.
In [30]:
a
Out[30]:
Let's compare this with the sizes specified in the data.
In [31]:
print(data[0].size, data[1].size, data[2].size)
We can access the first element of each array by dereferencing the pointer directly. Recall that there was an additional arrmeta property called offset
, which was zero. If this offset was not zero, we would have to add it to the pointer before dereferencing.
In [32]:
print(data[0].data.contents.value, data[1].data.contents.value, data[2].data.contents.value)
The scalar type we've been using so far, int32
, has no arrmeta. This is not always the case, and the string
type is a good example of this. Let's create a simple one-dimensional array of strings.
In [33]:
a = nd.array([u'this is the first string', u'second', 'third'])
In [34]:
nd.type_of(a)
Out[34]:
The string type has one important property, the string encoding. For default strings, the encoding is UTF-8.
In [35]:
nd.type_of(a).element_type.encoding
Out[35]:
Once again it's useful to look at the debug_repr first.
In [36]:
nd.debug_repr(a)
Out[36]:
Similar to the var_dim
arrmeta, the string
type has a memory block. In fact, that's all it contains.
The data element structure is slightly different from var_dim
, however. Instead of a pointer and a length, the data includes begin and end pointers which define a half-open interval of bytes for the string. Let's use ctypes to look into these bytes, and construct a string from them directly.
In [37]:
class StringData(ctypes.Structure):
_fields_ = [('begin', ctypes.c_void_p),
('end', ctypes.c_void_p)]
data = (StringData * 3).from_address(_lowlevel.data_address_of(a))
The extents of the first string are
In [38]:
print('begin:', hex(data[0].begin))
print('end: ', hex(data[0].end))
To build our own string out of it, we can create a ctypes char
array, convert it to bytes, than decode it using UTF-8, as that's the encoding from the dynd arrmeta.
In [39]:
length = data[0].end - data[0].begin
print('length:', length)
In [40]:
bytearray((ctypes.c_char * length).from_address(data[0].begin)).decode('utf-8')
Out[40]:
There are more things which need to be illustrated to more fully gain an understanding of how dynd's type/arrmeta system describes multi-dimensional data in a way which is quite general, yet useful for computation. So far we have just seen how the data is laid out, and have directly peeked at the internal structures of dynd using ctypes.
Some next steps will be to demonstrate how indexing in the dynd __getitem__ implementation works, how struct types work, and how these ideas can apply in a JIT setting with LLVM. There were some unanswered questions, like why there is an offset
field in the var_dim
arrmeta, which will get answered by exploring these topics.