This notebook provides a short demonstration of the categorical type, which represents a set of values of a given type as an integer in the range [0, N)
. This type is used in dynd's groupby
function, and can be used to save memory when storing a large amount of data with only a small number of possible values.
Let's start by importing dynd, and printing out some version numbers.
In [1]:
from __future__ import print_function
import sys
import dynd
from dynd import nd, ndt
print('Python: ', sys.version)
print('DyND: ', dynd.__version__)
print('LibDyND:', dynd.__libdynd_version__)
There are two functions for creating a categorical type, ndt.make_categorical
and ndt.factor_categorical
. If you want to control exactly what categories there are, and in what order they appear, you want to use the ndt.make_categorical
function. Let's make a rainbow type as our first example.
In [2]:
rainbow = ndt.make_categorical(['red', 'orange', 'yellow', 'green', 'blue', 'indigo', 'violet'])
rainbow
Out[2]:
We can look at some properties of the rainbow
object to see what we have created. For instance, with only 7 values, the storage of the type is only one byte.
In [3]:
print('type id: ', rainbow.type_id)
print('data size: ', rainbow.data_size)
print('data alignment:', rainbow.data_alignment)
We can get the integer storage type and the category type.
In [4]:
print('storage type: ', rainbow.storage_type)
print('category type:', rainbow.category_type)
The list of categories is itself an immutable dynd array.
In [5]:
rainbow.categories
Out[5]:
Let's go ahead and make an array with the rainbow
type. Note that we're using the udtype
parameter to the ndobject
constructor, indicating we want an array whose uniform type is rainbow
. If we used the dtype
parameter, we would have to explicitly include the dimensionality as well.
In [6]:
colors = nd.array(['red', 'red', 'violet', 'blue', 'yellow', 'yellow', 'red', 'indigo'], dtype=rainbow)
colors
Out[6]:
To access the stored integers of the categorical array, an ints
property is exposed.
In [7]:
colors.ints
Out[7]:
The values are always 0-based indices into the categories array.
In [8]:
[str(rainbow.categories[i]) for i in colors.ints]
Out[8]:
If we have an array of integers that we want to view as a rainbow
, we can do a view operation to get this. To make things interesting, let's also have the integers be of a different size, which means we'll have to cast to the the correct integer type before.
In [9]:
myints = nd.array([5, 1, 3, 2, 0, 3, 3], access='rw')
myints
Out[9]:
Let's first cast to the correct integer type, using the ucast
method.
In [10]:
my_cat_ints = myints.ucast(rainbow.storage_type)
my_cat_ints
Out[10]:
Then view it as a rainbow
.
In [11]:
mycolors = my_cat_ints.view_scalars(rainbow)
mycolors
Out[11]:
This resulting object is still a view of the original integer data in myints
, so if we modify mycolors
, we are actually modifying myints
in place.
In [12]:
print(repr(myints))
mycolors[1::2] = 'red'
print(repr(myints))
In the previous example, we saw how to create a categorical type by using the ndt.make_categorical
. Let's take a look at how the related ndt.factor_categorical
works. If we have an array of data, whose categories we don't necessarily know ahead of time, this function allows you to create a categorical type with a deduced list.
Let's create an example array using a structure of (gender, age).
In [13]:
myarr = nd.array([('M', 13), ('F', 17), ('F', 34), ('M', 19), ('M', 13), ('F', 34), ('F', 22)],
dtype='{gender: string[1], age: int32}')
myarr
Out[13]:
There were a few repeated pairs in the data, so when we factor it into a categorical type, the list becomes slightly smaller. The categories list is also in lexicographic order based on the fields of the structure.
In [14]:
catdt = ndt.factor_categorical(myarr)
catdt
Out[14]:
We can now use the ucast
method to cast the uniform type of the array into the categorical type.
In [15]:
mycats = myarr.ucast(catdt)
mycats
Out[15]:
The result of this cast is once again a view into the original data. If we want to make a concrete array that doesn't include any transformations, we need to do an eval
on the array.
In [16]:
mycats = mycats.eval()
mycats
Out[16]:
Once again we can take a look at the ints
storage to see what this is under the hood.
In [17]:
mycats.ints
Out[17]:
In our first examples, we created categorical types with less than 256 categories in them. This meant that the storage for them fit in one byte. When there are more categories, the storage type will be bigger as needed to fit the larger number of values. We'll illustrate this with a simple categorical type based on a range of values.
In [18]:
catdt = ndt.make_categorical(nd.range(-20000.0, 200000.0, 6))
If we were to print out this type, it will produce a very long list, because dynd doesn't yet do shortening of its repr
output, like numpy does. To see that the storage type is bigger, let's look at it and the category type.
In [ ]:
print('storage type: ', catdt.storage_type)
print('category type:', catdt.category_type)
print('number of categories:', len(catdt.categories))