DyND Categorical Type

This notebook provides a short demonstration of the categorical type, which represents a set of values of a given type as an integer in the range [0, N). This type is used in dynd's groupby function, and can be used to save memory when storing a large amount of data with only a small number of possible values.

Let's start by importing dynd, and printing out some version numbers.


In [1]:
from __future__ import print_function
import sys
import dynd
from dynd import nd, ndt
print('Python: ', sys.version)
print('DyND:   ', dynd.__version__)
print('LibDyND:', dynd.__libdynd_version__)


Python:  3.3.3 |Anaconda 1.8.0 (64-bit)| (default, Dec  3 2013, 11:56:40) [MSC v.1600 64 bit (AMD64)]
DyND:    0.6.1.post70.gc6ca7b4
LibDyND: 0.6.1.post298.gc6d68ea

Rainbow Example: ndt.make_categorical

There are two functions for creating a categorical type, ndt.make_categorical and ndt.factor_categorical. If you want to control exactly what categories there are, and in what order they appear, you want to use the ndt.make_categorical function. Let's make a rainbow type as our first example.


In [2]:
rainbow = ndt.make_categorical(['red', 'orange', 'yellow', 'green', 'blue', 'indigo', 'violet'])
rainbow


Out[2]:
ndt.type('categorical[string, ["red", "orange", "yellow", "green", "blue", "indigo", "violet"]]')

We can look at some properties of the rainbow object to see what we have created. For instance, with only 7 values, the storage of the type is only one byte.


In [3]:
print('type id:       ', rainbow.type_id)
print('data size:     ', rainbow.data_size)
print('data alignment:', rainbow.data_alignment)


type id:        categorical
data size:      1
data alignment: 1

We can get the integer storage type and the category type.


In [4]:
print('storage type: ', rainbow.storage_type)
print('category type:', rainbow.category_type)


storage type:  uint8
category type: string

The list of categories is itself an immutable dynd array.


In [5]:
rainbow.categories


Out[5]:
nd.array(["red", "orange", "yellow", "green", "blue", "indigo", "violet"], type="strided * string")

Let's go ahead and make an array with the rainbow type. Note that we're using the udtype parameter to the ndobject constructor, indicating we want an array whose uniform type is rainbow. If we used the dtype parameter, we would have to explicitly include the dimensionality as well.


In [6]:
colors = nd.array(['red', 'red', 'violet', 'blue', 'yellow', 'yellow', 'red', 'indigo'], dtype=rainbow)
colors


Out[6]:
nd.array(["red", "red", "violet", "blue", "yellow", "yellow", "red", "indigo"], type="strided * categorical[string, ["red", "orange", "yellow", "green", "blue", "indigo", "violet"]]")

To access the stored integers of the categorical array, an ints property is exposed.


In [7]:
colors.ints


Out[7]:
nd.array([0, 0, 6, 4, 2, 2, 0, 5], type="strided * uint8")

The values are always 0-based indices into the categories array.


In [8]:
[str(rainbow.categories[i]) for i in colors.ints]


Out[8]:
['red', 'red', 'violet', 'blue', 'yellow', 'yellow', 'red', 'indigo']

If we have an array of integers that we want to view as a rainbow, we can do a view operation to get this. To make things interesting, let's also have the integers be of a different size, which means we'll have to cast to the the correct integer type before.


In [9]:
myints = nd.array([5, 1, 3, 2, 0, 3, 3], access='rw')
myints


Out[9]:
nd.array([5, 1, 3, 2, 0, 3, 3], type="strided * int32")

Let's first cast to the correct integer type, using the ucast method.


In [10]:
my_cat_ints = myints.ucast(rainbow.storage_type)
my_cat_ints


Out[10]:
nd.array([5, 1, 3, 2, 0, 3, 3], type="strided * convert[to=uint8, from=int32]")

Then view it as a rainbow.


In [11]:
mycolors = my_cat_ints.view_scalars(rainbow)
mycolors


Out[11]:
nd.array(["indigo", "orange", "green", "yellow", "red", "green", "green"], type="strided * view[as=categorical[string, ["red", "orange", "yellow", "green", "blue", "indigo", "violet"]], original=convert[to=uint8, from=int32]]")

This resulting object is still a view of the original integer data in myints, so if we modify mycolors, we are actually modifying myints in place.


In [12]:
print(repr(myints))
mycolors[1::2] = 'red'
print(repr(myints))


nd.array([5, 1, 3, 2, 0, 3, 3], type="strided * int32")
nd.array([5, 0, 3, 0, 0, 0, 3], type="strided * int32")

Automatically Deducing The Categories: ndt.factor_categorical

In the previous example, we saw how to create a categorical type by using the ndt.make_categorical. Let's take a look at how the related ndt.factor_categorical works. If we have an array of data, whose categories we don't necessarily know ahead of time, this function allows you to create a categorical type with a deduced list.

Let's create an example array using a structure of (gender, age).


In [13]:
myarr = nd.array([('M', 13), ('F', 17), ('F', 34), ('M', 19), ('M', 13), ('F', 34), ('F', 22)],
                dtype='{gender: string[1], age: int32}')
myarr


Out[13]:
nd.array([["M", 13], ["F", 17], ["F", 34], ["M", 19], ["M", 13], ["F", 34], ["F", 22]], type="strided * {gender : string[1], age : int32}")

There were a few repeated pairs in the data, so when we factor it into a categorical type, the list becomes slightly smaller. The categories list is also in lexicographic order based on the fields of the structure.


In [14]:
catdt = ndt.factor_categorical(myarr)
catdt


Out[14]:
ndt.type('categorical[{gender : string[1], age : int32}, [["F", 17], ["F", 22], ["F", 34], ["M", 13], ["M", 19]]]')

We can now use the ucast method to cast the uniform type of the array into the categorical type.


In [15]:
mycats = myarr.ucast(catdt)
mycats


Out[15]:
nd.array([["M", 13], ["F", 17], ["F", 34], ["M", 19], ["M", 13], ["F", 34], ["F", 22]], type="strided * convert[to=categorical[{gender : string[1], age : int32}, [["F", 17], ["F", 22], ["F", 34], ["M", 13], ["M", 19]]], from={gender : string[1], age : int32}]")

The result of this cast is once again a view into the original data. If we want to make a concrete array that doesn't include any transformations, we need to do an eval on the array.


In [16]:
mycats = mycats.eval()
mycats


Out[16]:
nd.array([["M", 13], ["F", 17], ["F", 34], ["M", 19], ["M", 13], ["F", 34], ["F", 22]], type="strided * categorical[{gender : string[1], age : int32}, [["F", 17], ["F", 22], ["F", 34], ["M", 13], ["M", 19]]]")

Once again we can take a look at the ints storage to see what this is under the hood.


In [17]:
mycats.ints


Out[17]:
nd.array([3, 0, 2, 4, 3, 2, 1], type="strided * uint8")

Categorical Example With Larger Ints

In our first examples, we created categorical types with less than 256 categories in them. This meant that the storage for them fit in one byte. When there are more categories, the storage type will be bigger as needed to fit the larger number of values. We'll illustrate this with a simple categorical type based on a range of values.


In [18]:
catdt = ndt.make_categorical(nd.range(-20000.0, 200000.0, 6))

If we were to print out this type, it will produce a very long list, because dynd doesn't yet do shortening of its repr output, like numpy does. To see that the storage type is bigger, let's look at it and the category type.


In [ ]:
print('storage type: ', catdt.storage_type)
print('category type:', catdt.category_type)
print('number of categories:', len(catdt.categories))