Dataset module introduction



In [1]:

    
# Initial setup following http://docs.chainer.org/en/stable/tutorial/basic.html
import numpy as np
import chainer
from chainer import cuda, Function, gradient_check, report, training, utils, Variable
from chainer import datasets, iterators, optimizers, serializers
from chainer import Link, Chain, ChainList
import chainer.functions as F
import chainer.links as L
from chainer.training import extensions
import chainer.dataset
import chainer.datasets

Built-in dataset modules

Some dataset format is already implemented in chainer.datasets

TupleDataset



In [2]:

    
from chainer.datasets import TupleDataset

x = np.arange(10)
t = x * x

data = TupleDataset(x, t)

print('data type: {}, len: {}'.format(type(data), len(data)))









    



data type: <class 'chainer.datasets.tuple_dataset.TupleDataset'>, len: 10



In [3]:

    
# Unlike numpy, it does not have shape property.
data.shape









    



---------------------------------------------------------------------------
AttributeError                            Traceback (most recent call last)
<ipython-input-3-c2d494b81f2d> in <module>()
      1 # Unlike numpy, it does not have shape property.
----> 2 data.shape

AttributeError: 'TupleDataset' object has no attribute 'shape'

i-th data can be accessed by data[i]

which is a tuple of format ($x_i$, $t_i$, ...)



In [4]:

    
# get forth data -> x=3, t=9
data[3]









    Out[4]:





(3, 9)

Slice accessing

When TupleDataset is accessed by slice indexing, e.g. data[i:j], returned value is list of tuple $[(x_i, t_i), ..., (x_{j-1}, t_{j-1})]$



In [5]:

    
# Get 1st, 2nd, 3rd data at the same time.
examples = data[0:4]

print(examples)
print('examples type: {}, len: {}'
      .format(type(examples), len(examples)))









    



[(0, 0), (1, 1), (2, 4), (3, 9)]
examples type: <class 'list'>, len: 4

To convert examples into minibatch format, you can use concat_examples function in chainer.dataset.

Its return value is in format ([x_array], [t array], ...)



In [6]:

    
from chainer.dataset import concat_examples

data_minibatch = concat_examples(examples)

#print(data_minibatch)
#print('data_minibatch type: {}, len: {}'
#      .format(type(data_minibatch), len(data_minibatch)))

x_minibatch, t_minibatch = data_minibatch
# Now it is array format, which has shape
print('x_minibatch = {}, type: {}, shape: {}'.format(x_minibatch, type(x_minibatch), x_minibatch.shape))
print('t_minibatch = {}, type: {}, shape: {}'.format(t_minibatch, type(t_minibatch), t_minibatch.shape))









    



x_minibatch = [0 1 2 3], type: <class 'numpy.ndarray'>, shape: (4,)
t_minibatch = [0 1 4 9], type: <class 'numpy.ndarray'>, shape: (4,)

DictDataset

TBD



In [10]:

    
from chainer.datasets import DictDataset

x = np.arange(10)
t = x * x

# To construct `DictDataset`, you can specify each key-value pair by passing "key=value" in kwargs.
data = DictDataset(x=x, t=t)

print('data type: {}, len: {}'.format(type(data), len(data)))









    



data type: <class 'chainer.datasets.dict_dataset.DictDataset'>, len: 10



In [16]:

    
# Get 3rd data at the same time.
example = data[2]
          
print(example)
print('examples type: {}, len: {}'
      .format(type(example), len(example)))

# You can access each value via key
print('x: {}, t: {}'.format(example['x'], example['t']))









    



{'t': 4, 'x': 2}
examples type: <class 'dict'>, len: 2
x: 2, t: 4

ImageDataset

This is util class for image dataset.

If the number of dataset becomes very big (for example ImageNet dataset), it is not practical to load all the images into memory unlike CIFAR-10 or CIFAR-100.

In this case, ImageDataset class can be used to open image from storage everytime of minibatch creation.

[Note] ImageDataset will download only the images, if you need another label information (for example if you are working with image classification task) use LabeledImageDataset instead.

You need to create a text file which contains the list of image paths to use ImageDataset. See data/images.dat for how the paths text file look like.



In [28]:

    
import os

from chainer.datasets import ImageDataset

# print('Current direcotory: ', os.path.abspath(os.curdir))

filepath = './data/images.dat'
image_dataset = ImageDataset(filepath, root='./data/images')

print('image_dataset type: {}, len: {}'.format(type(image_dataset), len(image_dataset)))









    



image_dataset type: <class 'chainer.datasets.image_dataset.ImageDataset'>, len: 10

We have created the image_dataset above, however, images are not expanded into memory yet.

Image data will be loaded into memory from storage every time when you access via index, for efficient memory use.



In [31]:

    
# Access i-th image by image_dataset[i].
# image data is loaded here. for only 0-th image.
img = image_dataset[0]

# img is numpy array, already aligned as (channels, height, width), 
# which is the standard shape format to feed into convolutional layer.
print('img', type(img), img.shape)









    



img <class 'numpy.ndarray'> (3, 426, 640)



In [21]:

LabeledImageDataset

This is util class for image dataset.

It is similar to ImageDataset to allow load the image file from storage into memory at runtime of training. The difference is that it contains label information, which is usually used for image classification task.

You need to create a text file which contains the list of image paths and labels to use LabeledImageDataset. See data/images_labels.dat for how the text file look like.



In [32]:

    
import os

from chainer.datasets import LabeledImageDataset

# print('Current direcotory: ', os.path.abspath(os.curdir))

filepath = './data/images_labels.dat'
labeled_image_dataset = LabeledImageDataset(filepath, root='./data/images')

print('labeled_image_dataset type: {}, len: {}'.format(type(labeled_image_dataset), len(labeled_image_dataset)))









    



labeled_image_dataset type: <class 'chainer.datasets.image_dataset.LabeledImageDataset'>, len: 10

We have created the labeled_image_dataset above, however, images are not expanded into memory yet.

Image data will be loaded into memory from storage every time when you access via index, for efficient memory use.



In [34]:

    
# Access i-th image and label by image_dataset[i].
# image data is loaded here. for only 0-th image.
img, label = labeled_image_dataset[0]

print('img', type(img), img.shape)
print('label', type(label), label)









    



img <class 'numpy.ndarray'> (3, 426, 640)
label <class 'numpy.ndarray'> 0

SubDataset

TBD

It can be used for cross validation.



In [9]:

    
datasets.split_dataset_n_random()









    



---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
<ipython-input-9-ef1a22e41dd6> in <module>()
----> 1 datasets.split_dataset_n_random()

TypeError: split_dataset_n_random() missing 2 required positional arguments: 'dataset' and 'n'

Implement your own custom dataset

You can define your own dataset by implementing a sub class of DatasetMixin in chainer.dataset

DatasetMixin

If you want to define custom dataset, DatasetMixin provides the base function to make compatible with other dataset format.

Another important usage for DatasetMixin is to preprocess the input data, including data augmentation.

To implement subclass of DatasetMixin, you usually need to implement these 3 functions.

Override __init__(self, *args) function: It is not compulsary but
Override __len__(self) function : Iterator need to know the length of this dataset to understand the end of epoch.
Override get_examples(self, i) function:



In [10]:

    
from chainer.dataset import DatasetMixin


print_debug = True
class SimpleDataset(DatasetMixin):
    def __init__(self, values):
        self.values = values
        
    def __len__(self):
        return len(self.values)

    def get_example(self, i):
        if print_debug: 
            print('get_example, i = {}'.format(i))
        return self.values[i]

Important function in DatasetMixin is get_examples(self, i) function. This function is called when they access data[i]



In [11]:

    
simple_data = SimpleDataset([0, 1, 4, 9, 16, 25])



In [12]:

    
# get_example(self, i) is called when data is accessed by data[i]
simple_data[3]









    



get_example, i = 3






    Out[12]:





9



In [13]:

    
# data can be accessed using slice indexing as well

simple_data[1:3]









    



get_example, i = 1
get_example, i = 2






    Out[13]:





[1, 4]

The important point is that get_example function is called every time when the data is accessed by [] indexing.

Thus you may put random value generation for data augmentation code in get_example.



In [14]:

    
import numpy as np
from chainer.dataset import DatasetMixin

print_debug = False


def calc(x):
    return x * x


class SquareNoiseDataset(DatasetMixin):
    def __init__(self, values):
        self.values = values
        
    def __len__(self):
        return len(self.values)

    def get_example(self, i):
        if print_debug: 
            print('get_example, i = {}'.format(i))
        x = self.values[i]
        t = calc(x) 
        t_noise = t + np.random.normal(0, 0.1)
        return x, t_noise



In [15]:

    
square_noise_data = SquareNoiseDataset(np.arange(10))

Below SimpleNoiseDataset adds small Gaussian noise to the original value, and every time the value is accessed, get_example is called and differenct noise is added even if you access to the data with same index.



In [16]:

    
# Accessing to the same index, but the value is different!
print('Accessing square_noise_data[3]', )
print('1st: ', square_noise_data[3])
print('2nd: ', square_noise_data[3])
print('3rd: ', square_noise_data[3])









    



Accessing square_noise_data[3]
1st:  (3, 8.9710277341058227)
2nd:  (3, 9.0598517818294599)
3rd:  (3, 9.070345838019648)



In [17]:

    
# Same applies for slice index accessing.
print('Accessing square_noise_data[0:4]')
print('1st: ', square_noise_data[0:4])
print('2nd: ', square_noise_data[0:4])
print('3rd: ', square_noise_data[0:4])









    



Accessing square_noise_data[0:4]
1st:  [(0, -0.14427626899774257), (1, 0.85360656988561656), (2, 3.9732713008069145), (3, 8.9809134295500979)]
2nd:  [(0, -0.078466401141966013), (1, 0.85183819235205771), (2, 3.9409961378011142), (3, 9.0302699062379599)]
3rd:  [(0, 0.071952579879583839), (1, 1.025589783563474), (2, 4.10475859520119), (3, 9.0260985190124767)]

To convert examples into minibatch format, you can use concat_examples function in chainer.dataset in the sameway explained at TupleDataset.



In [19]:

    
from chainer.dataset import concat_examples

examples = square_noise_data[0:4]
print('examples = {}'.format(examples))
data_minibatch = concat_examples(examples)

x_minibatch, t_minibatch = data_minibatch
# Now it is array format, which has shape
print('x_minibatch = {}, type: {}, shape: {}'.format(x_minibatch, type(x_minibatch), x_minibatch.shape))
print('t_minibatch = {}, type: {}, shape: {}'.format(t_minibatch, type(t_minibatch), t_minibatch.shape))









    



examples = [(0, -0.011602612124293674), (1, 0.99882934589681571), (2, 3.9934740900218269), (3, 9.1771593657407067)]
x_minibatch = [0 1 2 3], type: <class 'numpy.ndarray'>, shape: (4,)
t_minibatch = [-0.01160261  0.99882935  3.99347409  9.17715937], type: <class 'numpy.ndarray'>, shape: (4,)

TransformDataset

Transform dataset can be used to create/modify dataset from existing dataset. New (modified) dataset can be created by TransformDataset(original_data, transform_function).

Let's see a concrete example to create new dataset from original tuple dataset by adding a small noise.



In [23]:

    
from chainer.datasets import TransformDataset

x = np.arange(10)
t = x * x - x

original_dataset = TupleDataset(x, t)

def transform_function(in_data):
    x_i, t_i = in_data
    new_t_i = t_i + np.random.normal(0, 0.1)
    return x_i, new_t_i

transformed_dataset = TransformDataset(original_dataset, transform_function)



In [24]:

    
original_dataset[:3]









    Out[24]:





[(0, 0), (1, 0), (2, 2)]



In [26]:

    
# Now Gaussian noise is added (in transform_function) to the original_dataset.
transformed_dataset[:3]









    Out[26]:





[(0, -0.045087054953009055),
 (1, -0.042453714041671031),
 (2, 1.9586232300993316)]



In [ ]: