In [ ]:
from PIL import Image, ImageOps
import seqtools
In [ ]:
! [[ -f owl.jpg ]] || curl -s "https://cdn.pixabay.com/photo/2017/04/07/01/05/owl-2209827_640.jpg" -o owl.jpg
! [[ -f rooster.jpg ]] || curl -s "https://cdn.pixabay.com/photo/2018/08/26/14/05/hahn-3632299_640.jpg" -o rooster.jpg
! [[ -f duck.jpg ]] || curl -s "https://cdn.pixabay.com/photo/2018/09/02/10/03/violet-duck-3648415_640.jpg" -o duck.jpg
! [[ -f bird.jpg ]] || curl -s "https://cdn.pixabay.com/photo/2018/08/21/05/15/tit-3620632_640.jpg" -o bird.jpg
! [[ -f dog.jpg ]] || curl -s "https://cdn.pixabay.com/photo/2018/09/04/18/07/pug-3654360_640.jpg" -o dog.jpg
! [[ -f hedgehog.jpg ]] || curl -s "https://cdn.pixabay.com/photo/2018/09/04/18/52/hedgehog-3654434_640.jpg" -o hedgehog.jpg
SeqTools works with list-like indexable objects, so the first step is to create one that maps to our samples, then this object will be passed to functions that apply the desired transformations. In this example, we represent our samples with their file names and store them in a list.
In [ ]:
labels = ['owl', 'rooster', 'duck', 'bird', 'dog', 'hedgehog']
# We artificially increase the size of the dataset for the example
labels = [labels[i % len(labels)] for i in range(200)]
image_files = [l + '.jpg' for l in labels]
Let's load the full resolution images, the result cannot normally fit into memory, but with SeqTools the evaluation is delayed until the images are actually accessed.
In [ ]:
raw_images = seqtools.smap(Image.open, image_files)
We can verify the result for one sample, this will trigger its evaluation and return it:
In [ ]:
raw_images[0]
In [ ]:
def normalize_size(im):
w, h = im.size
left_crop = w // 2 - h // 2
return im.resize((200, 200), Image.BILINEAR, box=(left_crop, 1, h, h))
small_images = seqtools.smap(normalize_size, raw_images)
small_images[1]
then apply common preprocessing steps:
In [ ]:
contrasted = seqtools.smap(ImageOps.autocontrast, small_images)
equalized = seqtools.smap(ImageOps.equalize, contrasted)
grayscale = seqtools.smap(ImageOps.grayscale, equalized)
grayscale[0]
That preprocessing seems a bit over the top... let's check where it went wrong:
In [ ]:
equalized[0]
In [ ]:
contrasted[0]
For each sample, the minimal set of computations was run to produce the requested item. We find here that equalization is inappropriate and autocontrast is too weak, let's fix this.
In [ ]:
grayscale = seqtools.smap(ImageOps.grayscale, small_images)
contrasted = seqtools.smap(lambda im: ImageOps.autocontrast(im, cutoff=3), grayscale)
contrasted[0]
In [ ]:
# Generate flipped versions of the images
flipped = seqtools.smap(ImageOps.mirror, contrasted)
# Combine with the original dataset
augmented_dataset = seqtools.concatenate([contrasted, flipped])
augmented_dataset[-1]
In [ ]:
%time computed_values = list(augmented_dataset);
This above evaluation is a bit slow, probably due to the IO operations when loading the images from the hard drive. Maybe using multiple threads could help keep the CPU busy?
In [ ]:
fast_dataset = seqtools.prefetch(augmented_dataset, max_buffered=10, nworkers=2)
%time computed_values = list(fast_dataset)
The CPU time is the same because the computations are the same (plus some threading overhead), but wall time is cut down because image processing continues for some images while others are being loaded.
However, we could spare some IO by not reading the same image a twice when generating the augmented version, and by the same token save some shared transformations. To avoid having one thread taking the regular image and another the flipped one in parallel, which would incur a cache miss for the latter, we propose to simply compute the transformed image and its flipped version in one step:
In [ ]:
regular_and_flipped = seqtools.smap(lambda im: (im, ImageOps.mirror), contrasted)
fast_dataset = seqtools.prefetch(regular_and_flipped, max_buffered=10, nworkers=2)
The output now contains pairs of images, to flatten them into a sequence of images, we can "unbatch" them:
In [ ]:
fast_dataset = seqtools.add_cache(fast_dataset, cache_size=1)
flat_dataset = seqtools.unbatch(fast_dataset, batch_size=2)
The cache is here to avoid recomputing the pair of images when the second one is accessed, indeed, SeqTools works in a stateless on-demand fashion and recomputes everything by default.
Please note that concatenation would be inappropriate to replace unbatching here. Concatenation checks the length of the sequences to join, which requires in this situation to compute each element from fast_dataset
before-hand. Besides, unbatching has slightly faster access times because it assumes that all batches have the same size.
In [ ]:
%time computed_values = list(flat_dataset)
CPU time is now cut by half, leaving more room for other processes.
SeqTools facilitates the manipulation of sequences and reduces the boilerplate from using multi-threading or multi-processing. However, users must remain conscious of the mechanic under the hood to avoid pitfalls. Here are a few points to keep in mind while programming with this library: