In [1]:
import os
import sys
import warnings
warnings.filterwarnings("ignore")
import numpy as np
import matplotlib.pyplot as plt
# the following line is not required if BatchFlow is installed as a python package.
sys.path.append("../..")
from batchflow import Pipeline, B, C, D, F, V
from batchflow.opensets import MNIST, CIFAR10, CIFAR100
from batchflow.models.torch import VGG7
BATCH_SIZE might be increased for modern GPUs with lots of memory (4GB and higher).
In [2]:
BATCH_SIZE = 64
MNIST is a dataset of handwritten digits frequently used as a baseline for machine learning tasks.
Downloading MNIST database might take a few minutes to complete.
In [3]:
dataset = MNIST(bar=True)
There are also predefined CIFAR10 and CIFAR100 datasets.
Config allows to create flexible pipelines which take parameters.
For instance, if you put a model type into config, you can run a pipeline against different models.
See a list of available models to choose the one which fits you best.
In [4]:
config = dict(model=VGG7)
A template pipeline is not linked to any dataset. It's just an abstract sequence of actions, so it cannot be executed, but it serves as a convenient building block.
In [5]:
train_template = (Pipeline()
.init_variable('loss_history', [])
.init_model('dynamic', C('model'), 'conv_nn',
config={'inputs/images/shape': B.image_shape,
'inputs/labels/classes': D.num_classes,
'initial_block/inputs': 'images',
'device': 'gpu:0'})
.to_array(channels='first', dtype='float32')
.train_model('conv_nn', B.images, B.labels,
fetches='loss', save_to=V('loss_history', mode='a'), use_lock=True)
)
Apply a dataset and a config to a template pipeline to create a runnable pipeline:
In [6]:
train_pipeline = (train_template << dataset.train) << config
Run the pipeline (it might take from a few minutes to a few hours depending on your hardware)
In [7]:
train_pipeline.run(BATCH_SIZE, shuffle=True, n_epochs=1, drop_last=True, bar=True, prefetch=1)
Out[7]:
Note that the progress bar often increments by 2 at a time - that's prefetch in action.
It does not give much here, though, since almost all time is spent in model training which is performed under a thread-lock one batch after another without any parallelism (otherwise the model would not learn anything as different batches would rewrite one another's model weights updates).
In [8]:
plt.figure(figsize=(15, 5))
plt.plot(train_pipeline.v('loss_history'))
plt.xlabel("Iterations"), plt.ylabel("Loss")
plt.show()
It is much faster than training, but if you don't have GPU it would take some patience.
In [9]:
test_pipeline = (dataset.test.p
.import_model('conv_nn', train_pipeline)
.init_variable('predictions')
.init_variable('metrics')
.to_array(channels='first', dtype='float32')
.predict_model('conv_nn', B.images,
fetches='predictions', save_to=V('predictions'))
.gather_metrics('class', targets=B.labels, predictions=V('predictions'),
fmt='logits', axis=-1, save_to=V('metrics', mode='w'))
.run(BATCH_SIZE, shuffle=True, n_epochs=1, drop_last=False, bar=True)
)
Let's get the accumulated metrics information
In [10]:
metrics = test_pipeline.get_variable('metrics')
Or a shorter version: metrics = test_pipeline.v('metrics')
Now we can easiliy calculate any metrics we need
In [11]:
metrics.evaluate('accuracy')
Out[11]:
In [12]:
metrics.evaluate(['false_positive_rate', 'false_negative_rate'], multiclass=None)
Out[12]:
In [13]:
train_pipeline.save_model_now('conv_nn', path='path/to/model.torch')
See the image augmentation tutorial or return to the table of contents.