Getting Started with models

This notebook introduces you to the modelling features of BatchFlow library via creating a simple regression model.


In [1]:
import sys
import warnings
warnings.filterwarnings("ignore")

import numpy as np
import tensorflow as tf

import matplotlib.pyplot as plt
%matplotlib inline

# the following line is not required if BatchFlow is installed as a python package.
sys.path.append('../..')
from batchflow import Dataset, V, F, B, action, Batch
from batchflow.models.tf import TFModel
from batchflow.models.metrics import ClassificationMetrics

Define a batch class

Creating a specific batch class is not required, though convenient. Besides, here it helps you get an idea of what batch components are.


In [2]:
class MyBatch(Batch):
    """ Batch class for regression models """
    components = 'features', 'labels'

All the batches will have 2 components: features and labels. Batch components might be thought of as columns in a table, while batch items are rows.

Linear regression

Firstly, we consider a linear regression that allows solving tasks where targets are continuous variables.

For this reason we generate data from uniform or normal distributions, multiply it by normally distributed weights and add normally distributed noise and then try to predict it.


In [3]:
def generate_linear_data(size, dist='unif', shape=13):
    """ Generation of data to fit linear regression.

    Parameters
    ----------
    size: int
        data length

    dist: {'unif', 'norm'}
        sample distribution 'unif' or 'norm'. Default is 'unif'
    
    shape: int
        a length of a feature vector

    Returns
    -------
    x: numpy array
        Uniformly or normally distributed array

    y: numpy array
        array with some random noize
    """
    if dist == 'unif':
        x = np.random.uniform(0, 2, size=(size, shape))
    elif dist == 'norm':
        x = np.random.normal(size=(size, shape))

    w = np.random.normal(loc=1., size=(shape, 1))
    error = np.random.normal(loc=0., scale=0.1, size=(size, 1))

    y = np.dot(x, w) + error

    return x, y

In this case, x and y are numpy arrays:

  • x is a matrix with size rows and 13 columns
  • y is a vector of size items.

In [4]:
size = 1000
linear_x, linear_y = generate_linear_data(size)

Create a dataset

Now it's time to create a dataset (an instance of Dataset class) which generates batches of MyBatch class.

Even though the dataset does not have any data yet, it contains the full index of dataset items. So we can split it into train and test parts.


In [5]:
linear_dset = Dataset(size, batch_class=MyBatch)
linear_dset.split()

After creation the dataset is empty, until data is loaded with a pipeline which allows you to use and perform action-methods from the batch class.


In [6]:
pipeline = (linear_dset.train.p
                       .load(src=(linear_x, linear_y)))

Define a model

The pipeline above only loads data. Clearly, it doesn't train a linear regression, therefore it is not enough for us.

Hence, we need to create a linear regression model. For more details on how to create your own model see the documentation.


In [7]:
class RegressionModel(TFModel):
    """ A universal regression model """

    @classmethod
    def body(cls, inputs, units, name='body', **kwargs):
        """ A simple one layer neural network

        Parameters
        ----------
        inputs : tf.Tensor
            input tensor
        units : int
            a number of neurons
        name : str
            scope name

        """
        with tf.variable_scope(name):
            dense = tf.layers.dense(inputs, units=units, name='dense')
        return dense

After the model is ready, you need to train it. The pipeline allows you to do this with just two functions:

init_model's config contains inputs sections to configure input tensors (placeholders) parameters:

  • shape
  • tensor's name
  • typical transformations (like one-hot-encoding) and so on.

To configure 'inputs' use the inputs_config dict shown in the cell below. This dict has two keys:

  • features - the name of the placeholder for the input data.
  • labels - the name of the placeholder for the answers before all transformations.

Values for these keys are dicts themselves which describe features and labels placeholders.

For more information see the documentation and API.


In [8]:
inputs_config = {
    'features/shape': 13,
    'labels/shape': 1
}

Other model configuration parameters include loss function, optimizer, and model specific parameters (number of blocks, a type of activation function, etc):


In [9]:
config = {
    'inputs': inputs_config,
    'initial_block/inputs': 'features',
    'body/units': 1,
    'loss': 'mse',
    'optimizer': {'name':'GradientDescent', 'learning_rate': .01},
}

config keys:

  • inputs - input data configuration (described above).
  • initial_block/inputs - the name of input tensor which should be fed into the model.
  • body/units - number of neurons in fully-connected layers. Can be a list, if you have more than one dense layer.
  • loss - a loss function to optimize.
  • optimizer - an optimization algorithm and its parameters.

As you can see, some configuration options have a hierarchical structure. See the documentationfor more info about the models and model configration.


In [10]:
feed_dict = {
    'features': B('features'),
    'labels': B('labels')
}

feed_dict is a dict in which:

  • key is a placeholder name
  • value is input data for that placeholder.

Letter B in values is a named expression which replaces the name within it into a value from a batch class attribute or a component.

In this case, it is the component name defined in the batch class.

Training pipeline

And now let's create a pipeline, which can generate batches and train a model.


In [11]:
BATCH_SIZE = 100
train_linear = (linear_dset.train.p
                .load(src=(linear_x, linear_y))
                .init_model('dynamic',
                            RegressionModel,
                            name='linear',
                            config=config)
                .train_model('linear',
                             feed_dict=feed_dict)
                .run(BATCH_SIZE, shuffle=True, n_epochs=10))

Prediction pipeline

A prediction pipeline would be also helpful. For this purpose the method predict_model is used.

I would direct your attention to the argument named fetches. It returns a value of tensor with a specified name. Using it, you can always get from model any tensor you want.

As you might already know, in output configuration parameter (when calling init_model) you can specify a list of useful outputs, such as 'proba', 'sigmoid', etc. (if you don't know about it, read the documentation). And predict_model's fetches argument allows to get those outputs from a model.

Another important method is import_model that loads a model from another pipeline.


In [12]:
test_linear = (linear_dset.test.p
                 .load(src=(linear_x, linear_y))
                 .import_model('linear', train_linear)
                 .init_variable('predict', init_on_each_run=list)
                 .predict_model('linear', 
                                fetches='predictions',
                                feed_dict=feed_dict,
                                save_to=V('predict', mode='a'))
                 .run(BATCH_SIZE, shuffle=False, n_epochs=1))

In the last pipeline, we test our model. Let's see, how well does it work?


In [13]:
predict = np.array(test_linear.get_variable('predict')).reshape(-1, 1)
target = np.array(linear_y[linear_dset.test.indices])

error = np.mean(np.abs((target - predict) * 100 / target))

print('Average error: {}%'.format(round(error, 3)))


Average error: 5.294%

The accuracy is far from being perfect because the training was too short (in order not make you wait too long). Increase the number of epochs in the training pipeline up to 1000 to get a more accurate predictions.

Logistic regression

It solves a task with a binary target (0 or 1, -1 or 1 and etc).

To train a logistic regression we generate two-dimensional data from two linearly separable clusters and fit a model to predict a cluster for a given point.


In [14]:
def generate_logistic_data(size, first_params, second_params):
    """ Generation of data for fit logistic regression.
    Parameters
    ----------
    size: int
        number of data items

    first_params: list of list
        distribution params for cloud #0

    second_params: list of list
        distribution params for cloud #1

    Returns
    -------
    x: numpy array
        coordinates in two-dimensional space

    y: numpy array
        labels {0, 1}
    """
    first = np.random.multivariate_normal(first_params[0], first_params[1], size)
    second = np.random.multivariate_normal(second_params[0], second_params[1], size)

    x = np.vstack((first, second))
    y = np.hstack((np.zeros(size), np.ones(size)))
    shuffle = np.arange(len(x))
    np.random.shuffle(shuffle)
    x = x[shuffle]
    y = y[shuffle] #.reshape(-1, 1)

    return x, y

In [15]:
size = 500
logistic_x, logistic_y = generate_logistic_data(size, [[1,2],[[15,0],[0,15]]], [[10,17],[[15,0],[0,15]]])

In [16]:
plt.style.use('seaborn-poster')
plt.style.use('ggplot')
plt.scatter(logistic_x[:,0], logistic_x[:,1], c=logistic_y)
plt.title('Cloud points distribution', fontsize=18)
plt.show()


One of the most important things that you need to know is that it really doesn't matter which model you want to train and what data you will use for it. The procedure stays the same.

First of all, a dataset is created and split into train/test parts.


In [17]:
logistic_dset = Dataset(size, batch_class=MyBatch)
logistic_dset.split()

As you can see, the pipeline and configurations also does not change much:

  • shape has changed from 13 to 2 since now we have two-dimensional data.

  • for labels new parameters appeared:

    • classes - the number of classes. We have a binary classification, hence 2 classes.
    • transform - type of transformation for labels. We use one hot encoding or 'ohe'.

In [18]:
inputs_config = {
    'features/shape': 2,
    'labels/classes': 2
}

Let's create and execute the pipeline with run:


In [19]:
BATCH_SIZE = 100

train_logistic = (logistic_dset.train.p
                .load(src=(logistic_x, logistic_y))
                .init_variable('loss_history', init_on_each_run=list)
                .init_model('dynamic',
                            RegressionModel,
                            'logistic',
                            config={
                                'inputs': inputs_config,
                                'loss': 'ce',
                                'optimizer': {'name':'Adam', 'learning_rate': 0.01},
                                'initial_block/inputs': 'features',
                                'body/units': 2,
                                'output': dict(ops=['accuracy'])})
                .train_model('logistic',
                             fetches='loss',
                             feed_dict={
                                 'features': B('features'),
                                 'labels': B('labels')},
                            save_to=V('loss_history', mode='a'))
                .run(BATCH_SIZE, shuffle=True, n_epochs=10))

In the same way, create test pipeline and run it too


In [20]:
test_logistic = (logistic_dset.test.p
                .import_model('logistic', train_logistic)
                .load(src=(logistic_x, logistic_y))
                .init_variables(['predictions', 'metrics'])
                .predict_model('logistic', 
                             fetches='predictions' ,
                             feed_dict={
                                 'features': B('features'),
                                 'labels': B('labels')},
                             save_to=V('predictions'))
                .gather_metrics(ClassificationMetrics, targets=B('labels'), predictions=V('predictions'),
                                fmt='logits', axis=-1, save_to=V('metrics', mode='a'))
                .run(BATCH_SIZE, shuffle=False, n_epochs=1))

After measure the quality of training


In [21]:
accuracy = test_logistic.get_variable('metrics').evaluate('accuracy')
print('Percentage of accurate predictions: {:.2%}'.format(accuracy))


Percentage of accurate predictions: 25.00%

And again to imporove the accuracy increase the number of epochs in the training pipeline.

Poisson regression

Poisson regression is being used if the answers contain counts.

The example shows how we can train poisson regression by using data generated from poisson distribution.


In [22]:
def generate_poisson_data(lam, size=10, shape=13):
    """ Generation of data for fit poisson regression

    Parameters
    ----------
    size : int
        number of data items

    lam : float
        Poisson distribution parameter
    
    shape : int
        number of features for each data item

    Returns
    -------
    x: numpy array
        Matrix with random numbers from the uniform distribution
    y: numpy array
        random Poisson distributed numbers
    """
    x = np.random.random(size=(size, shape))
    b = np.random.random(1)

    y_obs = np.random.poisson(np.exp(np.dot(x, lam) + b))

    shuffle = np.arange(len(x))
    np.random.shuffle(shuffle)
    x = x[shuffle]
    y = y_obs[shuffle].reshape(-1, 1)

    return x, y

In [23]:
size = 1000
NUM_DIM = 13
poisson_x, poisson_y = generate_poisson_data(np.random.random(NUM_DIM), size, NUM_DIM)

Below you can see the same cell as previously, but with different names of models


In [24]:
poisson_dset = Dataset(size, batch_class=MyBatch)
poisson_dset.split()

We have to create our own loss function:


In [25]:
def loss_poisson(target, predictions):
    loss = tf.reduce_mean(tf.nn.log_poisson_loss(target, predictions))
    tf.losses.add_loss(loss)
    return loss

Again shape equals 13, as the shape of the input data


In [26]:
inputs_config = {
    'features/shape': NUM_DIM,
    'labels/shape': 1
}

Create and run a train pipeline


In [27]:
BATCH_SIZE = 100

train_poisson = (poisson_dset.train.p
                .load(src=(poisson_x, poisson_y))
                .init_variable('shape')
                .init_model('dynamic', 
                            RegressionModel, 
                            'poisson',
                            config={
                                'inputs': inputs_config,
                                'loss': loss_poisson,
                                'optimizer': {'name': 'GradientDescent', 'learning_rate': 5e-5},
                                'initial_block/inputs': 'features',
                                'body/units': 1,
                                'output': tf.exp})
                .train_model('poisson',
                             fetches='loss',
                             feed_dict={
                                 'features': B('features'),
                                 'labels': B('labels')})
                .run(BATCH_SIZE, shuffle=True, n_epochs=100, bar=True))


100%|█████████▉| 799/800 [00:00<00:00, 870.72it/s]

Create test pipeline and make predictions


In [28]:
test_poisson = (poisson_dset.test.p
                .load(src=(poisson_x, poisson_y))
                .import_model('poisson', train_poisson)
                .init_variable('predictions', init_on_each_run=list)
                .predict_model('poisson', 
                               fetches='exp',
                               feed_dict={
                                   'features': B('features'),
                                   'labels': B('labels')},
                               save_to=V('predictions', mode='a'))
                .run(BATCH_SIZE, shuffle=True, n_epochs=1))

Measure the quality


In [29]:
pred = np.array(test_poisson.get_variable('predictions')).reshape(-1, 1)
target = np.array(poisson_y[poisson_dset.test.indices])

true_var = np.mean((target - np.mean(target))**2)
predict_var = np.mean((pred - np.mean(pred))**2)

error = np.mean(np.abs(pred - target)) / np.mean(target) * 100
print('Average error: {}%'.format(round(error, 3)), 'Variance ratio: %.3f' % (predict_var / true_var))


Average error: 73.113% Variance ratio: 0.752

Conclusion

  • No matter what you want to train and what data you want to use for it, pipeline always looks the same.
  • It takes time to train an accurate model.
  • Now you know how to use batchflow and specifically how to:
    • create pipelines for train and test models
    • train linear regression with multi-dimensional data
    • train logistic regression (with another type of data - two-dimensions clouds of dots)
    • create your own loss function and train poisson regression.

What's next?

You might hone your new skills by:

  • creating a multi-class regression model;
  • adding aditional features to the model in order to improve the quality.

You might also want to dig deeper into batch operations.

Or choose another topic from the table of contents.