This notebook introduces you to the modelling features of BatchFlow library via creating a simple regression model.
In [1]:
import sys
import warnings
warnings.filterwarnings("ignore")
import numpy as np
import tensorflow as tf
import matplotlib.pyplot as plt
%matplotlib inline
# the following line is not required if BatchFlow is installed as a python package.
sys.path.append('../..')
from batchflow import Dataset, V, F, B, action, Batch
from batchflow.models.tf import TFModel
from batchflow.models.metrics import ClassificationMetrics
Creating a specific batch class is not required, though convenient. Besides, here it helps you get an idea of what batch components are.
In [2]:
class MyBatch(Batch):
""" Batch class for regression models """
components = 'features', 'labels'
All the batches will have 2 components: features and labels. Batch components might be thought of as columns in a table, while batch items are rows.
Firstly, we consider a linear regression that allows solving tasks where targets are continuous variables.
For this reason we generate data from uniform or normal distributions, multiply it by normally distributed weights and add normally distributed noise and then try to predict it.
In [3]:
def generate_linear_data(size, dist='unif', shape=13):
""" Generation of data to fit linear regression.
Parameters
----------
size: int
data length
dist: {'unif', 'norm'}
sample distribution 'unif' or 'norm'. Default is 'unif'
shape: int
a length of a feature vector
Returns
-------
x: numpy array
Uniformly or normally distributed array
y: numpy array
array with some random noize
"""
if dist == 'unif':
x = np.random.uniform(0, 2, size=(size, shape))
elif dist == 'norm':
x = np.random.normal(size=(size, shape))
w = np.random.normal(loc=1., size=(shape, 1))
error = np.random.normal(loc=0., scale=0.1, size=(size, 1))
y = np.dot(x, w) + error
return x, y
In this case, x
and y
are numpy arrays:
x
is a matrix with size rows and 13 columnsy
is a vector of size items.
In [4]:
size = 1000
linear_x, linear_y = generate_linear_data(size)
Now it's time to create a dataset (an instance of Dataset class) which generates batches of MyBatch class.
Even though the dataset does not have any data yet, it contains the full index of dataset items. So we can split it into train and test parts.
In [5]:
linear_dset = Dataset(size, batch_class=MyBatch)
linear_dset.split()
After creation the dataset is empty, until data is loaded with a pipeline which allows you to use and perform action-methods from the batch class.
In [6]:
pipeline = (linear_dset.train.p
.load(src=(linear_x, linear_y)))
The pipeline above only loads data. Clearly, it doesn't train a linear regression, therefore it is not enough for us.
Hence, we need to create a linear regression model. For more details on how to create your own model see the documentation.
In [7]:
class RegressionModel(TFModel):
""" A universal regression model """
@classmethod
def body(cls, inputs, units, name='body', **kwargs):
""" A simple one layer neural network
Parameters
----------
inputs : tf.Tensor
input tensor
units : int
a number of neurons
name : str
scope name
"""
with tf.variable_scope(name):
dense = tf.layers.dense(inputs, units=units, name='dense')
return dense
After the model is ready, you need to train it. The pipeline allows you to do this with just two functions:
init_model's config contains inputs sections to configure input tensors (placeholders) parameters:
To configure 'inputs' use the inputs_config dict shown in the cell below. This dict has two keys:
Values for these keys are dicts themselves which describe features
and labels
placeholders.
For more information see the documentation and API.
In [8]:
inputs_config = {
'features/shape': 13,
'labels/shape': 1
}
Other model configuration parameters include loss function, optimizer, and model specific parameters (number of blocks, a type of activation function, etc):
In [9]:
config = {
'inputs': inputs_config,
'initial_block/inputs': 'features',
'body/units': 1,
'loss': 'mse',
'optimizer': {'name':'GradientDescent', 'learning_rate': .01},
}
config keys:
As you can see, some configuration options have a hierarchical structure. See the documentationfor more info about the models and model configration.
In [10]:
feed_dict = {
'features': B('features'),
'labels': B('labels')
}
feed_dict is a dict in which:
Letter B in values is a named expression which replaces the name within it into a value from a batch class attribute or a component.
In this case, it is the component name defined in the batch class.
And now let's create a pipeline, which can generate batches and train a model.
In [11]:
BATCH_SIZE = 100
train_linear = (linear_dset.train.p
.load(src=(linear_x, linear_y))
.init_model('dynamic',
RegressionModel,
name='linear',
config=config)
.train_model('linear',
feed_dict=feed_dict)
.run(BATCH_SIZE, shuffle=True, n_epochs=10))
A prediction pipeline would be also helpful. For this purpose the method predict_model is used.
I would direct your attention to the argument named fetches. It returns a value of tensor with a specified name. Using it, you can always get from model any tensor you want.
As you might already know, in output configuration parameter (when calling init_model) you can specify a list of useful outputs, such as 'proba', 'sigmoid', etc. (if you don't know about it, read the documentation). And predict_model's fetches argument allows to get those outputs from a model.
Another important method is import_model that loads a model from another pipeline.
In [12]:
test_linear = (linear_dset.test.p
.load(src=(linear_x, linear_y))
.import_model('linear', train_linear)
.init_variable('predict', init_on_each_run=list)
.predict_model('linear',
fetches='predictions',
feed_dict=feed_dict,
save_to=V('predict', mode='a'))
.run(BATCH_SIZE, shuffle=False, n_epochs=1))
In the last pipeline, we test our model. Let's see, how well does it work?
In [13]:
predict = np.array(test_linear.get_variable('predict')).reshape(-1, 1)
target = np.array(linear_y[linear_dset.test.indices])
error = np.mean(np.abs((target - predict) * 100 / target))
print('Average error: {}%'.format(round(error, 3)))
The accuracy is far from being perfect because the training was too short (in order not make you wait too long). Increase the number of epochs in the training pipeline up to 1000 to get a more accurate predictions.
In [14]:
def generate_logistic_data(size, first_params, second_params):
""" Generation of data for fit logistic regression.
Parameters
----------
size: int
number of data items
first_params: list of list
distribution params for cloud #0
second_params: list of list
distribution params for cloud #1
Returns
-------
x: numpy array
coordinates in two-dimensional space
y: numpy array
labels {0, 1}
"""
first = np.random.multivariate_normal(first_params[0], first_params[1], size)
second = np.random.multivariate_normal(second_params[0], second_params[1], size)
x = np.vstack((first, second))
y = np.hstack((np.zeros(size), np.ones(size)))
shuffle = np.arange(len(x))
np.random.shuffle(shuffle)
x = x[shuffle]
y = y[shuffle] #.reshape(-1, 1)
return x, y
In [15]:
size = 500
logistic_x, logistic_y = generate_logistic_data(size, [[1,2],[[15,0],[0,15]]], [[10,17],[[15,0],[0,15]]])
In [16]:
plt.style.use('seaborn-poster')
plt.style.use('ggplot')
plt.scatter(logistic_x[:,0], logistic_x[:,1], c=logistic_y)
plt.title('Cloud points distribution', fontsize=18)
plt.show()
One of the most important things that you need to know is that it really doesn't matter which model you want to train and what data you will use for it. The procedure stays the same.
First of all, a dataset is created and split into train/test parts.
In [17]:
logistic_dset = Dataset(size, batch_class=MyBatch)
logistic_dset.split()
As you can see, the pipeline and configurations also does not change much:
shape has changed from 13 to 2 since now we have two-dimensional data.
for labels new parameters appeared:
In [18]:
inputs_config = {
'features/shape': 2,
'labels/classes': 2
}
Let's create and execute the pipeline with run:
In [19]:
BATCH_SIZE = 100
train_logistic = (logistic_dset.train.p
.load(src=(logistic_x, logistic_y))
.init_variable('loss_history', init_on_each_run=list)
.init_model('dynamic',
RegressionModel,
'logistic',
config={
'inputs': inputs_config,
'loss': 'ce',
'optimizer': {'name':'Adam', 'learning_rate': 0.01},
'initial_block/inputs': 'features',
'body/units': 2,
'output': dict(ops=['accuracy'])})
.train_model('logistic',
fetches='loss',
feed_dict={
'features': B('features'),
'labels': B('labels')},
save_to=V('loss_history', mode='a'))
.run(BATCH_SIZE, shuffle=True, n_epochs=10))
In the same way, create test pipeline and run it too
In [20]:
test_logistic = (logistic_dset.test.p
.import_model('logistic', train_logistic)
.load(src=(logistic_x, logistic_y))
.init_variables(['predictions', 'metrics'])
.predict_model('logistic',
fetches='predictions' ,
feed_dict={
'features': B('features'),
'labels': B('labels')},
save_to=V('predictions'))
.gather_metrics(ClassificationMetrics, targets=B('labels'), predictions=V('predictions'),
fmt='logits', axis=-1, save_to=V('metrics', mode='a'))
.run(BATCH_SIZE, shuffle=False, n_epochs=1))
After measure the quality of training
In [21]:
accuracy = test_logistic.get_variable('metrics').evaluate('accuracy')
print('Percentage of accurate predictions: {:.2%}'.format(accuracy))
And again to imporove the accuracy increase the number of epochs in the training pipeline.
In [22]:
def generate_poisson_data(lam, size=10, shape=13):
""" Generation of data for fit poisson regression
Parameters
----------
size : int
number of data items
lam : float
Poisson distribution parameter
shape : int
number of features for each data item
Returns
-------
x: numpy array
Matrix with random numbers from the uniform distribution
y: numpy array
random Poisson distributed numbers
"""
x = np.random.random(size=(size, shape))
b = np.random.random(1)
y_obs = np.random.poisson(np.exp(np.dot(x, lam) + b))
shuffle = np.arange(len(x))
np.random.shuffle(shuffle)
x = x[shuffle]
y = y_obs[shuffle].reshape(-1, 1)
return x, y
In [23]:
size = 1000
NUM_DIM = 13
poisson_x, poisson_y = generate_poisson_data(np.random.random(NUM_DIM), size, NUM_DIM)
Below you can see the same cell as previously, but with different names of models
In [24]:
poisson_dset = Dataset(size, batch_class=MyBatch)
poisson_dset.split()
We have to create our own loss function:
In [25]:
def loss_poisson(target, predictions):
loss = tf.reduce_mean(tf.nn.log_poisson_loss(target, predictions))
tf.losses.add_loss(loss)
return loss
Again shape equals 13, as the shape of the input data
In [26]:
inputs_config = {
'features/shape': NUM_DIM,
'labels/shape': 1
}
Create and run a train pipeline
In [27]:
BATCH_SIZE = 100
train_poisson = (poisson_dset.train.p
.load(src=(poisson_x, poisson_y))
.init_variable('shape')
.init_model('dynamic',
RegressionModel,
'poisson',
config={
'inputs': inputs_config,
'loss': loss_poisson,
'optimizer': {'name': 'GradientDescent', 'learning_rate': 5e-5},
'initial_block/inputs': 'features',
'body/units': 1,
'output': tf.exp})
.train_model('poisson',
fetches='loss',
feed_dict={
'features': B('features'),
'labels': B('labels')})
.run(BATCH_SIZE, shuffle=True, n_epochs=100, bar=True))
Create test pipeline and make predictions
In [28]:
test_poisson = (poisson_dset.test.p
.load(src=(poisson_x, poisson_y))
.import_model('poisson', train_poisson)
.init_variable('predictions', init_on_each_run=list)
.predict_model('poisson',
fetches='exp',
feed_dict={
'features': B('features'),
'labels': B('labels')},
save_to=V('predictions', mode='a'))
.run(BATCH_SIZE, shuffle=True, n_epochs=1))
Measure the quality
In [29]:
pred = np.array(test_poisson.get_variable('predictions')).reshape(-1, 1)
target = np.array(poisson_y[poisson_dset.test.indices])
true_var = np.mean((target - np.mean(target))**2)
predict_var = np.mean((pred - np.mean(pred))**2)
error = np.mean(np.abs(pred - target)) / np.mean(target) * 100
print('Average error: {}%'.format(round(error, 3)), 'Variance ratio: %.3f' % (predict_var / true_var))
batchflow
and specifically how to:You might hone your new skills by:
You might also want to dig deeper into batch operations.
Or choose another topic from the table of contents.