Out-of-Core Learning

author: Jacob Schreiber
contact: jmschreiber91@gmail.com

Out-of-core learning refers to the process of training a model on an amount of data that cannot fit in memory. There are several approaches that can be described as out-of-core, but here we refer to the ability to derive exact updates to a model from a massive data set, despite not being able to fit the entire thing in memory.

This out-of-core learning approach is implemented for all of pomegranate's models using two methods. The first is a summarize method that will take in a batch of data and reduce it down to additive sufficient statistics. Because these summaries are additive, after the first call, these summaries are added to the previously stored summaries. Once the entire data set has been seen, the stored sufficient statistics will be identical to those that would have been derived if the entire data set had been seen at once. The second method is the from_summaries method, which uses the stored sufficient statistics to derive parameter updates for the model.

A common solution to having too much data is to randomly select an amount of data that does fit in memory to use in the place of the full data set. While simple to implement, this approach is likely to yield lower performance models because it is exposed to less data. However, by using out-of-core learning, on can train their models on a massive amount of data without being limited by the amount of memory their computer has.


In [1]:
%matplotlib inline
import time
import pandas
import random
import numpy
import matplotlib.pyplot as plt
import seaborn; seaborn.set_style('whitegrid')
import itertools

from pomegranate import *

random.seed(0)
numpy.random.seed(0)
numpy.set_printoptions(suppress=True)

%load_ext watermark
%watermark -m -n -p numpy,scipy,pomegranate


Tue Nov 27 2018 

numpy 1.15.1
scipy 1.1.0
pomegranate 0.10.0

compiler   : GCC 7.2.0
system     : Linux
release    : 4.15.0-39-generic
machine    : x86_64
processor  : x86_64
CPU cores  : 4
interpreter: 64bit

1. Training a Probability Distribution

Let's start off simple with training a multivariate Gaussian distribution in an out-of-core manner. First, we'll generate some random data.


In [2]:
X = numpy.random.normal([5, 7], [1.5, 0.4], size=(1000, 2))

Then we can make a blank distribution with 2 dimensions. This is equivalent to filling in the mean and standard deviation with dummy values that will be overwritten, and don't effect the calculation.


In [3]:
d1 = MultivariateGaussianDistribution.blank(2)
d1


Out[3]:
{
    "frozen" :false,
    "class" :"Distribution",
    "parameters" :[
        [
            0.0,
            0.0
        ],
        [
            [
                1.0,
                0.0
            ],
            [
                0.0,
                1.0
            ]
        ]
    ],
    "name" :"MultivariateGaussianDistribution"
}

Now let's summarize through a few batches of data.


In [4]:
d1.summarize(X[:250])
d1.summarize(X[250:500])
d1.summarize(X[500:750])
d1.summarize(X[750:])

In [5]:
d1.summaries


Out[5]:
[]

Now that we've seen the entire data set let's use the from_summaries method to update the parameters.


In [6]:
d1.from_summaries()
d1


Out[6]:
{
    "frozen" :false,
    "class" :"Distribution",
    "parameters" :[
        [
            4.967395273422254,
            6.996038686884455
        ],
        [
            [
                2.1362097536595757,
                -0.007992485878115985
            ],
            [
                -0.007992485878115985,
                0.15420875108571636
            ]
        ]
    ],
    "name" :"MultivariateGaussianDistribution"
}

And what do we get if we learn directly from the data?


In [7]:
MultivariateGaussianDistribution.from_samples(X)


Out[7]:
{
    "frozen" :false,
    "class" :"Distribution",
    "parameters" :[
        [
            4.967395273422257,
            6.996038686884458
        ],
        [
            [
                2.136209753659561,
                -0.007992485878137813
            ],
            [
                -0.007992485878137813,
                0.1542087510856727
            ]
        ]
    ],
    "name" :"MultivariateGaussianDistribution"
}

The exact same model.

2. Training a Mixture Model

This summarization option enables a variety of different training strategies that can be written by hand. This notebook focuses on out-of-core learning, so let's make a data set and "read it in" one batch at a time to train a mixture model with a custom training function. We'll make another data set here, but one could easily have a function that read through some number of lines in a CSV, or loaded up a chunk from a numpy memory map, or whatever other massive data store you had.


In [8]:
X = numpy.concatenate([numpy.random.normal(0, 1, size=(5000, 10)), numpy.random.normal(1, 1, size=(7500, 10))])
n = X.shape[0]

idx = numpy.arange(n)
numpy.random.shuffle(idx)

X = X[idx]

First we have to initialize our model. We can do that either by hand to some value we think is good, or by fitting to the first chunk of data, anticipating that it will be a decent representation of the remainder. We can also calculate the log probability of the data set now to see how much we improved.


In [9]:
# First we initialize our model on some small chunk of data.
model = GeneralMixtureModel.from_samples(MultivariateGaussianDistribution, 2, X[:200], max_iterations=1, init='first-k')

# The base performance on the data set.
base_logp = model.log_probability(X).sum()

In [10]:
from tqdm import tqdm_notebook as tqdm

# Now we write our own iterator. This outer loop will be the number of times we iterate---hard coded to 5 in this case.
for iteration in tqdm(range(5)):

    # This internal loop goes over chunks from the data set. We're just loading chunks of a fixed size iteratively
    # until we've seen the entire data set.
    for i in range(10):
        model.summarize(X[i * (n // 10):(i+1) * (n //10)])
    
    # Now we've seen the entire data set and summarized it. We can update the parameters now.
    model.from_summaries()



How we does did our model do on the data originally, and how well does it do now?


In [11]:
base_logp, model.log_probability(X).sum()


Out[11]:
(-188806.57618011418, -184074.12225611554)

Looks like a decent improvement.

Now, let's compare to having fit our model to the entire loaded data set for five epochs.


In [12]:
model = GeneralMixtureModel.from_samples(MultivariateGaussianDistribution, 2, X[:200], max_iterations=1, init='first-k')
base_logp = model.log_probability(X).sum()

model.fit(X, max_iterations=5)
base_logp, model.log_probability(X).sum()


Out[12]:
(-188806.57618011418, -184074.12225611557)

Looks like the exact same values.

You may ask why we bothered to write a summarization function for data that did fit in memory. The purpose here was entirely illustrative. Our function that use the summarize method would scale to any amount of data that could be loaded in batches, whereas the fit function can only scale to the amount of data that can fit in memory. However, they yield identical answers at the end, suggesting that if one wanted to scale to massive data sets but still get the same performance, this summarize function is the way to go.