General Structured Output Models with Shogun Machine Learning Toolbox

Shell Hu (GitHub ID: hushell)

Thanks Patrick Pletscher and Fernando J. Iglesias García for taking time to help me finish the project! Shoguners = awesome! Me = grateful!

Introduction

This notebook illustrates the training of a factor graph model using structured SVM in Shogun. We begin by giving a brief outline of factor graphs and structured output learning followed by the corresponding API in Shogun. Finally, we test the scalability by performing an experiment on a real OCR data set for handwritten character recognition.

Factor Graph

A factor graph explicitly represents the factorization of an undirected graphical model in terms of a set of factors (potentials), each of which is defined on a clique in the original graph [1]. For example, a MRF distribution can be factorized as

$$ P(\mathbf{y}) = \frac{1}{Z} \prod_{F \in \mathcal{F}} \theta_F(\mathbf{y}_F), $$

where $F$ is the factor index, $\theta_F(\mathbf{y}_F)$ is the energy with respect to assignment $\mathbf{y}_F$. In this demo, we focus only on table representation of factors. Namely, each factor holds an energy table $\theta_F$, which can be viewed as an unnormalized CPD. According to different factorizations, there are different types of factors. Usually we assume the Markovian property is held, that is, factors have the same parameterization if they belong to the same type, no matter how location or time changes. In addition, we have parameter free factor type, but nothing to learn for such kinds of types. More detailed implementation will be explained later.

Structured Prediction

Structured prediction typically involves an input $\mathbf{x}$ (can be structured) and a structured output $\mathbf{y}$. A joint feature map $\Phi(\mathbf{x},\mathbf{y})$ is defined to incorporate structure information into the labels, such as chains, trees or general graphs. In general, the linear parameterization will be used to give the prediction rule. We leave the kernelized version for future work.

$$ \hat{\mathbf{y}} = \underset{\mathbf{y} \in \mathcal{Y}}{\operatorname{argmax}} \langle \mathbf{w}, \Phi(\mathbf{x},\mathbf{y}) \rangle $$

where $\Phi(\mathbf{x},\mathbf{y})$ is the feature vector by mapping local factor features to corresponding locations in terms of $\mathbf{y}$, and $\mathbf{w}$ is the global parameter vector. In factor graph model, parameters are associated with a set of factor types. So $\mathbf{w}$ is a collection of local parameters.

The parameters are learned by regularized risk minimization, where the risk defined by user provided loss function $\Delta(\mathbf{y},\mathbf{\hat{y}})$ is usually non-convex and non-differentiable, e.g. the Hamming loss. So the empirical risk is defined in terms of the surrogate hinge loss $H_i(\mathbf{w}) = \max_{\mathbf{y} \in \mathcal{Y}} \Delta(\mathbf{y}_i,\mathbf{y}) - \langle \mathbf{w}, \Psi_i(\mathbf{y}) \rangle $, which is an upper bound of the user defined loss. Here $\Psi_i(\mathbf{y}) = \Phi(\mathbf{x}_i,\mathbf{y}_i) - \Phi(\mathbf{x}_i,\mathbf{y})$. The training objective is given by

$$ \min_{\mathbf{w}} \frac{\lambda}{2} ||\mathbf{w}||^2 + \frac{1}{N} \sum_{i=1}^N H_i(\mathbf{w}). $$

In Shogun's factor graph model, the corresponding implemented functions are:

Experiment: OCR

Show Data

First of all, we load the OCR data from a prepared mat file. The raw data can be downloaded from http://www.seas.upenn.edu/~taskar/ocr/. It has 6876 handwritten words with an average length of 8 letters from 150 different persons. Each letter is rasterized into a binary image of size 16 by 8 pixels. Thus, each $\mathbf{y}$ is a chain, and each node has 26 possible states denoting ${a,\cdots,z}$.


In [ ]:
%pylab inline
%matplotlib inline
import os
SHOGUN_DATA_DIR=os.getenv('SHOGUN_DATA_DIR', '../../../data')
import numpy as np
import scipy.io

In [ ]:
dataset = scipy.io.loadmat(os.path.join(SHOGUN_DATA_DIR, 'ocr/ocr_taskar.mat'))
# patterns for training
p_tr = dataset['patterns_train']
# patterns for testing
p_ts = dataset['patterns_test']
# labels for training
l_tr = dataset['labels_train']
# labels for testing
l_ts = dataset['labels_test']

# feature dimension
n_dims = p_tr[0,0].shape[0]
# number of states
n_stats = 26
# number of training samples
n_tr_samples = p_tr.shape[1]
# number of testing samples
n_ts_samples = p_ts.shape[1]

Few examples of the handwritten words are shown below. Note that the first capitalized letter has been removed.


In [ ]:
import matplotlib.pyplot as plt

def show_word(patterns, index):
    """show a word with padding"""
    plt.rc('image', cmap='binary')
    letters = patterns[0,index][:128,:]
    n_letters = letters.shape[1]
    for l in xrange(n_letters):
        lett = np.transpose(np.reshape(letters[:,l], (8,16)))
        lett = np.hstack((np.zeros((16,1)), lett, np.zeros((16,1))))
        lett = np.vstack((np.zeros((1,10)), lett, np.zeros((1,10))))
        subplot(1,n_letters,l+1)
        imshow(lett)
        plt.xticks(())
        plt.yticks(())
    plt.tight_layout()

In [ ]:
show_word(p_tr, 174)

In [ ]:
show_word(p_tr, 471)

In [ ]:
show_word(p_tr, 57)

Define Factor Types and Build Factor Graphs

Let's define 4 factor types, such that a word will be able to be modeled as a chain graph.

  • The unary factor type will be used to define unary potentials that capture the appearance likelihoods of each letter. In our case, each letter has $16 \times 8$ pixels, thus there are $(16 \times 8 + 1) \times 26$ parameters. Here the additional bits in the parameter vector are bias terms. One for each state.
  • The pairwise factor type will be used to define pairwise potentials between each pair of letters. This type in fact gives the Potts potentials. There are $26 \times 26$ parameters.
  • The bias factor type for the first letter is a compensation factor type, since the interaction is one-sided. So there are $26$ parameters to be learned.
  • The bias factor type for the last letter, which has the same intuition as the last item. There are also $26$ parameters.

Putting all parameters together, the global parameter vector $\mathbf{w}$ has length $4082$.


In [ ]:
from shogun import TableFactorType

# unary, type_id = 0
cards_u = np.array([n_stats], np.int32)
w_gt_u = np.zeros(n_stats*n_dims)
fac_type_u = TableFactorType(0, cards_u, w_gt_u)

# pairwise, type_id = 1
cards = np.array([n_stats,n_stats], np.int32)
w_gt = np.zeros(n_stats*n_stats)
fac_type = TableFactorType(1, cards, w_gt)

# first bias, type_id = 2
cards_s = np.array([n_stats], np.int32)
w_gt_s = np.zeros(n_stats)
fac_type_s = TableFactorType(2, cards_s, w_gt_s)

# last bias, type_id = 3
cards_t = np.array([n_stats], np.int32)
w_gt_t = np.zeros(n_stats)
fac_type_t = TableFactorType(3, cards_t, w_gt_t)

# all initial parameters
w_all = [w_gt_u,w_gt,w_gt_s,w_gt_t]

# all factor types
ftype_all = [fac_type_u,fac_type,fac_type_s,fac_type_t]

Next, we write a function to construct the factor graphs and prepare labels for training. For each factor graph instance, the structure is a chain but the number of nodes and edges depend on the number of letters, where unary factors will be added for each letter, pairwise factors will be added for each pair of neighboring letters. Besides, the first and last letter will get an additional bias factor respectively.


In [ ]:
def prepare_data(x, y, ftype, num_samples):
    """prepare FactorGraphFeatures and FactorGraphLabels """
    from shogun import Factor, TableFactorType, FactorGraph
    from shogun import FactorGraphObservation, FactorGraphLabels, FactorGraphFeatures

    samples = FactorGraphFeatures(num_samples)
    labels = FactorGraphLabels(num_samples)

    for i in xrange(num_samples):
        n_vars = x[0,i].shape[1]
        data = x[0,i].astype(np.float64)

        vc = np.array([n_stats]*n_vars, np.int32)
        fg = FactorGraph(vc)

        # add unary factors
        for v in xrange(n_vars):
            datau = data[:,v]
            vindu = np.array([v], np.int32)
            facu = Factor(ftype[0], vindu, datau)
            fg.add_factor(facu)

        # add pairwise factors
        for e in xrange(n_vars-1):
            datap = np.array([1.0])
            vindp = np.array([e,e+1], np.int32)
            facp = Factor(ftype[1], vindp, datap)
            fg.add_factor(facp)

        # add bias factor to first letter
        datas = np.array([1.0])
        vinds = np.array([0], np.int32)
        facs = Factor(ftype[2], vinds, datas)
        fg.add_factor(facs)

        # add bias factor to last letter
        datat = np.array([1.0])
        vindt = np.array([n_vars-1], np.int32)
        fact = Factor(ftype[3], vindt, datat)
        fg.add_factor(fact)

        # add factor graph
        samples.add_sample(fg)

        # add corresponding label
        states_gt = y[0,i].astype(np.int32)
        states_gt = states_gt[0,:]; # mat to vector
        loss_weights = np.array([1.0/n_vars]*n_vars)
        fg_obs = FactorGraphObservation(states_gt, loss_weights)
        labels.add_label(fg_obs)

    return samples, labels

In [ ]:
# prepare training pairs (factor graph, node states)
n_tr_samples = 350 # choose a subset of training data to avoid time out on buildbot
samples, labels = prepare_data(p_tr, l_tr, ftype_all, n_tr_samples)

An example of graph structure is visualized as below, from which you may have a better sense how a factor graph being built. Note that different colors are used to represent different factor types.


In [ ]:
try:
    import networkx as nx # pip install networkx
except ImportError:
    import pip
    pip.main(['install', '--user', 'networkx'])
    import networkx as nx

import matplotlib.pyplot as plt

# create a graph
G = nx.Graph()
node_pos = {}

# add variable nodes, assuming there are 3 letters
G.add_nodes_from(['v0','v1','v2'])
for i in xrange(3):
    node_pos['v%d' % i] = (2*i,1)

# add factor nodes
G.add_nodes_from(['F0','F1','F2','F01','F12','Fs','Ft'])
for i in xrange(3):
    node_pos['F%d' % i] = (2*i,1.006)
    
for i in xrange(2):
    node_pos['F%d%d' % (i,i+1)] = (2*i+1,1)
    
node_pos['Fs'] = (-1,1)
node_pos['Ft'] = (5,1)

# add edges to connect variable nodes and factor nodes
G.add_edges_from([('v%d' % i,'F%d' % i) for i in xrange(3)])
G.add_edges_from([('v%d' % i,'F%d%d' % (i,i+1)) for i in xrange(2)])
G.add_edges_from([('v%d' % (i+1),'F%d%d' % (i,i+1)) for i in xrange(2)])
G.add_edges_from([('v0','Fs'),('v2','Ft')])

# draw graph
fig, ax = plt.subplots(figsize=(6,2))
nx.draw_networkx_nodes(G,node_pos,nodelist=['v0','v1','v2'],node_color='white',node_size=700,ax=ax)
nx.draw_networkx_nodes(G,node_pos,nodelist=['F0','F1','F2'],node_color='yellow',node_shape='s',node_size=300,ax=ax)
nx.draw_networkx_nodes(G,node_pos,nodelist=['F01','F12'],node_color='blue',node_shape='s',node_size=300,ax=ax)
nx.draw_networkx_nodes(G,node_pos,nodelist=['Fs'],node_color='green',node_shape='s',node_size=300,ax=ax)
nx.draw_networkx_nodes(G,node_pos,nodelist=['Ft'],node_color='purple',node_shape='s',node_size=300,ax=ax)
nx.draw_networkx_edges(G,node_pos,alpha=0.7)
plt.axis('off')
plt.tight_layout()

Training

Now we can create the factor graph model and start training. We will use the tree max-product belief propagation to do MAP inference.


In [ ]:
from shogun import FactorGraphModel, TREE_MAX_PROD

# create model and register factor types
model = FactorGraphModel(samples, labels, TREE_MAX_PROD)
model.add_factor_type(ftype_all[0])
model.add_factor_type(ftype_all[1])
model.add_factor_type(ftype_all[2])
model.add_factor_type(ftype_all[3])

In Shogun, we implemented several batch solvers and online solvers. Let's first try to train the model using a batch solver. We choose the dual bundle method solver (DualLibQPBMSOSVM) [2], since in practice it is slightly faster than the primal n-slack cutting plane solver (PrimalMosekSOSVM) [3]. However, it still will take a while until convergence. Briefly, in each iteration, a gradually tighter piece-wise linear lower bound of the objective function will be constructed by adding more cutting planes (most violated constraints), then the approximate QP will be solved. Finding a cutting plane involves calling the max oracle $H_i(\mathbf{w})$ and in average $N$ calls are required in an iteration. This is basically why the training is time consuming.


In [ ]:
from shogun import DualLibQPBMSOSVM
from shogun import BmrmStatistics
import pickle
import time

# create bundle method SOSVM, there are few variants can be chosen
# BMRM, Proximal Point BMRM, Proximal Point P-BMRM, NCBM
# usually the default one i.e. BMRM is good enough
# lambda is set to 1e-2
bmrm = DualLibQPBMSOSVM(model, labels, 0.01)

bmrm.set_TolAbs(20.0)
bmrm.set_verbose(True)
bmrm.set_store_train_info(True)
    
# train
t0 = time.time()
bmrm.train()
t1 = time.time()

w_bmrm = bmrm.get_w()

print "BMRM took", t1 - t0, "seconds."

Let's check the duality gap to see if the training has converged. We aim at minimizing the primal problem while maximizing the dual problem. By the weak duality theorem, the optimal value of the primal problem is always greater than or equal to dual problem. Thus, we could expect the duality gap will decrease during the time. A relative small and stable duality gap may indicate the convergence. In fact, the gap doesn't have to become zero, since we know it is not far away from the local minima.


In [ ]:
import matplotlib.pyplot as plt
fig, axes = plt.subplots(nrows=1, ncols=2, figsize=(12,4))

primal_bmrm = bmrm.get_helper().get_primal_values()
dual_bmrm = bmrm.get_result().get_hist_Fd_vector()

len_iter = min(primal_bmrm.size, dual_bmrm.size)
primal_bmrm = primal_bmrm[1:len_iter]
dual_bmrm = dual_bmrm[1:len_iter]

# plot duality gaps
xs = range(dual_bmrm.size)
axes[0].plot(xs, (primal_bmrm-dual_bmrm), label='duality gap')
axes[0].set_xlabel('iteration')
axes[0].set_ylabel('duality gap')
axes[0].legend(loc=1)
axes[0].set_title('duality gaps');
axes[0].grid(True)

# plot primal and dual values
xs = range(dual_bmrm.size-1)
axes[1].plot(xs, primal_bmrm[1:], label='primal')
axes[1].plot(xs, dual_bmrm[1:], label='dual')
axes[1].set_xlabel('iteration')
axes[1].set_ylabel('objective')
axes[1].legend(loc=1)
axes[1].set_title('primal vs dual');
axes[1].grid(True)

There are other statitics may also be helpful to check if the solution is good or not, such as the number of cutting planes, from which we may have a sense how tight the piece-wise lower bound is. In general, the number of cutting planes should be much less than the dimension of the parameter vector.


In [ ]:
# statistics
bmrm_stats = bmrm.get_result()
nCP = bmrm_stats.nCP
nzA = bmrm_stats.nzA

print 'number of cutting planes: %d' % nCP
print 'number of active cutting planes: %d' % nzA

In our case, we have 101 active cutting planes, which is much less than 4082, i.e. the number of parameters. We could expect a good model by looking at these statistics. Now come to the online solvers. Unlike the cutting plane algorithms re-optimizes over all the previously added dual variables, an online solver will update the solution based on a single point. This difference results in a faster convergence rate, i.e. less oracle calls, please refer to Table 1 in [4] for more detail. Here, we use the stochastic subgradient descent (StochasticSOSVM) to compare with the BMRM algorithm shown before.


In [ ]:
from shogun import StochasticSOSVM

# the 3rd parameter is do_weighted_averaging, by turning this on, 
# a possibly faster convergence rate may be achieved.
# the 4th parameter controls outputs of verbose training information
sgd = StochasticSOSVM(model, labels, True, True)

sgd.set_num_iter(100)
sgd.set_lambda(0.01)
    
# train
t0 = time.time()
sgd.train()
t1 = time.time()
    
w_sgd = sgd.get_w()
    
print "SGD took", t1 - t0, "seconds."

We compare the SGD and BMRM in terms of the primal objectives versus effective passes. We first plot the training progress (until both algorithms converge) and then zoom in to check the first 100 passes. In order to make a fair comparison, we set the regularization constant to 1e-2 for both algorithms.


In [ ]:
fig, axes = plt.subplots(nrows=1, ncols=2, figsize=(12,4))

primal_sgd = sgd.get_helper().get_primal_values()

xs = range(dual_bmrm.size-1)
axes[0].plot(xs, primal_bmrm[1:], label='BMRM')
axes[0].plot(range(99), primal_sgd[1:100], label='SGD')
axes[0].set_xlabel('effecitve passes')
axes[0].set_ylabel('primal objective')
axes[0].set_title('whole training progress')
axes[0].legend(loc=1)
axes[0].grid(True)

axes[1].plot(range(99), primal_bmrm[1:100], label='BMRM')
axes[1].plot(range(99), primal_sgd[1:100], label='SGD')
axes[1].set_xlabel('effecitve passes')
axes[1].set_ylabel('primal objective')
axes[1].set_title('first 100 effective passes')
axes[1].legend(loc=1)
axes[1].grid(True)

As is shown above, the SGD solver uses less oracle calls to get to converge. Note that the timing is 2 times slower than they actually need, since there are additional computations of primal objective and training error in each pass. The training errors of both algorithms for each pass are shown in below.


In [ ]:
fig, axes = plt.subplots(nrows=1, ncols=2, figsize=(12,4))

terr_bmrm = bmrm.get_helper().get_train_errors()
terr_sgd = sgd.get_helper().get_train_errors()

xs = range(terr_bmrm.size-1)
axes[0].plot(xs, terr_bmrm[1:], label='BMRM')
axes[0].plot(range(99), terr_sgd[1:100], label='SGD')
axes[0].set_xlabel('effecitve passes')
axes[0].set_ylabel('training error')
axes[0].set_title('whole training progress')
axes[0].legend(loc=1)
axes[0].grid(True)

axes[1].plot(range(99), terr_bmrm[1:100], label='BMRM')
axes[1].plot(range(99), terr_sgd[1:100], label='SGD')
axes[1].set_xlabel('effecitve passes')
axes[1].set_ylabel('training error')
axes[1].set_title('first 100 effective passes')
axes[1].legend(loc=1)
axes[1].grid(True)

Interestingly, the training errors of SGD solver are lower than BMRM's in first 100 passes, but in the end the BMRM solver obtains a better training performance. A probable explanation is that BMRM uses very limited number of cutting planes at beginning, which form a poor approximation of the objective function. As the number of cutting planes increasing, we got a tighter piecewise lower bound, thus improve the performance. In addition, we would like to show the pairwise weights, which may learn important co-occurrances of letters. The hinton diagram is a wonderful tool for visualizing 2D data, in which positive and negative values are represented by white and black squares, respectively, and the size of each square represents the magnitude of each value. In our case, a smaller number i.e. a large black square indicates the two letters tend to coincide.


In [ ]:
def hinton(matrix, max_weight=None, ax=None):
    """Draw Hinton diagram for visualizing a weight matrix."""
    ax = ax if ax is not None else plt.gca()

    if not max_weight:
        max_weight = 2**np.ceil(np.log(np.abs(matrix).max())/np.log(2))

    ax.patch.set_facecolor('gray')
    ax.set_aspect('equal', 'box')
    ax.xaxis.set_major_locator(plt.NullLocator())
    ax.yaxis.set_major_locator(plt.NullLocator())

    for (x,y),w in np.ndenumerate(matrix):
        color = 'white' if w > 0 else 'black'
        size = np.sqrt(np.abs(w))
        rect = plt.Rectangle([x - size / 2, y - size / 2], size, size,
                             facecolor=color, edgecolor=color)
        ax.add_patch(rect)

    ax.autoscale_view()
    ax.invert_yaxis()

In [ ]:
# get pairwise parameters, also accessible from
# w[n_dims*n_stats:n_dims*n_stats+n_stats*n_stats]
model.w_to_fparams(w_sgd) # update factor parameters
w_p = ftype_all[1].get_w()
w_p = np.reshape(w_p,(n_stats,n_stats))
hinton(w_p)

Inference

Next, we show how to do inference with the learned model parameters for a given data point.


In [ ]:
# get testing data
samples_ts, labels_ts = prepare_data(p_ts, l_ts, ftype_all, n_ts_samples)

In [ ]:
from shogun import FactorGraphFeatures, FactorGraphObservation, TREE_MAX_PROD, MAPInference

# get a factor graph instance from test data
fg0 = samples_ts.get_sample(100)
fg0.compute_energies()
fg0.connect_components()

# create a MAP inference using tree max-product
infer_met = MAPInference(fg0, TREE_MAX_PROD)
infer_met.inference()

# get inference results
y_pred = infer_met.get_structured_outputs()
y_truth = FactorGraphObservation.obtain_from_generic(labels_ts.get_label(100))
print y_pred.get_data()
print y_truth.get_data()

Evaluation

In the end, we check average training error and average testing error. The evaluation can be done by two methods. We can either use the apply() function in the structured output machine or use the SOSVMHelper.


In [ ]:
from shogun import LabelsFactory, SOSVMHelper

# training error of BMRM method
bmrm.set_w(w_bmrm)
model.w_to_fparams(w_bmrm)
lbs_bmrm = bmrm.apply()
acc_loss = 0.0
ave_loss = 0.0
for i in xrange(n_tr_samples):
	y_pred = lbs_bmrm.get_label(i)
	y_truth = labels.get_label(i)
	acc_loss = acc_loss + model.delta_loss(y_truth, y_pred)

ave_loss = acc_loss / n_tr_samples
print('BMRM: Average training error is %.4f' % ave_loss)

In [ ]:
# training error of stochastic method
print('SGD: Average training error is %.4f' % SOSVMHelper.average_loss(w_sgd, model))

In [ ]:
# testing error
bmrm.set_features(samples_ts)
bmrm.set_labels(labels_ts)

lbs_bmrm_ts = bmrm.apply()
acc_loss = 0.0
ave_loss_ts = 0.0

for i in xrange(n_ts_samples):
	y_pred = lbs_bmrm_ts.get_label(i)
	y_truth = labels_ts.get_label(i)
	acc_loss = acc_loss + model.delta_loss(y_truth, y_pred)

ave_loss_ts = acc_loss / n_ts_samples
print('BMRM: Average testing error is %.4f' % ave_loss_ts)

In [ ]:
# testing error of stochastic method
print('SGD: Average testing error is %.4f' % SOSVMHelper.average_loss(sgd.get_w(), model))

References

[1] Kschischang, F. R., B. J. Frey, and H.-A. Loeliger, Factor graphs and the sum-product algorithm, IEEE Transactions on Information Theory 2001.

[2] Teo, C.H., Vishwanathan, S.V.N, Smola, A. and Quoc, V.Le, Bundle Methods for Regularized Risk Minimization, JMLR 2010.

[3] Tsochantaridis, I., Hofmann, T., Joachims, T., Altun, Y., Support Vector Machine Learning for Interdependent and Structured Ouput Spaces, ICML 2004.

[4] Lacoste-Julien, S., Jaggi, M., Schmidt, M., Pletscher, P., Block-Coordinate Frank-Wolfe Optimization for Structural SVMs, ICML 2013.