VAEs on sparse data

  • The following notebook provides an example of how to load a dataset, setup parameters for it, create the model and train it for a few epochs.
  • In the notebook, we will use with the RCV1 dataset (assuming it has been setup previously). For details on how to set it up, run python rcv2.py in the optvaedatasets folder

In [1]:
import sys,os,glob
from collections import OrderedDict
import numpy as np
from utils.misc import readPickle, createIfAbsent
sys.path.append('../')
from optvaedatasets.load import loadDataset as loadDataset_OVAE
from sklearn.feature_extraction.text import TfidfTransformer

Model Parameters

  • The model parameters have been saved here, we'll load them and look at them
  • These are what the model will be built based on

In [2]:
default_params = readPickle('../optvaeutils/default_settings.pkl')[0]
for k in default_params:
    print '(',k,default_params[k],')',
print


Read  1  objects
( q_dim_hidden 400 ) ( grad_noise 0.0 ) ( opt_type none ) ( dataset binarized_mnist ) ( epochs 500 ) ( seed 1 ) ( n_steps 200 ) ( q_layers 2 ) ( init_weight 0.1 ) ( reg_spec _ ) ( reg_value 0.01 ) ( input_type normalize ) ( p_dim_hidden 400 ) ( reloadFile ./NOSUCHFILE ) ( dim_stochastic 100 ) ( lr 0.0008 ) ( p_layers 2 ) ( init_scheme uniform ) ( input_dropout 0.0001 ) ( reg_type l2 ) ( optimizer adam ) ( batch_size 500 ) ( opt_method adam ) ( savedir ./chkpt ) ( param_lr 0.01 ) ( likelihood mult ) ( savefreq 5 ) ( paramFile ./NOSUCHFILE ) ( emission_type mlp ) ( nonlinearity relu ) ( anneal_rate 0 ) ( unique_id VAE_lr-8_0e-04-ph-400-qh-400-ds-100-pl-2-ql-2-nl-relu-bs-500-ep-500-plr-1_0e-02-ar-0-otype-none-ns-200-etype-mlp-ll-mult-itype-normalizel20_01_-uid ) ( leaky_param 0.0 )

For the moment, we will leave everything as is. Some worthwhile parameters to note:

  • n_steps: Number of steps of optimizing $\psi(x)$, the local variational parameters as output by the inference network. We'll set this to 10 below for the moment.
  • dim_stochastic: Number of latent dimensions.

In [3]:
default_params['opt_type'] = 'finopt' #set to finopt to optimize var. params, none otherwise
default_params['n_steps'] = 5
#temporary directory where checkpoints are saved
default_params['savedir'] = './tmp'

Load dataset

  • Lets load the RCV1(v2) dataset and visualize how the dataset <dict> is structured
  • We'll need to append some parameters from the dataset into the default parameters dict that we will use to create the model
  • Also, compute the idf vectors for the entire dataset (the term frequencies will be multiplied dynamically) inside the model

In [4]:
dset = loadDataset_OVAE('rcv2')

#Visualize structure of dataset dict
for k in dset:
    print k, type(dset[k]),
    if hasattr(dset[k],'shape'):
        print dset[k].shape
    elif type(dset[k]) is not list:
        print dset[k]
    else:
        print

#Add parameters to default_params
for k in ['dim_observations','data_type']:
    default_params[k] = dset[k]
default_params['max_word_count'] =dset['train'].max()


#Create IDF
additional_attrs        = {}
tfidf                   = TfidfTransformer(norm=None) 
tfidf.fit(dset['train'])
additional_attrs['idf'] = tfidf.idf_


vocabulary <type 'list'>
data_type <type 'str'> bow
dim_observations <type 'int'> 10000
train <class 'scipy.sparse.csr.csr_matrix'> (789414, 10000)
test <class 'scipy.sparse.csr.csr_matrix'> (10000, 10000)
valid <class 'scipy.sparse.csr.csr_matrix'> (5000, 10000)

In [5]:
from optvaemodels.vae import VAE as Model
import optvaemodels.vae_learn as Learn
import optvaemodels.vae_evaluate as Evaluate

Setup

  • Create directory for configuration files. The configuration file for a single experiment is in the pickle file.
  • We will use this directory to save checkpoint files as well

In [6]:
default_params['savedir']+='-rcv2-'+default_params['opt_type']
createIfAbsent(default_params['savedir'])
pfile= default_params['savedir']+'/'+default_params['unique_id']+'-config.pkl'
print 'Training model from scratch. Parameters in: ',pfile
model = Model(default_params, paramFile = pfile, additional_attrs = additional_attrs)


Training model from scratch. Parameters in:  ./tmp-rcv2-finopt/VAE_lr-8_0e-04-ph-400-qh-400-ds-100-pl-2-ql-2-nl-relu-bs-500-ep-500-plr-1_0e-02-ar-0-otype-none-ns-200-etype-mlp-ll-mult-itype-normalizel20_01_-uid-config.pkl
	<<Nparameters: 8451800>>
	<<Setting idf as theano shared variable>>
	<<WARNING: iter_ctr will not differentiated with respect to>>
	<<WARNING: anneal will not differentiated with respect to>>
	<<WARNING: lr will not differentiated with respect to>>
	<<Building Functions for Evaluation>>
	<<Inference with dropout :0.0000>>
	<<Optimizing variational parameters w/ ADAM>>
	<<Evaluation: Setting opt_method: ADAM, 100 steps w/ 8e-3 lr>>
	<<Inference with dropout :0.0000>>
	<<Optimizing variational parameters w/ ADAM>>
	<<Building Functions for Training>>
	<<Inference with dropout :0.0000>>
	<<Optimizing variational parameters w/ ADAM>>
	<<Modifying : [p_0_W,p_0_b,p_1_W,p_1_b,p_mean_W,p_mean_b]>>
	<<# additional updates: 0>>
	<<Modifying : [q_0_W,q_0_b,q_1_W,q_1_b,q_mu_W,q_logcov_W,q_mu_b,q_logcov_b]>>
	<<Inference with dropout :0.0000>>
	<<Done creating functions for training>>
	<<_buildModel took : 36.2146 seconds>>
	<<Modifying : [p_0_W,p_0_b,p_1_W,p_1_b,p_mean_W,p_mean_b]>>

Training the model

  • We can now train the model we created
  • This is the overall setup for the file train.py

In [ ]:
savef      = os.path.join(default_params['savedir'],default_params['unique_id']) #Prefix for saving in checkpoint directory
savedata   = Learn.learn( model, 
                                dataset     = dset['train'],
                                epoch_start = 0 , 
                                epoch_end   = 3,  #epochs -- set w/ default_params['epochs'] 
                                batch_size  = default_params['batch_size'], #batch size 
                                savefreq    = default_params['savefreq'], #frequency of saving
                                savefile    = savef,
                                dataset_eval= dset['valid']
                                )

In [ ]:
for k in savedata:
    print k, type(savedata[k]), savedata[k].shape