Topic Modeling via Method of Moments

This code performs learning and inference of topic modeles via method of moments using tensor decomposition on a single machine. Specifically the code first performs a pre-processing which implements a whitening transformation for matrix/tensor orthogonalisation and dimensionality reduction and then does "Alternating Least Squares (ALS)".

Now let us give a sketch of the method of moments approach to learning topic models. We consider the mixed membership (Latent Dirichlet Allocation)topic model. The that hidden topic proportions vector $\bf{h}\in \Delta^{k}$ is drawn from a Dirichlet distribution $\bf{h}\sim \text{Dir}(\alpha_1,\alpha_2,\ldots,\alpha_k)$, where $\alpha_0 : = \sum_{i\in[k]} {\alpha_i}$ is defined as the concentration parameter. $\bf{h}\in \Delta^{k}$, $\bf{h}\sim \text{Dir}(\alpha_1,\alpha_2,\ldots,\alpha_k)$, $\alpha_0 : = \sum_{i\in[k]} {\alpha_i}$. Here memberships $h$ are not pure but mixed and the concentration parameter $\alpha_0$ controls to what extend the memberships are mixed.

After drawing $\bf{h}$ from the Dirichlet distribution, the $v^{th}$ word in the document (for $v \in [\ell] := \{1, 2, \dots , \ell\}$) is generated by: (i) drawing topic $t \in [k]$ according to the distribution specified by h, then (ii) drawing $\bf{x}_v$ according to conditional distribution $a_t$ given topic $t$. We represent words via one-hot encoding, i.e. word $\bf{x}_v = e_j$ if the $j^{th}$ word is drawn. Here $e_j$ is the basis vector along the $j^{th}$ coordinate.

The first, second and third order moments for this model are given by:

$\mu := \mathbb{E}[\bf{x}_1] \in \mathbb{R}^d$

$\text{Pairs}_{\alpha_0} := \mathbb{E}[\bf{x}_1 \otimes \bf{x}_2] - \frac{\alpha_0}{\alpha_0+1} \mu \otimes \mu \in \mathbb{R}^{d \times d}$

$\text{Triplets}_{\alpha_0} := \mathbb{E}[\bf{x}_1 \otimes \bf{x}_2 \otimes \bf{x}_3] \\$ $-\frac{\alpha_0}{\alpha_0+2}\big( \mathbb{E}[\bf{x}_1 \otimes \bf{x}_2 \otimes \mu] + \mathbb{E}[\bf{x}_1 \otimes \mu \otimes \bf{x}_3] + \mathbb{E}[\mu \otimes \bf{x}_2 \otimes \bf{x}_3] \big) \\$ $+\frac{2\alpha_0^2}{(\alpha_0+2)(\alpha_0+1)} \mu \otimes \mu \otimes \mu \in \mathbb{R}^{d \times d \times d}$

The code computes the above moments using input frequency vectors of documents, and estimates the parameters of the model and performs decoding of topics to test the wellness of fit on a set of test data. In the code, we also provide the option of directly providing the second and third order moments as inputs.

The first section of the document describes how to compile and run the code. It also defines the input arguments to the model. The second section defines the tensor manipulation functions available in the code.

Compilation and Usage

This document will walk you through the compilation steps and teach you how to define the inputs. We will then generate a set of synthetic data and pass it on to the code and see the output

First clone the following repository and pull the contents:

The source files are contained in the TopicModelingSingleNodeALS/TopicModel/TopicModel folder. There is also a readme.md file which provides an overall description of the code and its inputs. The SyntheticDataGenerator.m file creats synthetic data from the model that can be passed on to the code.

We will first show you how to compile the source code and then generate a set of synthetic data and pass it on to the code. First cd into the source directory of the code:


In [ ]:
cd (PATH_TO_topicModling)/TopicModelingSingleNodeALS/TopicModel

First of all, generate some synthetic data to play with. Run the matlab script SyntheticDataGenerator.m. This will create 10000 (NX) training and 100 (NX_test) test documents. the vocabulary size (NA) is set to 100 and the number of topics (KHID) is set to 3. The mixing parameter $\alpha_0$ is set to 0.01. You can modify these values easily through the m-file. The output files will be written in "(PATH_TO_topicModeling)/TopicModelingSingleNodeALS/TopicModel/datasets/synthetic/" and they are called "samples_train.txt" and "samples_test.txt". Both files contain three columns. The first column is the document ID, the second column is the word ID and the third column is the word count. each line in either output file, therefore, indicates the number of times a specific word appears in a specific document.


In [ ]:
cd (PATH_TO_topicModling)/TopicModelingSingleNodeALS/TopicModel/TopicModel
Then you have to compile the files using the following command which will give the following output. Note that while making the executable you should have a folder named dependency that contains the Eigen library in the following path: "../../dependency/Eigen". For topic modeling purpose, compile using

In [ ]:
make exe-topicmodel

g++ -o exe-topicmodel TopicModel.cpp IOfn.cpp Probability.cpp Pvalue.cpp Spectral.cpp stdafx.cpp Util.cpp -O3 -g -I. -Wno-write-strings -DNDEBUG -std=c++0x

This will create an executable file named "exe-topicmodel" in your current directory which is the directory of the source code. Now you need to specify the input arguments to the executable file which will give you the desired outputs. There are a total number of 15 input arguments to the executable listed below in order. (The last four arguments - Arguments 12 through 15 - are arbitrary, leave blank in case you do not wish to give those inputs).

InputArgument 1: (NX) is the training sample size
InputArgument 2: (NX_test) is the test sample size
InputArgument 3: (NA) is the vocabulary size
InputArgument 4: (KHID) is the number of topics you want to learn
InputArgument 5: (alpha0) is the mixing parameter, usually set to < 1
InputArgument 6: (DATATYPE) denotes the index convention. Set it to 1 if the data is compatible with MATLAB (array indexing       starts from 1) and set it to 0 if the data is compatible with C++ (array indexing starts from 0)

InputArgument 7: (trainSamplePATH) Path to the training samples text file
InputArgument 8: (testSamplePATH) Path to the test samples text file

InputArgument 9: (outputAlphaPATH) denotes the filename/filepath for the estimated topic marginal distribution
InputArgument 10: (outputBetaPATH) denotes the filename/filepath for estimated topic-word probability matrix
InputArgument 11: (outputLogPATH) denotes the estimation of topics per document for the test data. 

InputArgument 12: (M2_yes) is a boolean input indicating whether you have direct access to the second order moment. This parameter is set to "false" by default. Set it to "true" in case you want to directly input the second order moment
InputArgument 13: (M2PATH) indicates the filename/filepath to the file containing the second order moment. Give this input only you have sat M2_yes to "true"
InputArgument 14: (M3_yes) is a boolean input indicating whether you have direct access to the third order moment. This parameter is set to "false" by default. Set it to "true" in case you want to directly input the third order moment
InputArgument 15: (M3PATH) indicates the filename/filepath to the file containing the third order moment. Give this input only if you have sat M3_yes to "true" 

You can now run the topic modeling example we generated by the matlab script by typing the following in the command line:


In [ ]:
make runtopic

./exe-topicmodel 10000 100 100 3 0.01 1 ../datasets/synthetic/samples_train.txt ../datasets/synthetic/samples_test.txt ../datasets/synthetic/result/corpus_topic_weights.txt ../datasets/synthetic/result/topic_word_matrix.txt ../datasets/synthetic/result/inferred_topic_weights_per_document.txt

This will create three text files in the paths specified in arguments 9 through 11.

reading Word Counts Training Data Exec Time reading matrices before preproc = 5.8794000000e-01 (Seconds) (1) Whitening-------------------------- time taken by whitening = 1.2250300000e-01 (Seconds) (1.5) Matricization--------------------- (2) Tensor decomposition---------------- Running ALS 0 replace current eigenvalue and eigenvectors with this run Running ALS 1 Running ALS 2 FINAL ERROR (0-th run) : 0.000259517 time taken by ALS = 1.0919700000e-01 (Seconds) (3) Unwhitening----------- time taken for post processing = 1.0100000000e-04 (Seconds) (4) Writing results---------- (5) Decoding----------- reading Word Counts Test Data time taken for decoding = 8.1884000000e-02 (Seconds) (6) Program over------------

time taken for execution of the whole program = 2.3180100000e-01 (Seconds)

As stated, this will create three output files called "alpha.txt", "beta.txt" and "log.txt" which are the estimated parameters of the model.

Tensor manipulating functions

It is also good to give an overview of some of the functions in the code. Most of the functions that perform tensor operations are contained in file named "Utill.cpp" in the source directory. We use the "Eigen" library for matrix operations. The functions in the code are listed below:

Accumulate_M_mul_S:

This function computes the second order moment of input matrix inputData and returns the output in matrix M2. The function's syntax is given in the code below. inputData indicates the input data which is in SparseMatrix format, RandomMat is a random matrix that is used for perfroming random projection. M2 is the returned second order moment, mu is the returned first order moment and Lengths indicates the maximum number of words in each document.


In [ ]:
void accumulate_M_mul_S(SparseMatrix<double> inputData, SparseMatrix<double> RandomMat, SparseMatrix<double> &M2, VectorXd &mu, VectorXd &Lengths)
second_whiten_topic

The pre-processing step mentioned in the first section is calles the whitening transformation. It involves truncated SVD (Singular Value Decomposition) on the second order moment matrix $\text{Pairs}_{\alpha_0}$, which results in whitened data denoted as:

$\bf{y}_i^{(t)} := <\bf{W},\bf{x}_i^{(t)}> \in \mathbb{R}^k$

This function computes whitening matrices of inputData, using either the data or its second order moment. It returns the computed whitening matrix in W. The function's syntax is given below. mu is the first order moment of the input data, Uw is the matrix used as the input to an Unwhitening function that computes the eigenvectors (described in the next function), Lengths is the maximum vocabulary length in each document and M2_yes indicates whether you are directly using the second order moment or not. if you are directly using M2, input_M2 is the input second order moment.


In [ ]:
void second_whiten_topic(SparseMatrix<double> inputData, SparseMatrix<double> &W, VectorXd &mu, SparseMatrix<double> &Uw, SparseMatrix<double> &diag_Lw_sqrt_inv_s, VectorXd &Lengths, bool M2_yes, SparseMatrix<double> &input_M2)
compute_M3_topic

This function computes the third order moment of input whitenedData based on the model parameters, and the mixing weights of the Dirichlet distribution and returns the output in tensor Ta. The code basically computes the following quantity where $y_i$ is the whitened version if $x_i$:

$T = \frac{(\alpha_0 + 1)(\alpha0 + 2)}{2} \frac{1}{\vert \bf{X} \vert} \sum{t \in [n]} \bf{y}_1^{(t)} \otimes \bf{y}_2^{(t)} \otimes \bf{y}_3^{(t)} + \alpha_0^2 \bf{\bar{y}}_1^{(t)} \otimes \bf{\bar{y}}_2^{(t)} \otimes \bf{\bar{y}}_3^{(t)} \

  • \frac{\alpha_0 (\alpha0 + 1)}{2 \vert \bf{X} \vert} \sum{t \in [n]} \big[ \bf{y}_1^{(t)} \otimes \bf{y}_2^{(t)} \otimes \bf{\bar{y}}_3 + \bf{y}_1^{(t)} \otimes \bf{\bar{y}}_2 \otimes \bf{y}_3^{(t)} + \bf{\bar{y}}_1 \otimes \bf{y}_2^{(t)} \otimes \bf{y}_3^{(t)} \big]$

In [ ]:
Compute_M3_topic(MatrixXd whitenedData, VectorXd whitenedMean, VectorXd Lengths, MatrixXd &Ta)
tensorDecom_batchALS

Let $T$ have the following factorization. The goal of ALS is to perform this tensor decomposition and return $\bf{A}$ and $\Lambda$ which is a vector whose entries are the $\lambda_i$'s.

$\bf{T} = \sum_{i \in [k]} \lambda_i \bf{A}_i \otimes \bf{A}_i \otimes \bf{A}_i$

The ALS algorithm is a iterative procedure, where in each iteration it updates $\bf{A}$ using the previous value of $\bf{A}$. The update equations are given below (t indicates the iteration index):

$\bf{A}^{t+1} \longleftarrow \bf{A}^t + \beta \frac{1}{\vert \bf{X} \vert} \bf{y}_1 \bigg(\bf{y_3} ^\top \bf{A}^t * \bf{y}_2 ^\top \bf{A}^{t} \bigg) \Bigg( \bigg( (\bf{A}^t)^\top \bf{A}^t * (\bf{A}^t)^\top \bf{A}^t \bigg)^\top \Bigg) ^\dagger$

$\Lambda^{t+1} \longleftarrow \| \bf{A}^{t+1} \| \| \bf{A}^{t+1} \| \| \bf{A}^{t+1} \|$

This function prforms tensor decomposition of input tensor T using batch ALS (Alternating Least Squares) and returns the components of the tensor in A_new and lambda. The following gives the syntax of the function used in the code. lambda indicates the set of eigenvalues and A_new is the set of eigenvalues.


In [ ]:
void tensorDecom_batchALS(MatrixXd T, VectorXd &lambda, MatrixXd &A_new)

___ $_{Document}$ $_{prepared}$ $_{by}$ $_{Forough}$ $_{Arabshahi}$ $_{and}$ $_{edited}$ $_{by}$ $_{Animashree}$ $_{Anandkumar}$ - $_{Fall}$ $_{2014}$ __