Factor analysis: the big picture

The model

Factor analysis tries to account for the distribution of objects in a high-dimensional feature space by locating each object in a low-dimensional space, whose dimensions are traditionally called the factors. The set of high-dimensional vectors is a linear function of the set of low-dimensional ones, plus independent Gaussian noise. The intuition is that, in the original representation, lots of features are correlated with each other, so there are fewer true dimensions of variation than there are features.

Factor analysis is an old standby in statistics, and is implemented in most statistical software packages. Here, however, we can take advantage of our probabilistic programming language to define the FA model from scratch.

To get a bit more formal, we have to define some things:

$d$ is the number of features in the original representation, and $m$ is the number of factors. $n$ is the number of objects represented in the data. For the moment, we are choosing $m$ by guess-work.
$x$ is the $n$-by-$d$ matrix of data.
$z$ is the $n$-by-$m$ matrix of latent representations.
$\mu$ is the $d$-by-$1$ vector of feature means.
$W$ is a $d$-by-$m$ matrix of weights, which we will assume are independent draws from a zero-mean normal distribution, with some random shared, stipulated (non-random) variance, $\xi_W^2$.
The $i$th of the $d$ features has a zero-mean noise distribution with a randomly-drawn, feature-specific variance $\sigma_{i}^2$. Each of these is drawn from a gamma distribution with shape parameter $\alpha$ and scale parameter $\beta$, both stipulated. We collect the set of $\sigma_{i}^2$ into a $d$-by-$d$ diagonal matrix $\Sigma$.

$$z_{object,factor} \sim N(0,1)$$$$\sigma_{feature} \sim gamma(\alpha,\beta)$$$$x_{object,.} \sim N(Wz_{object,.} + \mu,\Sigma)$$

If that's a little too terse, hold on; things will become more clear as we proceed. For now, it's important to keep in mind that we observe the data $x$, and we infer the latent representations $z$ and the "factor loadings" $W$.

What we want to do with the model

Intuitively, factor analysis is doing a good job at compressing high-dimensional representations to low-dimensional ones only if pair-wise similarities between objects are preserved--accurately represented by distances between low-dimensional representations thereof. One very useful summary of similarities is provided by a dendrogram--a graphical representation of a tree formed by repeatedly linking subtrees (originally $n$ single-object "trees") that are "close" together. [mumble mumble...]

So, as a first, "sanity check" test of our model, we will try to make a tree that looks reasonable, using distances between latent variable representations of objects.

Test data

To build and test a factor analysis model, we will use a small, intuitively interpretable data set, collected by psychologist Daniel Osherson (and...): the average rating of many human subjects of the degree to which each of 50 species has each of 85 features. This data is publicly avaiable at (...).

A few "requires"

We will need a few special functions that don't come with Gamble...this section is not very interesting, but should be completed, at some point...



In [1]:

    
(require racket/string
         racket/list
         racket/vector
         "table_utils.rkt")

Some mysterious parameters

I need a story about where these numbers are coming from.



In [2]:

    
(define shape 0.5)
(define scale 7)
(define factor-load-sigma 5)

Reading in the data

The Osherson files are in an odd format, so I have to get the second thing from each row.

After loading from these files, the top-level environment will contain

animal-names, a list of 50 animal species.
attribute-names, a list of 85 animal attributes.
animal-data-table, a hash from animal/attribute pairs to numerical values.

TODO: Maybe just put all of this info in one R-readable tsv file. Then we can use standard functions from file_utils.rkt, and hide this one-off junk. This will require an import-tsv function parallel to the export-tsv one I have now.



In [3]:

    
(define animal-names
   (with-input-from-file "animal_classes.txt"
     (lambda ()
       (for/list ([line (in-lines)])
         (let ([parts (string-split line #px"\\s+")])
           (second parts))))))

(define attribute-names
   (with-input-from-file "animal_attributes.txt"
     (lambda ()
       (for/list ([line (in-lines)])
         (let ([parts (string-split line #px"\\s+")])
           (second parts))))))


;; Load the data once, into a hash table
(define animal-data-table
  (let ([tbl (make-hash)])
    (begin
      (with-input-from-file "animal_features.txt"
        (lambda ()
          ;; Each row is an animal.
          (for ([line (in-lines)]
                [animal animal-names])
            (let ([attr-vals (string-split line #px"\\s+")])
              ;; Each column is an attribute.
              (for ([attribute attribute-names]
                    [val attr-vals])
                (unless (missing-value? val)
                  (hash-set! tbl (cons animal attribute) (string->number val))))))))
      tbl)))

Setting up a few things

Factor analysis assumes zero-centered data, so we will need the attribute means.

I index by factor name strings rather than integers, as in a traditional array-based implementation. The hash table doesn't care.



In [4]:

    
(define mean-value
  (make-hash
   (map (lambda (name) (cons name (mean-value-of name animal-data-table))) 
          attribute-names)))
  

      

(define factors (list "factor1" "factor2" "factor3" "factor4" "factor5")) ; "factor6"))
                  ;    "factor7" "factor8" "factor9" "factor10" "factor11" "factor12"))

Collecting samples from the posterior

What should we sample? What exactly are we interested in? [...]

Sampling may move between symmetric modes, so averaging latent vectors themselves may not behave well. Two options:

Use MAP to pick one set of latent vectors.
Compute per-sample inter-animal distances, instead of raw latent vectors. These are preserved across symmetries.

Finally, the model!



In [6]:

    
(define animals-fa-sampler
  (mh-sampler
   
   ;; We don't know how noisy each observable attribute is.
   (defmem (attr-sigma attribute) (gamma shape scale))
   
   ;; Factors are always assumed independent standard normal.
   (defmem (factor-value object factor) (normal 0 1))
   
   ;; We don't know how variable the factor loads are. But we are guessing.
   (defmem (factor-load attribute factor) (normal 0 factor-load-sigma))
   
   ;; The observable value is the up-projection of the factor vector, plus noise.
   (defmem (attribute-value object attribute)
     (normal (+ (hash-ref mean-value attribute)
                (for/sum ([factor factors])
                  (* (factor-load attribute factor)
                     (factor-value object factor)))) 
             (attr-sigma attribute)))
   
   
   (for ([animal animal-names])
     (for ([attribute attribute-names])
       (let ([val (hash-ref animal-data-table (cons animal attribute) #f)])
         (when val
           (observe (attribute-value animal attribute) val)))))                
                 
   ;; Return the latent representations for all animals.
   (cross-prod-hash animal-names factors factor-value)
   #:transition (slice)))









    



mh-sampler: undefined;
 cannot reference undefined identifier
  context...:
   /Applications/Racket v6.1.1/share/pkgs/sandbox-lib/racket/sandbox.rkt:475:0: call-with-custodian-shutdown
   /Applications/Racket v6.1.1/collects/racket/private/more-scheme.rkt:147:2: call-with-break-parameterization
   /Applications/Racket v6.1.1/share/pkgs/sandbox-lib/racket/sandbox.rkt:837:5: loop

Musings on the results

It took hours of sampling (a 12K sample burn in) to get nice samples.

(define smpls (generate-samples animals-fa-sampler 100 #:burn 1000 #:thin 50))
(define mean-dists (mean-distance-table smpls factors animal-names))

Here's the clustering produced:

Two lessons:

We should really be using gradients when we can--HMC.
We should allow ad hoc initialization. In this case, using fast ML or MAP from some standard stats package.

Now that I got the plain vanilla version to work, it would be nice to try one with a sparse factor load matrix--more interpretable (ICA, I think).



In [22]:

    
(define smpls (generate-samples animals-fa-sampler 10 #:burn 10 #:thin 2))



In [24]:

    
(define mean-dists (mean-distance-table smpls factors animal-names))



In [ ]: