Factor analysis tries to account for the distribution of objects in a high-dimensional feature space by locating each object in a low-dimensional space, whose dimensions are traditionally called the factors. The set of high-dimensional vectors is a linear function of the set of low-dimensional ones, plus independent Gaussian noise. The intuition is that, in the original representation, lots of features are correlated with each other, so there are fewer true dimensions of variation than there are features.
Factor analysis is an old standby in statistics, and is implemented in most statistical software packages. Here, however, we can take advantage of our probabilistic programming language to define the FA model from scratch.
To get a bit more formal, we have to define some things:
If that's a little too terse, hold on; things will become more clear as we proceed. For now, it's important to keep in mind that we observe the data $x$, and we infer the latent representations $z$ and the "factor loadings" $W$.
Intuitively, factor analysis is doing a good job at compressing high-dimensional representations to low-dimensional ones only if pair-wise similarities between objects are preserved--accurately represented by distances between low-dimensional representations thereof. One very useful summary of similarities is provided by a dendrogram--a graphical representation of a tree formed by repeatedly linking subtrees (originally $n$ single-object "trees") that are "close" together. [mumble mumble...]
So, as a first, "sanity check" test of our model, we will try to make a tree that looks reasonable, using distances between latent variable representations of objects.
To build and test a factor analysis model, we will use a small, intuitively interpretable data set, collected by psychologist Daniel Osherson (and...): the average rating of many human subjects of the degree to which each of 50 species has each of 85 features. This data is publicly avaiable at (...).
We will need a few special functions that don't come with Gamble...this section is not very interesting, but should be completed, at some point...
In [1]:
(require racket/string
racket/list
racket/vector
"table_utils.rkt")
In [2]:
(define shape 0.5)
(define scale 7)
(define factor-load-sigma 5)
The Osherson files are in an odd format, so I have to get the second thing from each row.
After loading from these files, the top-level environment will contain
animal-names
, a list of 50 animal species.attribute-names
, a list of 85 animal attributes.animal-data-table
, a hash from animal/attribute pairs to numerical values.TODO: Maybe just put all of this info in one R-readable tsv file. Then we can use standard functions from file_utils.rkt
, and hide this one-off junk. This will require an import-tsv
function parallel to the export-tsv
one I have now.
In [3]:
(define animal-names
(with-input-from-file "animal_classes.txt"
(lambda ()
(for/list ([line (in-lines)])
(let ([parts (string-split line #px"\\s+")])
(second parts))))))
(define attribute-names
(with-input-from-file "animal_attributes.txt"
(lambda ()
(for/list ([line (in-lines)])
(let ([parts (string-split line #px"\\s+")])
(second parts))))))
;; Load the data once, into a hash table
(define animal-data-table
(let ([tbl (make-hash)])
(begin
(with-input-from-file "animal_features.txt"
(lambda ()
;; Each row is an animal.
(for ([line (in-lines)]
[animal animal-names])
(let ([attr-vals (string-split line #px"\\s+")])
;; Each column is an attribute.
(for ([attribute attribute-names]
[val attr-vals])
(unless (missing-value? val)
(hash-set! tbl (cons animal attribute) (string->number val))))))))
tbl)))
In [4]:
(define mean-value
(make-hash
(map (lambda (name) (cons name (mean-value-of name animal-data-table)))
attribute-names)))
(define factors (list "factor1" "factor2" "factor3" "factor4" "factor5")) ; "factor6"))
; "factor7" "factor8" "factor9" "factor10" "factor11" "factor12"))
What should we sample? What exactly are we interested in? [...]
Sampling may move between symmetric modes, so averaging latent vectors themselves may not behave well. Two options:
In [6]:
(define animals-fa-sampler
(mh-sampler
;; We don't know how noisy each observable attribute is.
(defmem (attr-sigma attribute) (gamma shape scale))
;; Factors are always assumed independent standard normal.
(defmem (factor-value object factor) (normal 0 1))
;; We don't know how variable the factor loads are. But we are guessing.
(defmem (factor-load attribute factor) (normal 0 factor-load-sigma))
;; The observable value is the up-projection of the factor vector, plus noise.
(defmem (attribute-value object attribute)
(normal (+ (hash-ref mean-value attribute)
(for/sum ([factor factors])
(* (factor-load attribute factor)
(factor-value object factor))))
(attr-sigma attribute)))
(for ([animal animal-names])
(for ([attribute attribute-names])
(let ([val (hash-ref animal-data-table (cons animal attribute) #f)])
(when val
(observe (attribute-value animal attribute) val)))))
;; Return the latent representations for all animals.
(cross-prod-hash animal-names factors factor-value)
#:transition (slice)))
It took hours of sampling (a 12K sample burn in) to get nice samples.
(define smpls (generate-samples animals-fa-sampler 100 #:burn 1000 #:thin 50))
(define mean-dists (mean-distance-table smpls factors animal-names))
Here's the clustering produced:
Two lessons:
We should really be using gradients when we can--HMC.
We should allow ad hoc initialization. In this case, using fast ML or MAP from some standard stats package.
Now that I got the plain vanilla version to work, it would be nice to try one with a sparse factor load matrix--more interpretable (ICA, I think).
In [22]:
(define smpls (generate-samples animals-fa-sampler 10 #:burn 10 #:thin 2))
In [24]:
(define mean-dists (mean-distance-table smpls factors animal-names))
In [ ]: