In [1]:
import os
import sys
sys.path.append(os.path.pardir)
%matplotlib inline
import numpy as np
from fa_kit import FactorAnalysis
from fa_kit import plotting as fa_plotting
This function will generate samples of fake data and store it in the n_samples-by-n_features matrix data.
The data is constructed so that there's structure in the data for the factor analysis to reveal. Specifically, there are 'bands' of 10 adjacent dimensions that are positively correlated with one another. And just to spice things up a bit more, each of these bands is slightly overlapped with it's neighbor.
In [2]:
def make_random_data(n_samp=10000, n_feat=100):
"""
make some random data with correlated features
"""
data = np.random.randn(n_samp, n_feat)
signal_width = 10
signal_overlap = 2
step_size = signal_width - signal_overlap
for i in range(0, data.shape[1], step_size):
shared_signal = 0.3*np.random.randn(n_samp, 1)
data[:, i:(i+signal_width)] += shared_signal
return data
data = make_random_data()
The function run_pipeline will take a set of data and run through each of the steps in a factor analysis:
n_features-by-n_features correlation matrix
In [3]:
def run_pipeline(data, retain_method='broken_stick',
rotation_method='varimax', **kwargs):
# Set up the factor analysis object, indiate how to calculate the
# correlation matrix out of this input data.
fa = FactorAnalysis.load_data_samples(
data,
preproc_demean=True,
preproc_scale=True
)
# Extract the components
fa.extract_components()
# Calculate how many components to retain
# You can use any of these methods:
# 'top_n', 'top_pct', 'kaiser', 'broken_stick'
fa.find_comps_to_retain(
method=retain_method,
**kwargs
)
# Once you know how many to retain, re-extract with PAF
fa.reextract_using_paf()
# Apply factor rotation
# Right now there are both 'varimax' and 'quartimax'
fa.rotate_components(
method=rotation_method
)
# Plot summary figure
fig_summary = fa_plotting.graph_summary(fa)
return fig_summary
Below, we see the results when we run the pipeline while using the 'keep the top 5 components' retention method. The top panel in the graph is a scree-plot that shows us how the first 30 eigenvalues are distributed, and puts a dotted line where we set our top-5 cutoff. Looking at the graph, we should probably extract more than 5 components -- it looks like there's a natural dropoff at 12.
The middle panel shows us the five un-rotated components, and the lower panel shows us the five retained components after applying PAF and factor rotation. Even though we've under-extracted the number of components we need, there's still some clear structure with adjacent features loading onto the same components.
In [4]:
fig_topn = run_pipeline(data, retain_method='top_n', num_keep=5)
Up next, we run the analysis using top_pct to keep however many components we need to capture a certain proportion of the overall covariance. We've set our cutoff to retain the top 20% of the overall signal.
The top panel has changed and now shows the cumulative proportion of variance explained and we have our cutoff placed at the 20% mark. In this view, it's harder to see the natural cutoff that occurs after the 12th component, but it's still there. And our 20% retention criteria almost gets there and decides to keep 11 components. The lower two panels are the same as the previous figure, but now with more components the banded structure of the input becomes more apparent.
In [5]:
fig_toppct = run_pipeline(data, retain_method='top_pct', pct_keep=0.2)
In [6]:
fig_kaiser = run_pipeline(data, retain_method='kaiser')
The distribution of eigenvalues in a covariance/correlation matrix often follow what appears to be a broken-stick distribution. Anecdotally, I've had a lot of luck with this method but often found that the broken stick didn't quite line up the way you'd expect. So, I've cooked up this variant that does a best-fit of a broken stick to the eigenvalue distribution. There's a little bit of magic in how it works to make sure that the fit isn't overly sensitive to the values in the first couple of components (because those are the ones that we expect to deviate from the broken stick and capture structure/signal).
Another benefit of this modified broken stick is that the method can be used with other types of n_features-by-n_features feature similariy matrices, such as co-occurance matrices, mutual information scores, etc.
However, as you'll see in Episode 2, it doesn't always do what you want. It's still a work in progress, but overall I'm really happy with how well it does as a fully-automatic method for calculating the number of features to retain.
In [7]:
fig_bs = run_pipeline(data, retain_method='broken_stick', rotation_method='varimax')