In [1]:
%pylab inline
from catcorr import *


Populating the interactive namespace from numpy and matplotlib
Using Cython-powered Fisher's exact test

In [2]:
df = generateTestData(nrows = 100)

print df.head()
print df.groupby(['ColA','ColB','ColC']).agg(len)


  ColA ColB ColC
0    Y    B  bar
1    X    B  bar
2    Y    B  bar
3    X    B  bar
4    X    A  bar

[5 rows x 3 columns]
ColA  ColB  ColC
X     A     bar     11
            foo      7
      B     bar     25
            foo      3
Y     A     bar     28
      B     bar     26
dtype: int64

Networks for visualizing categorical correlation

The plot represents three columns (red, green and blue) of "paired" categorical data. Each column has many unique values along with quite a few repeated values: the diameter of each node is proportional to the frequency of each value in the column.

Two nodes representing unique values in two different columns are connected by an edge if they ever appear together (i.e. paired) in the rows of the data. The width of the edge is proportional to the number of times that pairing is observed in the data. If the frequency of the pairing is more/less common than one would expect by chance (based on the marginal frequencies), then the line is colored orange (p < 0.01 by Fisher’s exact test).

The layout of the nodes is not directly driven by the data, but is instead automatically generated using network plotting algorithms in the pygraphviz and Graphiz software libraries.

Plots like this one show the correlation amongst the categories in each column and can be generated using the “catcorr” Python package with any set of categorical data.

You can download the code from my github repository: https://github.com/agartland/utils/


In [10]:
figure(1,figsize=(12,5))
catcorr(df)