Correlation of numeric columns

Recently a colleague who works with very large data frames (millions of rows, hundreds of columns) mentioned that one task, evaluating a correlation matrix of the numeric columns, was taking a long time.

DataFrame columns may be Arrays or DataArrays.DataArrays or, more recently, NullableArrays or CategoricalArrays or DataArrays.PooledDataArrays. All of these are "parameterized types". That is, they are array-like containers, the contents of which could be numeric or character strings or ...

The objective is a function that takes two arguments, a DataFrame and a (possibly empty) vector of names (as Symbols) of columns to consider. If the vector of names is empty then all columns are considered.

First load the packages to be used


In [1]:
using BenchmarkTools, DataFrames, NamedArrays, RCall

Before creating the final function consider the task of determining if a column is "numeric". This, obviously, is based on the column type not the contents. It is common to write a predicate function, which just means a function that returns true or false, in such cases. The default predicate is false. Other methods are defined for the types we wish to consider as numeric. The definitions are easier than the explanations.


In [2]:
isnumeric{T<:Number}(::AbstractArray{T}) = true
isnumeric(::Any) = false


Out[2]:
isnumeric (generic function with 2 methods)

The second method will be the default. The first method overrides the second for any type of array of numbers.

To provide an example data frame


In [3]:
R" library(mlmRev) ";
chem97 = rcopy("Chem97")  # at present this output is messy in IJulia


WARNING: RCall.jl: Loading required package: lme4
Loading required package: Matrix
Out[3]:
leaschoolstudentscoregenderagegcsescoregcsecnt
1CategoricalArrays.CategoricalValue{String,UInt32} "1"CategoricalArrays.CategoricalValue{String,UInt32} "1"CategoricalArrays.CategoricalValue{String,UInt32} "1"4.0CategoricalArrays.CategoricalValue{String,UInt32} "F"3.06.6250.3393157114303298
2CategoricalArrays.CategoricalValue{String,UInt32} "1"CategoricalArrays.CategoricalValue{String,UInt32} "1"CategoricalArrays.CategoricalValue{String,UInt32} "2"10.0CategoricalArrays.CategoricalValue{String,UInt32} "F"-3.07.6251.3393157114303298
3CategoricalArrays.CategoricalValue{String,UInt32} "1"CategoricalArrays.CategoricalValue{String,UInt32} "1"CategoricalArrays.CategoricalValue{String,UInt32} "3"10.0CategoricalArrays.CategoricalValue{String,UInt32} "F"-4.07.250.9643157114303298
4CategoricalArrays.CategoricalValue{String,UInt32} "1"CategoricalArrays.CategoricalValue{String,UInt32} "1"CategoricalArrays.CategoricalValue{String,UInt32} "4"10.0CategoricalArrays.CategoricalValue{String,UInt32} "F"-2.07.51.2143157114303298
5CategoricalArrays.CategoricalValue{String,UInt32} "1"CategoricalArrays.CategoricalValue{String,UInt32} "1"CategoricalArrays.CategoricalValue{String,UInt32} "5"8.0CategoricalArrays.CategoricalValue{String,UInt32} "F"-1.06.4440.15831571143032974
6CategoricalArrays.CategoricalValue{String,UInt32} "1"CategoricalArrays.CategoricalValue{String,UInt32} "1"CategoricalArrays.CategoricalValue{String,UInt32} "6"10.0CategoricalArrays.CategoricalValue{String,UInt32} "F"4.07.751.4643157114303298
7CategoricalArrays.CategoricalValue{String,UInt32} "1"CategoricalArrays.CategoricalValue{String,UInt32} "1"CategoricalArrays.CategoricalValue{String,UInt32} "7"6.0CategoricalArrays.CategoricalValue{String,UInt32} "F"1.06.750.4643157114303298
8CategoricalArrays.CategoricalValue{String,UInt32} "1"CategoricalArrays.CategoricalValue{String,UInt32} "1"CategoricalArrays.CategoricalValue{String,UInt32} "8"8.0CategoricalArrays.CategoricalValue{String,UInt32} "F"4.06.9090.6233157114303296
9CategoricalArrays.CategoricalValue{String,UInt32} "1"CategoricalArrays.CategoricalValue{String,UInt32} "1"CategoricalArrays.CategoricalValue{String,UInt32} "9"4.0CategoricalArrays.CategoricalValue{String,UInt32} "F"3.06.3750.08931571143032979
10CategoricalArrays.CategoricalValue{String,UInt32} "1"CategoricalArrays.CategoricalValue{String,UInt32} "1"CategoricalArrays.CategoricalValue{String,UInt32} "10"10.0CategoricalArrays.CategoricalValue{String,UInt32} "F"0.07.751.4643157114303298
11CategoricalArrays.CategoricalValue{String,UInt32} "1"CategoricalArrays.CategoricalValue{String,UInt32} "1"CategoricalArrays.CategoricalValue{String,UInt32} "11"10.0CategoricalArrays.CategoricalValue{String,UInt32} "F"-1.07.8571.57131571143033
12CategoricalArrays.CategoricalValue{String,UInt32} "1"CategoricalArrays.CategoricalValue{String,UInt32} "1"CategoricalArrays.CategoricalValue{String,UInt32} "12"8.0CategoricalArrays.CategoricalValue{String,UInt32} "F"1.07.3331.04731571143033
13CategoricalArrays.CategoricalValue{String,UInt32} "1"CategoricalArrays.CategoricalValue{String,UInt32} "1"CategoricalArrays.CategoricalValue{String,UInt32} "13"10.0CategoricalArrays.CategoricalValue{String,UInt32} "F"1.07.751.4643157114303298
14CategoricalArrays.CategoricalValue{String,UInt32} "1"CategoricalArrays.CategoricalValue{String,UInt32} "2"CategoricalArrays.CategoricalValue{String,UInt32} "14"10.0CategoricalArrays.CategoricalValue{String,UInt32} "M"0.07.71.41431571143033
15CategoricalArrays.CategoricalValue{String,UInt32} "1"CategoricalArrays.CategoricalValue{String,UInt32} "2"CategoricalArrays.CategoricalValue{String,UInt32} "15"10.0CategoricalArrays.CategoricalValue{String,UInt32} "M"-4.06.30.014315711430329614
16CategoricalArrays.CategoricalValue{String,UInt32} "1"CategoricalArrays.CategoricalValue{String,UInt32} "2"CategoricalArrays.CategoricalValue{String,UInt32} "16"10.0CategoricalArrays.CategoricalValue{String,UInt32} "M"5.07.31.0143157114303296
17CategoricalArrays.CategoricalValue{String,UInt32} "1"CategoricalArrays.CategoricalValue{String,UInt32} "2"CategoricalArrays.CategoricalValue{String,UInt32} "17"8.0CategoricalArrays.CategoricalValue{String,UInt32} "M"-3.06.6360.3503157114303299
18CategoricalArrays.CategoricalValue{String,UInt32} "1"CategoricalArrays.CategoricalValue{String,UInt32} "2"CategoricalArrays.CategoricalValue{String,UInt32} "18"10.0CategoricalArrays.CategoricalValue{String,UInt32} "M"4.07.2720.98631571143033
19CategoricalArrays.CategoricalValue{String,UInt32} "1"CategoricalArrays.CategoricalValue{String,UInt32} "2"CategoricalArrays.CategoricalValue{String,UInt32} "19"10.0CategoricalArrays.CategoricalValue{String,UInt32} "M"0.07.20.91431571143033
20CategoricalArrays.CategoricalValue{String,UInt32} "1"CategoricalArrays.CategoricalValue{String,UInt32} "2"CategoricalArrays.CategoricalValue{String,UInt32} "20"4.0CategoricalArrays.CategoricalValue{String,UInt32} "M"-3.06.4540.16831571143032953
21CategoricalArrays.CategoricalValue{String,UInt32} "1"CategoricalArrays.CategoricalValue{String,UInt32} "2"CategoricalArrays.CategoricalValue{String,UInt32} "21"6.0CategoricalArrays.CategoricalValue{String,UInt32} "M"4.06.8180.5323157114303294
22CategoricalArrays.CategoricalValue{String,UInt32} "1"CategoricalArrays.CategoricalValue{String,UInt32} "2"CategoricalArrays.CategoricalValue{String,UInt32} "22"10.0CategoricalArrays.CategoricalValue{String,UInt32} "M"-5.07.31.0143157114303296
23CategoricalArrays.CategoricalValue{String,UInt32} "1"CategoricalArrays.CategoricalValue{String,UInt32} "2"CategoricalArrays.CategoricalValue{String,UInt32} "23"2.0CategoricalArrays.CategoricalValue{String,UInt32} "M"-1.06.2-0.08568428856967003
24CategoricalArrays.CategoricalValue{String,UInt32} "1"CategoricalArrays.CategoricalValue{String,UInt32} "2"CategoricalArrays.CategoricalValue{String,UInt32} "24"10.0CategoricalArrays.CategoricalValue{String,UInt32} "M"-2.07.1110.8253157114303296
25CategoricalArrays.CategoricalValue{String,UInt32} "1"CategoricalArrays.CategoricalValue{String,UInt32} "2"CategoricalArrays.CategoricalValue{String,UInt32} "25"10.0CategoricalArrays.CategoricalValue{String,UInt32} "M"2.06.80.5143157114303296
26CategoricalArrays.CategoricalValue{String,UInt32} "1"CategoricalArrays.CategoricalValue{String,UInt32} "2"CategoricalArrays.CategoricalValue{String,UInt32} "26"8.0CategoricalArrays.CategoricalValue{String,UInt32} "M"-4.06.50.2143157114303298
27CategoricalArrays.CategoricalValue{String,UInt32} "1"CategoricalArrays.CategoricalValue{String,UInt32} "2"CategoricalArrays.CategoricalValue{String,UInt32} "27"10.0CategoricalArrays.CategoricalValue{String,UInt32} "M"-5.06.7270.4413157114303301
28CategoricalArrays.CategoricalValue{String,UInt32} "1"CategoricalArrays.CategoricalValue{String,UInt32} "2"CategoricalArrays.CategoricalValue{String,UInt32} "28"6.0CategoricalArrays.CategoricalValue{String,UInt32} "M"-6.07.00.7143157114303298
29CategoricalArrays.CategoricalValue{String,UInt32} "1"CategoricalArrays.CategoricalValue{String,UInt32} "2"CategoricalArrays.CategoricalValue{String,UInt32} "29"10.0CategoricalArrays.CategoricalValue{String,UInt32} "M"-2.07.71.41431571143033
30CategoricalArrays.CategoricalValue{String,UInt32} "1"CategoricalArrays.CategoricalValue{String,UInt32} "2"CategoricalArrays.CategoricalValue{String,UInt32} "30"10.0CategoricalArrays.CategoricalValue{String,UInt32} "M"3.07.31.0143157114303296
&vellip&vellip&vellip&vellip&vellip&vellip&vellip&vellip&vellip

In [4]:
for (n, v) in eachcol(chem97)
    println(rpad(n, 10), typeof(v))
end


lea       CategoricalArrays.CategoricalArray{String,1,UInt32}
school    CategoricalArrays.CategoricalArray{String,1,UInt32}
student   CategoricalArrays.CategoricalArray{String,1,UInt32}
score     Array{Float64,1}
gender    CategoricalArrays.CategoricalArray{String,1,UInt32}
age       Array{Float64,1}
gcsescore Array{Float64,1}
gcsecnt   Array{Float64,1}

In [5]:
show(isnumeric.(chem97.columns))


Bool[false,false,false,true,false,true,true,true]

Creating the intermediate array

It is an advantage to create a numeric array from the numeric columns then evaluate the correlation matrix, possibly using the cor function. To provide a default value for the names to consider, create two methods, the more general method and the special case.

The more general method examines the columns in turn, using the eachcol iterator and appends them to a numeric vector if they satisfy the conditions. Julia vectors allow for efficient extension by "over-allocating" on individual extensions. (Don't try this in R.)

The correlation matrix is created (using cor) as a named array and the names installed.


In [6]:
dfcor(dfr::DataFrame) = dfcor(dfr, names(dfr))

function dfcor(dfr::DataFrame, cols::Vector{Symbol})
    nms = Symbol[]
    arr = Float64[]
    for (n, v) in eachcol(dfr)
        if n  cols && isnumeric(v)
            push!(nms, n)
            append!(arr, v)
        end
    end
    result = NamedArray(cor(reshape(arr, (size(dfr, 1), length(nms)))))
    NamedArrays.setnames!(result, string.(nms), 1)
    NamedArrays.setnames!(result, string.(nms), 2)
    result
end


Out[6]:
dfcor (generic function with 2 methods)

In [7]:
dfcor(chem97)


Out[7]:
4×4 Named Array{Float64,2}
    A ╲ B │       score          age    gcsescore      gcsecnt
──────────┼───────────────────────────────────────────────────
score     │         1.0  -0.00362303     0.662248     0.662248
age       │ -0.00362303          1.0    0.0519797    0.0519797
gcsescore │    0.662248    0.0519797          1.0          1.0
gcsecnt   │    0.662248    0.0519797          1.0          1.0

In [8]:
dfcor(chem97, [:lea, :score, :age, :gcsescore])


Out[8]:
3×3 Named Array{Float64,2}
    A ╲ B │       score          age    gcsescore
──────────┼──────────────────────────────────────
score     │         1.0  -0.00362303     0.662248
age       │ -0.00362303          1.0    0.0519797
gcsescore │    0.662248    0.0519797          1.0

A super whiz-bang version of cor

In its current state dfcor is perfectly adequate but some of us just can't leave well enough alone. An alternative calculation of the correlation of the columns of the intermediate array is based on a QR factorization of the intermediate array, with a column of 1's prepended. Without going in to the details, the effect of having a column of 1's (or any non-zero constant) as the first column is to subtract the column means from the other columns. Then extract the upper triangular matrix R for the other columns, divide each column by its Euclidean length (the norm function) and create the R'R product. Proof that this should work is left as an exercise for the reader.

The point here is that the intermediate matrix is going to be constructed by appending columns so there is no penalty in starting with a column of ones.


In [9]:
dfcor1(dfr::DataFrame) = dfcor1(dfr, names(dfr))

function dfcor1(dfr::DataFrame, cols::Vector{Symbol})
    m = size(dfr, 1)
    nms = Symbol[]
    arr = ones(m)
    for (n, v) in eachcol(dfr)
        if n  cols && isnumeric(v)
            push!(nms, n)
            append!(arr, v)
        end
    end
    n = length(nms) + 1
    R = full(UpperTriangular(view(qrfact!(reshape(arr, (m, n)))[:R], 
                2:n, 2:n)))
    for j in 1:size(R, 2)
        colj = view(R, 1:j, j)
        colj ./= norm(colj)
    end
    Rt = UpperTriangular(R)
    result = NamedArray(Rt'Rt)
    NamedArrays.setnames!(result, string.(nms), 1)
    NamedArrays.setnames!(result, string.(nms), 2)
    result
end


Out[9]:
dfcor1 (generic function with 2 methods)

In [10]:
dfcor1(chem97)


Out[10]:
4×4 Named Array{Float64,2}
    A ╲ B │       score          age    gcsescore      gcsecnt
──────────┼───────────────────────────────────────────────────
score     │         1.0  -0.00362303     0.662248     0.662248
age       │ -0.00362303          1.0    0.0519797    0.0519797
gcsescore │    0.662248    0.0519797          1.0          1.0
gcsecnt   │    0.662248    0.0519797          1.0          1.0

Unfortunately, on benchmarks dfcor is faster - about 2 to 3 times faster - than dfcor1. In a way, I think this reflects the fact that cor is very carefully written. There may be some slight advantage in accuracy for dfcor1, but it is unlikely to be noticeable.


In [11]:
@benchmark dfcor(chem97)


Out[11]:
BenchmarkTools.Trial: 
  memory estimate:  2.61 mb
  allocs estimate:  188
  --------------
  minimum time:     485.034 μs (0.00% GC)
  median time:      524.309 μs (0.00% GC)
  mean time:        641.506 μs (15.22% GC)
  maximum time:     2.442 ms (52.99% GC)
  --------------
  samples:          7778
  evals/sample:     1
  time tolerance:   5.00%
  memory tolerance: 1.00%

In [12]:
@benchmark dfcor1(chem97)


Out[12]:
BenchmarkTools.Trial: 
  memory estimate:  1.90 mb
  allocs estimate:  196
  --------------
  minimum time:     1.226 ms (0.00% GC)
  median time:      1.526 ms (0.00% GC)
  mean time:        1.576 ms (4.48% GC)
  maximum time:     3.984 ms (20.77% GC)
  --------------
  samples:          3170
  evals/sample:     1
  time tolerance:   5.00%
  memory tolerance: 1.00%