Recently a colleague who works with very large data frames (millions of rows, hundreds of columns) mentioned that one task, evaluating a correlation matrix of the numeric columns, was taking a long time.
DataFrame
columns may be Array
s or DataArrays.DataArray
s or, more recently, NullableArray
s or CategoricalArray
s or DataArrays.PooledDataArray
s. All of these are "parameterized types". That is, they are array-like containers, the contents of which could be numeric or character strings or ...
The objective is a function that takes two arguments, a DataFrame
and a (possibly empty) vector of names (as Symbol
s) of columns to consider. If the vector of names is empty then all columns are considered.
First load the packages to be used
In [1]:
using BenchmarkTools, DataFrames, NamedArrays, RCall
Before creating the final function consider the task of determining if a column is "numeric". This, obviously, is based on the column type not the contents. It is common to write a predicate function, which just means a function that returns true
or false
, in such cases. The default predicate is false
. Other methods are defined for the types we wish to consider as numeric. The definitions are easier than the explanations.
In [2]:
isnumeric{T<:Number}(::AbstractArray{T}) = true
isnumeric(::Any) = false
Out[2]:
The second method will be the default. The first method overrides the second for any type of array of numbers.
To provide an example data frame
In [3]:
R" library(mlmRev) ";
chem97 = rcopy("Chem97") # at present this output is messy in IJulia
Out[3]:
In [4]:
for (n, v) in eachcol(chem97)
println(rpad(n, 10), typeof(v))
end
In [5]:
show(isnumeric.(chem97.columns))
It is an advantage to create a numeric array from the numeric columns then evaluate the correlation matrix, possibly using the cor
function. To provide a default value for the names to consider, create two methods, the more general method and the special case.
The more general method examines the columns in turn, using the eachcol
iterator and appends them to a numeric vector if they satisfy the conditions. Julia vectors allow for efficient extension by "over-allocating" on individual extensions. (Don't try this in R
.)
The correlation matrix is created (using cor
) as a named array and the names installed.
In [6]:
dfcor(dfr::DataFrame) = dfcor(dfr, names(dfr))
function dfcor(dfr::DataFrame, cols::Vector{Symbol})
nms = Symbol[]
arr = Float64[]
for (n, v) in eachcol(dfr)
if n ∈ cols && isnumeric(v)
push!(nms, n)
append!(arr, v)
end
end
result = NamedArray(cor(reshape(arr, (size(dfr, 1), length(nms)))))
NamedArrays.setnames!(result, string.(nms), 1)
NamedArrays.setnames!(result, string.(nms), 2)
result
end
Out[6]:
In [7]:
dfcor(chem97)
Out[7]:
In [8]:
dfcor(chem97, [:lea, :score, :age, :gcsescore])
Out[8]:
In its current state dfcor
is perfectly adequate but some of us just can't leave well enough alone. An alternative calculation of the correlation of the columns of the intermediate array is based on a QR factorization of the intermediate array, with a column of 1's prepended. Without going in to the details, the effect of having a column of 1's (or any non-zero constant) as the first column is to subtract the column means from the other columns. Then extract the upper triangular matrix R for the other columns, divide each column by its Euclidean length (the norm
function) and create the R'R
product. Proof that this should work is left as an exercise for the reader.
The point here is that the intermediate matrix is going to be constructed by appending columns so there is no penalty in starting with a column of ones.
In [9]:
dfcor1(dfr::DataFrame) = dfcor1(dfr, names(dfr))
function dfcor1(dfr::DataFrame, cols::Vector{Symbol})
m = size(dfr, 1)
nms = Symbol[]
arr = ones(m)
for (n, v) in eachcol(dfr)
if n ∈ cols && isnumeric(v)
push!(nms, n)
append!(arr, v)
end
end
n = length(nms) + 1
R = full(UpperTriangular(view(qrfact!(reshape(arr, (m, n)))[:R],
2:n, 2:n)))
for j in 1:size(R, 2)
colj = view(R, 1:j, j)
colj ./= norm(colj)
end
Rt = UpperTriangular(R)
result = NamedArray(Rt'Rt)
NamedArrays.setnames!(result, string.(nms), 1)
NamedArrays.setnames!(result, string.(nms), 2)
result
end
Out[9]:
In [10]:
dfcor1(chem97)
Out[10]:
Unfortunately, on benchmarks dfcor
is faster - about 2 to 3 times faster - than dfcor1
. In a way, I think this reflects the fact that cor
is very carefully written. There may be some slight advantage in accuracy for dfcor1
, but it is unlikely to be noticeable.
In [11]:
@benchmark dfcor(chem97)
Out[11]:
In [12]:
@benchmark dfcor1(chem97)
Out[12]: