Ordering a pedigree by LAP

We represent a pedigree with two integer vectors, sire and dam, of the same length, say n. The elements of these vectors must be in the range 0:n with 0 representing an unknown parent.

For each of the n animals the longest ancestral path (LAP) is the maximum length of a path that can be traced from the animal through its ancestors. If both sire[i] and dam[i] are zero then the longest ancestral path for animal i is zero.

In general we want a pedigree to be sorted so that the parents of an animal have lower indices than the animal itself. If it is possible to sort the pedigree so that all the animals with LAP = 0 occur first followed by all the animals with LAP = 1 followed by all the animals with LAP = 2, etc. then we are guaranteed that the parents occur before the offspring. (The LAP of the offspring is always greater than the LAP of both of its parents.)

Outline of the algorithm

The algorithm is straightforward but, as always, getting the details right can be a bit tedious. The function laporder takes the sire and dam vectors and returns the re-ordering permutation, ord, the indices of the last element in the permuted ordering before a change in the LAP, lappt, and modified sire and dam vectors, ss and dd, under the new ordering.

Those animals with LAP = 0 are easy to identify because both sire and dam must be zero. Their indices are the leading part of the ord vector. The convention for the lappt vector is that the first element is always zero so the second element is the number of animals in the LAP0 group. The set of possible ancestors at the next stage is the union of [0] and the LAP0 group.

The algorithm proceeds iteratively from the LAP0 group to the LAP1 group to the LAP2 group and so on. At each stage the next LAP group is determined according to both the sire and dam being in the current ancestor, anc, set.

Using IntSets in Julia

The Julia language provides an Intset data type to represent a set of integers as a string of bits. Operations like union, intersection and set difference are expressed as logical operations on bits in a register, making this type very efficient. In the example shown below there are 6547 animals, sets of which are represented as a vector of 103 64-bit integers.

Many Julia functions are mutating functions, meaning that they modify one or more of their arguments. The iteration involves three IntSets, the current set of ancestors, anc, the current unassigned population, pop, and the next generation, nextgen. The update step is to detect and record the next generation then move those indices from pop to anc. The function source code is in the file types.jl. Notice that once nextgen has been detected and recorded, the rest of the update consists of

union!(anc,nextgen)
        setdiff!(pop,nextgen)
        empty!(nextgen)

An interesting characteristic of the algorithm is that there is essentially no storage allocated during the iterative loop. Elements are pushed onto the end of the ord and lappt vectors but those have been overallocated to allow for the additional elements.

An Example

The pedigreeR package for R contains a dataset named pedCows. After suitable conversion of the NA's to zeros, the sire and dam vectors are stored in an HDF5 file as shown below.


In [1]:
using pedigree,HDF5
fid = h5open(Pkg.dir("pedigree","data","pedCows.h5"))
sire = readmmap(fid["sire"])
dam = readmmap(fid["dam"])
close(fid)
dump(sire)


Array(Int16,(6547,)) Int16[0,0,0,0,0,0,0,0,0,0  …  1630,1630,1630,1630,1630,1630,2793,2793,2793,1630]

Because the arrays sire and dam will be read-only we memory-map them, which provides more efficient storage. For this size of pedigree this isn't important but for very large data sets memory mapping can be very effective.

Finally we call the ordering function.


In [2]:
ord,lappt,ss,dd = laporder(sire,dam)


Out[2]:
([1,2,3,4,5,6,7,8,9,10  …  6439,6472,6476,6477,6484,6485,4951,5797,5877,6022],[0,1866,2277,2606,3227,4124,4851,5860,6438,6543,6547],Int16[0,0,0,0,0,0,0,0,0,0  …  3541,4209,5865,5865,3541,3541,4858,4430,4428,4209],Int16[0,0,0,0,0,0,0,0,0,0  …  5917,5920,4710,3917,5916,5919,6442,6440,6441,6439])

This is very fast and uses only a small amount of memory in total.


In [3]:
@time laporder(sire,dam);


elapsed time: 0.003499049 seconds (568952 bytes allocated)

We can check that the resulting pedigree of ss and dd is indeed sorted by LAP order.


In [4]:
p = Pedigree(ss,dd)
dump(p.lap)


Array(Int16,(6547,)) Int16[0,0,0,0,0,0,0,0,0,0  …  8,8,8,8,8,8,9,9,9,9]

The constructor for the Pedigree type currently performs several consistency checks and evaluates the lap member separately. The differences in the lappt elements are the number of animals in each group.


In [5]:
show(diff(lappt))


[1866,411,329,621,897,727,1009,578,105,4]

The reason for storing the lappt vector in the form that is used here is so that the LAP group of any animal can be easily determined.