Sibship clustering

Tom Ellis, March 2017

FAPS uses information in a paternityArray to generate plausible full-sibship configurations. This information is stored as a sibshipCluster object, and is the basis for almost all inference about biological processes in FAPS.

This notebook will examine how to:

Use a paternityArray to cluster offspring into plausible full-sibships.
Compare the relative support for different partitions structures
Make some basic inferences about the size and number of full sibships, and who is related to whom.

If the goal of your study is to say something about how specific phenotypes or geography influence patterns of mating, you can then use your sibshipArray to sample likely mating events, accounting fo uncertainty in sibship structure.

Generating a `sibshipCluster` object

We will begin by generating a population of 100 adults with 50 loci.



In [1]:

    
from faps import *
import numpy as np

np.random.seed(867)
allele_freqs = np.random.uniform(0.3,0.5,50)
adults = make_parents(100, allele_freqs, family_name='a')

We take the first individal as the mother and mate her to four males, to create three full sibships of five offspring. We then generate a paternityArray based on the genotype data.



In [2]:

    
progeny = make_sibships(adults, 0, [1,2,3], 5, 'x')
mothers = adults.subset(progeny.parent_index('m', adults.names))
patlik  = paternity_array(progeny, mothers, adults, mu = 0.0015)

It is straightforward to cluster offspring into full sibships. For now we'll stick with the default number of Monte Carlo draws.



In [4]:

    
sc = sibship_clustering(patlik, ndraws=1000)

Sibship clustering calculates likelihoods that each pair of offspring are full siblings, then builds a dendrogram from this. We can visualise this dendrogram if we so wish, although the output is not pretty.



In [5]:

    
from scipy.cluster.hierarchy import dendrogram
import matplotlib.pyplot as plt

dendrogram(sc.linkage_matrix)
plt.show()









    





<matplotlib.figure.Figure at 0x7f6d99786690>

Offspring individuals are labelled by their index in the array. Since full sibships are of size five we should expect to see clusters of {0,1,2,3,4}, {5,6,7,8,9} and {10,11,12,13,14}. This is indeed what we do see. What is difficult to see on the dendrogram are the branches connecting full siblings at the very bottom of the plot. If we bisect this dendrogram at different places on the y-axis we can infer different ways to partition the offspring into full siblings.

sc is an object of class sibshipCluster that contains various information about the array. Of primary interest are the set of partition structures inferred from the dendrogram. There are sixteen partitions - one for each individual in the array (i.e. one for each bifurcation in the dendrogram).



In [6]:

    
sc.partitions









    Out[6]:





array([[ 1,  1,  1,  1,  1,  1,  1,  1,  1,  1,  1,  1,  1,  1,  1],
       [ 2,  2,  2,  2,  2,  1,  1,  1,  1,  1,  2,  2,  2,  2,  2],
       [ 3,  3,  3,  3,  3,  1,  1,  1,  1,  1,  2,  2,  2,  2,  2],
       [ 4,  4,  4,  4,  4,  1,  1,  2,  1,  1,  3,  3,  3,  3,  3],
       [ 5,  5,  5,  5,  5,  1,  1,  3,  1,  2,  4,  4,  4,  4,  4],
       [ 5,  6,  5,  5,  5,  1,  1,  3,  1,  2,  4,  4,  4,  4,  4],
       [ 6,  7,  6,  6,  6,  1,  1,  3,  1,  2,  4,  5,  4,  4,  4],
       [ 7,  8,  7,  7,  7,  1,  1,  4,  2,  3,  5,  6,  5,  5,  5],
       [ 8,  9,  8,  8,  8,  1,  2,  5,  3,  4,  6,  7,  6,  6,  6],
       [ 8, 10,  9,  8,  8,  1,  2,  5,  3,  4,  6,  7,  6,  6,  6],
       [ 8, 11, 10,  9,  8,  1,  2,  5,  3,  4,  6,  7,  6,  6,  6],
       [ 9, 12, 11, 10,  9,  1,  2,  5,  3,  4,  7,  8,  6,  6,  6],
       [10, 13, 12, 11, 10,  1,  2,  5,  3,  4,  8,  9,  6,  6,  7],
       [11, 14, 13, 12, 11,  1,  2,  5,  3,  4,  9, 10,  6,  7,  8],
       [11, 15, 14, 13, 12,  1,  2,  5,  3,  4,  9, 10,  6,  7,  8]],
      dtype=int32)

What is key about partition structures is that each symbol represents a unique but arbitrary family identifier. For example in the third row we see the true partition structure, with individuals grouped into three groups of five individuals.



In [7]:

    
sc.partitions[2]









    Out[7]:





array([3, 3, 3, 3, 3, 1, 1, 1, 1, 1, 2, 2, 2, 2, 2], dtype=int32)

Beyond denoting who is in a family with whom, the labels are arbitrary, with no meaningful order. This partition would be identical to [0,0,0,0,0,1,1,1,1,1,2,2,2,2,2] or [10,10,10,10,10,7,7,7,7,7,22,22,22,22,22] for example.

Each partition is associated with a log likelihood and equivalent log probability. We can see from both scores that the third partition is most consistent with the data. This is of course the true partition.



In [8]:

    
print sc.lik_partitions # log likelihood of each partition
print np.exp(sc.prob_partitions) # probabilities of each partition









    



[-4.23560188e+02 -1.94067281e+02 -2.70500804e-04 -8.55784873e+00
            -inf            -inf            -inf            -inf
            -inf            -inf            -inf            -inf
            -inf            -inf            -inf]
[1.12248824e-184 5.22016966e-085 9.99807953e-001 1.92047026e-004
 0.00000000e+000 0.00000000e+000 0.00000000e+000 0.00000000e+000
 0.00000000e+000 0.00000000e+000 0.00000000e+000 0.00000000e+000
 0.00000000e+000 0.00000000e+000 0.00000000e+000]

We also see that the first and second partitions have non-zero, but small likelihoods. Parititons 5-8 have negative infinity log likelihood - they are incompatible with the data. These partitions split up true full siblings, and there is no way to reconcile this with the data. In real world situations such partitions might have non-zero likelihoods if they were an unrelated candidate male compatible with one or more offspring through chance alone.

In some cases there can be rounding error when log probabilities are exponentiated and probabilities do not sum to one. This is classic machine error, and the reason it is good to work with log values wherever possible. We can check:



In [9]:

    
np.exp(sc.prob_partitions).sum()









    Out[9]:





0.9999999999999999

You can directly call the most likely partition. This is somewhat against the spirit of fractional analyses though...



In [10]:

    
sc.mlpartition









    Out[10]:





array([3, 3, 3, 3, 3, 1, 1, 1, 1, 1, 2, 2, 2, 2, 2], dtype=int32)

How many Monte Carlo draws?

Calculating the likelihood of a partition structure is challenging in a fractional framework because we need to allow for the possibility of every candidate to be the sire of every putative full sibship, but also disallow the possibility that two full sibships share a single father. In the absence of a simple closed form estimator, FAPS uses Monte Carlo simulations to draw possible fathers for each sibship proportional to rows in the paternityArray, removes cases where multiple sibships share a father, and calculates likelihoods for the remaining cases.

The downside of this is that when the number of candidate fathers becomes large, many Monte Carlo draws are needed to properly sample the space of possible fathers. In the example above with 100 candidates, increasing the number of draws makes no difference. If we make a much larger example with 5000 candidates, 40 SNP loci, and fairly high genotype-error rates, then increasing the number of draws means we are able to find valid configurations for more partition structures.



In [13]:

    
np.random.seed(625)
mu = 0.003 # genotype error rate
allele_freqs = np.random.uniform(0.3,0.5,40)
adults = make_parents(5000, allele_freqs, family_name='a').mutations(mu)
progeny = make_sibships(adults, 0, [1,2,3,4], 5, 'x').mutations(mu)
mothers = adults.subset(progeny.parent_index('m', adults.names))
patlik  = paternity_array(progeny, mothers, adults, mu)

print sibship_clustering(patlik, ndraws=100).lik_partitions
print sibship_clustering(patlik, ndraws=1000).lik_partitions
print sibship_clustering(patlik, ndraws=10000).lik_partitions
print sibship_clustering(patlik, ndraws=100000).lik_partitions









    



[-300.14878498 -223.32087212 -114.4050134    -4.11643151   -3.75532011
   -3.78106756   -4.54111533   -5.58483364   -6.55225317          -inf
          -inf          -inf          -inf          -inf          -inf
          -inf          -inf          -inf          -inf          -inf]
[-300.14878498 -223.32087212 -114.4050134    -4.11643151   -3.63074801
   -3.63470269   -4.32477986   -5.62514679   -7.11246111   -8.84318051
          -inf          -inf          -inf          -inf          -inf
          -inf          -inf          -inf          -inf          -inf]
[-300.14878498 -223.32087212 -114.4050134    -4.11643151   -3.62413702
   -3.60996847   -4.23569295   -5.33015851   -6.20040007   -7.86546219
  -11.37870605  -14.8146326           -inf          -inf          -inf
          -inf          -inf          -inf          -inf          -inf]
[-300.14878498 -223.32087212 -114.4050134    -4.11643151   -3.62012365
   -3.59877473   -4.21624287   -5.30205236   -6.04506731   -7.39214618
   -8.92866663  -18.97122213  -21.14795632  -34.57247516          -inf
          -inf          -inf          -inf          -inf          -inf]

Notice that the extra partitions idenitified are towards the end of the list. These tend to be partitions where true full-sib families are (erroneously) split into smaller groups, especially singleton families. Likelihoods for the extra partitions are not increasing, so most of the probability weight remains around partitions which are quite close to the true partition.

Now consider a slightly different case where every offspring really is in a full sibship of its own (i.e. the true partition is [0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19], which will be at the end of the lists of partitons). Likelihoods increase as we look from the start to the end of each list, and the final partition is the most likely for 100, 1000 and 10000 Monte Carlo draws.



In [24]:

    
np.random.seed(7636)
mu = 0.003 # genotype error rate
allele_freqs = np.random.uniform(0.3,0.5,40)
adults = make_parents(5000, allele_freqs, family_name='a').mutations(mu)
progeny = make_sibships(adults, 0, range(20), 1, 'x').mutations(mu)
mothers = adults.subset(progeny.parent_index('m', adults.names))
patlik  = paternity_array(progeny, mothers, adults, mu)

print sibship_clustering(patlik, ndraws=100).lik_partitions
print sibship_clustering(patlik, ndraws=1000).lik_partitions
print sibship_clustering(patlik, ndraws=10000).lik_partitions
print sibship_clustering(patlik, ndraws=100000).lik_partitions









    



[-416.01100325 -379.99362752 -356.58214515 -321.73375477 -295.76199498
 -262.84565561 -233.23990514 -202.46259871 -151.32028313 -132.52095045
 -107.84368486  -87.116699    -72.55838541  -58.80581086  -45.13526799
  -34.52188616  -24.90701804  -15.18539673   -6.24542751   -1.76187955]
[-416.01100325 -379.99362752 -356.58195734 -321.72968256 -295.76188442
 -262.82770667 -233.22795883 -202.4468761  -151.31535186 -132.47893826
 -107.78473665  -87.03103877  -72.48431536  -58.73350681  -44.98254529
  -34.26318889  -24.398553    -14.66285795   -5.79014055   -0.90425392]
[-416.01100325 -379.99362752 -356.58143992 -321.72943378 -295.76144521
 -262.82719638 -233.22038811 -202.44508888 -151.30750387 -132.46557852
 -107.7691896   -87.00655403  -72.45765733  -58.68801417  -44.93090915
  -34.15881097  -24.19512812  -14.42335531   -5.53412026   -0.49542979]
[-4.16011003e+02 -3.79993479e+02 -3.56581167e+02 -3.21729213e+02
 -2.95760913e+02 -2.62826871e+02 -2.33219930e+02 -2.02443752e+02
 -1.51306437e+02 -1.32460620e+02 -1.07764392e+02 -8.69977352e+01
 -7.24463954e+01 -5.86736155e+01 -4.49080325e+01 -3.41092351e+01
 -2.41014620e+01 -1.43062675e+01 -5.40899143e+00 -2.64920178e-01]

So how many Monte Carlo draws are necessary? The short answer is: it probably doesn't matter. In the original FAPS paper (figure 5 in that paper) we found that increasing the number of Monte Carlo draws does increase the amount of probability space explored, but these regions tend to be areas of low probability. Likely configurations are found first, and increasing the number of draws doesn't increase overall accuracy of inferred sibship relationships.

A good rule is to use the default setting of 1000 draws. If you are concerned about this effect you can also change the number to 100 or 10,000 and see if this alters your downstream analyses. You can also test this explicitly with simulations using the power analysis tools.

Inferring family structure

For this section we will simulate a slightly more interesting family structure. This block of code creates a half-sib array of 15 offspring from five fathers, where each father contributes five, four, three, two and one offspring respectively. It then performs sibship clustering on the array. We use 1000 candidate males and 50 loci.



In [32]:

    
# Lists indexing sires and dams
sires = [1]*5 + [2]*4 + [3]*3 + [4]*2 +[5]*1
dam   = [0] * len(sires)

np.random.seed(542)
allele_freqs = np.random.uniform(0.3,0.5,30)
adults  = make_parents(1000, allele_freqs, family_name='a')
progeny = make_offspring(adults, dam_list=dam, sire_list=sires)
mothers = adults.subset(progeny.parent_index('m', adults.names))

patlik  = paternity_array(progeny, mothers, adults, mu)
sc = sibship_clustering(patlik)

Number of families

We saw before that we could call a list of valid partitions for sc using sc.partitions. The output is not terribly enlightening on its own, however. We could instead ask how probable it is that there are x full sibships in the array, integrating over all partition structures. Here each number is the probability that there are 1, 2, ..., 15 families.



In [34]:

    
sc.nfamilies()









    Out[34]:





array([1.53113909e-94, 2.55473148e-60, 9.10941998e-44, 7.65780993e-06,
       7.40956442e-01, 2.23357677e-01, 3.47258104e-02, 9.51716337e-04,
       6.96251149e-07, 0.00000000e+00, 0.00000000e+00, 0.00000000e+00,
       0.00000000e+00, 0.00000000e+00, 0.00000000e+00])

We could show the same information graphically. Its clear that almost all the probability denisty is around $x=5$ families.



In [38]:

    
%matplotlib inline
import matplotlib.pyplot as plt

fig = plt.figure()
ax = fig.add_subplot(111)
#ax.bar(np.arange(0.5, len(sc.nfamilies())+0.5), sc.nfamilies())
ax.bar(np.arange(1,16), sc.nfamilies())
ax.set_xlabel('Number of full sibships')
ax.set_ylabel('Posterior probability')
plt.show()

Family size

We can also get the distribution of family sizes within the array, averaged over all partitions. This returns a vector of the same length as the number of offspring in the array. family_size returns the posterior probability of observing one or more families of size 1, 2, ... , 15. It will be clear that we are unable to distinguish a single sibship with high probability from multiple families of the same size, each with low probability; this is the price we pay for integrating out uncertainty in partition structure.



In [14]:

    
sc.family_size()









    Out[14]:





array([4.80117564e-005, 0.00000000e+000, 0.00000000e+000, 4.80117564e-005,
       9.99903976e-001, 0.00000000e+000, 0.00000000e+000, 0.00000000e+000,
       0.00000000e+000, 2.61008483e-085, 0.00000000e+000, 0.00000000e+000,
       0.00000000e+000, 0.00000000e+000, 1.12248824e-184])

Plotting this shows that we are roughly equally likely to observe a family of sizes one, two, three, four and five.



In [43]:

    
fig = plt.figure()
ax = fig.add_subplot(111)
ax.bar(np.arange(len(sires))+0.5, sc.family_size())

plt.show()

Sibling relationships

Often we want to know who is related to whom. sc.full_sib_matrix() returns an $n*n$ matrix, where $n$ is the number of offspring. Each element describes the log probability that a pair of individuals are full siblings. If we plot this using a heatmap you can clearly see the five full sibships jump out as blocks of yellow (>90% probability of being full siblings) against a sea of purple (near zero probability of being full siblings).



In [45]:

    
sibmat = sc.full_sib_matrix()
plt.pcolor(np.exp(sibmat))
plt.colorbar()
plt.show()

Note that real datasets seldom look this tidy!