Tom Ellis, March 2018, updated June 2020
In the previous sections on genotype arrays, paternity arrays and sibship clustering we considered only a single half-sibling array. In most real-world situations, you would probably have multiple half-sibling arrays from multiple mothers.
FAPS assumes that these families are independent, which seems a reasonable assumption for most application, so dealing with multiple families boils down to performing the same operation on these families through a loop. This guide outlines some tricks to automate this.
This notebook will examine how to:
To illustrate this we will use data from wild-pollinated seed capsules of the snapdragon Antirrhinum majus. Each capsule represents a single maternal family, which may contain mixtures of full- and half-siblings. Each maternal family can be treated as independent.
These are the raw data described in Ellis et al. (2018), and are available from the IST Austria data repository (DOI:10.15479/AT:ISTA:95). For the analysis presented in that paper we did extensive data cleaning and checking, which is given as a case study later in this guide. Here, we will skip this process, since it primarily concerns accuracy of results.
There are two ways to divide data into families: by splitting up a genotypeArray
into families, and making a paternityArray
for each, or create a single paternityArray
and split up that.
Frequently offspring have been genotyped from multiple half-sibling arrays, and it is convenient to store these data together in a single file on disk. However, it (usually) only makes sense to look for sibling relationships within known half-sib families, so we need to split those data up into half-sibling famililes.
First, import the required packages and data for the sample of candidate fathers and the progeny.
In [1]:
import numpy as np
from faps import *
import matplotlib.pyplot as plt
%pylab inline
adults = read_genotypes('../../data/parents_2012_genotypes.csv', genotype_col=1)
progeny = read_genotypes('../../data/offspring_2012_genotypes.csv', genotype_col=2, mothers_col=1)
For simplicity, let's restrict the progeny to those offspring belonging to three maternal families.
In [2]:
ix = np.isin(progeny.mothers, ['J1246', 'K1809', 'L1872'])
progeny = progeny.subset(individuals=ix)
We also need to define an array of genotypes for the mothers, and a genotyping error rate.
In [3]:
mothers = adults.subset(individuals=np.unique(progeny.mothers))
mu= 0.0015
Pull out the numbers of adults and progeny in the dataset, as well as the number of maternal families.
In [4]:
print(adults.size)
print(progeny.size)
print(len(np.unique(progeny.mothers)))
Most maternal families are between 20 and 30, with some either side.
In the data import we specified that the ID of the mother of each offspring individual was given in column 2 of the data file (i.e. column 1 for Python, which starts counting from zero). Currently this information is contained in progeny.mothers
.
To separate a genotypeArray
into separate families you can use split
, and the vector of maternal names. This returns a dictionary of genotypeArray
objects for each maternal family.
In [5]:
progeny2 = progeny.split(progeny.mothers)
mothers2 = mothers.split(mothers.names)
If we inspect progeny2
we can see the structure of the dictionary. Python dictionaries are indexed by a key, which in this case is the maternal family name. Each key refers to some values, which in this case is a genotypeArray
object for each maternal family.
In [6]:
progeny2
Out[6]:
You can pull attributes about an individual family by indexing the key like you would for any other python dictionary.
In [7]:
progeny2["J1246"].size
Out[7]:
To do this for all families you can iterate with a dictionary comprehension, or loop over the dictionary. Here are three ways to get the number of offspring in each maternal family:
In [8]:
{k: v.size for k,v in progeny2.items()} # the .items() suffix needed to separate keys and values
Out[8]:
In [9]:
{k : progeny2[k].size for k in progeny2.keys()} # using only the keys.
Out[9]:
In [10]:
# Using a for loop.
for k,v in progeny2.items():
print(k, v.size)
The previous section divided up a genotypeArray
containing data for offspring from multiple mothers and split that up into maternal families. You can then pass this dictionary of genotypeArray
objects to paternity_array
directly, just as if they were single objects. paternity_array
detects that these are dictionaries, and returns a dictionary of paternityArray
objects.
In [21]:
%time patlik1 = paternity_array(progeny2, mothers2, adults, mu, missing_parents=0.01)
The alternative way to do this is to pass the entire arrays for progeny and mothers to
paternity_array
. A word of caution is needed here, because paternity_array
is quite memory hungry, and for large datasets there is a very real chance you could exhaust the RAM on your computer and the machine will grind to a halt. By splitting up the genotype data first you can deal with small chunks at a time.
In [22]:
mothers_full = adults.subset(progeny.mothers)
%time patlik2 = paternity_array(progeny, mothers_full, adults, mu)
patlik2
Out[22]:
There doesn't seem to be any difference in speed the two methods, although in other cases I have found that creating a single paternityArray
is slower. Your mileage may vary.
We split up the paternity_array
in the same way as a genotype_array
. It returns a list of paternityArray
objects.
In [23]:
patlik3 = patlik2.split(progeny.mothers)
patlik3
Out[23]:
We would hope that patlik
and patlik3
are identical lists of paternityArray
objects. We can inspect family J1246 to check:
In [24]:
patlik1['J1246'].offspring
Out[24]:
In [25]:
patlik3['J1246'].offspring
Out[25]:
sibship_clustering
is also able to detect when a list of paternityArray
objects is being passed, and treat each independently. It returns a dictionary of sibshipCluster
objects.
In [26]:
%%time
sc = sibship_clustering(patlik1)
sc
Out[26]:
This time there is quite a substantial speed advantage to performing sibship clustering on each maternal family separately rather than on all individuals together. This advanatge is modest here, but gets substantial quickly as you add more families and offspring, because the number of pairs of relationships to consider scales quadratically.
In [17]:
%time sibship_clustering(patlik2)
Out[17]:
You can index any single family to extract information about it in the same way as was explained in the section on sibship clustering. For example, the posterior distribution of full-sibship sizes for the first maternal family.
In [18]:
sc['J1246'].family_size()
Out[18]:
As with genotypeArray
objects, to extract information about each sibshipCluster
object it is straightforward to set up a list comprehension. For example, this cell pulls out the number of partition structures for each maternal family.
In [19]:
{k : v.npartitions for k,v in sc.items()}
Out[19]:
Since paternity is likely to be a common aim, there is a handy function for calling the sires
method on a list of sibshipCluster
objects.
In [20]:
summarise_sires(sc)
Out[20]:
The output is similar to that of sires()
except that it gives labels for mother and father separately, replacing the label
column. The prob
and offspring
columns have the same interpretation as for single sibshipCluster
objects.