Tom Ellis, February 2017
FAPS stands for Fractional Analysis of Sibships and Paternity. It is a Python package for reconstructing genealogical relationships in wild populations, and making inference about biological processes. The sections of this document are intended as a user's guide to introduce how FAPS works.
The motivation for developing of FAPS was to provide a package to investigate biological processes in a large population of snapdragons in the wild. Existing packages to do this relied on computationally intense Markov-chain algorithms, which limited the scope for subsequent analysis and for checking assumptions through simulations. As such, most of the examples in this guide relate to snapdragons. That said, FAPS addresses general issues in pedigree reconstruction of wild populations, and it is hoped that FAPS will be useful for other plant and animal systems.
The specific aims of FAPS were to provide a package which would allow us to:
FAPS reconstructs relationships for one or more half-sibling arrays, that is to say a sample of offspring from one or more mothers whose identity is uncertain. A half-sibling array consistent of seedlings from the same maternal plant, or a family of lambs from the same ewe, to give two examples. The paternity of each offspring is unkown, and hence it is unknown whether pairs of offspring are full or half siblings.
There is also a sample of males, each of whom is a candidate to be the true sire of each offspring individual. FAPS is used to identify likely sibling relationships between offspring and their shared fathers based on typed genetic markers, and to use this information to make meaningful conclusions about population or matig biology. The procedure can be summarised as follows:
Like all statistical analyses, FAPS makes a number of assumptions about your data. It is good to state these explicitly so everyone is aware of the limitations of the data and method:
Depending on your biological questions, your data may not fit the assumptions listed above, and an alternative approach might be more appropriate. For example:
It is assumed you have read Ellis. (2016) for the basic background to the method. It would also be useful to read Devlin (1988) for an overview of the motivation and basic methods of fractional assignment. It is also assumed you have a basic understanding of probability and likelihood; see Bolker (2006, chapter 6) for an example of a general introduction.
FAPS uses Python as an interface, but it is hoped that this guide should allow users who aren't familiar with Python to adapt the code to their needs. It would be worthwhile to at least familiarise yourself with Python's data types, especially lists and NumPy arrays, and how list comprehensions work. A general introduction to Python concepts can be found here. I recommend interacting with FAPS through IPython/Jupyter, which allows you to test small pieces of code and annotate analyses as you go. This document, for example, is written in IPython.
You will of course need to have Python installed on your machine. If you do not already have this, instructions can be found here. You will also need to install the NumPy, fastcluster and Pandas libraries. These should be installed automatically if you intall FAPS using pip (see below), but if for some reason they are not, the easiest way to do this is to install one of the scientific Python bundles. Some of the simulation tools also make use of Jupyter widgets, but these are optional. There are no specific hardware requirements beyond what is needed to run Python, but it is possible that RAM will be a limiting factor if you are dealing with large samples (for example ~100 offspring and 10,000 candidate males).
All testing and development of FAPS was done on Linux and Mac machines. I have not tested it on Windows, nor do I intend to. That said, an advantage of Python is that it ought to work on any operating system, so in principle FAPS ought to run as well as on a Unix machine. One important difference is that Windows uses '\' instead of '/' in its file paths, so you will need to edit accordingly.
You can download the development version of FAPS from the project github repository. The best way to install FAPS is to use Python's package manager, Pip. Instructions to do so can be found at that projects documentation page. You can then either download the package form this repository, and run pip install .
from the package directory. At some point I will also endeavour to get the package on the PyPi database.
If for some reason that doesn't work, you could also unzip the package contents to your working directory, and import it from there. For example, if you're working directory is /home/Documents/myproject
where you will save your analyses, you need a folder in that directory called faps
containing the functions and classes contained in FAPS. This is not recommended.
Once in Python/IPython you'll need to import the package, as well as the NumPy library on which it is based. In the rest of this document, I'll assume you've run the following lines to do this if this isn't explicitly stated.
In [1]:
from faps import *
import numpy as np
The asterisk on the first line is a shortcut to tell Python to import all the functions and classes in FAPS. This is somewhat lazy, but saves us having to give the package name every time we call something.
The basic unit on which analyses are built is a matrix of likelihoods of paternities, with a row for each offspring individual and a column for each candidate father (matrix G in Ellis 2016). Each element represents the likelihood that a single candidate male is the father of a single offspring individual based on alleles shared between them and the offspring's mother. One of the aims of FAPS was to create a method which did not depend on marker type, mating system, ploidy, or genotyping technology, with the aim that it should be applicable to as broad a range of datasets that exist, or may yet exist. As such, the optimum way to estimate G will vary from case to case.
Most genealogical studies have considered diploid organisms typed with microsatellite or SNP markers. FAPS can directly estimate G for SNP markers, and the examples given here concern SNPs. In contrast with microsatellites, SNP markers have only two alleles per locus, which means that around ten-fold more markers are required to gain the same statistical power as would be needed for microsatellites. However, SNP markers are abundant in the genome, and are considerably cheaper to genotype per locus than microsatellites. Moreover, genotyping error rates for SNPs are a tiny fraction of those for microsatellites, and do not require the enormous time investment needed for visually checking autoradiograms. For these reasons, we have used SNPs for our own work, and FAPS has developed to reflect that.
Calculating G is not currently implemented, so at present it would be necessary to create a G matrix yourself and import this into FAPS. However, I recognise that SNP markers are frequently not available for many biological systems, and I am open to incorporating a function for microsatellites in the future. Please feel free to contact me if you are interested in doing this.
FAPS will also work given an appropriate G matrix for a polyploid species, , but you will also need to provide a G matrix yourself. See Wang 2016 or Field 2017 for inspiration. This topic is rather involved, and I personally do not feel comfortable implementing anything in this area myself, but I would be interested to hear from anyone who is willing to try it.
See the sections on Importing genotype data and Paternity arrays for more details on how to import data.