Introduction

Tom Ellis, February 2017

FAPS stands for Fractional Analysis of Sibships and Paternity. It is a Python package for reconstructing genealogical relationships in wild populations, and making inference about biological processes. The sections of this document are intended as a user's guide to introduce how FAPS works.

Background

The motivation for developing of FAPS was to provide a package to investigate biological processes in a large population of snapdragons in the wild. Existing packages to do this relied on computationally intense Markov-chain algorithms, which limited the scope for subsequent analysis and for checking assumptions through simulations. As such, most of the examples in this guide relate to snapdragons. That said, FAPS addresses general issues in pedigree reconstruction of wild populations, and it is hoped that FAPS will be useful for other plant and animal systems.

The specific aims of FAPS were to provide a package which would allow us to:

  1. Jointly inferring sibship and paternity relationships, and hence gain better estimates of both (Wang 2007), without using a random-walk algorithm such as MCMC or simulated annealing.
  2. Account for uncertainty about relationships (i.e. to integrate fractional paternity assignment with sibship inference).
  3. Provide tools for testing biological hypotheses with minimal hassle.
  4. Do this efficiently for large populations.
  5. Not depend on mating system or genetic marker technology.

Overview of the method

FAPS reconstructs relationships for one or more half-sibling arrays, that is to say a sample of offspring from one or more mothers whose identity is uncertain. A half-sibling array consistent of seedlings from the same maternal plant, or a family of lambs from the same ewe, to give two examples. The paternity of each offspring is unkown, and hence it is unknown whether pairs of offspring are full or half siblings.

There is also a sample of males, each of whom is a candidate to be the true sire of each offspring individual. FAPS is used to identify likely sibling relationships between offspring and their shared fathers based on typed genetic markers, and to use this information to make meaningful conclusions about population or matig biology. The procedure can be summarised as follows:

  1. FAPS begins with an matrix with a row for each offspring and a column for each candidate male. Each element represents the likelihood that the corresponding male is the sire of the offspring individual.
  2. Based on this matrix, we apply hierarchical clustering to group individuals into possible configurations of full-siblings.
  3. Calculate statistics of interest on each configuration, weighted by the probability of each candidate male and each configuration.

Like all statistical analyses, FAPS makes a number of assumptions about your data. It is good to state these explicitly so everyone is aware of the limitations of the data and method:

  1. Individuals come from two generations: a sample of offspring, and a sample of reproductive adults.
  2. At present FAPS is intended to deal with paternity assignment only, i.e. when one parent is known and the second must be inferred. That said, it would not take too much work to modify FAPS to accommodate bi-parental assignment.
  3. Sampling of the adults does not need to be complete, but the more true parents are missing the more likely it is that true full-siblings are inferred to be half-siblings.
  4. Genotype data are biallelic, and genotyping errors can be characterised as a haploid genotype being either correct or incorrect with some probability. Multi-allelic markers with complex error patterns such as microsatellites are not supported at present - see the sections on genotype data and paternity arrays for more on this.

Other software

Depending on your biological questions, your data may not fit the assumptions listed above, and an alternative approach might be more appropriate. For example:

  • Cervus (Marshall 1998, Kalinowski 200?) was the first widely-used program for parentage/paternity assignment. It is restricted to categorical assignment of parent-offspring pairs and relies on substantial prior information about the demography of the population.
  • Colony2 is a dedicated Fortran program for sibship inference (Wang 2004, Wang and Santure 2009). Analyses can be run with or without a sample of parents. Originally designed for microsatellites, it can be slow for large SNP datasets, although Wang et al. 2012 updated the software to support SNP data with a slight cost to accuracy.
  • As argued eloquently by Hadfield (2006), frequently the aim of genealogical reconstruction is very often not the pedigree itself, but some biological parameter of the population. His MasterBayes package for R allows the user to dircetly estimate the posterior distribution for the statistic of interest. It is usually slow for SNP data, can does not incorporate information about siblings.
  • SNPPIT (Anderson & Garza, 2006) provides very fast and accurate assignments of parentage using SNPs.
  • Almudevar (2003), Huisman (2017) and Anderson and Ng (2016) provide software for estimating multigenerational pedigrees.

Prerequisites

Python packages

It is assumed you have read Ellis. (2016) for the basic background to the method. It would also be useful to read Devlin (1988) for an overview of the motivation and basic methods of fractional assignment. It is also assumed you have a basic understanding of probability and likelihood; see Bolker (2006, chapter 6) for an example of a general introduction.

FAPS uses Python as an interface, but it is hoped that this guide should allow users who aren't familiar with Python to adapt the code to their needs. It would be worthwhile to at least familiarise yourself with Python's data types, especially lists and NumPy arrays, and how list comprehensions work. A general introduction to Python concepts can be found here. I recommend interacting with FAPS through IPython/Jupyter, which allows you to test small pieces of code and annotate analyses as you go. This document, for example, is written in IPython.

You will of course need to have Python installed on your machine. If you do not already have this, instructions can be found here. You will also need to install the NumPy, fastcluster and Pandas libraries. These should be installed automatically if you intall FAPS using pip (see below), but if for some reason they are not, the easiest way to do this is to install one of the scientific Python bundles. Some of the simulation tools also make use of Jupyter widgets, but these are optional. There are no specific hardware requirements beyond what is needed to run Python, but it is possible that RAM will be a limiting factor if you are dealing with large samples (for example ~100 offspring and 10,000 candidate males).

All testing and development of FAPS was done on Linux and Mac machines. I have not tested it on Windows, nor do I intend to. That said, an advantage of Python is that it ought to work on any operating system, so in principle FAPS ought to run as well as on a Unix machine. One important difference is that Windows uses '\' instead of '/' in its file paths, so you will need to edit accordingly.

Installing FAPS

You can download the development version of FAPS from the project github repository. The best way to install FAPS is to use Python's package manager, Pip. Instructions to do so can be found at that projects documentation page. You can then either download the package form this repository, and run pip install . from the package directory. At some point I will also endeavour to get the package on the PyPi database.

If for some reason that doesn't work, you could also unzip the package contents to your working directory, and import it from there. For example, if you're working directory is /home/Documents/myproject where you will save your analyses, you need a folder in that directory called faps containing the functions and classes contained in FAPS. This is not recommended.

Once in Python/IPython you'll need to import the package, as well as the NumPy library on which it is based. In the rest of this document, I'll assume you've run the following lines to do this if this isn't explicitly stated.


In [1]:
from faps import *
import numpy as np

The asterisk on the first line is a shortcut to tell Python to import all the functions and classes in FAPS. This is somewhat lazy, but saves us having to give the package name every time we call something.

Marker data types

The basic unit on which analyses are built is a matrix of likelihoods of paternities, with a row for each offspring individual and a column for each candidate father (matrix G in Ellis 2016). Each element represents the likelihood that a single candidate male is the father of a single offspring individual based on alleles shared between them and the offspring's mother. One of the aims of FAPS was to create a method which did not depend on marker type, mating system, ploidy, or genotyping technology, with the aim that it should be applicable to as broad a range of datasets that exist, or may yet exist. As such, the optimum way to estimate G will vary from case to case.

Most genealogical studies have considered diploid organisms typed with microsatellite or SNP markers. FAPS can directly estimate G for SNP markers, and the examples given here concern SNPs. In contrast with microsatellites, SNP markers have only two alleles per locus, which means that around ten-fold more markers are required to gain the same statistical power as would be needed for microsatellites. However, SNP markers are abundant in the genome, and are considerably cheaper to genotype per locus than microsatellites. Moreover, genotyping error rates for SNPs are a tiny fraction of those for microsatellites, and do not require the enormous time investment needed for visually checking autoradiograms. For these reasons, we have used SNPs for our own work, and FAPS has developed to reflect that.

Calculating G is not currently implemented, so at present it would be necessary to create a G matrix yourself and import this into FAPS. However, I recognise that SNP markers are frequently not available for many biological systems, and I am open to incorporating a function for microsatellites in the future. Please feel free to contact me if you are interested in doing this.

FAPS will also work given an appropriate G matrix for a polyploid species, , but you will also need to provide a G matrix yourself. See Wang 2016 or Field 2017 for inspiration. This topic is rather involved, and I personally do not feel comfortable implementing anything in this area myself, but I would be interested to hear from anyone who is willing to try it.

See the sections on Importing genotype data and Paternity arrays for more details on how to import data.