Solution of 4.10.1, Jiang et al. 2013

Write a function that takes as input the desired Taxon, and returns the mean value of r.

First, we're going to import the csv module, and read the data. We store the taxon name in the list Taxa, and the corresponding r value in the list r_values. Note that we need to convert the values to float (we need numbers, and they are read as strings).


In [1]:
import csv

In [2]:
with open('../data/Jiang2013_data.csv') as csvfile:
    # set up csv reader and specify correct delimiter
    reader = csv.DictReader(csvfile, delimiter = '\t')
    taxa = []
    r_values = []
    for row in reader:
        taxa.append(row['Taxon'])
        r_values.append(float(row['r']))

We check the first five entries to make sure that everything went well:


In [3]:
taxa[:5]


Out[3]:
['Fish', 'Fish', 'Fish', 'Amphibian', 'Amphibian']

In [4]:
r_values[:5]


Out[4]:
[-0.11, 0.38, 0.51, 0.868, 0.297]

Now we write a function that, given a list of taxa names and corresponding r values, calculates the mean r for a given category of taxa:


In [5]:
def get_mean_r(names, values, target_taxon = 'Fish'):
    n = len(names)
    mean_r = 0.0
    sample_size = 0
    for i in range(n):
        if names[i] == target_taxon:
            mean_r = mean_r + values[i]
            sample_size = sample_size + 1
    return mean_r / sample_size

Test the function using Fish as target taxon:


In [6]:
get_mean_r(taxa, r_values, target_taxon = 'Fish')


Out[6]:
0.39719005173783783

Let's try to run this on all taxa. We can write a separate function that returns the set of unique taxa in the database:


In [7]:
def get_taxa_list(names):
    return(set(names))

In [8]:
get_taxa_list(taxa)


Out[8]:
{'Amphibian',
 'Annelids',
 'Bird',
 'Chelicerate',
 'Crustacean',
 'Fish',
 'Gastropod',
 'Insect',
 'Mammal',
 'Protist',
 'Reptile'}

Calculate the mean r for each taxon:


In [9]:
for t in get_taxa_list(taxa):
    print(t, get_mean_r(taxa, r_values, target_taxon = t))


Fish 0.39719005173783783
Chelicerate 0.49113529650000004
Annelids 0.2
Mammal 0.009
Insect 0.19664531553867934
Amphibian 0.18552824175524468
Reptile 0.11750000000000002
Bird 0.13175671104423078
Protist 0.61402
Crustacean 0.40302827731946345
Gastropod 0.40099999999999997

You should see that fish have a positive value of r, but that this is also true for other taxa. Is the mean value of r especially high for fish? To test this, compute a p-value by repeatedly sampling 37 values of r at random (37 experiments on fish are reported in the database), and calculating the probability of observing a higher mean value of r. To get an accurate estimate of the p-value, use 50,000 randomizations.

Are these values of assortative mating high, compared to what is expected by chance? We can try associating a p-value to each r value by repeatedly computing the mean r of randomized taxa and observing how often we obtain a mean r larger than the observed value. There are many other ways of obtaining such an emperical p-value, for example counting how many times a certain taxon is represented, and sampling the values at random.


In [10]:
import scipy # scipy for random shuffle

def get_p_value_for_mean_r(names, 
                           values, 
                           target_taxon = 'Fish', 
                           num_simulations = 1000):
    # compute the (observed) mean_r
    obs_mean_r = get_mean_r(names, values, target_taxon)
    # create a copy of the names, to be randomized
    rnd_names = names[:]
    # create counter for observations that are higher than obs_mean_r
    count_mean_r = 0.0
    for i in range(num_simulations):
        # shuffle the taxa names
        scipy.random.shuffle(rnd_names)
        # calculate mean r value of randomized data
        rnd_mean_r = get_mean_r(rnd_names, values, target_taxon)
        # count number of rdn_mean_r that are larger or equal to obs_mean_r
        if rnd_mean_r >= obs_mean_r:
            count_mean_r = count_mean_r + 1.0
    # calculate p_value: chance of observing rnd_r_mean larger than r_mean
    p_value = count_mean_r / num_simulations
    return [target_taxon, round(obs_mean_r, 3), round(p_value, 5)]

Let's try the function on Fish:


In [11]:
get_p_value_for_mean_r(taxa, r_values, 'Fish', 50000)


Out[11]:
['Fish', 0.397, 0.00328]

A very small p-value: this means that the observed mean r value (0.397) is larger than what we would expect by chance. Note that your calculated p-value might deviate slightly from ours given the randomness in a simulation.

Repeat the procedure for all taxa.


In [12]:
for t in get_taxa_list(taxa):
    print(get_p_value_for_mean_r(taxa, r_values, t, 50000))


['Fish', 0.397, 0.00348]
['Chelicerate', 0.491, 0.01142]
['Annelids', 0.2, 0.5905]
['Mammal', 0.009, 0.84474]
['Insect', 0.197, 0.99854]
['Amphibian', 0.186, 0.99998]
['Reptile', 0.118, 0.93066]
['Bird', 0.132, 0.99988]
['Protist', 0.614, 0.00394]
['Crustacean', 0.403, 0.0]
['Gastropod', 0.401, 0.07638]

Fish, Protists and Crustaceans have higher mean r values than expected by chance (p-value $\leq$ 0.01). Insects, Amphibians and Birds have lower values than expected by chance (p-value $\geq$ 0.99).