First, we're going to import the csv module, and read the data. We store the taxon name in the list Taxa
, and the corresponding r
value in the list r_values
. Note that we need to convert the values to float
(we need numbers, and they are read as strings).
In [1]:
import csv
In [2]:
with open('../data/Jiang2013_data.csv') as csvfile:
reader = csv.DictReader(csvfile, delimiter = '\t')
taxa = []
r_values = []
for row in reader:
taxa.append(row['Taxon'])
r_values.append(float(row['r']))
Make sure that everything went well:
In [3]:
taxa[:5]
Out[3]:
In [4]:
r_values[:5]
Out[4]:
Now we write a function that, given a list of taxa names and corresponding r values, calculates the mean r for a given category of taxa:
In [5]:
def get_mean_r(names, values, target_taxon = 'Fish'):
n = len(names)
mean_r = 0.0
sample_size = 0
for i in range(n):
if names[i] == target_taxon:
mean_r = mean_r + values[i]
sample_size = sample_size + 1
return mean_r / sample_size
Testing using Fish
:
In [6]:
get_mean_r(taxa, r_values, target_taxon = 'Fish')
Out[6]:
Let's try to run this on all taxa. We can write a little function that returns the set of unique taxa in the database:
In [7]:
def get_taxa_list(names):
return(set(names))
In [8]:
get_taxa_list(taxa)
Out[8]:
Calculate the mean r for each taxon:
In [9]:
for t in get_taxa_list(taxa):
print(t, get_mean_r(taxa, r_values, target_taxon = t))
r
, but that this is also true for other taxa. Is the mean value of r
especially high for fish? To test this, compute a p-value by repeatedly sampling 37 values of r
(37 experiments on fish are reported in the database) at random, and calculating the probability of observing a higher mean value of r
. To get an accurate estimate of the p-value, use 50,000 randomizations.Are these values of assortative mating high, compared to what expected at random? We can try associating a p-value to each r
value by repeatedly computing the mean r
value for the taxa, once we scrambled the taxa names! (There are many other ways of doing the same thing, for example counting how many times a certain taxon is represented, and sampling the values at random).
In [10]:
import scipy
def get_p_value_for_mean_r(names,
values,
target_taxon = 'Fish',
num_simulations = 1000):
# first, compute the observed mean
observed = get_mean_r(names, values, target_taxon)
# now create a copy of the names, to be randomized
rnd_names = names[:]
p_value = 0.0
for i in range(num_simulations):
# shuffle the fake names
scipy.random.shuffle(rnd_names)
tmp = get_mean_r(rnd_names, values, target_taxon)
if tmp >= observed:
p_value = p_value + 1.0
p_value = p_value / num_simulations
return [target_taxon, round(observed, 3), round(p_value, 5)]
Let's try the function on Fish
:
In [11]:
get_p_value_for_mean_r(taxa, r_values, 'Fish', 50000)
Out[11]:
A very small p-value: this means that the observed value (0.397) is larger than what we would expect by chance.
In [12]:
for t in get_taxa_list(taxa):
print(get_p_value_for_mean_r(taxa, r_values, t, 50000))
Meaning that Fish, Protists and Crustaceans have high values, while Amphibians and Birds lower values than expected by chance.