genetic distances

This is not the true genetic distance between samples since we start by sampling sites that are already variable. Rather, we can use this to calculate relative genetic distance matrices among samples. To calculate true genetic distances we would need to divide by the total number of sites examined (and should take into account missing data again). We could also apply a substitution model correction for homoplasy. None of that is implemented yet.

required software



In [1]:

    
# conda install ipyrad -c bioconda
# conda install toyplot -c eaton-lab (optional)



In [1]:

    
import ipyrad.analysis as ipa
import toyplot

Short tutorial

Setup input files and params



In [6]:

    
# the path to your VCF or HDF5 formatted snps file
data = "/home/deren/Downloads/cranos_pop4.snps.hdf5"
names = ipa.snps_extracter(data).names



In [7]:

    
# load object again this time entering imap and minmap using names
imap = {
    "DE_1":  [i for i in names if "DE_1." in i],
    "DE_4":  [i for i in names if "DE_4." in i],
    "DE_6":  [i for i in names if "DE_6." in i],
    "DE_8":  [i for i in names if "DE_8." in i],
    "DE_9":  [i for i in names if "DE_9." in i],
    "DE_10":  [i for i in names if "DE_10." in i],
    "DE_11":  [i for i in names if "DE_11." in i],
    "DE_12":  [i for i in names if "DE_12." in i],
    "DE_15":  [i for i in names if "DE_15." in i],
    "DE_18":  [i for i in names if "DE_18." in i],
    "DE_19":  [i for i in names if "DE_19." in i],
    "DE_22":  [i for i in names if "DE_22." in i],
    "DE_23":  [i for i in names if "DE_23." in i],
    "DE_24":  [i for i in names if "DE_24." in i],
    "DE_26":  [i for i in names if "DE_26." in i],
}

minmap={
    "DE_1": 4,
    "DE_4": 4,
    "DE_6": 4,
    "DE_8": 4,
    "DE_9": 4,
    "DE_10": 4,
    "DE_11": 4,
    "DE_12": 4,
    "DE_15": 4,
    "DE_18": 4,
    "DE_19": 4,
    "DE_22": 4,
    "DE_23": 4,
    "DE_24": 4,
    "DE_26": 4,
}

calculate distances



In [8]:

    
# load the snp data into distance tool with arguments
from ipyrad.analysis.distance import Distance
dist = Distance(
    data=data, 
    imap=imap,
    minmap=minmap,
    mincov=0.5,
    impute_method="sample",
    subsample_snps=False,
)
dist.run()









    



Samples: 102
Sites before filtering: 92109
Filtered (indels): 15037
Filtered (bi-allel): 6345
Filtered (mincov): 657
Filtered (minmap): 7580
Filtered (combined): 20625
Sites after filtering: 71484
Sites containing missing values: 62367 (87.25%)
Missing values in SNP matrix: 314975 (4.32%)
Imputation (sampled by freq. within pops): 94.2%, 2.9%, 2.9%

save results



In [10]:

    
# save to a CSV file
dist.dists.to_csv("cranolopha_distances.csv")



In [33]:

    
# save to a CSV file with no labels (eems style)
dist.dists.to_csv(
    "cranolopha_distances_eems.csv",
    header=None,
    index=False,
    sep=" ",
)

Draw the matrix

Hover over cells to see values in a pop-up.



In [28]:

    
toyplot.matrix(
    dist.dists, 
    bshow=False,
    tshow=False,
);

Order samples



In [14]:

    
# get list of concatenated names from each group
ordered_names = []
for group in dist.imap.values():
    ordered_names += group

# reorder matrix to match name order    
ordered_matrix = dist.dists[ordered_names].T[ordered_names]



In [24]:

    
c, a = toyplot.matrix(
    ordered_matrix,
    margin=20,
    #lshow=False,
    tshow=False,
    llabel="population",
    llocator=toyplot.locator.Explicit(
        range(len(ordered_names)),
        [i.split("_")[1].split(".")[0] for i in ordered_names],
))