Notebook to analyse if dosage compensated genes on the Z chromosome are more likely to be ohnologs

In this notebook I will test if ohnologs located on the Z chromosome are more likely to be dosage compensated.

I use a list of Z ohnologs from http://ohnologs.curie.fr/ as an input set and validate this with Ensembl paralog data.



In [286]:

    
import numpy as np
import pandas as pd
from scipy.stats import fisher_exact, mannwhitneyu, wilcoxon, ranksums, chi2_contingency, chisquare
from tabulate import tabulate
import requests
import time

import matplotlib.pyplot as plt
from matplotlib.lines import Line2D 
import matplotlib
%matplotlib inline
from matplotlib_venn import venn3, venn3_circles, venn3_unweighted, venn2_unweighted

Read in EdgeR files and filter expression for all chromosomes

Dosage compensated genes are defined as not differentially expressed between males and females using the internal edgeR calculation.

First read in the csv files:



In [487]:

    
def read_detags(infile):
    ''' Wrap the Pandas csv function'''
    detags = pd.read_csv(infile)
    return detags

de_gonad = read_detags("/data/ucbtzim/ohnolog-dc/differential-expression/2015-10-27-MF-GONAD-detags-edgeR.csv")
de_spleen = read_detags("/data/ucbtzim/ohnolog-dc/differential-expression/2015-10-27-MF-SPLEEN-detags-edgeR.csv")
de_heart = read_detags("/data/ucbtzim/ohnolog-dc/differential-expression/2015-10-27-MF-HEART-detags-edgeR.csv")
de_liver = read_detags("/data/ucbtzim/ohnolog-dc/differential-expression/2015-10-27-MF-LIVER-detags-edgeR.csv")

de_gonad.head()









    Out[487]:






  
    
      
      logFC
      logCPM
      PValue
      FDR
    
  
  
    
      ENSGALG00000000386
        8.108556
       6.594766
        0.000000e+00
        0.000000e+00
    
    
      ENSGALG00000001756
        8.684406
       7.036651
       9.761462e-316
       7.589536e-312
    
    
      ENSGALG00000013312
        9.834955
       5.384945
       9.475218e-253
       4.911321e-249
    
    
      ENSGALG00000022674
       12.677968
       5.112837
       3.101470e-219
       1.205696e-215
    
    
      ENSGALG00000022679
       12.374174
       4.810069
       2.111286e-215
       6.566100e-212



In [288]:

    
def anti_log2(i):
    return np.power(2, i)

def log2(i):
    return np.log2(i)

A note on the logFC

The edegR logFC value is calculated internally using some additional processing.(see below)

Also "note that the first group listed in the pair is the baseline for the comparison—so if exactTest the pair is c("A","B") then the comparison is B - A, so genes with positive log-fold change are up-regulated in group B compared with group A (and vice versa for genes with negative log-fold change)." Source EdgeR: Manual

In our case this means that the Male expression is substracted from the Female expression (F-M): Positive logFC is upregulation in females, negative logFC is upregulation in Males.

On the Z that means the closer we get to 0.0 logFC the more DC occurs.



In [289]:

    
rpkm_gonad = pd.read_csv("/data/ucbtzim/ohnolog-dc/differential-expression/2015-10-27-MF-GONAD-RPKM-NORM-log2-edgeR.csv")
rpkm_spleen = pd.read_csv("/data/ucbtzim/ohnolog-dc/differential-expression/2015-10-27-MF-SPLEEN-RPKM-NORM-log2-edgeR.csv")
rpkm_heart = pd.read_csv("/data/ucbtzim/ohnolog-dc/differential-expression/2015-10-27-MF-HEART-RPKM-NORM-log2-edgeR.csv")
rpkm_liver = pd.read_csv("/data/ucbtzim/ohnolog-dc/differential-expression/2015-10-27-MF-LIVER-RPKM-NORM-log2-edgeR.csv")

Here I did a quick test to see if the logCPM from edgeR and manual calculations are the same. It turns out that because of internal EdgeR processing the log values are very similar, but not the same.

https://stat.ethz.ch/pipermail/bioconductor/2014-February/057453.html According to this the logCPM is across all libraries. "> Note that the logCPM column of the topTags output represents average logCPM over all the libraries." Same here: http://grokbase.com/t/r/bioconductor/1449tbyet6/bioc-question-about-output-of-edger

Also "Given an RNA-seq dataset, the overall expression level of each gene is 8 calculated as an average across all samples and expressed as an average log countper-million (logCPM) using the aveLogCPM function." http://www.statsci.org/smyth/pubs/edgeRChapterPreprint.pdf

TopTags uses input from an exactTest object, wich contains an avelogCPM column. So the TopTags logCPM refers to that.

de.out <- data.frame(logFC = logFC, logCPM = AveLogCPM, PValue = exact.pvals)



In [420]:

    
cpm_gonad = pd.read_csv("/data/ucbtzim/ohnolog-dc/differential-expression/2015-10-27-MF-GONAD-CPM-NORM-log2-edgeR.csv")
cpm_spleen = pd.read_csv("/data/ucbtzim/ohnolog-dc/differential-expression/2015-10-27-MF-SPLEEN-CPM-NORM-log2-edgeR.csv")
cpm_heart = pd.read_csv("/data/ucbtzim/ohnolog-dc/differential-expression/2015-10-27-MF-HEART-CPM-NORM-log2-edgeR.csv")
cpm_liver = pd.read_csv("/data/ucbtzim/ohnolog-dc/differential-expression/2015-10-27-MF-LIVER-CPM-NORM-log2-edgeR.csv")

# Check logFC calculation
avgCPM = cpm_gonad[:5].applymap(anti_log2).mean(axis=1).map(log2)
# print cpm_heart[:5]
# avgCPM = cpm_spleen[:5].mean(axis=1)
m_mean = cpm_gonad.ix[:5,0:4].applymap(anti_log2).mean(axis=1).map(log2)
f_mean =  cpm_gonad.ix[:5,4:8].applymap(anti_log2).mean(axis=1).map(log2)
# print f_mean - m_mean
print avgCPM
# print m_mean
# print f_mean

print de_gonad.loc["ENSGALG00000000003"]
print de_gonad.loc["ENSGALG00000000004"]
print de_gonad.loc["ENSGALG00000000011"]
print de_gonad.loc["ENSGALG00000000013"]
print de_gonad.loc["ENSGALG00000000014"]









    



ENSGALG00000000003    3.691593
ENSGALG00000000004    4.411972
ENSGALG00000000011    4.471795
ENSGALG00000000013    3.450193
ENSGALG00000000014   -0.056549
dtype: float64
logFC    -0.934248
logCPM    3.711351
PValue    0.000004
FDR       0.000025
Name: ENSGALG00000000003, dtype: float64
logFC    -0.479235
logCPM    4.422986
PValue    0.003436
FDR       0.010554
Name: ENSGALG00000000004, dtype: float64
logFC     0.084192
logCPM    4.482635
PValue    0.618495
FDR       0.740100
Name: ENSGALG00000000011, dtype: float64
logFC    -0.244049
logCPM    3.471562
PValue    0.161962
FDR       0.267738
Name: ENSGALG00000000013, dtype: float64
logFC     0.403703
logCPM    0.152203
PValue    0.270645
FDR       0.397481
Name: ENSGALG00000000014, dtype: float64

Here we add some positional information (where are genes located):



In [483]:

    
def read_gid_to_chr(infile):
    '''Reads in a file obtained from Ensembl mapping each gene to its chromosomal location'''
    gid_to_chr = {}
    with open(infile, "rb") as handle:
        for line in handle:
            if line.startswith("Ensembl"):
                continue
            else:
                line = line.rstrip().split()
                gid_to_chr[line[0]] = line[1]
    return gid_to_chr

gid_to_chr = read_gid_to_chr("/data/ucbtzim/ohnolog-dc/dc-comparison/Gallus_gallus.75.gid.chr.proteincoding.tsv")
# Get a list of Z linked gene IDs
Z_genes = [g for g, pos in gid_to_chr.items() if pos == "Z"]
print "Number of genes (protein coding) on the Z chromosome:", len(Z_genes)
print "Number of all genes (protein coding + pseudogenes): ", len(gid_to_chr)

chicken_chromosomes = [str(i) for i in range(29)] + ["32", "Z"]

autosomal_genes = [g for g, pos in gid_to_chr.items() if pos in chicken_chromosomes]

rest = [g for g, pos in gid_to_chr.items() if pos not in chicken_chromosomes]


print "Number of genes only on the autosomes + Z: ", len(autosomal_genes)
# print rest









    



Number of genes (protein coding) on the Z chromosome: 764
Number of all genes (protein coding + pseudogenes):  15550
Number of genes only on the autosomes + Z:  14496



In [292]:

    
# Z genes that have a W ortholog
Z_W_orthologs = ["ENSGALG00000013809",
                 "ENSGALG00000001668",
                 "ENSGALG00000018639",
                 "ENSGALG00000014697",
                 "ENSGALG00000003049",
                 "ENSGALG00000026192",
                 "ENSGALG00000001808",
                 "ENSGALG00000015391",
                 "ENSGALG00000014734",
                 "ENSGALG00000025641", #sno RNA
                 "ENSGALG00000025660"] #sno RNA
Z_genes_no_W = [g for g in Z_genes if g not in Z_W_orthologs]
print "Number of genes on the Z chromosome without W orthologs:", len(Z_genes_no_W)









    



Number of genes on the Z chromosome without W orthologs: 755

11 Z genes have a paralog on the W, which may be excluded, because they have no selection for dosage compensation. If excluded, recovered results are similar. Many of these genes are Ohnologs!

Now we filter out all scaffolds and the W chromosome. Only keep autosomes 1-28 and the Z chromosome



In [293]:

    
# Create a list of chicken chromosomes
# Converts the integers into strings, as they are names and to match the read_gid_to_chr()
chicken_chromosomes = [str(i) for i in range(29)] + ["32", "Z"]

# Similar to http://pandas.pydata.org/pandas-docs/stable/indexing.html
# search for lambda
de_gonad = de_gonad[de_gonad.index.map(lambda x: gid_to_chr[x] in chicken_chromosomes)]
de_spleen = de_spleen[de_spleen.index.map(lambda x: gid_to_chr[x] in chicken_chromosomes)]
de_heart = de_heart[de_heart.index.map(lambda x: gid_to_chr[x] in chicken_chromosomes)]
de_liver = de_liver[de_liver.index.map(lambda x: gid_to_chr[x] in chicken_chromosomes)]

print de_gonad.shape
print de_spleen.shape
print de_heart.shape
print de_liver.shape









    



(14496, 4)
(14496, 4)
(14496, 4)
(14496, 4)



In [431]:

    
def filter_detags(detags, mincpm):
    '''Filter out lowly expressed genes based on a fixed minimum CPM threshold'''
    filtered = detags[detags["logCPM"] > mincpm]
    return filtered

# Beware average CPM is log2 transformed.
# For avgCPM of 2 use 1 | log2(2) = 1
de_gonad_min2 = filter_detags(de_gonad, 1)
de_spleen_min2 = filter_detags(de_spleen, 1)
de_heart_min2 = filter_detags(de_heart, 1)
de_liver_min2 = filter_detags(de_liver, 1)

print de_gonad_min2.shape
print de_heart_min2.shape
print de_liver_min2.shape
print de_spleen_min2.shape

print de_gonad_min2.dtypes
print de_heart_min2.dtypes
print de_liver_min2.dtypes
print de_spleen_min2.dtypes









    



(11944, 4)
(9971, 4)
(9383, 4)
(10847, 4)
logFC     float64
logCPM    float64
PValue    float64
FDR       float64
dtype: object
logFC     float64
logCPM    float64
PValue    float64
FDR       float64
dtype: object
logFC     float64
logCPM    float64
PValue    float64
FDR       float64
dtype: object
logFC     float64
logCPM    float64
PValue    float64
FDR       float64
dtype: object



In [432]:

    
def get_chromosomal_expr(detags, genes):
    '''Returns a dataframe that contains only genes given in the gene list.
       All genes are expected to be from the same chromosome.
    '''
    return detags[detags.index.isin(genes)]

def filter_foldchange(detags):
    '''Filter by foldchange, here hardcoded for our dosage compensation criterion'''
    DC = detags[(detags["logFC"] >= -0.5) & (detags["logFC"] <= 0.5)]
    biased = detags[(detags["logFC"] < -0.5) | (detags["logFC"] > 0.5)]
    return biased, DC

# Filter the expression so it only contains Z expression
Z_de_gonad_min2 = get_chromosomal_expr(de_gonad_min2, Z_genes)
Z_de_spleen_min2 = get_chromosomal_expr(de_spleen_min2, Z_genes)
Z_de_heart_min2 = get_chromosomal_expr(de_heart_min2, Z_genes)
Z_de_liver_min2 = get_chromosomal_expr(de_liver_min2, Z_genes)

print "GONAD", Z_de_gonad_min2.shape
print "HEART", Z_de_heart_min2.shape
print "LIVER", Z_de_liver_min2.shape
print "SPLEEN",Z_de_spleen_min2.shape
# Filter expression based on log fold change. if within -0.5 to 0.5 call DC
# Else is biased in some form.
Z_de_gonad_min2_biased, Z_de_gonad_min2_dc = filter_foldchange(Z_de_gonad_min2)
Z_de_spleen_min2_biased, Z_de_spleen_min2_dc = filter_foldchange(Z_de_spleen_min2)
Z_de_heart_min2_biased, Z_de_heart_min2_dc = filter_foldchange(Z_de_heart_min2)
Z_de_liver_min2_biased, Z_de_liver_min2_dc = filter_foldchange(Z_de_liver_min2)

print tabulate([["Tissue", "[Sex]-biased", "Dosage compensated"],
                ["Gonad", Z_de_gonad_min2_biased.shape[0], Z_de_gonad_min2_dc.shape[0]],
                ["Spleen", Z_de_spleen_min2_biased.shape[0], Z_de_spleen_min2_dc.shape[0]],
                ["Heart", Z_de_heart_min2_biased.shape[0], Z_de_heart_min2_dc.shape[0]],
                ["Liver", Z_de_liver_min2_biased.shape[0], Z_de_liver_min2_dc.shape[0]]],
                headers="firstrow")

print
print "Degree of dosage compensation vs rest in all tissues"
print "Fisher's exact test (spleen vs heart): ", fisher_exact([[Z_de_spleen_min2_biased.shape[0],Z_de_spleen_min2_dc.shape[0]],
                                             [Z_de_heart_min2_biased.shape[0], Z_de_heart_min2_dc.shape[0]]])

print "Fisher's exact test (spleen vs liver): ", fisher_exact([[Z_de_spleen_min2_biased.shape[0],Z_de_spleen_min2_dc.shape[0]],
                                             [Z_de_liver_min2_biased.shape[0], Z_de_liver_min2_dc.shape[0]]])

print "Fisher's exact test (spleen vs gonad): ", fisher_exact([[Z_de_spleen_min2_biased.shape[0],Z_de_spleen_min2_dc.shape[0]],
                                             [Z_de_gonad_min2_biased.shape[0], Z_de_gonad_min2_dc.shape[0]]])

print "Fisher's exact test (heart vs liver): ", fisher_exact([[Z_de_heart_min2_biased.shape[0],Z_de_heart_min2_dc.shape[0]],
                                             [Z_de_liver_min2_biased.shape[0], Z_de_liver_min2_dc.shape[0]]])

print "Fisher's exact test (heart vs gonad): ", fisher_exact([[Z_de_heart_min2_biased.shape[0],Z_de_heart_min2_dc.shape[0]],
                                             [Z_de_gonad_min2_biased.shape[0], Z_de_gonad_min2_dc.shape[0]]])

print "Fisher's exact test (liver vs gonad): ", fisher_exact([[Z_de_liver_min2_biased.shape[0],Z_de_liver_min2_dc.shape[0]],
                                             [Z_de_gonad_min2_biased.shape[0], Z_de_gonad_min2_dc.shape[0]]])

print "The two 189 numbers in the table are not the same! (paranoia check)"
len(set(Z_de_liver_min2_biased.index).intersection(set(Z_de_gonad_min2_dc.index)))









    



GONAD (604, 4)
HEART (493, 4)
LIVER (445, 4)
SPLEEN (529, 4)
Tissue      [Sex]-biased    Dosage compensated
--------  --------------  --------------------
Gonad                415                   189
Spleen               223                   306
Heart                230                   263
Liver                193                   252

Degree of dosage compensation vs rest in all tissues
Fisher's exact test (spleen vs heart):  (0.83331912475134984, 0.16566695847350732)
Fisher's exact test (spleen vs liver):  (0.95153916488875345, 0.74513868887312928)
Fisher's exact test (spleen vs gonad):  (0.33189227498228208, 2.2849385903242662e-19)
Fisher's exact test (heart vs liver):  (1.1418664670304774, 0.32478521303199592)
Fisher's exact test (heart vs gonad):  (0.39827752072930506, 1.7346123144608079e-13)
Fisher's exact test (liver vs gonad):  (0.34879518072289156, 2.2551727741897068e-16)
The two 189 numbers in the table are not the same! (paranoia check)






    Out[432]:





38

OHNOLOG DB candidates

Using candidates from http://ohnologs.curie.fr/
We use the relaxed set for the main analysis, because we focus on Z ohnologs only and we would loose power.
The analysis with the strict set returns similar results (other notebook)



In [343]:

    
def read_ohnolog_pairs(infile):
    ohnolog_pairs = pd.read_csv(infile,  sep='\t')
    return ohnolog_pairs


ohnolog_curie_pairs = read_ohnolog_pairs("/data/ucbtzim/ohnolog-dc/ohnologs-curie/CHICKEN.Pairs.Relaxed.2R.txt")

# ohnolog_curie_pairs = read_ohnolog_pairs("/data/ucbtzim/ohnolog-dc/ohnologs-curie/CHICKEN.Pairs.Strict.2R.txt")
ohno1 = list(ohnolog_curie_pairs["Ohnolog-1 Id"])
print len(ohno1)
ohno2 = list(ohnolog_curie_pairs["Ohnolog-2 Id"])
# We are interested in unique ohnologs, not pairs
combined_ohno = ohno1+ohno2
ohnolog_curie_pairs_relaxed_ids = list(set(combined_ohno))
print "Number of ohnologs detected in the chicken genome: ", len(ohnolog_curie_pairs_relaxed_ids)









    



4282
Number of ohnologs detected in the chicken genome:  5228



In [468]:

    
Z_ohnologs_curie = []
for gid in ohnolog_curie_pairs_relaxed_ids:
    # Some genes in the DB are not in this genome release Ensembl 75
    if gid in gid_to_chr:
        if gid_to_chr[gid] == "Z":
            Z_ohnologs_curie.append(gid)

mapped_ohnologs_curie = []

for gid in ohnolog_curie_pairs_relaxed_ids:
    # Some genes in the DB are not in this genome release Ensembl 75
    if gid in gid_to_chr:
        mapped_ohnologs_curie.append(gid)
print "Mapped ohnologs: ", len(set(mapped_ohnologs_curie))
print "Number of ohnologs on Z according to DB Curie:", len(set(Z_ohnologs_curie))
# print Z_ohnologs_curie









    



Mapped ohnologs:  4871
Number of ohnologs on Z according to DB Curie: 223



In [345]:

    
# Difference between human and chicken genome

human_ohnolog_curie_pairs = read_ohnolog_pairs("/data/ucbtzim/ohnolog-dc/ohnologs-curie/HUMAN.Pairs.Relaxed.2R.txt")
human_ohno1 = list(human_ohnolog_curie_pairs["Ohnolog-1 Id"])
print len(human_ohno1)
human_ohno2 = list(human_ohnolog_curie_pairs["Ohnolog-2 Id"])
human_ohnolog_curie_pairs_relaxed_ids = list(set(human_ohno1+human_ohno2))
print "Number of ohnologs detected in the human genome: ", len(human_ohnolog_curie_pairs_relaxed_ids)

# Protein coding numbers from Ensembl 82
print chi2_contingency(np.array([[5228, 7831],[(15508-5228),(20296-7831)]]),correction=True)
# Percentages
print "Chicken: ", 5228/15508.
print "Human: ", 7831./20296
print human_ohnolog_curie_pairs.shape









    



8178
Number of ohnologs detected in the human genome:  7831
(89.858641137822175, 2.557988980939207e-21, 1, array([[  5656.3225338,   7402.6774662],
       [  9851.6774662,  12893.3225338]]))
Chicken:  0.337116327057
Human:  0.3858395743
(8178, 16)

Test for Ohnolog dosage compensation



In [358]:

    
def get_categories(ohnologs, biased, dc):
    
    results={"DC_OHNO": [],
             "NO_DC_OHNO": [],
             "DC_NO_OHNO": [],
             "NO_DC_NO_OHNO": []
            }
    
    for g in dc.index:
        if g in ohnologs:
            results["DC_OHNO"].append(g)
        else:
            results["DC_NO_OHNO"].append(g)
            
    for g in biased.index:
        if g in ohnologs:
            results["NO_DC_OHNO"].append(g)
        else:
            results["NO_DC_NO_OHNO"].append(g)

    print tabulate([["", "OHNO","NO-OHNO"],
                    ["DC", len(results["DC_OHNO"]), len(results["DC_NO_OHNO"])],
                    ["NODC",len(results["NO_DC_OHNO"]), len(results["NO_DC_NO_OHNO"])]],
                    headers="firstrow")
    print "Fisher's exact test: ", fisher_exact([[len(results["DC_OHNO"]),len(results["DC_NO_OHNO"])],
                                                 [len(results["NO_DC_OHNO"]), len(results["NO_DC_NO_OHNO"])]])
    #Rotate the table, no difference
#     print "Fisher's exact test: ", fisher_exact([[len(results["DC_OHNO"]),len(results["NO_DC_OHNO"])],
#                                              [len(results["DC_NO_OHNO"]), len(results["NO_DC_NO_OHNO"])]])
    print "Sum: ", sum([len(results["DC_OHNO"]),len(results["DC_NO_OHNO"]),
                        len(results["NO_DC_OHNO"]), len(results["NO_DC_NO_OHNO"])])
    print ""
    
    return results

print "Gonad"
gonad_result = get_categories(Z_ohnologs_curie, Z_de_gonad_min2_biased, Z_de_gonad_min2_dc)
print "Spleen"
spleen_result = get_categories(Z_ohnologs_curie, Z_de_spleen_min2_biased, Z_de_spleen_min2_dc)
print "Heart"
heart_result = get_categories(Z_ohnologs_curie, Z_de_heart_min2_biased, Z_de_heart_min2_dc)
print "Liver"
liver_result = get_categories(Z_ohnologs_curie, Z_de_liver_min2_biased, Z_de_liver_min2_dc)









    



Gonad
        OHNO    NO-OHNO
----  ------  ---------
DC        86        103
NODC     115        300
Fisher's exact test:  (2.1781342338539469, 2.5656590279759648e-05)
Sum:  604

Spleen
        OHNO    NO-OHNO
----  ------  ---------
DC       126        180
NODC      51        172
Fisher's exact test:  (2.3607843137254902, 1.0758431419184137e-05)
Sum:  529

Heart
        OHNO    NO-OHNO
----  ------  ---------
DC       111        152
NODC      54        176
Fisher's exact test:  (2.3801169590643276, 1.0612190707958937e-05)
Sum:  493

Liver
        OHNO    NO-OHNO
----  ------  ---------
DC       105        147
NODC      41        152
Fisher's exact test:  (2.6480836236933798, 6.5250335207362417e-06)
Sum:  445

Overlap between dosage-compensated ohnologs in different tissues



In [370]:

    
spleen_dc_ohno = set(spleen_result["DC_OHNO"])
heart_dc_ohno = set(heart_result["DC_OHNO"])
liver_dc_ohno = set(liver_result["DC_OHNO"])
gonad_dc_ohno = set(gonad_result["DC_OHNO"])
plt.figure(figsize=(3,3))
venn = venn3_unweighted([spleen_dc_ohno, heart_dc_ohno, liver_dc_ohno], ('Spleen', 'Heart', 'Liver'))
plt.savefig("2015-10-27-ohnolog-dc-soma-venn-overlap.pdf", dpi=300)



In [348]:

    
soma_dc_ohno = spleen_dc_ohno.intersection(heart_dc_ohno,liver_dc_ohno)
print len(soma_dc_ohno)
plt.figure(figsize=(3,3))
venn = venn2_unweighted([soma_dc_ohno, gonad_dc_ohno], ('Soma', 'Gonad'))
plt.savefig("2015-10-27-ohnolog-dc-venn-soma-gonad-overlap.pdf", dpi=300)

Overlap between dosage-compensated non-ohnologs in different tissues



In [349]:

    
spleen_dc_no_ohno = set(spleen_result["DC_NO_OHNO"])
heart_dc_no_ohno = set(heart_result["DC_NO_OHNO"])
liver_dc_no_ohno = set(liver_result["DC_NO_OHNO"])
gonad_dc_no_ohno = set(gonad_result["DC_NO_OHNO"])

plt.figure(figsize=(3,3))
venn = venn3_unweighted([spleen_dc_no_ohno, heart_dc_no_ohno, liver_dc_no_ohno], ('Spleen', 'Heart', 'Liver'))
# plt.savefig("2015-10-06-ohnolog-dc-noohno-soma-venn-overlap.pdf", dpi=300)



In [350]:

    
soma_dc_no_ohno = spleen_dc_no_ohno.intersection(heart_dc_no_ohno,liver_dc_no_ohno)
print len(soma_dc_ohno)
plt.figure(figsize=(3,3))
venn = venn2_unweighted([soma_dc_no_ohno, gonad_dc_no_ohno], ('Soma', 'Gonad'))
# plt.savefig("2015-10-06-ohnolog-dc-venn-soma-gonad-overlap.pdf", dpi=300)

How many ohnologs have partners on the Z chromosome?

If both ohnologs are located on the Z chromosome this may be reason to exclude these pairs?



In [351]:

    
Z_ohnolog_pairs = ohnolog_curie_pairs[(ohnolog_curie_pairs["Ohnolog-1 Id"].isin(Z_genes)) & (ohnolog_curie_pairs["Ohnolog-2 Id"].isin(Z_genes))]
print Z_ohnolog_pairs.shape[0], " pairs (8 genes) are ohnologs with partners on the Z chromosome"









    



4  pairs (8 genes) are ohnologs with partners on the Z chromosome

Compare ohnolog vs other paralogs

Ohnologs are a subset of all paralogs on the Z chromosome. Ohnologs should be more likely to be dosage compensated, but regular paralogs should not be more likely to be dosage compensated. I therefore query the Ensembl REST API to get a dictionary with all genes on the Z chromosome and their status Paralog (duplicates on other chromosomes) ohnologs.



In [458]:

    
def ensembl_paralogs(identifier):
    '''Access to the Ensembl REST API'''
    payload = {"content-type": "application/json",
               "type": "paralogues"}
    r = requests.get('http://rest.ensembl.org/homology/id/'+identifier, params=payload)
    return r.json()

Z_paralogs_no_ohno = []

paralog_types = []
for g in Z_genes:
    # Be polite - don't hammer the API
    time.sleep(0.05)
    query_result = ensembl_paralogs(g)
    if len(query_result["data"][0]["homologies"]) != 0:
        for p in query_result["data"][0]["homologies"]:
            paralog_types.append(p["type"])
            if p["type"] == "within_species_paralog":
                # There can be multiple paralogs
                if g not in Z_ohnologs_curie and g not in Z_paralogs_no_ohno:
                    Z_paralogs_no_ohno.append(g)
    else:
        Z_solo += 1

print len(Z_paralogs_no_ohno)

See for an explanation: http://www.ensembl.org/info/genome/compara/homology_method.html Paralogues Any gene pairwise relation where the ancestor node is a duplication event. We predict several descriptions of paralogues:

same-species paralogies (within_species_paralog)

fragments of the same ‐predicted‐ gene (gene_split)

A paralog labelled as a gene_split is an artefactual type of paralogy. It is commonly related to fragmented genome assemblies or a gene prediction that is poor in supporting evidences (cDNA, ESTs, proteins, etc.).



In [459]:

    
print set(paralog_types)
# print Z_paralogs_no_ohno









    



set([u'gene_split', u'within_species_paralog'])



In [460]:

    
print len(Z_paralogs_no_ohno)
def get_categories_para(paralogs, biased, dc):
    results={"DC_PARA": [],
             "NO_DC_PARA": [],
             "DC_NO_PARA": [],
             "NO_DC_NO_PARA": []
            }
    
    for g in dc.index:
        if g in paralogs:
            results["DC_PARA"].append(g)
        else:
            results["DC_NO_PARA"].append(g)
            
    for g in biased.index:
        if g in paralogs:
            results["NO_DC_PARA"].append(g)
        else:
            results["NO_DC_NO_PARA"].append(g)
    print tabulate([["", "PARA","NO-PARA"],
                    ["DC", len(results["DC_PARA"]), len(results["DC_NO_PARA"])],
                    ["NODC",len(results["NO_DC_PARA"]), len(results["NO_DC_NO_PARA"])]],
                    headers="firstrow")
    print "Fisher's exact test: ", fisher_exact([[len(results["DC_PARA"]),len(results["DC_NO_PARA"])],
                                                 [len(results["NO_DC_PARA"]), len(results["NO_DC_NO_PARA"])]])
    print ""
    print sum([len(results["DC_PARA"]),len(results["DC_NO_PARA"]),len(results["NO_DC_PARA"]), len(results["NO_DC_NO_PARA"])])
    return results
    
print "Gonad"
gonad_result = get_categories_para(Z_paralogs_no_ohno, Z_de_gonad_min2_biased, Z_de_gonad_min2_dc)
print "Spleen"
spleen_result = get_categories_para(Z_paralogs_no_ohno, Z_de_spleen_min2_biased, Z_de_spleen_min2_dc)
print "Heart"
heart_result = get_categories_para(Z_paralogs_no_ohno, Z_de_heart_min2_biased, Z_de_heart_min2_dc)
print "Liver"
liver_result = get_categories_para(Z_paralogs_no_ohno, Z_de_liver_min2_biased, Z_de_liver_min2_dc)









    



291
Gonad
        PARA    NO-PARA
----  ------  ---------
DC        55        134
NODC     141        274
Fisher's exact test:  (0.79760770615010057, 0.26105686847626286)

604
Spleen
        PARA    NO-PARA
----  ------  ---------
DC        96        210
NODC      61        162
Fisher's exact test:  (1.2140515222482435, 0.3362557340235639)

529
Heart
        PARA    NO-PARA
----  ------  ---------
DC        76        187
NODC      71        159
Fisher's exact test:  (0.91014536416359115, 0.69316888593745341)

493
Liver
        PARA    NO-PARA
----  ------  ---------
DC        76        176
NODC      57        136
Fisher's exact test:  (1.0303030303030303, 0.91702999186184264)

445



In [ ]:

Ohnolog expression fold change vs non-ohnolog fold change on the Z chromosome



In [488]:

    
def split_expr(ohnologs, expr):
    '''Split the detags into those that are ohnologs and those that are not'''
    ohno_index = expr.index.isin(ohnologs)
    ohnolog_fc = expr[ohno_index]
    other_fc = expr[~ohno_index]
    
    return ohnolog_fc, other_fc


fc_tissues = {"Gonad":{},"Spleen":{},"Heart":{},"Liver":{}}
print "Gonad"
gonad_ohnolog_fc, gonad_other_fc = split_expr(Z_ohnologs_curie, Z_de_gonad_min2)
print np.mean(gonad_ohnolog_fc["logFC"]), np.mean(gonad_other_fc["logFC"])
print np.median(gonad_ohnolog_fc["logFC"]), np.median(gonad_other_fc["logFC"])
print ranksums(gonad_ohnolog_fc["logFC"], gonad_other_fc["logFC"])
print '%.2f' % round(ranksums(gonad_ohnolog_fc["logFC"], gonad_other_fc["logFC"])[0], 2)
fc_tissues["Gonad"]["OHNO"] = gonad_ohnolog_fc["logFC"]
fc_tissues["Gonad"]["NO-OHNO"] = gonad_other_fc["logFC"]
print mannwhitneyu(gonad_ohnolog_fc["logFC"], gonad_other_fc["logFC"])[1]*2
print len(set(fc_tissues["Gonad"]["OHNO"])), len(set(fc_tissues["Gonad"]["NO-OHNO"]))

print "Spleen"
spleen_ohnolog_fc, spleen_other_fc = split_expr(Z_ohnologs_curie, Z_de_spleen_min2)
print np.mean(spleen_ohnolog_fc["logFC"]), np.mean(spleen_other_fc["logFC"])
print np.median(spleen_ohnolog_fc["logFC"]), np.median(spleen_other_fc["logFC"])
print ranksums(spleen_ohnolog_fc["logFC"], spleen_other_fc["logFC"])
print '%.2f' % round(ranksums(spleen_ohnolog_fc["logFC"], spleen_other_fc["logFC"])[0], 2)
fc_tissues["Spleen"]["OHNO"] = spleen_ohnolog_fc["logFC"]
fc_tissues["Spleen"]["NO-OHNO"] = spleen_other_fc["logFC"]
print mannwhitneyu(spleen_ohnolog_fc["logFC"], spleen_other_fc["logFC"])[1]*2
print len(set(fc_tissues["Spleen"]["OHNO"])), len(set(fc_tissues["Spleen"]["NO-OHNO"]))

print "Heart"
heart_ohnolog_fc, heart_other_fc = split_expr(Z_ohnologs_curie, Z_de_heart_min2)
print np.mean(heart_ohnolog_fc["logFC"]), np.mean(heart_other_fc["logFC"])
print np.median(heart_ohnolog_fc["logFC"]), np.median(heart_other_fc["logFC"])
print ranksums(heart_ohnolog_fc["logFC"], heart_other_fc["logFC"])
print '%.2f' % round(ranksums(heart_ohnolog_fc["logFC"], heart_other_fc["logFC"])[0], 2)
fc_tissues["Heart"]["OHNO"] = heart_ohnolog_fc["logFC"]
fc_tissues["Heart"]["NO-OHNO"] = heart_other_fc["logFC"]
print mannwhitneyu(heart_ohnolog_fc["logFC"], heart_other_fc["logFC"])[1]*2
print len(set(fc_tissues["Heart"]["OHNO"])), len(set(fc_tissues["Heart"]["NO-OHNO"]))

print "Liver"
liver_ohnolog_fc, liver_other_fc = split_expr(Z_ohnologs_curie, Z_de_liver_min2)
print np.mean(liver_ohnolog_fc["logFC"]), np.mean(liver_other_fc["logFC"])
print np.median(liver_ohnolog_fc["logFC"]), np.median(liver_other_fc["logFC"])
print ranksums(liver_ohnolog_fc["logFC"], liver_other_fc["logFC"])
print '%.2f' % round(ranksums(liver_ohnolog_fc["logFC"], liver_other_fc["logFC"])[0], 2)
fc_tissues["Liver"]["OHNO"] = liver_ohnolog_fc["logFC"]
fc_tissues["Liver"]["NO-OHNO"] = liver_other_fc["logFC"]
print mannwhitneyu(liver_ohnolog_fc["logFC"], liver_other_fc["logFC"])[1]*2
print len(set(fc_tissues["Liver"]["OHNO"])), len(set(fc_tissues["Liver"]["NO-OHNO"]))









    



Gonad
-0.308491592183 -0.632631281576
-0.359633272283 -0.708270459907
(4.8922084633809444, 9.9710749944626391e-07)
4.89
9.98362077247e-07
201 403
Spleen
-0.266165090405 -0.47009898039
-0.307510805893 -0.473413334629
(5.9529444178903672, 2.6336075963346287e-09)
5.95
2.63846415746e-09
177 352
Heart
-0.286722794218 -0.422457071305
-0.3425500658 -0.519701061507
(4.5671108073486959, 4.9449264296078773e-06)
4.57
4.95283184319e-06
165 328
Liver
-0.239076821377 -0.423827801882
-0.266854795553 -0.480449795971
(5.2168236602498457, 1.8201739056461707e-07)
5.22
1.82403357162e-07
146 299



In [356]:

    
def color_boxes(bp):
    red = False
    for box in bp['boxes']:
        # change outline color
        box.set( color='black', linewidth=1)
#         box.set( color='black', linewidth=1)

        if red == True:
            box.set( facecolor = 'gray' )
            red = False
        else:
            box.set( facecolor = '#379F7A' )
            red = True

    for median in bp['medians']:
        median.set(color='black', linewidth=1)

    for whisker in bp['whiskers']:
        whisker.set(color='black', linewidth=1, linestyle="-")

    ## change color and linewidth of the caps
    for cap in bp['caps']:
        cap.set(color='black', linewidth=1)

        
font = {'family' : 'Source Sans Pro',
'weight' : 'normal',
'size'   : 11}

matplotlib.rc('font', **font)
        
fig, axes = plt.subplots(ncols=4, sharey=True, figsize=(6, 4))
fig.subplots_adjust(wspace=0.1)

for ax, name in zip(axes, ['Spleen', 'Heart', 'Liver', "Gonad"]):
    bp = ax.boxplot([fc_tissues[name][item] for item in ['OHNO', 'NO-OHNO']], sym="", widths=0.7, patch_artist=True, notch=True)
    lengths = [len(fc_tissues[name][item]) for item in ['OHNO', 'NO-OHNO']]
    ax.set(xticklabels=['Ohno\n(%i)' %lengths[0], 'non-Ohno\n(%i)' %lengths[1]])
    color_boxes(bp)
    ax.grid(False)
    
    ax.patch.set_facecolor('None')
    # Remove the ticks on the right and top side
    ax.yaxis.set_ticks_position('left')
    ax.xaxis.set_ticks_position('bottom')

    
    
    ax.axhline(y=0, c="black",linewidth=1,zorder=0, linestyle="--")
    if name == "Gonad":
        ax.set_title("(d) Gonad")
        ax.text(1.30, 1.2, "***", fontsize=15)
        l = Line2D([1,2],[1.2,1.2],color="black")                                    
        ax.add_line(l)
        l = Line2D([1,1],[1.2,1.15],color="black")                                    
        ax.add_line(l)
        l = Line2D([2,2],[1.2,1.15],color="black")
        ax.add_line(l)
    elif name == "Spleen":
        ax.set_title("(a) Spleen")
        ax.set_ylabel(r'$log_2$'+"FC")
        ax.text(1.30, 0.7, "***", fontsize=15)
        l = Line2D([1,2],[0.7,0.7],color="black")                                    
        ax.add_line(l)
        l = Line2D([1,1],[0.7,0.65],color="black")                                    
        ax.add_line(l)
        l = Line2D([2,2],[0.7,0.65],color="black")
        ax.add_line(l)
    elif name == "Liver":
        ax.set_title("(c) Liver")
        l = Line2D([1,2],[0.7,0.7],color="black")                                    
        ax.add_line(l)
        l = Line2D([1,1],[0.7,0.65],color="black")                                    
        ax.add_line(l)
        l = Line2D([2,2],[0.7,0.65],color="black")                                    
        ax.add_line(l)
        ax.text(1.3, 0.7, "***", fontsize=15)
    elif name == "Heart":
        ax.set_title("(b) Heart")
        l = Line2D([1,2],[0.7,0.7],color="black")                                    
        ax.add_line(l)
        l = Line2D([1,1],[0.7,0.65],color="black")                                    
        ax.add_line(l)
        l = Line2D([2,2],[0.7,0.65],color="black")                                    
        ax.add_line(l)
        ax.text(1.3, 0.7, "***", fontsize=15)
    else:
        continue
fig.tight_layout()

plt.savefig("2015-10-27-ohnolog-foldchange.png", dpi=300)
plt.savefig("2015-10-27-ohnolog-foldchange.pdf", dpi=300)

Ohnolog overrepresentation in strata? Positional analysis.

We may expect a higher amount of DC ohnologs in older strata

Calculate the number of ohnologs per stratum (are there more ohnologs in one in comparison to others?)
Calculate the proportion of DC ohnologs in old strata (1 and 2) vs new strata (3+4). Test with chi square test, if proportion is higher than expected (roughly 50% DC ohnologs should be in old and 50% in new



In [464]:

    
strata = [{"stratum": 1, "start": 44762622, "stop":51097769, "ohnologs": []}, # Old
          {"stratum": 2, "start": 53653310, "stop":60835253, "ohnologs": []}, # Old
          {"stratum": 3, "start": 17217589, "stop":43106716, "ohnologs": []}, #Young
          {"stratum": 4, "start": 440703, "stop": 11217083, "ohnologs": []}] #Young

def ensembl_id_lookup(identifier):
    payload = {"content-type": "application/json"}
    r = requests.get('http://rest.ensembl.org/lookup/id/'+identifier, params=payload)
    return r.json()


mhm = []
no_mhm = []

old = []
young = []




for g in Z_ohnologs_curie:
    # Requests turns this into the correct integer
    # Added explicit type conversion to be save
    gstart = int(ensembl_id_lookup(g)["start"])
#     print g, type(gstart), gstart
    time.sleep(0.1)
    
    # Group ohnologs into strata
    for stratum in strata:
        if stratum["start"] <= gstart <= stratum["stop"]:
            stratum["ohnologs"].append(g)
            
    # Find those ohnologs located in the MHM region
    if 25000000 <= gstart <= 35000000:
        mhm.append(g)
    else:
        no_mhm.append(g)
        
    # Devide Ohnologs into young and old
    if gstart > 43106716: # End of last gene in stratum 3 
        old.append(g)
    else:
        young.append(g)

all_old = []
all_young = []
for g in Z_genes:
    gstart = int(ensembl_id_lookup(g)["start"])
    if gstart > 43106716: # End of last gene in stratum 3 
        all_old.append(g)
    else:
        all_young.append(g)



In [467]:

    
print len(set(Z_ohnologs_curie))
for s in strata:
    print "Stratum", s["stratum"], len(s["ohnologs"])

print "MHM", len(mhm)
print "OLD", len(old)
print "YOUNG", len(young)
print "ALL OLD", len(all_old)
print "ALL Young", len(all_young)

print "Chisquare testing distribution of OHNO genes in old and young parts of chromosome"
print chisquare(np.array([76, 147]), f_exp=np.array([111.5, 111.5]))

print "Fisher's exact test: ", fisher_exact([[len(old), len(young)],
                                            [len(all_old)-len(old), len(all_young)-len(young)]])
print tabulate([["", "OLD","YOUNG"],
                ["Ohno", len(old), len(young)],
                ["No-ohno",len(all_old)-len(old), len(all_young)-len(young)]])









    



223
Stratum 1 8
Stratum 2 15
Stratum 3 93
Stratum 4 33
MHM 31
OLD 76
YOUNG 147
ALL OLD 322
ALL Young 442
Chisquare testing distribution of OHNO genes in old and young parts of chromosome
(22.605381165919283, 1.9892665714285975e-06)
Fisher's exact test:  (0.61998783253138656, 0.0037588888700665771)
-------  ---  -----
         OLD  YOUNG
Ohno     76   147
No-ohno  246  295
-------  ---  -----

The chisquare test indicates that less ohnologs are located in the old part of the Z chromosome.



In [466]:

    
def comp_dc_old_young(young, old, biased, dc):
    young_dc_ohno = []
    young_no_dc_ohno = [] 
    old_dc_ohno = []
    old_no_dc_ohno = []

    for g in dc.index:
        if g in young:
            young_dc_ohno.append(g)
        elif g in old:
            old_dc_ohno.append(g)

    for g in biased.index:
        if g in young:
            young_no_dc_ohno.append(g)
        elif g in old:
            old_no_dc_ohno.append(g)

    print 
    print tabulate([["", "DC-OHNO","NODC-OHNO"],
                    ["YOUNG", len(young_dc_ohno), len(young_no_dc_ohno)],
                    ["OLD",len(old_dc_ohno), len(old_no_dc_ohno)]],
                    headers="firstrow")
    print "Sum: ", sum([len(young_dc_ohno), len(young_no_dc_ohno), len(old_dc_ohno), len(old_no_dc_ohno)])
    print "Fisher's exact test: ", fisher_exact([[len(young_dc_ohno), len(young_no_dc_ohno)],
                                                    [len(old_dc_ohno), len(old_no_dc_ohno)]])

comp_dc_old_young(young, old, Z_de_gonad_min2_biased, Z_de_gonad_min2_dc)
comp_dc_old_young(young, old, Z_de_spleen_min2_biased, Z_de_spleen_min2_dc)
comp_dc_old_young(young, old, Z_de_heart_min2_biased, Z_de_heart_min2_dc)
comp_dc_old_young(young, old, Z_de_liver_min2_biased, Z_de_liver_min2_dc)

print
print "Expected sums:"
for i in ['Spleen', 'Heart', 'Liver', "Gonad"]:
    print i, len(fc_tissues[i]["OHNO"])









    



         DC-OHNO    NODC-OHNO
-----  ---------  -----------
YOUNG         59           75
OLD           27           40
Sum:  201
Fisher's exact test:  (1.1654320987654321, 0.65214756582715583)

         DC-OHNO    NODC-OHNO
-----  ---------  -----------
YOUNG         87           32
OLD           39           19
Sum:  177
Fisher's exact test:  (1.3245192307692308, 0.48023931084928095)

         DC-OHNO    NODC-OHNO
-----  ---------  -----------
YOUNG         74           35
OLD           37           19
Sum:  165
Fisher's exact test:  (1.0857142857142856, 0.86166743332693108)

         DC-OHNO    NODC-OHNO
-----  ---------  -----------
YOUNG         72           26
OLD           33           15
Sum:  146
Fisher's exact test:  (1.2587412587412588, 0.56188169909249619)

Expected sums:
Spleen 177
Heart 165
Liver 146
Gonad 201

In no tissue ohnologs are more likely to be dosage compensated in the old parts or vice versa.

Are ohnologs less likely to be on the Z chromosome than on the autosomes?

One possible explanation for the incomplete dosage compensation mechanism in birds may be that the Z chromosome has a lower than expected number of dosage-sensitive ohnologs. Here I compare the number of Z-linked ohnologs to the number of ohnologs located on the autosomes.



In [485]:

    
len(ohnolog_curie_pairs_relaxed_ids)

Z_OHNO = []
AUTO_OHNO = []

# Source Ensembl v 75 chicken karyotypes
# total_protein_coding = 14496
# Protein coding genes across the entire genome
total_protein_coding = 15508
Z_protein_coding = 764
# also not significant using coding + non coding genes
# total_genes = 17108
# Z_genes = 859
auto_protein_coding = total_protein_coding - Z_protein_coding

for g in ohnolog_curie_pairs_relaxed_ids:
    if g in gid_to_chr:
        if gid_to_chr[g] == "Z":
            Z_OHNO.append(g)
        else:
            AUTO_OHNO.append(g)

print len(Z_OHNO)
print len(AUTO_OHNO)
print len(Z_OHNO) + len(AUTO_OHNO)
print tabulate([["", "OHNO","NO-OHNO"],
                ["Z", len(Z_OHNO), Z_protein_coding - len(Z_OHNO)],
                ["AUTO", len(AUTO_OHNO), auto_protein_coding - len(AUTO_OHNO)]],
                headers="firstrow")
print "Fisher exact test: \n", fisher_exact([[len(Z_OHNO),(Z_protein_coding-len(Z_OHNO))],
                                             [len(AUTO_OHNO),  (auto_protein_coding-len(AUTO_OHNO))]])









    



223
4648
4871
        OHNO    NO-OHNO
----  ------  ---------
Z        223        541
AUTO    4648      10096
Fisher exact test: 
(0.8953458407169741, 0.18711639651628434)

There is no difference in the distribution of ohnologs. There are almost as many as expected given the number of protein coding genes.

	logFC	logCPM	PValue	FDR
ENSGALG00000000386	8.108556	6.594766	0.000000e+00	0.000000e+00
ENSGALG00000001756	8.684406	7.036651	9.761462e-316	7.589536e-312
ENSGALG00000013312	9.834955	5.384945	9.475218e-253	4.911321e-249
ENSGALG00000022674	12.677968	5.112837	3.101470e-219	1.205696e-215
ENSGALG00000022679	12.374174	4.810069	2.111286e-215	6.566100e-212