In this notebook I will test if ohnologs located on the Z chromosome are more likely to be dosage compensated.
I use a list of Z ohnologs from http://ohnologs.curie.fr/ as an input set and validate this with Ensembl paralog data.
In [286]:
import numpy as np
import pandas as pd
from scipy.stats import fisher_exact, mannwhitneyu, wilcoxon, ranksums, chi2_contingency, chisquare
from tabulate import tabulate
import requests
import time
import matplotlib.pyplot as plt
from matplotlib.lines import Line2D
import matplotlib
%matplotlib inline
from matplotlib_venn import venn3, venn3_circles, venn3_unweighted, venn2_unweighted
In [487]:
def read_detags(infile):
''' Wrap the Pandas csv function'''
detags = pd.read_csv(infile)
return detags
de_gonad = read_detags("/data/ucbtzim/ohnolog-dc/differential-expression/2015-10-27-MF-GONAD-detags-edgeR.csv")
de_spleen = read_detags("/data/ucbtzim/ohnolog-dc/differential-expression/2015-10-27-MF-SPLEEN-detags-edgeR.csv")
de_heart = read_detags("/data/ucbtzim/ohnolog-dc/differential-expression/2015-10-27-MF-HEART-detags-edgeR.csv")
de_liver = read_detags("/data/ucbtzim/ohnolog-dc/differential-expression/2015-10-27-MF-LIVER-detags-edgeR.csv")
de_gonad.head()
Out[487]:
In [288]:
def anti_log2(i):
return np.power(2, i)
def log2(i):
return np.log2(i)
The edegR logFC value is calculated internally using some additional processing.(see below)
Also "note that the first group listed in the pair is the baseline for the comparison—so if exactTest the pair is c("A","B") then the comparison is B - A, so genes with positive log-fold change are up-regulated in group B compared with group A (and vice versa for genes with negative log-fold change)." Source EdgeR: Manual
In our case this means that the Male expression is substracted from the Female expression (F-M): Positive logFC is upregulation in females, negative logFC is upregulation in Males.
On the Z that means the closer we get to 0.0 logFC the more DC occurs.
In [289]:
rpkm_gonad = pd.read_csv("/data/ucbtzim/ohnolog-dc/differential-expression/2015-10-27-MF-GONAD-RPKM-NORM-log2-edgeR.csv")
rpkm_spleen = pd.read_csv("/data/ucbtzim/ohnolog-dc/differential-expression/2015-10-27-MF-SPLEEN-RPKM-NORM-log2-edgeR.csv")
rpkm_heart = pd.read_csv("/data/ucbtzim/ohnolog-dc/differential-expression/2015-10-27-MF-HEART-RPKM-NORM-log2-edgeR.csv")
rpkm_liver = pd.read_csv("/data/ucbtzim/ohnolog-dc/differential-expression/2015-10-27-MF-LIVER-RPKM-NORM-log2-edgeR.csv")
Here I did a quick test to see if the logCPM from edgeR and manual calculations are the same. It turns out that because of internal EdgeR processing the log values are very similar, but not the same.
https://stat.ethz.ch/pipermail/bioconductor/2014-February/057453.html According to this the logCPM is across all libraries. "> Note that the logCPM column of the topTags output represents average logCPM over all the libraries." Same here: http://grokbase.com/t/r/bioconductor/1449tbyet6/bioc-question-about-output-of-edger
Also "Given an RNA-seq dataset, the overall expression level of each gene is 8 calculated as an average across all samples and expressed as an average log countper-million (logCPM) using the aveLogCPM function." http://www.statsci.org/smyth/pubs/edgeRChapterPreprint.pdf
TopTags uses input from an exactTest object, wich contains an avelogCPM column. So the TopTags logCPM refers to that.
de.out <- data.frame(logFC = logFC, logCPM = AveLogCPM, PValue = exact.pvals)
In [420]:
cpm_gonad = pd.read_csv("/data/ucbtzim/ohnolog-dc/differential-expression/2015-10-27-MF-GONAD-CPM-NORM-log2-edgeR.csv")
cpm_spleen = pd.read_csv("/data/ucbtzim/ohnolog-dc/differential-expression/2015-10-27-MF-SPLEEN-CPM-NORM-log2-edgeR.csv")
cpm_heart = pd.read_csv("/data/ucbtzim/ohnolog-dc/differential-expression/2015-10-27-MF-HEART-CPM-NORM-log2-edgeR.csv")
cpm_liver = pd.read_csv("/data/ucbtzim/ohnolog-dc/differential-expression/2015-10-27-MF-LIVER-CPM-NORM-log2-edgeR.csv")
# Check logFC calculation
avgCPM = cpm_gonad[:5].applymap(anti_log2).mean(axis=1).map(log2)
# print cpm_heart[:5]
# avgCPM = cpm_spleen[:5].mean(axis=1)
m_mean = cpm_gonad.ix[:5,0:4].applymap(anti_log2).mean(axis=1).map(log2)
f_mean = cpm_gonad.ix[:5,4:8].applymap(anti_log2).mean(axis=1).map(log2)
# print f_mean - m_mean
print avgCPM
# print m_mean
# print f_mean
print de_gonad.loc["ENSGALG00000000003"]
print de_gonad.loc["ENSGALG00000000004"]
print de_gonad.loc["ENSGALG00000000011"]
print de_gonad.loc["ENSGALG00000000013"]
print de_gonad.loc["ENSGALG00000000014"]
Here we add some positional information (where are genes located):
In [483]:
def read_gid_to_chr(infile):
'''Reads in a file obtained from Ensembl mapping each gene to its chromosomal location'''
gid_to_chr = {}
with open(infile, "rb") as handle:
for line in handle:
if line.startswith("Ensembl"):
continue
else:
line = line.rstrip().split()
gid_to_chr[line[0]] = line[1]
return gid_to_chr
gid_to_chr = read_gid_to_chr("/data/ucbtzim/ohnolog-dc/dc-comparison/Gallus_gallus.75.gid.chr.proteincoding.tsv")
# Get a list of Z linked gene IDs
Z_genes = [g for g, pos in gid_to_chr.items() if pos == "Z"]
print "Number of genes (protein coding) on the Z chromosome:", len(Z_genes)
print "Number of all genes (protein coding + pseudogenes): ", len(gid_to_chr)
chicken_chromosomes = [str(i) for i in range(29)] + ["32", "Z"]
autosomal_genes = [g for g, pos in gid_to_chr.items() if pos in chicken_chromosomes]
rest = [g for g, pos in gid_to_chr.items() if pos not in chicken_chromosomes]
print "Number of genes only on the autosomes + Z: ", len(autosomal_genes)
# print rest
In [292]:
# Z genes that have a W ortholog
Z_W_orthologs = ["ENSGALG00000013809",
"ENSGALG00000001668",
"ENSGALG00000018639",
"ENSGALG00000014697",
"ENSGALG00000003049",
"ENSGALG00000026192",
"ENSGALG00000001808",
"ENSGALG00000015391",
"ENSGALG00000014734",
"ENSGALG00000025641", #sno RNA
"ENSGALG00000025660"] #sno RNA
Z_genes_no_W = [g for g in Z_genes if g not in Z_W_orthologs]
print "Number of genes on the Z chromosome without W orthologs:", len(Z_genes_no_W)
11 Z genes have a paralog on the W, which may be excluded, because they have no selection for dosage compensation. If excluded, recovered results are similar. Many of these genes are Ohnologs!
Now we filter out all scaffolds and the W chromosome. Only keep autosomes 1-28 and the Z chromosome
In [293]:
# Create a list of chicken chromosomes
# Converts the integers into strings, as they are names and to match the read_gid_to_chr()
chicken_chromosomes = [str(i) for i in range(29)] + ["32", "Z"]
# Similar to http://pandas.pydata.org/pandas-docs/stable/indexing.html
# search for lambda
de_gonad = de_gonad[de_gonad.index.map(lambda x: gid_to_chr[x] in chicken_chromosomes)]
de_spleen = de_spleen[de_spleen.index.map(lambda x: gid_to_chr[x] in chicken_chromosomes)]
de_heart = de_heart[de_heart.index.map(lambda x: gid_to_chr[x] in chicken_chromosomes)]
de_liver = de_liver[de_liver.index.map(lambda x: gid_to_chr[x] in chicken_chromosomes)]
print de_gonad.shape
print de_spleen.shape
print de_heart.shape
print de_liver.shape
In [431]:
def filter_detags(detags, mincpm):
'''Filter out lowly expressed genes based on a fixed minimum CPM threshold'''
filtered = detags[detags["logCPM"] > mincpm]
return filtered
# Beware average CPM is log2 transformed.
# For avgCPM of 2 use 1 | log2(2) = 1
de_gonad_min2 = filter_detags(de_gonad, 1)
de_spleen_min2 = filter_detags(de_spleen, 1)
de_heart_min2 = filter_detags(de_heart, 1)
de_liver_min2 = filter_detags(de_liver, 1)
print de_gonad_min2.shape
print de_heart_min2.shape
print de_liver_min2.shape
print de_spleen_min2.shape
print de_gonad_min2.dtypes
print de_heart_min2.dtypes
print de_liver_min2.dtypes
print de_spleen_min2.dtypes
In [432]:
def get_chromosomal_expr(detags, genes):
'''Returns a dataframe that contains only genes given in the gene list.
All genes are expected to be from the same chromosome.
'''
return detags[detags.index.isin(genes)]
def filter_foldchange(detags):
'''Filter by foldchange, here hardcoded for our dosage compensation criterion'''
DC = detags[(detags["logFC"] >= -0.5) & (detags["logFC"] <= 0.5)]
biased = detags[(detags["logFC"] < -0.5) | (detags["logFC"] > 0.5)]
return biased, DC
# Filter the expression so it only contains Z expression
Z_de_gonad_min2 = get_chromosomal_expr(de_gonad_min2, Z_genes)
Z_de_spleen_min2 = get_chromosomal_expr(de_spleen_min2, Z_genes)
Z_de_heart_min2 = get_chromosomal_expr(de_heart_min2, Z_genes)
Z_de_liver_min2 = get_chromosomal_expr(de_liver_min2, Z_genes)
print "GONAD", Z_de_gonad_min2.shape
print "HEART", Z_de_heart_min2.shape
print "LIVER", Z_de_liver_min2.shape
print "SPLEEN",Z_de_spleen_min2.shape
# Filter expression based on log fold change. if within -0.5 to 0.5 call DC
# Else is biased in some form.
Z_de_gonad_min2_biased, Z_de_gonad_min2_dc = filter_foldchange(Z_de_gonad_min2)
Z_de_spleen_min2_biased, Z_de_spleen_min2_dc = filter_foldchange(Z_de_spleen_min2)
Z_de_heart_min2_biased, Z_de_heart_min2_dc = filter_foldchange(Z_de_heart_min2)
Z_de_liver_min2_biased, Z_de_liver_min2_dc = filter_foldchange(Z_de_liver_min2)
print tabulate([["Tissue", "[Sex]-biased", "Dosage compensated"],
["Gonad", Z_de_gonad_min2_biased.shape[0], Z_de_gonad_min2_dc.shape[0]],
["Spleen", Z_de_spleen_min2_biased.shape[0], Z_de_spleen_min2_dc.shape[0]],
["Heart", Z_de_heart_min2_biased.shape[0], Z_de_heart_min2_dc.shape[0]],
["Liver", Z_de_liver_min2_biased.shape[0], Z_de_liver_min2_dc.shape[0]]],
headers="firstrow")
print
print "Degree of dosage compensation vs rest in all tissues"
print "Fisher's exact test (spleen vs heart): ", fisher_exact([[Z_de_spleen_min2_biased.shape[0],Z_de_spleen_min2_dc.shape[0]],
[Z_de_heart_min2_biased.shape[0], Z_de_heart_min2_dc.shape[0]]])
print "Fisher's exact test (spleen vs liver): ", fisher_exact([[Z_de_spleen_min2_biased.shape[0],Z_de_spleen_min2_dc.shape[0]],
[Z_de_liver_min2_biased.shape[0], Z_de_liver_min2_dc.shape[0]]])
print "Fisher's exact test (spleen vs gonad): ", fisher_exact([[Z_de_spleen_min2_biased.shape[0],Z_de_spleen_min2_dc.shape[0]],
[Z_de_gonad_min2_biased.shape[0], Z_de_gonad_min2_dc.shape[0]]])
print "Fisher's exact test (heart vs liver): ", fisher_exact([[Z_de_heart_min2_biased.shape[0],Z_de_heart_min2_dc.shape[0]],
[Z_de_liver_min2_biased.shape[0], Z_de_liver_min2_dc.shape[0]]])
print "Fisher's exact test (heart vs gonad): ", fisher_exact([[Z_de_heart_min2_biased.shape[0],Z_de_heart_min2_dc.shape[0]],
[Z_de_gonad_min2_biased.shape[0], Z_de_gonad_min2_dc.shape[0]]])
print "Fisher's exact test (liver vs gonad): ", fisher_exact([[Z_de_liver_min2_biased.shape[0],Z_de_liver_min2_dc.shape[0]],
[Z_de_gonad_min2_biased.shape[0], Z_de_gonad_min2_dc.shape[0]]])
print "The two 189 numbers in the table are not the same! (paranoia check)"
len(set(Z_de_liver_min2_biased.index).intersection(set(Z_de_gonad_min2_dc.index)))
Out[432]:
Using candidates from http://ohnologs.curie.fr/
We use the relaxed set for the main analysis, because we focus on Z ohnologs only and we would loose power.
The analysis with the strict set returns similar results (other notebook)
In [343]:
def read_ohnolog_pairs(infile):
ohnolog_pairs = pd.read_csv(infile, sep='\t')
return ohnolog_pairs
ohnolog_curie_pairs = read_ohnolog_pairs("/data/ucbtzim/ohnolog-dc/ohnologs-curie/CHICKEN.Pairs.Relaxed.2R.txt")
# ohnolog_curie_pairs = read_ohnolog_pairs("/data/ucbtzim/ohnolog-dc/ohnologs-curie/CHICKEN.Pairs.Strict.2R.txt")
ohno1 = list(ohnolog_curie_pairs["Ohnolog-1 Id"])
print len(ohno1)
ohno2 = list(ohnolog_curie_pairs["Ohnolog-2 Id"])
# We are interested in unique ohnologs, not pairs
combined_ohno = ohno1+ohno2
ohnolog_curie_pairs_relaxed_ids = list(set(combined_ohno))
print "Number of ohnologs detected in the chicken genome: ", len(ohnolog_curie_pairs_relaxed_ids)
In [468]:
Z_ohnologs_curie = []
for gid in ohnolog_curie_pairs_relaxed_ids:
# Some genes in the DB are not in this genome release Ensembl 75
if gid in gid_to_chr:
if gid_to_chr[gid] == "Z":
Z_ohnologs_curie.append(gid)
mapped_ohnologs_curie = []
for gid in ohnolog_curie_pairs_relaxed_ids:
# Some genes in the DB are not in this genome release Ensembl 75
if gid in gid_to_chr:
mapped_ohnologs_curie.append(gid)
print "Mapped ohnologs: ", len(set(mapped_ohnologs_curie))
print "Number of ohnologs on Z according to DB Curie:", len(set(Z_ohnologs_curie))
# print Z_ohnologs_curie
In [345]:
# Difference between human and chicken genome
human_ohnolog_curie_pairs = read_ohnolog_pairs("/data/ucbtzim/ohnolog-dc/ohnologs-curie/HUMAN.Pairs.Relaxed.2R.txt")
human_ohno1 = list(human_ohnolog_curie_pairs["Ohnolog-1 Id"])
print len(human_ohno1)
human_ohno2 = list(human_ohnolog_curie_pairs["Ohnolog-2 Id"])
human_ohnolog_curie_pairs_relaxed_ids = list(set(human_ohno1+human_ohno2))
print "Number of ohnologs detected in the human genome: ", len(human_ohnolog_curie_pairs_relaxed_ids)
# Protein coding numbers from Ensembl 82
print chi2_contingency(np.array([[5228, 7831],[(15508-5228),(20296-7831)]]),correction=True)
# Percentages
print "Chicken: ", 5228/15508.
print "Human: ", 7831./20296
print human_ohnolog_curie_pairs.shape
In [358]:
def get_categories(ohnologs, biased, dc):
results={"DC_OHNO": [],
"NO_DC_OHNO": [],
"DC_NO_OHNO": [],
"NO_DC_NO_OHNO": []
}
for g in dc.index:
if g in ohnologs:
results["DC_OHNO"].append(g)
else:
results["DC_NO_OHNO"].append(g)
for g in biased.index:
if g in ohnologs:
results["NO_DC_OHNO"].append(g)
else:
results["NO_DC_NO_OHNO"].append(g)
print tabulate([["", "OHNO","NO-OHNO"],
["DC", len(results["DC_OHNO"]), len(results["DC_NO_OHNO"])],
["NODC",len(results["NO_DC_OHNO"]), len(results["NO_DC_NO_OHNO"])]],
headers="firstrow")
print "Fisher's exact test: ", fisher_exact([[len(results["DC_OHNO"]),len(results["DC_NO_OHNO"])],
[len(results["NO_DC_OHNO"]), len(results["NO_DC_NO_OHNO"])]])
#Rotate the table, no difference
# print "Fisher's exact test: ", fisher_exact([[len(results["DC_OHNO"]),len(results["NO_DC_OHNO"])],
# [len(results["DC_NO_OHNO"]), len(results["NO_DC_NO_OHNO"])]])
print "Sum: ", sum([len(results["DC_OHNO"]),len(results["DC_NO_OHNO"]),
len(results["NO_DC_OHNO"]), len(results["NO_DC_NO_OHNO"])])
print ""
return results
print "Gonad"
gonad_result = get_categories(Z_ohnologs_curie, Z_de_gonad_min2_biased, Z_de_gonad_min2_dc)
print "Spleen"
spleen_result = get_categories(Z_ohnologs_curie, Z_de_spleen_min2_biased, Z_de_spleen_min2_dc)
print "Heart"
heart_result = get_categories(Z_ohnologs_curie, Z_de_heart_min2_biased, Z_de_heart_min2_dc)
print "Liver"
liver_result = get_categories(Z_ohnologs_curie, Z_de_liver_min2_biased, Z_de_liver_min2_dc)
In [370]:
spleen_dc_ohno = set(spleen_result["DC_OHNO"])
heart_dc_ohno = set(heart_result["DC_OHNO"])
liver_dc_ohno = set(liver_result["DC_OHNO"])
gonad_dc_ohno = set(gonad_result["DC_OHNO"])
plt.figure(figsize=(3,3))
venn = venn3_unweighted([spleen_dc_ohno, heart_dc_ohno, liver_dc_ohno], ('Spleen', 'Heart', 'Liver'))
plt.savefig("2015-10-27-ohnolog-dc-soma-venn-overlap.pdf", dpi=300)
In [348]:
soma_dc_ohno = spleen_dc_ohno.intersection(heart_dc_ohno,liver_dc_ohno)
print len(soma_dc_ohno)
plt.figure(figsize=(3,3))
venn = venn2_unweighted([soma_dc_ohno, gonad_dc_ohno], ('Soma', 'Gonad'))
plt.savefig("2015-10-27-ohnolog-dc-venn-soma-gonad-overlap.pdf", dpi=300)
In [349]:
spleen_dc_no_ohno = set(spleen_result["DC_NO_OHNO"])
heart_dc_no_ohno = set(heart_result["DC_NO_OHNO"])
liver_dc_no_ohno = set(liver_result["DC_NO_OHNO"])
gonad_dc_no_ohno = set(gonad_result["DC_NO_OHNO"])
plt.figure(figsize=(3,3))
venn = venn3_unweighted([spleen_dc_no_ohno, heart_dc_no_ohno, liver_dc_no_ohno], ('Spleen', 'Heart', 'Liver'))
# plt.savefig("2015-10-06-ohnolog-dc-noohno-soma-venn-overlap.pdf", dpi=300)
In [350]:
soma_dc_no_ohno = spleen_dc_no_ohno.intersection(heart_dc_no_ohno,liver_dc_no_ohno)
print len(soma_dc_ohno)
plt.figure(figsize=(3,3))
venn = venn2_unweighted([soma_dc_no_ohno, gonad_dc_no_ohno], ('Soma', 'Gonad'))
# plt.savefig("2015-10-06-ohnolog-dc-venn-soma-gonad-overlap.pdf", dpi=300)
In [351]:
Z_ohnolog_pairs = ohnolog_curie_pairs[(ohnolog_curie_pairs["Ohnolog-1 Id"].isin(Z_genes)) & (ohnolog_curie_pairs["Ohnolog-2 Id"].isin(Z_genes))]
print Z_ohnolog_pairs.shape[0], " pairs (8 genes) are ohnologs with partners on the Z chromosome"
Ohnologs are a subset of all paralogs on the Z chromosome. Ohnologs should be more likely to be dosage compensated, but regular paralogs should not be more likely to be dosage compensated. I therefore query the Ensembl REST API to get a dictionary with all genes on the Z chromosome and their status Paralog (duplicates on other chromosomes) ohnologs.
In [458]:
def ensembl_paralogs(identifier):
'''Access to the Ensembl REST API'''
payload = {"content-type": "application/json",
"type": "paralogues"}
r = requests.get('http://rest.ensembl.org/homology/id/'+identifier, params=payload)
return r.json()
Z_paralogs_no_ohno = []
paralog_types = []
for g in Z_genes:
# Be polite - don't hammer the API
time.sleep(0.05)
query_result = ensembl_paralogs(g)
if len(query_result["data"][0]["homologies"]) != 0:
for p in query_result["data"][0]["homologies"]:
paralog_types.append(p["type"])
if p["type"] == "within_species_paralog":
# There can be multiple paralogs
if g not in Z_ohnologs_curie and g not in Z_paralogs_no_ohno:
Z_paralogs_no_ohno.append(g)
else:
Z_solo += 1
print len(Z_paralogs_no_ohno)
See for an explanation: http://www.ensembl.org/info/genome/compara/homology_method.html Paralogues Any gene pairwise relation where the ancestor node is a duplication event. We predict several descriptions of paralogues:
same-species paralogies (within_species_paralog)
fragments of the same ‐predicted‐ gene (gene_split)
A paralog labelled as a gene_split is an artefactual type of paralogy. It is commonly related to fragmented genome assemblies or a gene prediction that is poor in supporting evidences (cDNA, ESTs, proteins, etc.).
In [459]:
print set(paralog_types)
# print Z_paralogs_no_ohno
In [460]:
print len(Z_paralogs_no_ohno)
def get_categories_para(paralogs, biased, dc):
results={"DC_PARA": [],
"NO_DC_PARA": [],
"DC_NO_PARA": [],
"NO_DC_NO_PARA": []
}
for g in dc.index:
if g in paralogs:
results["DC_PARA"].append(g)
else:
results["DC_NO_PARA"].append(g)
for g in biased.index:
if g in paralogs:
results["NO_DC_PARA"].append(g)
else:
results["NO_DC_NO_PARA"].append(g)
print tabulate([["", "PARA","NO-PARA"],
["DC", len(results["DC_PARA"]), len(results["DC_NO_PARA"])],
["NODC",len(results["NO_DC_PARA"]), len(results["NO_DC_NO_PARA"])]],
headers="firstrow")
print "Fisher's exact test: ", fisher_exact([[len(results["DC_PARA"]),len(results["DC_NO_PARA"])],
[len(results["NO_DC_PARA"]), len(results["NO_DC_NO_PARA"])]])
print ""
print sum([len(results["DC_PARA"]),len(results["DC_NO_PARA"]),len(results["NO_DC_PARA"]), len(results["NO_DC_NO_PARA"])])
return results
print "Gonad"
gonad_result = get_categories_para(Z_paralogs_no_ohno, Z_de_gonad_min2_biased, Z_de_gonad_min2_dc)
print "Spleen"
spleen_result = get_categories_para(Z_paralogs_no_ohno, Z_de_spleen_min2_biased, Z_de_spleen_min2_dc)
print "Heart"
heart_result = get_categories_para(Z_paralogs_no_ohno, Z_de_heart_min2_biased, Z_de_heart_min2_dc)
print "Liver"
liver_result = get_categories_para(Z_paralogs_no_ohno, Z_de_liver_min2_biased, Z_de_liver_min2_dc)
In [ ]:
In [488]:
def split_expr(ohnologs, expr):
'''Split the detags into those that are ohnologs and those that are not'''
ohno_index = expr.index.isin(ohnologs)
ohnolog_fc = expr[ohno_index]
other_fc = expr[~ohno_index]
return ohnolog_fc, other_fc
fc_tissues = {"Gonad":{},"Spleen":{},"Heart":{},"Liver":{}}
print "Gonad"
gonad_ohnolog_fc, gonad_other_fc = split_expr(Z_ohnologs_curie, Z_de_gonad_min2)
print np.mean(gonad_ohnolog_fc["logFC"]), np.mean(gonad_other_fc["logFC"])
print np.median(gonad_ohnolog_fc["logFC"]), np.median(gonad_other_fc["logFC"])
print ranksums(gonad_ohnolog_fc["logFC"], gonad_other_fc["logFC"])
print '%.2f' % round(ranksums(gonad_ohnolog_fc["logFC"], gonad_other_fc["logFC"])[0], 2)
fc_tissues["Gonad"]["OHNO"] = gonad_ohnolog_fc["logFC"]
fc_tissues["Gonad"]["NO-OHNO"] = gonad_other_fc["logFC"]
print mannwhitneyu(gonad_ohnolog_fc["logFC"], gonad_other_fc["logFC"])[1]*2
print len(set(fc_tissues["Gonad"]["OHNO"])), len(set(fc_tissues["Gonad"]["NO-OHNO"]))
print "Spleen"
spleen_ohnolog_fc, spleen_other_fc = split_expr(Z_ohnologs_curie, Z_de_spleen_min2)
print np.mean(spleen_ohnolog_fc["logFC"]), np.mean(spleen_other_fc["logFC"])
print np.median(spleen_ohnolog_fc["logFC"]), np.median(spleen_other_fc["logFC"])
print ranksums(spleen_ohnolog_fc["logFC"], spleen_other_fc["logFC"])
print '%.2f' % round(ranksums(spleen_ohnolog_fc["logFC"], spleen_other_fc["logFC"])[0], 2)
fc_tissues["Spleen"]["OHNO"] = spleen_ohnolog_fc["logFC"]
fc_tissues["Spleen"]["NO-OHNO"] = spleen_other_fc["logFC"]
print mannwhitneyu(spleen_ohnolog_fc["logFC"], spleen_other_fc["logFC"])[1]*2
print len(set(fc_tissues["Spleen"]["OHNO"])), len(set(fc_tissues["Spleen"]["NO-OHNO"]))
print "Heart"
heart_ohnolog_fc, heart_other_fc = split_expr(Z_ohnologs_curie, Z_de_heart_min2)
print np.mean(heart_ohnolog_fc["logFC"]), np.mean(heart_other_fc["logFC"])
print np.median(heart_ohnolog_fc["logFC"]), np.median(heart_other_fc["logFC"])
print ranksums(heart_ohnolog_fc["logFC"], heart_other_fc["logFC"])
print '%.2f' % round(ranksums(heart_ohnolog_fc["logFC"], heart_other_fc["logFC"])[0], 2)
fc_tissues["Heart"]["OHNO"] = heart_ohnolog_fc["logFC"]
fc_tissues["Heart"]["NO-OHNO"] = heart_other_fc["logFC"]
print mannwhitneyu(heart_ohnolog_fc["logFC"], heart_other_fc["logFC"])[1]*2
print len(set(fc_tissues["Heart"]["OHNO"])), len(set(fc_tissues["Heart"]["NO-OHNO"]))
print "Liver"
liver_ohnolog_fc, liver_other_fc = split_expr(Z_ohnologs_curie, Z_de_liver_min2)
print np.mean(liver_ohnolog_fc["logFC"]), np.mean(liver_other_fc["logFC"])
print np.median(liver_ohnolog_fc["logFC"]), np.median(liver_other_fc["logFC"])
print ranksums(liver_ohnolog_fc["logFC"], liver_other_fc["logFC"])
print '%.2f' % round(ranksums(liver_ohnolog_fc["logFC"], liver_other_fc["logFC"])[0], 2)
fc_tissues["Liver"]["OHNO"] = liver_ohnolog_fc["logFC"]
fc_tissues["Liver"]["NO-OHNO"] = liver_other_fc["logFC"]
print mannwhitneyu(liver_ohnolog_fc["logFC"], liver_other_fc["logFC"])[1]*2
print len(set(fc_tissues["Liver"]["OHNO"])), len(set(fc_tissues["Liver"]["NO-OHNO"]))
In [356]:
def color_boxes(bp):
red = False
for box in bp['boxes']:
# change outline color
box.set( color='black', linewidth=1)
# box.set( color='black', linewidth=1)
if red == True:
box.set( facecolor = 'gray' )
red = False
else:
box.set( facecolor = '#379F7A' )
red = True
for median in bp['medians']:
median.set(color='black', linewidth=1)
for whisker in bp['whiskers']:
whisker.set(color='black', linewidth=1, linestyle="-")
## change color and linewidth of the caps
for cap in bp['caps']:
cap.set(color='black', linewidth=1)
font = {'family' : 'Source Sans Pro',
'weight' : 'normal',
'size' : 11}
matplotlib.rc('font', **font)
fig, axes = plt.subplots(ncols=4, sharey=True, figsize=(6, 4))
fig.subplots_adjust(wspace=0.1)
for ax, name in zip(axes, ['Spleen', 'Heart', 'Liver', "Gonad"]):
bp = ax.boxplot([fc_tissues[name][item] for item in ['OHNO', 'NO-OHNO']], sym="", widths=0.7, patch_artist=True, notch=True)
lengths = [len(fc_tissues[name][item]) for item in ['OHNO', 'NO-OHNO']]
ax.set(xticklabels=['Ohno\n(%i)' %lengths[0], 'non-Ohno\n(%i)' %lengths[1]])
color_boxes(bp)
ax.grid(False)
ax.patch.set_facecolor('None')
# Remove the ticks on the right and top side
ax.yaxis.set_ticks_position('left')
ax.xaxis.set_ticks_position('bottom')
ax.axhline(y=0, c="black",linewidth=1,zorder=0, linestyle="--")
if name == "Gonad":
ax.set_title("(d) Gonad")
ax.text(1.30, 1.2, "***", fontsize=15)
l = Line2D([1,2],[1.2,1.2],color="black")
ax.add_line(l)
l = Line2D([1,1],[1.2,1.15],color="black")
ax.add_line(l)
l = Line2D([2,2],[1.2,1.15],color="black")
ax.add_line(l)
elif name == "Spleen":
ax.set_title("(a) Spleen")
ax.set_ylabel(r'$log_2$'+"FC")
ax.text(1.30, 0.7, "***", fontsize=15)
l = Line2D([1,2],[0.7,0.7],color="black")
ax.add_line(l)
l = Line2D([1,1],[0.7,0.65],color="black")
ax.add_line(l)
l = Line2D([2,2],[0.7,0.65],color="black")
ax.add_line(l)
elif name == "Liver":
ax.set_title("(c) Liver")
l = Line2D([1,2],[0.7,0.7],color="black")
ax.add_line(l)
l = Line2D([1,1],[0.7,0.65],color="black")
ax.add_line(l)
l = Line2D([2,2],[0.7,0.65],color="black")
ax.add_line(l)
ax.text(1.3, 0.7, "***", fontsize=15)
elif name == "Heart":
ax.set_title("(b) Heart")
l = Line2D([1,2],[0.7,0.7],color="black")
ax.add_line(l)
l = Line2D([1,1],[0.7,0.65],color="black")
ax.add_line(l)
l = Line2D([2,2],[0.7,0.65],color="black")
ax.add_line(l)
ax.text(1.3, 0.7, "***", fontsize=15)
else:
continue
fig.tight_layout()
plt.savefig("2015-10-27-ohnolog-foldchange.png", dpi=300)
plt.savefig("2015-10-27-ohnolog-foldchange.pdf", dpi=300)
We may expect a higher amount of DC ohnologs in older strata
In [464]:
strata = [{"stratum": 1, "start": 44762622, "stop":51097769, "ohnologs": []}, # Old
{"stratum": 2, "start": 53653310, "stop":60835253, "ohnologs": []}, # Old
{"stratum": 3, "start": 17217589, "stop":43106716, "ohnologs": []}, #Young
{"stratum": 4, "start": 440703, "stop": 11217083, "ohnologs": []}] #Young
def ensembl_id_lookup(identifier):
payload = {"content-type": "application/json"}
r = requests.get('http://rest.ensembl.org/lookup/id/'+identifier, params=payload)
return r.json()
mhm = []
no_mhm = []
old = []
young = []
for g in Z_ohnologs_curie:
# Requests turns this into the correct integer
# Added explicit type conversion to be save
gstart = int(ensembl_id_lookup(g)["start"])
# print g, type(gstart), gstart
time.sleep(0.1)
# Group ohnologs into strata
for stratum in strata:
if stratum["start"] <= gstart <= stratum["stop"]:
stratum["ohnologs"].append(g)
# Find those ohnologs located in the MHM region
if 25000000 <= gstart <= 35000000:
mhm.append(g)
else:
no_mhm.append(g)
# Devide Ohnologs into young and old
if gstart > 43106716: # End of last gene in stratum 3
old.append(g)
else:
young.append(g)
all_old = []
all_young = []
for g in Z_genes:
gstart = int(ensembl_id_lookup(g)["start"])
if gstart > 43106716: # End of last gene in stratum 3
all_old.append(g)
else:
all_young.append(g)
In [467]:
print len(set(Z_ohnologs_curie))
for s in strata:
print "Stratum", s["stratum"], len(s["ohnologs"])
print "MHM", len(mhm)
print "OLD", len(old)
print "YOUNG", len(young)
print "ALL OLD", len(all_old)
print "ALL Young", len(all_young)
print "Chisquare testing distribution of OHNO genes in old and young parts of chromosome"
print chisquare(np.array([76, 147]), f_exp=np.array([111.5, 111.5]))
print "Fisher's exact test: ", fisher_exact([[len(old), len(young)],
[len(all_old)-len(old), len(all_young)-len(young)]])
print tabulate([["", "OLD","YOUNG"],
["Ohno", len(old), len(young)],
["No-ohno",len(all_old)-len(old), len(all_young)-len(young)]])
The chisquare test indicates that less ohnologs are located in the old part of the Z chromosome.
In [466]:
def comp_dc_old_young(young, old, biased, dc):
young_dc_ohno = []
young_no_dc_ohno = []
old_dc_ohno = []
old_no_dc_ohno = []
for g in dc.index:
if g in young:
young_dc_ohno.append(g)
elif g in old:
old_dc_ohno.append(g)
for g in biased.index:
if g in young:
young_no_dc_ohno.append(g)
elif g in old:
old_no_dc_ohno.append(g)
print
print tabulate([["", "DC-OHNO","NODC-OHNO"],
["YOUNG", len(young_dc_ohno), len(young_no_dc_ohno)],
["OLD",len(old_dc_ohno), len(old_no_dc_ohno)]],
headers="firstrow")
print "Sum: ", sum([len(young_dc_ohno), len(young_no_dc_ohno), len(old_dc_ohno), len(old_no_dc_ohno)])
print "Fisher's exact test: ", fisher_exact([[len(young_dc_ohno), len(young_no_dc_ohno)],
[len(old_dc_ohno), len(old_no_dc_ohno)]])
comp_dc_old_young(young, old, Z_de_gonad_min2_biased, Z_de_gonad_min2_dc)
comp_dc_old_young(young, old, Z_de_spleen_min2_biased, Z_de_spleen_min2_dc)
comp_dc_old_young(young, old, Z_de_heart_min2_biased, Z_de_heart_min2_dc)
comp_dc_old_young(young, old, Z_de_liver_min2_biased, Z_de_liver_min2_dc)
print
print "Expected sums:"
for i in ['Spleen', 'Heart', 'Liver', "Gonad"]:
print i, len(fc_tissues[i]["OHNO"])
In no tissue ohnologs are more likely to be dosage compensated in the old parts or vice versa.
One possible explanation for the incomplete dosage compensation mechanism in birds may be that the Z chromosome has a lower than expected number of dosage-sensitive ohnologs. Here I compare the number of Z-linked ohnologs to the number of ohnologs located on the autosomes.
In [485]:
len(ohnolog_curie_pairs_relaxed_ids)
Z_OHNO = []
AUTO_OHNO = []
# Source Ensembl v 75 chicken karyotypes
# total_protein_coding = 14496
# Protein coding genes across the entire genome
total_protein_coding = 15508
Z_protein_coding = 764
# also not significant using coding + non coding genes
# total_genes = 17108
# Z_genes = 859
auto_protein_coding = total_protein_coding - Z_protein_coding
for g in ohnolog_curie_pairs_relaxed_ids:
if g in gid_to_chr:
if gid_to_chr[g] == "Z":
Z_OHNO.append(g)
else:
AUTO_OHNO.append(g)
print len(Z_OHNO)
print len(AUTO_OHNO)
print len(Z_OHNO) + len(AUTO_OHNO)
print tabulate([["", "OHNO","NO-OHNO"],
["Z", len(Z_OHNO), Z_protein_coding - len(Z_OHNO)],
["AUTO", len(AUTO_OHNO), auto_protein_coding - len(AUTO_OHNO)]],
headers="firstrow")
print "Fisher exact test: \n", fisher_exact([[len(Z_OHNO),(Z_protein_coding-len(Z_OHNO))],
[len(AUTO_OHNO), (auto_protein_coding-len(AUTO_OHNO))]])
There is no difference in the distribution of ohnologs. There are almost as many as expected given the number of protein coding genes.