An important question that I also wanted to address was the cell-wide effects of HIF-1. Although the hypoxia response by itself is informative as to what HIF-1 actually turns on and off in C. elegans, enrichment analyses are not the only way to get information out of transcriptomes.
Another way to get information about these effects is to change what biological units we are studying. In this paper, we have focused a lot on single genes. However, we could also ask what pathways, or what entities, are represented in our dataset.
The way I will look at pathways is by identifying the genes that are in a 'pathway' or biological process of interest. I will extract the genes within this process that are differentially expressed in each mutant. Then, I will look at how the pathway changes overall. If a pathway is being down-regulated in a given set of mutants, we would expect that all of the genes that are D.E. in this pathway would show up as down-regulated. However, we no longer require that ALL of the genes in this pathway be D.E. in our dataset.
When a pathway is mainly changing in one direction, with the exception of a single gene that is changing in the opposite direction, I only consider that gene to be informative if and only if it was represented in 2 samples or more. Why? Because false positives exist, but we also need to take into consideration that pathways are human constructs that are likely to be incomplete. Branching may be ocurring, and there could be specific reasons for why a single node changes in opposite direction to the rest of the pathway.
In [1]:
# important stuff:
import os
import pandas as pd
import numpy as np
# morgan
import morgan as morgan
import tissue_enrichment_analysis as tea
import matplotlib.pyplot as plt
import seaborn as sns
# Magic function to make matplotlib inline;
# other style specs must come AFTER
%matplotlib inline
# This enables SVG graphics inline.
# There is a bug, so uncomment if it works.
%config InlineBackend.figure_formats = {'png', 'retina'}
import genpy
import seqplotter
import gvars
import epistasis as epi
q = 0.1
genvar = gvars.genvars()
I will load the respiratory complexes and central dogma complexes, which I obtained from a manual curation of Wormbase using WormMine
In [2]:
respiratory_complexes = pd.read_excel('../input/respiratory_complexes.xlsx')
central_dogma = pd.read_excel('../input/central_dogma.xlsx')
In [3]:
tissue_df = tea.fetch_dictionary()
phenotype_df = pd.read_csv('../input/phenotype_ontology.csv')
go_df = pd.read_csv('../input/go_dictionary.csv')
In [4]:
melted_tissue = pd.melt(tissue_df, id_vars='wbid',
var_name='term', value_name='expressed')
melted_tissue = melted_tissue[melted_tissue.expressed == 1]
melted_phenotype = pd.melt(phenotype_df, id_vars='wbid',
var_name='term', value_name='expressed')
melted_phenotype = melted_phenotype[melted_phenotype.expressed == 1]
melted_go = pd.melt(go_df, id_vars='wbid',
var_name='term', value_name='expressed')
melted_go = melted_go[melted_go.expressed == 1]
In [5]:
tidy_data = pd.read_csv('../output/temp_files/DE_genes.csv')
tidy_data.sort_values('target_id', inplace=True)
tidy_data.dropna(subset=['ens_gene'], inplace=True)
# tidy_data.sort_values('sort_order', inplace=True)
# drop the fog-2 data:
tidy_data = tidy_data[tidy_data.genotype != 'fog-2']
# tidy_data = tidy_data[tidy_data.qval < q] # keep only sig data.
Before we start, I will define a function, called gene_compactifier
which will make visualization of gene representation much easier. How does it work?
Given a gene list, it:
So if a gene list contains unc-119, unc-15 and unc-1, the program will output:
Gene "Family", Number Found
unc, 3 ['1', '15', '119']
Moreover, if a gene list contains unc-119, unc-119 and unc-119 (the same gene repeated $n$ times), the program will output:
Gene "Family", Number Found
unc, 3 ['119', '119', '119']
This makes it quite easy to visualize what genes within a pathway are represented in all of the mutants (coverage), as well as how many times each gene is represented in the dataset (coverage).
In [6]:
def gene_compactifier(ext_gene):
"""Given a list of ext_gene names, compactify them and print"""
d = {}
ext_gene = sorted(ext_gene)
for gene in ext_gene:
ind = gene.find('-')
if ind > 1:
name = gene[:ind]
number = gene[ind+1:]
else:
name = gene
number = ''
if name in d.keys():
d[name] += [number]
else:
d[name] = [number]
print('Gene "Family", Number Found')
for name, numbers in d.items():
if len(numbers) > 1:
print(name + ', ', len(numbers), sorted(numbers))
else:
if len(numbers[0]) > 0:
print(name + '-' + numbers[0])
else:
print(name)
First, let me extract all the genes that are overrepresented in mitochondria. The way I do this is via a function call plot_by_term
which, given a string, a dataframe to search, and the kind of ontology that the string should be found in, plots for each genotype the perturbation values of the significantly altered genes and returns the axis that contains that plot, as well as the list of genes that are annotated with the desired string.
In [7]:
ax, mito = seqplotter.plot_by_term('mitochondrion', df=tidy_data,
kind='go', swarm=True)
Visually, it looks like maybe 1/3 of the mitochondrial genes go up, and the rest go down. What genes are most represented in this pathway? How many points consistently show up in mutants that have a constitutive HIF-1 response? Let's find out.
Next, I find out what genes that are annotated with the term 'mitochondria' go up across genotypes with a constitutive HIF-1 response:
In [8]:
common = epi.find_overlap(['e', 'b', 'd', 'a'], tidy_data)
trial = tidy_data[(tidy_data.ens_gene.isin(mito)) &
(tidy_data.target_id.isin(common)) &
(tidy_data.b > 0)].ext_gene
gene_compactifier(trial)
What about the genes that go DOWN in all genotypes with a constitutive HIF-1 response?
In [9]:
trial = tidy_data[(tidy_data.ens_gene.isin(mito)) &
(tidy_data.target_id.isin(common)) &
(tidy_data.b < 0)].ext_gene
gene_compactifier(trial)
In [10]:
term = 'structural constituent of ribosome GO:0003735'
ax, ribosome = seqplotter.plot_by_term(term, df=tidy_data, kind='go')
In [11]:
trial = tidy_data[(tidy_data.ens_gene.isin(ribosome)) & (tidy_data.qval < q)].ext_gene.unique()
gene_compactifier(trial)
What about the effects of HIF-1 on the Electron Transport Chain? Or the TCA cycle?
To explore this, I will make a new dataframe, that contains only the genes in the ETC. I will also annotate each gene with the complex it belongs to, and then I will add a column called sort_order so I can sort the dataframe at my pleasure.
In [12]:
resp = tidy_data[tidy_data.ens_gene.isin(respiratory_complexes.ens_gene)
& (~tidy_data.code.isin(['f', 'c']))].copy()
f = lambda x: respiratory_complexes[respiratory_complexes.ens_gene == x].complex.values[0]
resp['complex'] = resp.ens_gene.map(f)
f = lambda x: respiratory_complexes[respiratory_complexes.ens_gene == x].sort_order.values[0]
resp['sort_order'] = resp.ens_gene.map(f)
resp.sort_values('sort_order', inplace=True)
resp = resp[resp.complex != 'Ubiquinone Biosynthesis']
This is what the dataframe looks like:
In [13]:
resp[['ext_gene', 'genotype', 'complex', 'sort_order']].head()
Out[13]:
Let's plot the dataframe, see what comes out. We would expect all genes in the ETC and TCA to go down:
In [14]:
fig, ax = plt.subplots()
ax = sns.swarmplot(x='complex', y='b', hue='ens_gene', data=resp, size=7)
plt.xticks(rotation=45)
ax.legend_.remove()
plt.title('HIF-1 mediated bioenergetics changes')
plt.ylabel(r'\beta')
plt.xlabel('TCA, ETC or Energy Reserve')
ax.hlines(0, xmin=-2, xmax=10, lw=2, linestyle='--')
plt.ylim(-4, 1)
plt.savefig('../output/mito_function.pdf')
Well, we can definitely see that not all genes in the ETC and TCA go down. Let's figure out what genes go UP in each cycle/complex.
In [15]:
gene_compactifier(resp[(resp.complex == 'TCA') & (resp.b > 0)].ext_gene)
In [16]:
gene_compactifier(resp[(resp.complex == 'Complex I') & (resp.b > 0)].ext_gene)
In [17]:
gene_compactifier(resp[(resp.complex == 'Complex II') & (resp.b > 0)].ext_gene)
In [18]:
gene_compactifier(resp[(resp.complex == 'energy reserve') & (resp.b > 0)].ext_gene)
Notice that for complex I, the genes nuo-2 and nduo-4 are up-regulated. But those exact same genes are also down-regulated (see below). Therefore, there is insufficient information to conclude whether these genes are going up, or down as a result of HIF-1. However, for other genes, namely fum-1 and sdha-1 we can conclude that those are significantly and consistently up-regulated in mutants that have a constitutive HIF-1 mutant.
In [19]:
gene_compactifier(resp[(resp.complex == 'TCA') & (resp.b < 0)].ext_gene)
In [20]:
gene_compactifier(resp[(resp.complex == 'Complex I') & (resp.b < 0)].ext_gene)
In [21]:
gene_compactifier(resp[(resp.complex == 'energy reserve') & (resp.b < 0)].ext_gene)
In [22]:
prot = tidy_data[tidy_data.ens_gene.isin(central_dogma.ens_gene)].copy()
prot['complex'] = prot.ens_gene.map(lambda x: central_dogma[central_dogma.ens_gene == x].complex.values[0])
In [23]:
fig, ax = plt.subplots()
ax = sns.swarmplot(x='complex', y='b', hue='ens_gene', data=prot, size=7)
plt.xticks(rotation=45)
ax.legend_.remove()
# plt.title('HIF-1 mediated changes in ETC expression')
plt.ylabel(r'\beta')
# plt.xlabel('Electron Transport Chain Complexes')
Out[23]:
In [24]:
ax, negregproteolysis = seqplotter.plot_by_term('protein catabolic process GO:0030163',
df=tidy_data, kind='go')
In [25]:
temp = tidy_data[(tidy_data.ens_gene.isin(negregproteolysis)) &
(tidy_data.target_id.isin(common)) &
(tidy_data.b > 0)
].ext_gene.unique()
gene_compactifier(temp)
In [26]:
temp = tidy_data[(tidy_data.ens_gene.isin(negregproteolysis)) &
(tidy_data.target_id.isin(common)) &
(tidy_data.b < 0)
].ext_gene.unique()
gene_compactifier(temp)
In [27]:
ax, folding = seqplotter.plot_by_term('protein folding', df=tidy_data, kind='go')
In [28]:
temp = tidy_data[(tidy_data.ens_gene.isin(folding)) & (tidy_data.b > 0)].ext_gene.unique()
gene_compactifier(temp)
In [29]:
temp = tidy_data[(tidy_data.ens_gene.isin(folding)) & (tidy_data.b < 0)].ext_gene.unique()
gene_compactifier(temp)
In [30]:
ax, immune = seqplotter.plot_by_term('immune system process', df=tidy_data, kind='go')
In [31]:
temp = tidy_data[(tidy_data.ens_gene.isin(immune)) & (tidy_data.target_id.isin(common)) &
(tidy_data.b > 0)].ext_gene.unique()
gene_compactifier(temp)
In [32]:
temp = tidy_data[(tidy_data.ens_gene.isin(immune)) & (tidy_data.target_id.isin(common)) &
(tidy_data.b < 0)].ext_gene.unique()
gene_compactifier(temp)