Visualize and analyze differential expression data in a network

In analysis of differential expression data, it is often useful to analyze properties of the local neighborhood of specific genes. I developed a simple interactive tool for this purpose, which takes as input diferential expression data, and gene interaction data (from http://www.genemania.org/). The network is then plotted in an interactive widget, where the node properties, edge properties, and layout can be mapped to different network properties. The interaction type (of the 6 options from genemania) can also be selected.

This tool will also serve as an example for how to create, modify, visualize and analyze weighted and unweighted gene interaction networks using the highly useful and flexible python package NetworkX (https://networkx.github.io/)

This tool is most useful if you have a reasonably small list of genes (~100) with differential expression data, and want to explore properties of their interconnections and their local neighborhoods.


In [154]:
# import some useful packages

import numpy as np
import pandas as pd
import networkx as nx
import matplotlib.pyplot as plt

% matplotlib inline

Import a real network (from this experiment http://www.ncbi.nlm.nih.gov/sites/GDSbrowser?acc=GDS4419)

This experiment contains fold change information for genes in an experiment studying 'alveolar macrophage response to bacterial endotoxin lipopolysaccharide exposure in vivo'. We selected a list of genes from the experiment which had high differential expression, and were enriched for 'immune response' and 'response to external biotic stimulus' in the gene ontology. This experiment and gene list were selected purely as examples for how to use this tool for an initial exploration of differential expression data.

NOTE: change paths/filenames in this cell to apply network visualizer to other datasets. Network format comes from genemania (e.g. columns are 'Entity 1', 'Entity 2', 'Weight', 'Network_group', 'Networks')

NOTE: File format is tsv, and needs to contain columns for 'IDENTIFIER', 'DiffExp', and 'absDiffExp'. Other columns are optional


In [155]:
dataDE = pd.read_csv('DE_data/DE_ayyagari_data_genename_foldchange.csv',sep='\t')
print(dataDE.head())

# genes in dataDE
gene_list = list(dataDE['IDENTIFIER'])

# only use the average fold-change (because there are multiple entries for some genes
#dataDE_mean = dataDE.DiffExp.groupby(dataDE['IDENTIFIER']).mean()
dataDE_mean = dataDE['fold_change']
dataDE_mean.index=gene_list
print(dataDE_mean)

# load the gene-gene interactions (from genemania)
#filename = 'DE_data/DE_experiment_interactions.txt'
filename = 'DE_data/DE_ayyagari_interactions.txt'

DE_network = pd.read_csv(filename, sep='\t', header=6)
DE_network.columns = ['Entity 1','Entity 2', 'Weight','Network_group','Networks']


   Unnamed: 0 IDENTIFIER  fold_change
0           0     PKD1P5       5.5476
1           1     S100A9       7.2441
2           2       FPR1       6.0959
3           3       ALX1      -3.0315
4           4     SIRPB1       5.0527
PKD1P5            5.5476
S100A9            7.2441
FPR1              6.0959
ALX1             -3.0315
SIRPB1            5.0527
SERPINA1          4.8047
MASP1            -3.9672
LINC01139        -4.2947
HCK               3.0721
VCAN              3.1264
LCP1              3.8147
CTSS              2.7935
ALOX5             3.3878
LGR5             -3.8928
KIAA0226L         2.9026
PTPRC             3.5355
ITGAX             3.9179
GRM7              3.5688
LAT2              3.0188
AQP9              3.8259
LAPTM5            3.2281
MNDA              3.7688
IKZF1             2.9142
CD53              3.1240
CCL2              3.5117
ACTA2-AS1         2.6210
RP11-1415C14.4   -2.8011
ESPNL            -3.2067
GFRA1            -2.5162
CORO1A            2.8800
TMEM154           3.3761
CTGF              3.1240
FCGR2A            3.1842
MYO1G             2.7531
HSPA6             2.8088
SELL              3.5322
CSF3R             3.5002
RADIL            -2.0305
EMR2              3.0068
WFDC1             3.2072
TLR2              3.1900
MMP25             3.3605
SAMSN1            3.2828
IGLC3             3.2324
PTPRJ             2.4107
TYROBP            2.4113
RHBDF2            1.9232
DDIT4             1.8672
LCP2              2.4603
RPS6KA1           2.5812
IL1B              2.5280
LILRA5            2.5347
Name: fold_change, Length: 52, dtype: float64

In [156]:
# create the graph, and add some edges (and nodes)
G_DE = nx.Graph()
idxCE = DE_network['Network_group']=='Co-expression'
edge_list = zip(list(DE_network['Entity 1'][idxCE]),list(DE_network['Entity 2'][idxCE]))
G_DE.add_edges_from(edge_list)

print('number of edges = ' + str(len(G_DE.edges())))
print('number of nodes = '+ str(len(G_DE.nodes())))


number of edges = 392
number of nodes = 44

In [157]:
# create version with weighted edges
G_DE_w = nx.Graph()
edge_list_w = zip(list(DE_network['Entity 1']),list(DE_network['Entity 2']),list(DE_network['Weight']))
    
G_DE_w.add_weighted_edges_from(edge_list_w)

Import plotting tool

(and reload it if changes have been made)


In [158]:
import imp
import plot_network
imp.reload(plot_network)


Out[158]:
<module 'plot_network' from 'plot_network.pyc'>

Run the plotting tool on prepared data

Description of options:

  • focal_node_name: Select gene to focus on (a star will be drawn on this node)
  • edge_threshold: Change the number of edges included in the network by moving the edge_threshold slider. Higher values of edge_threshold means fewer edges will be included in the graph (and may improve interpretability). The threshold is applied to the 'Weight' column of DE_network, so the less strongly weighted edges are not included as the threshold increases
  • network_algo: Select the network algorithm to apply to the graph. Choices are:
    • 'spl' (shortest path length): Plot the network in a circular tree layout, with the focal gene at the center, with nodes color-coded by log fold-change.
    • 'clustering coefficient': Plot the network in a circular tree layout, with nodes color-coded by the local clustering coefficient (see https://en.wikipedia.org/wiki/Clustering_coefficient).
    • 'pagerank': Plot the network in a spring layout, with nodes color-coded by page rank score (see https://en.wikipedia.org/wiki/PageRank for algorithm description)
    • 'community': Group the nodes in the network into communities, using the Louvain modularity maximization algorithm, which finds groups of nodes optimizing for modularity (a metric which measures the number of edges within communities compared to number of edges between communities, see https://en.wikipedia.org/wiki/Modularity_(networks) for more information). The nodes are then color-coded by these communities, and the total modularity of the partition is printed above the graph (where the maximal value for modularity is 1 which indicates a perfectly modular network so that there are no edges connecting communities). Below the network the average fold-change in each community is shown with box-plots, where the focal node's community is indicated by a white star, and the colors of the boxes correspond to the colors of the communities above.
  • map_degree: Choose whether to map the node degree to node size
  • plot_border_col: Choose whether to plot the log fold-change as the node border color
  • draw_shortest_paths: If checked, draw the shortest paths between the focal node and all other nodes in blue transparent line. More opaque lines indicate that section of path was traveled more often.
  • coexpression, colocalization, other, physical_interactions, predicted_interactions, shared_protein_domain: Select whether to include interactions of these types (types come from GeneMania- http://pages.genemania.org/data/)

In [159]:
from IPython.html.widgets import interact
from IPython.html import widgets
import matplotlib.colorbar as cb
import seaborn as sns
import community

# import network plotting module
from plot_network import *

# temporary graph variable
Gtest = nx.Graph()

# check whether you have differential expression data
diff_exp_analysis=True

# replace G_DE_w with G_DE in these two lines if unweighted version is desired
Gtest.add_nodes_from(G_DE_w.nodes())  
Gtest.add_edges_from(G_DE_w.edges(data=True))

# prep border colors
nodes = Gtest.nodes()
#gene_list = gene_list

if diff_exp_analysis:
    diff_exp = dataDE_mean
    genes_intersect = np.intersect1d(gene_list,nodes)
    border_cols = Series(index=nodes)

    for i in genes_intersect:
        if diff_exp[i]=='Unmeasured':
            border_cols[i] = np.nan
        else:
            border_cols[i] = diff_exp[i] 
else:  # if no differential expression data
    border_cols = [None]


numnodes = len(Gtest)

# make these three global to feed into widget
global Gtest

global boder_cols  

global DE_network

def plot_network_shell(focal_node_name,edge_thresh=.5,network_algo='spl', map_degree=True,
                       plot_border_col=False, draw_shortest_paths=True,
                       coexpression=True, colocalization=True, other=False,physical_interactions=False,
                       predicted_interactions=False,shared_protein_domain=False):
    
    # this is the main plotting function, called from plot_network module
    fig = plot_network(Gtest, border_cols, DE_network,
                 focal_node_name, edge_thresh, network_algo, map_degree, plot_border_col, draw_shortest_paths,
                 coexpression, colocalization, other, physical_interactions, predicted_interactions, shared_protein_domain)


    return fig

# threshold slider parameters
min_thresh = np.min(DE_network['Weight'])
max_thresh = np.max(DE_network['Weight']/10)
thresh_step = (max_thresh-min_thresh)/1000.0

interact(plot_network_shell, focal_node_name=list(np.sort(nodes)),
         edge_thresh=widgets.FloatSliderWidget(min=min_thresh,max=max_thresh,step=thresh_step,value=min_thresh,description='edge threshold'),
         network_algo = ['community','clustering_coefficient','pagerank','spl']);


Some examples

First let's look at the graph when 'spl' (shortest path length) is selected as the network algo. ADA is the focal node in this case, and it has 4 nearest neighbors (MX1, CD44, FITM1, and CD80). Note that CD44 connects the focal node ADA to many other nodes in the network, as it is an opaque blue line. Also note that there is only one gene with a negative fold change in this gene set (CCL13). The white nodes are genes included by genemania- they are the 20 genes nearest to the input genelist.

Community detection

When the network_algo button is switched to 'community', the louvain modularity maximization algorithm runs on the network, and partitions the nodes into communities which maximize the modularity. In this case (with CXCL10 as the focal node), the nodes are partitioned into 5 groups, with the three largest groups indicated by red, green, and teal circles. While you can see some support for this grouping by eye, the overall graph modularity is 0.33, which is a relatively low value. This means that although groups were found in the graph, the graph itself is not very modular. As a rule of thumb, very modular graphs have modularities of about 0.5 or 0.6.

Below the graph, there is a panel showing the average fold change for the nodes in this community. Since most of the nodes in the input gene list have positive fold changes here, all communities also have positive average fold changes. Were the input gene list to have fewer large fold changes, this would enable you to see if a particular grouping of nodes had significantly higher (or lower) levels of differential expression than alternative groupings.


In [ ]:


In [ ]:


In [ ]:


In [ ]: