ENES use case graph example

  • A graph is generated representing a server infrastructure providing different types of data services hosted by data servers
  • The data is organized in collections
  • Collections can be hosted by multiple servers (replication)
  • Servers provide different average bandwidth to different geographical regions

Setup Connection to neo4j instance


In [1]:
import ENESNeoTools 
from py2neo import Graph, Node, Relationship, authenticate
authenticate("localhost:7474", ENESNeoTools.user_name, ENESNeoTools.pass_word)

# connect to authenticated graph database
graph = Graph("http://localhost:7474/db/data/")

also rest client possible


In [8]:
from neo4jrestclient.client import GraphDatabase
from neo4jrestclient.query import Q
gdb = GraphDatabase("http://localhost:7474/db/data/",username="neo4j",password="prolog16")

Set up a data collection graph

  • data is organized in collections
  • collections are hierarchically organized according to levels (file directory analogon)

In [2]:
# collection organization reflects directory structure:
# e.g. cordex/output/EUR-11/MPI-CSC/MPI-M-MPI-ESM-LR/rcp85/r1i1p1/MPI-CSC-REMO2009/v1/day/tas
# generic structure: <activity>/<product>/<Domain>/<Institution>/<GCMModelName>/<CMIP5ExperimentName>
#           /<CMIP5EnsembleMember>/<RCMModelName>/<RCMVersionID>/<Frequency>/<VariableName>.

# facets describing collection

facet_nodes = []
for key, value in ENESNeoTools.facet_list1.iteritems():
    facet_node = Node("Collection",name=value[1], level=value[0])
    facet_nodes.append(facet_node)
       
facet_chain = []
for i in range(1,len(facet_nodes)):
    rel = Relationship(facet_nodes[i],"belongs_to",facet_nodes[i-1])
    facet_chain.append(rel)
        
for rel in facet_chain:
    graph.create(rel)

cordex_file_set1 = ENESNeoTools.get_files(ENESNeoTools.facet_list1)

#cordex_set1 = []
cordex_rel1 = []

for cordexfile in cordex_file_set1:
    node = Node("File", name=cordexfile, group="file")
   # cordex_set1.append(node)
    cordex_rel1.append(Relationship(node,"belongs_to",facet_nodes[0]))
                       
for rel in cordex_rel1:    
   graph.create(rel)

Data servers graph setup

  • servers and expose three types of data access services (http, globus, opendap)
  • services and servers can be non-operational ("down")

In [3]:
server_list = ENESNeoTools.get_servers()


service_rels = []
server_nodes = []
for (sname, surl) in server_list:
    new_node = Node('data_server',name=sname, url=surl)
   
    server_nodes.append(new_node)
    data_services = ENESNeoTools.data_service_nodes(sname)
   
    for data_service in data_services:
          service_rels.append(Relationship(data_service,"service",new_node)) 
            
for rel in service_rels:
    graph.create(rel)

Combine data set graph with server graph

  • a data collection is "served_by" a data_server

In [4]:
orig1 = Relationship(facet_nodes[1],"served_by",server_nodes[0])
replica1 = Relationship(facet_nodes[1],"served_by",server_nodes[1])

graph.create(orig1)
graph.create(replica1)


Out[4]:
(<Relationship graph=u'http://localhost:7474/db/data/' ref=u'relationship/0' start=u'node/134' end=u'node/169' type=u'served_by' properties={}>,)

Data servers provide different bandwidth to different regions / countries and end users belong to different regions (temporarily)


In [5]:
region_germany = Node("country", name="Germany", provider="DFN")
region_australia = Node("country", name="Australia", provider="RNet")
region_sweden = Node("country", name="Sweden", provider="SweNet")

user1 = Node("user",name="Stephan Kindermann")
user2 =  Node("user",name="Mr Spock")
user3 = Node("user",name="Michael Kolax")

home1 = Relationship(user1,"connects_to",region_germany)
home2 = Relationship(user2,"connects_to",region_australia)
home3 = Relationship(user3,"connects_to",region_sweden)

link1 = Relationship(server_nodes[0],"nw_link",region_germany,  bandwidth=2000000)
link2 = Relationship(server_nodes[0],"nw_link",region_sweden,   bandwidth=1000000)
link3 = Relationship(server_nodes[0],"nw_link",region_australia,bandwidth=500000)

link4 = Relationship(server_nodes[1],"nw_link",region_germany,   bandwidth=1500000)
link5 = Relationship(server_nodes[1],"nw_link",region_sweden,    bandwidth=3000000)
link6 = Relationship(server_nodes[1],"nw_link",region_australia, bandwidth=400000)

graph.create(link1,link2,link3,link4,link5,link6)


Out[5]:
(<Relationship graph=u'http://localhost:7474/db/data/' ref=u'relationship/1' start=u'node/165' end=u'node/3' type=u'nw_link' properties={'bandwidth': 2000000}>,
 <Relationship graph=u'http://localhost:7474/db/data/' ref=u'relationship/2' start=u'node/165' end=u'node/4' type=u'nw_link' properties={'bandwidth': 1000000}>,
 <Relationship graph=u'http://localhost:7474/db/data/' ref=u'relationship/3' start=u'node/165' end=u'node/5' type=u'nw_link' properties={'bandwidth': 500000}>,
 <Relationship graph=u'http://localhost:7474/db/data/' ref=u'relationship/4' start=u'node/169' end=u'node/3' type=u'nw_link' properties={'bandwidth': 1500000}>,
 <Relationship graph=u'http://localhost:7474/db/data/' ref=u'relationship/5' start=u'node/169' end=u'node/4' type=u'nw_link' properties={'bandwidth': 3000000}>,
 <Relationship graph=u'http://localhost:7474/db/data/' ref=u'relationship/6' start=u'node/169' end=u'node/5' type=u'nw_link' properties={'bandwidth': 400000}>)

Data servers are sometimes down (not operational and thus do not serve data to users)


In [6]:
server_nodes[0].properties["status"] = "UP"
server_nodes[1].properties["status"] = "UP"
server_nodes[0].push()
server_nodes[1].push()
server_nodes[0].properties


Out[6]:
{'name': u'carbon.dkrz', 'status': u'UP', 'url': u'http://carbon.dkrz.de'}


Interactive cells to play with graph


In [8]:
from IPython.display import HTML
HTML('<iframe src=http://localhost:7474/browser/ width=1000 height=800> </iframe>')


Out[8]:

In [9]:
%load_ext cypher

In [10]:
statement = """MATCH (myfile:File {name:"tas_EUR-11_MPI-M-MPI-ESM-LR_rcp85_r1i1p1_MPI-CSC-REMO2009_v1_day_20660101-20701231.nc"}) RETURN myfile"""
results = graph.cypher.execute(statement)
results


Out[10]:
   | myfile                                                                                                                 
---+-------------------------------------------------------------------------------------------------------------------------
 1 | (n157:File {group:"file",name:"tas_EUR-11_MPI-M-MPI-ESM-LR_rcp85_r1i1p1_MPI-CSC-REMO2009_v1_day_20660101-20701231.nc"})

In [11]:
results = %cypher http://neo4j:prolog16@localhost:7474/db/data MATCH (myfile:File {name:"tas_EUR-11_MPI-M-MPI-ESM-LR_rcp85_r1i1p1_MPI-CSC-REMO2009_v1_day_20660101-20701231.nc"}) RETURN myfile
results.get_dataframe()


1 rows affected.
Out[11]:
myfile
0 {u'group': u'file', u'name': u'tas_EUR-11_MPI-...

In [ ]:
graph.open_browser()

return operational servers for a specific file


In [12]:
%%cypher  http://neo4j:prolog16@localhost:7474/db/data

MATCH (a:File)-[:belongs_to*]-(b:Collection) -[:served_by]- (c:data_server)  
WHERE c.status = 'UP' AND a.name = 'tas_EUR-11_MPI-M-MPI-ESM-LR_rcp85_r1i1p1_MPI-CSC-REMO2009_v1_day_20760101-20801231.nc'
RETURN c


2 rows affected.
Out[12]:
c
{u'url': u'http://cordex.shmhi.se', u'status': u'UP', u'name': u'cordex.smhi'}
{u'url': u'http://carbon.dkrz.de', u'status': u'UP', u'name': u'carbon.dkrz'}

switch off a server and rerun query


In [13]:
server_nodes[1].properties["status"] = "DOWN"
server_nodes[1].push()

In [14]:
%%cypher  http://neo4j:prolog16@localhost:7474/db/data

MATCH (a:File)-[:belongs_to*]-(b:Collection) -[:served_by]- (c:data_server)  
WHERE c.status = 'UP' AND a.name = 'tas_EUR-11_MPI-M-MPI-ESM-LR_rcp85_r1i1p1_MPI-CSC-REMO2009_v1_day_20760101-20801231.nc'
RETURN c


1 rows affected.
Out[14]:
c
{u'url': u'http://carbon.dkrz.de', u'status': u'UP', u'name': u'carbon.dkrz'}

In [ ]:


In [ ]:
results = %cypher http://neo4j:prolog16@localhost:7474/db/data MATCH (a)-[r]-(b) RETURN a,r, b

In [ ]:
%%bash
ls

Simple cells to clean graphdb


In [10]:
%%cypher http://neo4j:prolog16@localhost:7474/db/data

MATCH (n)
OPTIONAL MATCH (n)-[r]-()
DELETE n,r


0 rows affected.
Out[10]:
[]

In [2]:
graph.delete_all()

simple graph visualizations


In [ ]:
%matplotlib inline
results.get_graph()

In [ ]:
results.draw()

ToDo / Ideas

Add data server access log information (e.g.) users who downloaded - also downloaded

.. number of downloads of specific data sets etc.


In [ ]: