In [2]:
%pylab inline


Populating the interactive namespace from numpy and matplotlib

0.0. Visualizing Metadata

RDF (Resource Description Framework) is a data model for information on the internet. It can be used to describe just about anything, but is usually applied to bibliographic collections: representing metadata about published documents.

RDF XML is a grammar/serialization format for representing RDF. There are other ways of representing RDF, like N-triples, Turtle, and JSON-LD.

One of the core concepts of RDF is representing (meta)data as a graph. Every element of the RDF document (a file containing RDF statements) is a node in the graph: articles, people, journals, literals (like volume numbers), etc. These nodes are called resources. Resources are linked together in triples: tri-partite statements consisting of a subject, a predicate, and an object.

In this notebook, we'll convert a simple Zotero RDF/XML document into a GraphML graph (a graph serialization format) and visualize that graph using Cytoscape.


Note This exercise is intended only to introduce you to RDF and graphs, and isn't something that you are likely to do as part of an analysis. There is a sample RDF/XML file included in the data subdirectory, describing a single document. Use that file to start. You can try this with your own RDF if you want, but even a moderate number of documents will lead to extremely large and unweildy graphs. So, be careful.



In [3]:
import rdflib
import networkx as nx
import os

In [1]:
rdf_path = 'data/example.rdf'

Correct Zotero RDF

Zotero isn't exactly a pro at creating valid RDF/XML. The code cell below fixes a known issue with Zotero RDF documents.


In [4]:
with open(rdf_path, 'r') as f:
    corrected = f.read().replace('rdf:resource rdf:resource',
                                 'link:link rdf:resource')

# The corrected graph will be saved to a file with `_corrected` 
#  added to the name. E.g. if the original RDF document was 
#  called `example.rdf`, the new file will be called 
#  `example_corrected.rdf`.
base, name = os.path.split(rdf_path)
corrected_name = '%s_corrected_.%s' % tuple(name.split('.'))
corrected_rdf_path = os.path.join(base, corrected_name)

with open(corrected_rdf_path, 'w') as f:
    f.write(corrected)

Parse RDF

We use the rdflib Python package to parse the corrected RDF document. The code-cell below creates an empty RDF graph, and then reads the triples from the corrected RDF document created above.


In [5]:
rdf_graph = rdflib.Graph()
rdf_graph.load(corrected_rdf_path)

Create a GraphML file

GraphML is a popular graph serialization format. My favorite graph visualization tool, Cytoscape, can read GraphML. The NetworkX Python package makes it easy to create GraphML files.


In [6]:
graph = nx.DiGraph()    # Metadata is `directed`.

for s, p, o in rdf_graph.triples((None, None, None)):
    # The .toPython() method converts rdflib objects into objects
    #  that any Python module can understand (e.g. str, int, float).
    graph.add_edge(s.toPython(), 
                   o.toPython(), 
                   attr_dict={'predicate': p.toPython()})

print 'Added %i nodes and %i edges to the graph' % (graph.order(),
                                                    graph.size())


Added 12 nodes and 11 edges to the graph

The code-cell below will create a new GraphML file that we can import in Cytoscape.


In [7]:
graphml_path = 'output/example.graphml'
nx.write_graphml(graph, graphml_path)

In Cytoscape:

  • File > Import > Network File and select example.graphml.
  • Select Layout > Apply Preferred Layout to apply a force-directed layout.
  • Select the Style panel at left, and select the "Directed" style from the dropdown menu.
  • Select the Edge tab at the bottom left; expand the Label property, and select predicate in the Column field.