DH3501: Advanced Social Networks

Class 14: Graph Data and Storage

Western University
Department of Modern Languages and Literatures
Digital Humanities – DH 3501

Instructor: David Brown
E-mail: dbrow52@uwo.ca
Office: AHB 1R14

Today marks the beginning of the Hacking On Graphs unit, in which we will be learning about graph databases and tools for processing big graphs. Get ready to party!

So up until now we've used flat files and in-memory processing to interact with graphs...

What are some of the advantages and disadvantages of this approach? Really think about it, the disadvantages can be obvious, but the incredible advantages may not be quite as noticable until you've seen some other approaches. There are also some more subtle disadvantages...do you have any criticisms about NetworkX? What is hard to do with NetworkX?

Flat file formats

Let's review the formats we have seen for storing graph data in flat files:

  • GEXF
  • GraphML
  • CSV/TSV
  • GraphSON
  • ...

GEXF

<?xml version="1.0" encoding="UTF-8"?>
<gexf xmlns="http://www.gexf.net/1.2draft" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="http://www.gexf.net/1.2draft http://www.gexf.net/1.2draft/gexf.xsd" version="1.2">
    <meta lastmodifieddate="2009-03-20">
        <creator>Gephi.org</creator>
        <description>A Web network</description>
    </meta>
    <graph defaultedgetype="directed">
        <attributes class="node">
            <attribute id="0" title="url" type="string"/>
            <attribute id="1" title="indegree" type="float"/>
            <attribute id="2" title="frog" type="boolean">
                <default>true</default>
            </attribute>
        </attributes>
        <nodes>
            <node id="0" label="Gephi">
                <attvalues>
                    <attvalue for="0" value="http://gephi.org"/>
                    <attvalue for="1" value="1"/>
                </attvalues>
            </node>
            <node id="1" label="Webatlas">
                <attvalues>
                    <attvalue for="0" value="http://webatlas.fr"/>
                    <attvalue for="1" value="2"/>
                </attvalues>
            </node>
            <node id="2" label="RTGI">
                <attvalues>
                    <attvalue for="0" value="http://rtgi.fr"/>
                    <attvalue for="1" value="1"/>
                </attvalues>
            </node>
            <node id="3" label="BarabasiLab">
                <attvalues>
                    <attvalue for="0" value="http://barabasilab.com"/>
                    <attvalue for="1" value="1"/>
                    <attvalue for="2" value="false"/>
                </attvalues>
            </node>
        </nodes>
        <edges>
            <edge id="0" source="0" target="1"/>
            <edge id="1" source="0" target="2"/>
            <edge id="2" source="1" target="0"/>
            <edge id="3" source="2" target="1"/>
            <edge id="4" source="0" target="3"/>
        </edges>
    </graph>
</gexf>

GraphSON

{
    "mode":"EXTENDED",
    "vertices": [
        {
            "name": {
                "type": "string",
                "value": "lop"
            },
            "lang": {
                "type": "string",
                "value": "java"
            },
            "_id": "3",
            "_type": "vertex"
        },
        {
            "name": {
                "type": "string",
                "value": "vadas"
            },
            "age": {
                "type": "integer",
                "value": 27
            },
            "_id": "2",
            "_type": "vertex"
        },
        {
            "name": {
                "type": "string",
                "value": "marko"
            },
            "age": {
                "type": "integer",
                "value": 29
            },
            "_id": "1",
            "_type": "vertex"
        },
        {
            "name": {
                "type": "string",
                "value": "peter"
            },
            "age": {
                "type": "integer",
                "value": 35
            },
            "_id": "6",
            "_type": "vertex"
        },
        {
            "name": {
                "type": "string",
                "value": "ripple"
            },
            "lang": {
                "type": "string",
                "value": "java"
            },
            "_id": "5",
            "_type": "vertex"
        },
        {
            "name": {
                "type": "string",
                "value": "josh"
            },
            "age": {
                "type": "integer",
                "value": 32
            },
            "_id": "4",
            "_type": "vertex"
        }
    ],
    "edges": [
        {
            "weight": {
                "type": "float",
                "value": 1
            },
            "_id": "10",
            "_type": "edge",
            "_outV": "4",
            "_inV": "5",
            "_label": "created"
        },
        {
            "weight": {
                "type": "float",
                "value": 0.5
            },
            "_id": "7",
            "_type": "edge",
            "_outV": "1",
            "_inV": "2",
            "_label": "knows"
        },
        {
            "weight": {
                "type": "float",
                "value": 0.4000000059604645
            },
            "_id": "9",
            "_type": "edge",
            "_outV": "1",
            "_inV": "3",
            "_label": "created"
        },
        {
            "weight": {
                "type": "float",
                "value": 1
            },
            "_id": "8",
            "_type": "edge",
            "_outV": "1",
            "_inV": "4",
            "_label": "knows"
        },
        {
            "weight": {
                "type": "float",
                "value": 0.4000000059604645
            },
            "_id": "11",
            "_type": "edge",
            "_outV": "4",
            "_inV": "3",
            "_label": "created"
        },
        {
            "weight": {
                "type": "float",
                "value": 0.20000000298023224
            },
            "_id": "12",
            "_type": "edge",
            "_outV": "6",
            "_inV": "3",
            "_label": "created"
        }
    ]
}

NetworkX and memory limitations

According to T & K, ~1% of Twitter traffic for one day would create a graph that occupies 1.5 gigs of memory space in NetworkX. Processing real world graphs in memory quickly becomes unrealistic.

Let's check out http://tweetping.net/

Intro. to relational databases

So we don't focus much on relational databases (RDBs) in this course, but it is important to know what they are and the basics of how they work, as graph databases (GDBs) borrow many concepts from RDBs.

Tables and the E-R model

  • Relational databases store data in tables...you can think about a table kind of like an excel sheet.

    • MySQL
    • PostGRESQL
    • SQLite
    • ...
  • Generally, a table represents an entity, a person, place or thing.
  • Relationships between things are stored in references to other tables.

    • A foreign key stores the primary key of a related entry in another table.
    • One-to-one relationships
    • One-to-many relationships
    • Many-to-many relationships

Challenge: Our first property graph example.

Below you will see an image representing a property graph. Using what we just learned about relational database and your intuitive understanding of an image to come up with an E-R diagram representing the data contained in the graph.

Hint: Ask yourself, what are the nodes that have the names "lop" and "ripple".

Structured Query Language

SQL is a declarative DSL that allows you to compose queries against a relational database.

  • It allows you to perform operations such as creating tables, inserting data into tables, retrieving data and changing or deleting data.

CRUD: Create Read Update Delete

Examples:

# Table creation
CREATE TABLE TeachingAssistant (
    firstname varchar(30) NOT NULL,
    lastname varchar(30) NOT NULL,
    student_number char(9) NOT NULL UNIQUE,
    western_id varchar(8) NOT NULL UNIQUE,
    degree varchar(7) NOT NULL,
    image_url varchar(2083), # http://stackoverflow.com/questions/219569/best-database-field-type-for-a-url
    head_supervisor varchar(8),
    FOREIGN KEY (head_supervisor) REFERENCES Instructor(western_id),
    PRIMARY KEY (western_id)
);

# Insertion
INSERT INTO TeachingAssistant
VALUES ('Homer', 'Simpson', '250011111', 'hsimpson', 'Masters', '', 'krabppl');

INSERT INTO TeachingAssistant
VALUES ('Ned', 'Flanders', '250000666', 'nflander', 'PhD', '', 'skinner');

# Retrieval
select * from TeachingAssistant;

# Get all TeachingAssistants that have XXX as head_supervisor
select * from TeachingAssistant WHERE head_supervisor in (SELECT western_id from Instructor where western_id=250XXXXXX)

Cursors and lazy evaluation

  • Querysets aren't loaded into memory until they are needed.

Transactions

Wikipedia provide a wonderful description of the concept of a database transaction:

A transaction symbolizes a unit of work performed within a database management system (or similar system) against a database, and treated in a coherent and reliable way independent of other transactions. A transaction generally represents any change in database. Transactions in a database environment have two main purposes:

  • To provide reliable units of work that allow correct recovery from failures and keep a database consistent even in cases of system failure, when execution stops (completely or partially) and many operations upon a database remain uncompleted, with unclear status.

  • To provide isolation between programs accessing a database concurrently. If this isolation is not provided, the program's outcome are possibly erroneous.

NoSQL databases

  • Key-value store
  • Document based
  • Column family

GraphDBs

Some graph vendors:

GraphDBs are an up and coming technology...Some prominent GraphDBs:

  • Neo4j (Neo Technology)

  • Titan (Aurelius)

  • OrientDB (Orient Technologies)

  • ArangoDB

GraphDBs leverage the property graph model. A property graph has the following characteristics (RWE, 3):

  • It contains nodes and relationships.
  • Nodes contain properties (key-value pairs)
  • Relationships are named and directed, and always have a start and an end node.
  • Relationships can also contain properties.

GraphDB can be used fairly loosely, however we distinguish between a native and non-native storage techniques...

  • In a database with native graph storage, nodes physically point to one another.

What are the advantages of GraphDBs...?

  • Relational databases are actually really bad with relationships (so are many NoSQLs)

  • In the end, the relational database is unaware of the connections in the data, and application code is responsible for creating a network out of tabular results

  • Graphs store connected data as, surprise surprise, connected data!

  • Optimized for traversals!!!

    • Traversals in SQL (joins) are quite costly
  • Flexible

  • Semantically rich

The big data revolution

There are many more technologies being developed for managing large data sets including graphs...

Hadoop, Spark, Cassandra...

Over the next few classes, we will become familiar with the most common tools and techniques for working with large quantities of graph data...see you soon!