Today marks the beginning of the Hacking On Graphs unit, in which we will be learning about graph databases and tools for processing big graphs. Get ready to party!
What are some of the advantages and disadvantages of this approach? Really think about it, the disadvantages can be obvious, but the incredible advantages may not be quite as noticable until you've seen some other approaches. There are also some more subtle disadvantages...do you have any criticisms about NetworkX? What is hard to do with NetworkX?
Let's review the formats we have seen for storing graph data in flat files:
<?xml version="1.0" encoding="UTF-8"?>
<gexf xmlns="http://www.gexf.net/1.2draft" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="http://www.gexf.net/1.2draft http://www.gexf.net/1.2draft/gexf.xsd" version="1.2">
<meta lastmodifieddate="2009-03-20">
<creator>Gephi.org</creator>
<description>A Web network</description>
</meta>
<graph defaultedgetype="directed">
<attributes class="node">
<attribute id="0" title="url" type="string"/>
<attribute id="1" title="indegree" type="float"/>
<attribute id="2" title="frog" type="boolean">
<default>true</default>
</attribute>
</attributes>
<nodes>
<node id="0" label="Gephi">
<attvalues>
<attvalue for="0" value="http://gephi.org"/>
<attvalue for="1" value="1"/>
</attvalues>
</node>
<node id="1" label="Webatlas">
<attvalues>
<attvalue for="0" value="http://webatlas.fr"/>
<attvalue for="1" value="2"/>
</attvalues>
</node>
<node id="2" label="RTGI">
<attvalues>
<attvalue for="0" value="http://rtgi.fr"/>
<attvalue for="1" value="1"/>
</attvalues>
</node>
<node id="3" label="BarabasiLab">
<attvalues>
<attvalue for="0" value="http://barabasilab.com"/>
<attvalue for="1" value="1"/>
<attvalue for="2" value="false"/>
</attvalues>
</node>
</nodes>
<edges>
<edge id="0" source="0" target="1"/>
<edge id="1" source="0" target="2"/>
<edge id="2" source="1" target="0"/>
<edge id="3" source="2" target="1"/>
<edge id="4" source="0" target="3"/>
</edges>
</graph>
</gexf>
{
"mode":"EXTENDED",
"vertices": [
{
"name": {
"type": "string",
"value": "lop"
},
"lang": {
"type": "string",
"value": "java"
},
"_id": "3",
"_type": "vertex"
},
{
"name": {
"type": "string",
"value": "vadas"
},
"age": {
"type": "integer",
"value": 27
},
"_id": "2",
"_type": "vertex"
},
{
"name": {
"type": "string",
"value": "marko"
},
"age": {
"type": "integer",
"value": 29
},
"_id": "1",
"_type": "vertex"
},
{
"name": {
"type": "string",
"value": "peter"
},
"age": {
"type": "integer",
"value": 35
},
"_id": "6",
"_type": "vertex"
},
{
"name": {
"type": "string",
"value": "ripple"
},
"lang": {
"type": "string",
"value": "java"
},
"_id": "5",
"_type": "vertex"
},
{
"name": {
"type": "string",
"value": "josh"
},
"age": {
"type": "integer",
"value": 32
},
"_id": "4",
"_type": "vertex"
}
],
"edges": [
{
"weight": {
"type": "float",
"value": 1
},
"_id": "10",
"_type": "edge",
"_outV": "4",
"_inV": "5",
"_label": "created"
},
{
"weight": {
"type": "float",
"value": 0.5
},
"_id": "7",
"_type": "edge",
"_outV": "1",
"_inV": "2",
"_label": "knows"
},
{
"weight": {
"type": "float",
"value": 0.4000000059604645
},
"_id": "9",
"_type": "edge",
"_outV": "1",
"_inV": "3",
"_label": "created"
},
{
"weight": {
"type": "float",
"value": 1
},
"_id": "8",
"_type": "edge",
"_outV": "1",
"_inV": "4",
"_label": "knows"
},
{
"weight": {
"type": "float",
"value": 0.4000000059604645
},
"_id": "11",
"_type": "edge",
"_outV": "4",
"_inV": "3",
"_label": "created"
},
{
"weight": {
"type": "float",
"value": 0.20000000298023224
},
"_id": "12",
"_type": "edge",
"_outV": "6",
"_inV": "3",
"_label": "created"
}
]
}
According to T & K, ~1% of Twitter traffic for one day would create a graph that occupies 1.5 gigs of memory space in NetworkX. Processing real world graphs in memory quickly becomes unrealistic.
Let's check out http://tweetping.net/
So we don't focus much on relational databases (RDBs) in this course, but it is important to know what they are and the basics of how they work, as graph databases (GDBs) borrow many concepts from RDBs.
Relational databases store data in tables...you can think about a table kind of like an excel sheet.
Relationships between things are stored in references to other tables.
Below you will see an image representing a property graph. Using what we just learned about relational database and your intuitive understanding of an image to come up with an E-R diagram representing the data contained in the graph.
Hint: Ask yourself, what are the nodes that have the names "lop" and "ripple".
SQL is a declarative DSL that allows you to compose queries against a relational database.
CRUD: Create Read Update Delete
Examples:
# Table creation
CREATE TABLE TeachingAssistant (
firstname varchar(30) NOT NULL,
lastname varchar(30) NOT NULL,
student_number char(9) NOT NULL UNIQUE,
western_id varchar(8) NOT NULL UNIQUE,
degree varchar(7) NOT NULL,
image_url varchar(2083), # http://stackoverflow.com/questions/219569/best-database-field-type-for-a-url
head_supervisor varchar(8),
FOREIGN KEY (head_supervisor) REFERENCES Instructor(western_id),
PRIMARY KEY (western_id)
);
# Insertion
INSERT INTO TeachingAssistant
VALUES ('Homer', 'Simpson', '250011111', 'hsimpson', 'Masters', '', 'krabppl');
INSERT INTO TeachingAssistant
VALUES ('Ned', 'Flanders', '250000666', 'nflander', 'PhD', '', 'skinner');
# Retrieval
select * from TeachingAssistant;
# Get all TeachingAssistants that have XXX as head_supervisor
select * from TeachingAssistant WHERE head_supervisor in (SELECT western_id from Instructor where western_id=250XXXXXX)
Wikipedia provide a wonderful description of the concept of a database transaction:
A transaction symbolizes a unit of work performed within a database management system (or similar system) against a database, and treated in a coherent and reliable way independent of other transactions. A transaction generally represents any change in database. Transactions in a database environment have two main purposes:
To provide reliable units of work that allow correct recovery from failures and keep a database consistent even in cases of system failure, when execution stops (completely or partially) and many operations upon a database remain uncompleted, with unclear status.
To provide isolation between programs accessing a database concurrently. If this isolation is not provided, the program's outcome are possibly erroneous.
Some graph vendors:
GraphDBs are an up and coming technology...Some prominent GraphDBs:
Neo4j (Neo Technology)
Titan (Aurelius)
OrientDB (Orient Technologies)
ArangoDB
GraphDBs leverage the property graph model. A property graph has the following characteristics (RWE, 3):
GraphDB can be used fairly loosely, however we distinguish between a native and non-native storage techniques...
Relational databases are actually really bad with relationships (so are many NoSQLs)
In the end, the relational database is unaware of the connections in the data, and application code is responsible for creating a network out of tabular results
Graphs store connected data as, surprise surprise, connected data!
Optimized for traversals!!!
Flexible
Semantically rich
There are many more technologies being developed for managing large data sets including graphs...
Hadoop, Spark, Cassandra...
Over the next few classes, we will become familiar with the most common tools and techniques for working with large quantities of graph data...see you soon!