DH3501: Advanced Social Networks

Assignment 2: Getting Going with Graph Storage and Analytics

Western University
Department of Modern Languages and Literatures
Digital Humanities – DH 3501

Instructor: David Brown
E-mail: dbrow52@uwo.ca
Office: AHB 1R14

Description

You're hired for a temp gig at a new startup to get their graph storage and analytics system up and running. If you do a good job, they might just hire you as chief analyst. They just got a big funding round, and the future is bright. Plus you're junior developer job isn't all it's cracked up to be. Too many last minute assignments: YOU WANT THAT JOB!

They've already got their data stored in some huge csv files, but they need you to model it as a property graph and get it imported into Neo4j graph database. If you are really into Gremlin with the Titan backend, they may consider that as an alternative to Neo4j, but you'd really have to know what you are doing. If you botch it, that job is history!

What you need to do:

1. Download the dataset here.

The dataset is packaged as a tarball, and it is up to you to extract it and store it somewhere accessible. The tarball contains several csv files containing node and edge lists, and also contains several metadata files with specifications for the format of the csv files. Make sure you know what you've got before you move on to step 2.

2. Model the data as a property graph.

This is something you will want to do using pen and paper (or the electronic equivalent). Take some time to consider alternative models, and choose the one that you think will be most appropriate for the data at hand. In your final report, you will need to justify your model.

3. Import the data into Neo4j in a way that complies with your data model.

You may use a Python driver to upload the data, or you can use the Neo4j web browser console to execute raw Cypher statements. This is up to you. Warning: This data set has large number of edges, on the order of $10^7$, be sure to choose your import method carefully. The import will take a long time, but it shouldn't take more than 3-4 hours. Advice: USE NEO4J indicies combined with some batch import technique!!!

4. Do your best to confirm that all data has been successfully imported.

It is important to make sure that the data has been successfully loaded into the graph. Write some Cypher queries to determine this. Count the number of nodes and relationships, and generate some traversals across the different node and relationship types. Show sample results.

5. Turn in your work.

This assignment isn't necesarily notebook based, so if you want to turn in seperate files that is okay. However, you must include the following:

  • A legible, hand/computer drawn diagram showing the property graph model you've created. This must be accompanied by a written description of the data model, along with the reasoning behind your choices. How did you determine node/edge types, attributes, and why did you choose to model them that way? ~ 300 words

  • All scripts used for pre-processing and data import into Neo4j. These can be pure Python, pure Cypher, or a mix of the two.

  • A file containing a 4 Cypher queries:

    • Number of nodes in the graph.
    • Number of relationships in the graph.
    • Two traversals that generate results encompassing all of the possible node/edge types.

Grading

This assignment will be evaluated as follows:

Percentage of Final Grade: 15%

Data model w/written description: 40% - Data models will be graded on on completeness i.e., the inclusion of all relevant node/edge types and properties, and accuracy. Entities (node types), relationships, and properties should be appropriate based on the justification presented in the written description of the model, taking into account paths of access to data through queries and analysis objectives.

Import script: 40% - The import script will be graded on import technique (batch vs. serial), index creation, and most importantly, efficacy. Scripts should be able to import full data set in less than 5 hours. All scripts will be ran and benchmarked agains a smaller replica of the data set. Furthermore, all pre-processing (presumably performed in Python) should be written in concise idomatic style, and produce data structures that conform the student's data model. Finally, the resulting database must conform to the declared data model.

Cypher traversal queries: 20% - Queries should run without throwing errors and produce the required output. Students are encouraged to avoid verbosity and write efficient queries.