This notebook contains a detailed example, demonstrating the typical workflow Graft aims to support. The dataset used here, Google+, was obtained from the Stanford Network Analysis Platform,
Reference : J. McAuley and J. Leskovec. Learning to Discover Social Circles in Ego Networks. NIPS, 2012.
vertices : 107614edges : 51127The Google+ data is essentially a network of professionals across the world. Each vertex or person, has the following attributes attached:
gender : enuminstitute : An array containing keywords describing the person's workplacejob_title : An array containing keywords describing the person's rolelast_nameplace : An array containing places the person has worked/liveduniversity : An array containing keywords describing the universities a person has attendedThe dataset is in the form of a ego-network, and contains a set of files for each ego-node:
nodeId.edges : The edges in the ego network for the node 'nodeId'. The 'ego' node does not appear, but it is assumed that they
follow every node id that appears in this file.nodeId.feat : The features for each of the nodes that appears in the edge file.nodeId.egofeat : The features for the ego user.nodeId.featnames : The names of each of the feature dimensions. Features are '1' if the user has this property in their profile, and '0' otherwise.The structure of the vertex metadata is quite awkward, but nothing a bit of preprocessing can't handle:
In [ ]:
using Graft
using StatsBase
import LightGraphs
# Fetch the dataset
# Uncompress the vertex metadata and convert to TSV
# Write the vertex metadata to vertex_data.txt
# Initialize the graph file, Graph.txt, with a header
include(joinpath(Pkg.dir("Graft", "examples/build_dataset.jl")))
In [3]:
;awk '!seen[$1]++' vertex_data.txt > vdata.txt
In [4]:
;awk '!seen[$0]++' gplus_combined.txt | tr ' ' '\t' > edata.txt
In [5]:
;cat vdata.txt edata.txt >> Graph.txt
In [6]:
# The graph dataset is now stored in Graph.txt
countlines("Graph.txt")
Out[6]:
In [ ]:
g = loadgraph("Graph.txt"; verbose=true)
In [8]:
# Get the graph's size
size(g)
Out[8]:
In [9]:
# Function to fetch the 5 most frequent entries
top5(x) = sort(collect(countmap(vcat(filter(y->length(y) > 0, collect(x))...))), by=x->x[2], rev=true)[1 : 5]
Out[9]:
In [10]:
# Find the universities where alumni are well connected
@query(g |> filter(s.university == t.university) |> eachedge(s.university)) |> top5
Out[10]:
In [11]:
# If you work for Google, which schools did people in your network go to?
network = hopgraph(g, @query(g |> filter("Google" in v.institution) |> eachvertex(v.label)), 1)
@query(network |> eachvertex(v.university)) |> top5
Out[11]:
In [12]:
# Find the most popular schools in Los Angeles
@query(g |> filter("Los Angeles" in v.place) |> eachvertex(v.university)) |> top5
Out[12]:
In [13]:
# Find cities that are well connected to New York
@query(g |> filter("New York" in s.place) |> eachedge(t.place)) |> top5
Out[13]:
In [15]:
# Run page rank, using LightGraphs, and set the result as a vertex property
M = export_adjacency(g)
setvprop!(g, :, LightGraphs.pagerank(LightGraphs.DiGraph(M)), :pagerank);
In [16]:
# Print out the vertex descriptor with a few properties
VertexDescriptor(@query(g |> select(v.gender, v.last_name, v.pagerank)))
Out[16]:
In [17]:
# Find the number of mutual friends between the source and target vertices for each edge
seteprop!(g, :, @query(g |> eachedge(e.mutualcount)), :mutual_friends);
EdgeDescriptor(g)
Out[17]: