Google+ Demo

This notebook contains a detailed example, demonstrating the typical workflow Graft aims to support. The dataset used here, Google+, was obtained from the Stanford Network Analysis Platform, Reference : J. McAuley and J. Leskovec. Learning to Discover Social Circles in Ego Networks. NIPS, 2012.

DataSet summary:

  • vertices : 107614
  • edges : 51127

The Google+ data is essentially a network of professionals across the world. Each vertex or person, has the following attributes attached:

  1. gender : enum
  2. institute : An array containing keywords describing the person's workplace
  3. job_title : An array containing keywords describing the person's role
  4. last_name
  5. place : An array containing places the person has worked/lived
  6. university : An array containing keywords describing the universities a person has attended

Preprocessing

The dataset is in the form of a ego-network, and contains a set of files for each ego-node:

  1. nodeId.edges : The edges in the ego network for the node 'nodeId'. The 'ego' node does not appear, but it is assumed that they follow every node id that appears in this file.
  2. nodeId.feat : The features for each of the nodes that appears in the edge file.
  3. nodeId.egofeat : The features for the ego user.
  4. nodeId.featnames : The names of each of the feature dimensions. Features are '1' if the user has this property in their profile, and '0' otherwise.

The structure of the vertex metadata is quite awkward, but nothing a bit of preprocessing can't handle:


In [ ]:
using Graft
using StatsBase
import LightGraphs

# Fetch the dataset
# Uncompress the vertex metadata and convert to TSV
# Write the vertex metadata to vertex_data.txt
# Initialize the graph file, Graph.txt, with a header
include(joinpath(Pkg.dir("Graft", "examples/build_dataset.jl")))


Progress:  95%  ETA: 0:00:18

In [3]:
;awk '!seen[$1]++' vertex_data.txt > vdata.txt

In [4]:
;awk '!seen[$0]++' gplus_combined.txt | tr ' ' '\t' > edata.txt

In [5]:
;cat vdata.txt edata.txt >> Graph.txt

In [6]:
# The graph dataset is now stored in Graph.txt
countlines("Graph.txt")


Out[6]:
13781072

Loading the Graph into memory

Graft provides the loadgraph method to extract graphs from files, but it supports only a form of TSV at the moment:


In [ ]:
g = loadgraph("Graph.txt"; verbose=true)


Fetching Graph Header
Loading Vertex Data
Progress: 100% Time: 0:00:43
Loading Edge Data
Progress:  98%  ETA: 0:00:01

In [8]:
# Get the graph's size
size(g)


Out[8]:
(107614,13673453)

Graph Queries

Now that the graph is loaded into memory, we can start mining interesting information from the graph:


In [9]:
# Function to fetch the 5 most frequent entries
top5(x) = sort(collect(countmap(vcat(filter(y->length(y) > 0, collect(x))...))), by=x->x[2], rev=true)[1 : 5]


Out[9]:
top5 (generic function with 1 method)

In [10]:
# Find the universities where alumni are well connected
@query(g |> filter(s.university == t.university) |> eachedge(s.university)) |> top5


Out[10]:
5-element Array{Pair{String,Int64},1}:
 "Stanford University"=>452                  
 "Polytechnic University of Puerto Rico"=>105
 "East Carolina University"=>91              
 "University of Utah"=>86                    
 "Colorado State University"=>84             

In [11]:
# If you work for Google, which schools did people in your network go to?
network = hopgraph(g, @query(g |> filter("Google" in v.institution) |> eachvertex(v.label)), 1)
@query(network |> eachvertex(v.university)) |> top5


Out[11]:
5-element Array{Pair{String,Int64},1}:
 "Stanford University"=>182               
 "University of California, Berkeley"=>103
 "University of Phoenix"=>87              
 "University of Michigan"=>82             
 "Harvard University"=>75                 

In [12]:
# Find the most popular schools in Los Angeles
@query(g |> filter("Los Angeles" in v.place) |> eachvertex(v.university)) |> top5


Out[12]:
5-element Array{Pair{String,Int64},1}:
 "University of Southern California"=>13    
 "University of California, Los Angeles"=>12
 "University of California, Berkeley"=>8    
 "Columbia University"=>7                   
 "New York University"=>7                   

In [13]:
# Find cities that are well connected to New York
@query(g |> filter("New York" in s.place) |> eachedge(t.place)) |> top5


Out[13]:
5-element Array{Pair{String,Int64},1}:
 "London"=>2105           
 "New York"=>1577         
 "San Francisco, CA"=>1381
 "Chicago, IL"=>1163      
 "San Francisco"=>1131    

In [15]:
# Run page rank, using LightGraphs, and set the result as a vertex property
M = export_adjacency(g)
setvprop!(g, :, LightGraphs.pagerank(LightGraphs.DiGraph(M)), :pagerank);

In [16]:
# Print out the vertex descriptor with a few properties
VertexDescriptor(@query(g |> select(v.gender, v.last_name, v.pagerank)))


Out[16]:
│ VertexID │ Labels                │ gender │ last_name       │ pagerank   │
├──────────┼───────────────────────┼────────┼─────────────────┼────────────┤
│ 1        │ 114985346359714431656 │ 1      │ "non-ascii_620" │ 6.48789e-6 │
│ 2        │ 111065108889012087599 │ 1      │ "dan"           │ 7.756e-5   │
│ 3        │ 113204882497955654314 │ 1      │ NA              │ 6.3443e-6  │
│ 4        │ 116860750964767060846 │ 1      │ "steven"        │ 1.08936e-5 │
│ 5        │ 109870053628419941069 │ 0      │ NA              │ 4.64961e-5 │
│ 6        │ 108249232416813189685 │ 1      │ "jonathan"      │ 3.61508e-5 │
│ 7        │ 111193388731102401849 │ 1      │ "scott"         │ 4.4071e-6  │
│ 8        │ 111737859526639530840 │ 2      │ NA              │ 8.02871e-6 │
│ 9        │ 112299972688047529628 │ 1      │ "todd"          │ 4.42207e-6 │
│ 10       │ 102945758979783986480 │ 1      │ "adam"          │ 4.87197e-5 │
│ 11       │ 100710846514306801296 │ 1      │ "daniel"        │ 2.36998e-5 │
⋮
│ 107603   │ 106569813518583803263 │ 1      │ NA              │ 3.88311e-6 │
│ 107604   │ 108762954268468707753 │ 1      │ NA              │ 1.00141e-5 │
│ 107605   │ 115866686692802419945 │ 1      │ NA              │ 5.79192e-6 │
│ 107606   │ 110854930685324287139 │ 1      │ NA              │ 6.4456e-6  │
│ 107607   │ 116320812345106679256 │ 2      │ NA              │ 5.4473e-6  │
│ 107608   │ 114679782050104286479 │ 1      │ NA              │ 6.20278e-6 │
│ 107609   │ 109172913211802298398 │ 1      │ "chris"         │ 3.88311e-6 │
│ 107610   │ 109553288557762375881 │ 1      │ NA              │ 4.83773e-6 │
│ 107611   │ 102953973878117962383 │ 1      │ "adam"          │ 8.05326e-6 │
│ 107612   │ 116863879615717543881 │ 1      │ "wolfgang"      │ 5.96955e-6 │
│ 107613   │ 102977664233671324384 │ 1      │ NA              │ 3.88311e-6 │
│ 107614   │ 109995768273040439562 │ 1      │ NA              │ 1.5836e-5  │

In [17]:
# Find the number of mutual friends between the source and target vertices for each edge
seteprop!(g, :, @query(g |> eachedge(e.mutualcount)), :mutual_friends);
EdgeDescriptor(g)


Out[17]:
│ Index    │ Source                │ Target                │ mutual_friends │
├──────────┼───────────────────────┼───────────────────────┼────────────────┤
│ 1        │ 114985346359714431656 │ 107211094246725549672 │ 38             │
│ 2        │ 114985346359714431656 │ 106467891941793155295 │ 2              │
│ 3        │ 114985346359714431656 │ 100043057758270223301 │ 84             │
│ 4        │ 114985346359714431656 │ 107831413340697273273 │ 4              │
│ 5        │ 114985346359714431656 │ 114839638425953508537 │ 68             │
│ 6        │ 114985346359714431656 │ 113114462378360775452 │ 71             │
│ 7        │ 114985346359714431656 │ 116407635616074189669 │ 30             │
│ 8        │ 114985346359714431656 │ 104428814384443083380 │ 50             │
│ 9        │ 114985346359714431656 │ 111159036121102171686 │ 20             │
│ 10       │ 114985346359714431656 │ 106898588952511738977 │ 71             │
│ 11       │ 114985346359714431656 │ 105228342880444036996 │ 63             │
⋮
│ 13673442 │ 109995768273040439562 │ 111754564023086319115 │ 51             │
│ 13673443 │ 109995768273040439562 │ 112198806072857130680 │ 61             │
│ 13673444 │ 109995768273040439562 │ 110060537346086731221 │ 21             │
│ 13673445 │ 109995768273040439562 │ 107841579136101149046 │ 8              │
│ 13673446 │ 109995768273040439562 │ 114294515409520683622 │ 11             │
│ 13673447 │ 109995768273040439562 │ 117362202493439608589 │ 89             │
│ 13673448 │ 109995768273040439562 │ 118332792877295434028 │ 71             │
│ 13673449 │ 109995768273040439562 │ 101941479331974800200 │ 66             │
│ 13673450 │ 109995768273040439562 │ 106815259454822754885 │ 66             │
│ 13673451 │ 109995768273040439562 │ 105657594395895234072 │ 71             │
│ 13673452 │ 109995768273040439562 │ 102367310168069130255 │ 38             │
│ 13673453 │ 109995768273040439562 │ 102953973878117962383 │ 38             │