In [1]:
import networkx as nx
import matplotlib.pyplot as plt # for plotting graphs
%matplotlib inline
In [2]:
GA = nx.read_gexf('../data/ga_graph.gexf')
The idea of link prediction was first proposed by Liben-Nowell and Kleinberg in 2004 as the following question:
"Given a snapshot of a social network, can we infer which new interactions among its members are likely to occur in the near future?"
It's an inticing idea and has led to many interesting developments in the network literature. For our example, the question could be rephrased as:
"Given a snapshot of the Grey's Anatomy relationship network, can we infer which new relationships are likely to occur in the near future?"
Sounds awesome, but how does it work?
The most popular measures for link prediction analyze the “proximity” of nodes in a network. One way to measure proximity is to see what proportion of neighbors a pair of nodes share. This can be capture succintly with the Jaccard index.
In the context of a network, we're comparing sets of neighbors:
$$ Jaccard = \frac{|\Gamma(u) \cap \Gamma(v)|}{|\Gamma(u) \cup \Gamma(v)|} $$where $\Gamma(u)$ denotes the set of neighbors of $u$.
In [3]:
preds_jc = nx.jaccard_coefficient(GA)
In [4]:
pred_jc_dict = {}
for u, v, p in preds_jc:
pred_jc_dict[(u,v)] = p
In [5]:
sorted(pred_jc_dict.items(), key=lambda x:x[1], reverse=True)[:10]
Out[5]:
In [6]:
extra_attrs = {'finn':('Finn Dandridge','M','S'),
'olivia':('Olivia Harper','F','S'),
'steve':('Steve Murphy','M','S'),
'torres':('Callie Torres','F','B'),
'colin':('Colin Marlow','M','S'),
'grey':('Meredith Grey','F','S'),
'mrs. seabury':('Dana Seabury','F','S'),
'altman':('Teddy Altman','F','S'),
'tucker':('Tucker Jones','M','S'),
'ben':('Ben Warren','M','S'),
"o'malley":("George O'Malley",'M','S'),
'thatch grey':('Thatcher Grey','M','S'),
'susan grey':('Susan Grey','F','S'),
'derek':('Derek Shepherd','M','S'),
'chief':('Richard Webber','M','S'),
'addison':('Addison Montgomery','F','S'),
'karev':('Alex Karev','M','S'),
'hank':('Hank','M','S'),
'lexi':('Lexie Grey','F','S'),
'adele':('Adele Webber','F','S'),
'owen':('Owen Hunt','M','S'),
'sloan':('Mark Sloan','M','S'),
'arizona':('Arizona Robbins','F','G'),
'izzie':('Izzie Stevens','F','S'),
'preston':('Preston Burke','M','S'),
'kepner':('April Kepner','M','S'),
'bailey':('Miranda Bailey','F','S'),
'ellis grey':('Ellis Grey','F','S'),
'denny':('Denny Duquette','M','S'),
'yang':('Cristina Yang','F','S'),
'nancy':('Nancy Shepherd','F','S'),
'avery':('Jackson Avery','M','S')}
In [7]:
for i in GA.nodes():
GA.node[i]["full_name"] = extra_attrs[i][0]
GA.node[i]["gender"] = extra_attrs[i][1]
GA.node[i]["orientation"] = extra_attrs[i][2]
In [8]:
GA.node['grey']
Out[8]:
The preferential attachement methods mirrors the “rich get richer” -- nodes with more connections will be the ones to be more likely to get future connections.
Essentially, the measure is the product of a node pairs degrees:
$$ PA = |\Gamma(u)| \bullet |\Gamma(v)|$$where $\Gamma(u)$ denotes the set of neighbors (degree) of $u$.
In [9]:
preds_pa = nx.preferential_attachment(GA)
In [10]:
pred_pa_dict = {}
for u, v, p in preds_pa:
pred_pa_dict[(u,v)] = p
In [11]:
sorted(pred_pa_dict.items(), key=lambda x:x[1], reverse=True)[:10]
Out[11]: