Link Prediction

In this notebook, we demonstrate how to use GraphLab Create to construct a simple Link Prediction classifier. A link prediction classifier is a classifier that can predict the probability of the existence (or non-existence) of a link in a social network (SN). Namely, given a social network, such as Facebook, a link prediction classifier can help to determine if a link between two users exist. Our construction is general and not limited to SN. For example, in a phone call network our classifier can predict whether two subscribers will call each other.

There are many diverse methods to construct a link prediction classifier, such as using the SN topology, using the SN users' personal details or using posts and other published online content.

The main goal of this notebook is to demonstrate how to use GraphLab's SFrame and SGraph to extract various SN topological features. These topological features can later be utilized to perform various prediction tasks, such as link prediction, and predicting students' exam scores.

This notebook is organized as follows: First, we download a link prediction training dataset and constract a graph from it. Next, we will illustrate how GraphLab's SFrame and SGraph objects can be utilized to extract various SN topological features for each link. Finally, we will demonstrate how to use those features inside a link prediction classifier.

Downloading the Dataset

In this notebook, we will use dataset which was published by BGU Social Network Research Group. We will download the Google+ dataset which consists of around 3 million links, and generate a SGraph object containing the social links.


In [1]:
import graphlab as gl

# Loading the links dataset into a SFrame object
sf_links = gl.SFrame.read_csv("https://static.turi.com/datasets/bgu_directed_network_googleplus/g_plus_pos_and_neg_links.csv.gz")

# Let's view the data
print sf_links.head(3)

# Creating SGraph object from the SFrame object
g = gl.SGraph().add_edges(sf_links, src_field='src', dst_field='dst')


[INFO] This commercial license of GraphLab Create is assigned to engr@turi.com.

[INFO] Start server at: ipc:///tmp/graphlab_server-3600 - Server binary: /Library/Python/2.7/site-packages/graphlab/unity_server - Server log: /tmp/graphlab_server_1441218013.log
[INFO] GraphLab Server Version: 1.5.2
PROGRESS: Downloading https://static.turi.com/datasets/bgu_directed_network_googleplus/g_plus_pos_and_neg_links.csv.gz to /var/tmp/graphlab-bickson/3600/000000.gz
PROGRESS: Finished parsing file https://static.turi.com/datasets/bgu_directed_network_googleplus/g_plus_pos_and_neg_links.csv.gz
PROGRESS: Parsing completed. Parsed 100 lines in 1.35685 secs.
------------------------------------------------------
Inferred types from first line of file as 
column_type_hints=[int,int,int]
If parsing fails due to incorrect types, you can correct
the inferred type list above and pass it to read_csv in
the column_type_hints argument
------------------------------------------------------
PROGRESS: Finished parsing file https://static.turi.com/datasets/bgu_directed_network_googleplus/g_plus_pos_and_neg_links.csv.gz
PROGRESS: Parsing completed. Parsed 3013792 lines in 1.60591 secs.
+-------+-------+-------+
|  src  |  dst  | class |
+-------+-------+-------+
| 16400 | 60081 |   1   |
| 55441 | 34187 |   1   |
| 59472 | 67007 |   1   |
+-------+-------+-------+
[3 rows x 3 columns]

Let use the SGraph's summary function to get details on the loaded SGraph object


In [2]:
g.summary()


Out[2]:
{'num_edges': 3013792, 'num_vertices': 211187}

Our directed graph has around 3 million edges (referred to as links) and over 200,000 vertices (referred to as users). Those links are roughtly divided into two. Positive links (labeled 1), and Negative links (labeled 0). The positive links are observed links in the social network, while negative links were added at random so our binary classifier could learn from the negative examples as well. Note, that if you like to use your own dataset, you will need to add negative (unobserved) edges at random for our construction to work.

Topological Feature Extraction

Let's create an SFrame obejct with all user ids.


In [3]:
users_sf = g.get_vertices()
users_sf.rename({"__id": "id"}) 
users_sf.head(5)


Out[3]:
id
5
7
8
10
27
[5 rows x 1 columns]

The users sframe can be used for calculating several topological feature for each link in the dataset. We will start by calculating various simple topological features, such as the each user's in-degree (i.e. the user's number of followers) and user's out-degree (i.e. number of users the user is followling).

To calculate each vertex toplogical features, we will first create a SFrame object that contains each user's in-friends (i.e. the users in the network that follows the user), and the user's out-friends (the users in the network which the user follows).


In [4]:
# Calculating each vertices in and out degree
out_friends_sf = sf_links.groupby("src", {"out_friends": gl.aggregate.CONCAT("dst")})
out_friends_sf.rename({"src": "id"})
in_friends_sf = sf_links.groupby("dst", {"in_friends": gl.aggregate.CONCAT("src")})
in_friends_sf.rename({"dst": "id"})


Out[4]:
id in_friends
211023 [175821, 191454, 42927,
19425, 20037, 84186, ...
21855 [117830, 202635, 7922,
148051, 98174, 10749, ...
88004 [184267, 165121, 156061,
25801, 82540, 200953, ...
79732 [166236, 156056, 42127,
173758, 44003, 113404, ...
63664 [133355, 135427, 189724,
52481, 74786, 148446, ...
127950 [108832, 124060, 191707,
64471, 18708, 31194, ...
7899 [19009, 14262, 166769,
92858, 45224, 202978, ...
25263 [47241, 145136, 57020,
2642, 47271, 60974, ...
130872 [132401, 79291, 158521,
197742, 6452, 48452, ...
208837 [122872, 14865, 23427,
86661, 108812, 42584, ...
[211187 rows x 2 columns]
Note: Only the head of the SFrame is printed.
You can use print_rows(num_rows=m, num_columns=n) to print more rows and columns.

Using the SFrame join operation, we create a single SFrame which consists of each user's in and out friends.


In [5]:
users_sf = users_sf.join(in_friends_sf, on="id", how="outer")
users_sf = users_sf.join(out_friends_sf, on="id", how="outer")

# we replace missing values with empty lists
users_sf = users_sf.fillna('in_friends',[])
users_sf = users_sf.fillna('out_friends',[])
users_sf.head(10)


Out[5]:
id in_friends out_friends
211023 [175821, 191454, 42927,
19425, 20037, 84186, ...
[178359, 191454, 114855,
117085, 95731, 150495, ...
21855 [117830, 202635, 7922,
148051, 98174, 10749, ...
[41086, 49199, 75658,
55823, 67923, 199185, ...
88004 [184267, 165121, 156061,
25801, 82540, 200953, ...
[194693, 53575, 36045,
162977, 184267, 82540, ...
79732 [166236, 156056, 42127,
173758, 44003, 113404, ...
[42127, 56032, 190023,
137866, 56552, 166236, ...
63664 [133355, 135427, 189724,
52481, 74786, 148446, ...
[189724, 148446, 74702,
42791, 133355, 153655, ...
127950 [108832, 124060, 191707,
64471, 18708, 31194, ...
[179818, 185573, 85706,
207053, 27056, 44792, ...
7899 [19009, 14262, 166769,
92858, 45224, 202978, ...
[45224, 82615, 202978,
68792, 53078, 66750, ...
25263 [47241, 145136, 57020,
2642, 47271, 60974, ...
[91704, 23796, 71943,
32645, 189962, 109815, ...
130872 [132401, 79291, 158521,
197742, 6452, 48452, ...
[108790, 43254, 114610,
79291, 56216, 104160, ...
208837 [122872, 14865, 23427,
86661, 108812, 42584, ...
[193657, 203867, 177456,
107268, 12804, 161402, ...
[10 rows x 3 columns]

Using the created SFrame with each user in-friends and out-friends, we calculate several simple topological features for each user, such as each user's in-degree (i.e. number of followers) and each user's out-degree (i.e. number of users the user is followling).


In [6]:
#out_degree - number of users each vertex is following
users_sf['out_degree'] = users_sf["out_friends"].apply(lambda l: len(l) )

#in_degree - number of users following each vertex
users_sf['in_degree'] = users_sf["in_friends"].apply(lambda l: len(l) )

#all_degree - number of uniuqe users that following or are followed by each user
users_sf['all_friends'] = users_sf[['in_friends', 'out_friends']].apply(lambda r: list(set(r['in_friends']) | set(r['out_friends'])))
users_sf['all_degree'] = users_sf["all_friends"].apply(lambda l: len(l) )

#bi_degree - number of uniuqe users that are both following and followed by each user
users_sf['bi_friends'] = users_sf[['in_friends', 'out_friends']].apply(lambda r: list(set(r['in_friends']) & set(r['out_friends'])))
users_sf['bi_degree'] = users_sf["bi_friends"].apply(lambda l: len(l) )

users_sf.head(10)


Out[6]:
id in_friends out_friends out_degree in_degree all_friends
211023 [175821, 191454, 42927,
19425, 20037, 84186, ...
[178359, 191454, 114855,
117085, 95731, 150495, ...
8 7 [19425.0, 39588.0,
20037.0, 84934.0, ...
21855 [117830, 202635, 7922,
148051, 98174, 10749, ...
[41086, 49199, 75658,
55823, 67923, 199185, ...
8 7 [206059.0, 117830.0,
5607.0, 75658.0, ...
88004 [184267, 165121, 156061,
25801, 82540, 200953, ...
[194693, 53575, 36045,
162977, 184267, 82540, ...
10 11 [165121.0, 53575.0,
194693.0, 9350.0, ...
79732 [166236, 156056, 42127,
173758, 44003, 113404, ...
[42127, 56032, 190023,
137866, 56552, 166236, ...
9 9 [187648.0, 56032.0,
44003.0, 57284.0, ...
63664 [133355, 135427, 189724,
52481, 74786, 148446, ...
[189724, 148446, 74702,
42791, 133355, 153655, ...
11 13 [52481.0, 135427.0,
133516.0, 74702.0, ...
127950 [108832, 124060, 191707,
64471, 18708, 31194, ...
[179818, 185573, 85706,
207053, 27056, 44792, ...
19 29 [186883.0, 71174.0,
138631.0, 98699.0, ...
7899 [19009, 14262, 166769,
92858, 45224, 202978, ...
[45224, 82615, 202978,
68792, 53078, 66750, ...
11 10 [19009.0, 202978.0,
199941.0, 149921.0, ...
25263 [47241, 145136, 57020,
2642, 47271, 60974, ...
[91704, 23796, 71943,
32645, 189962, 109815, ...
7 9 [71943.0, 32645.0,
47271.0, 47241.0, ...
130872 [132401, 79291, 158521,
197742, 6452, 48452, ...
[108790, 43254, 114610,
79291, 56216, 104160, ...
8 9 [104160.0, 188507.0,
48452.0, 43254.0, ...
208837 [122872, 14865, 23427,
86661, 108812, 42584, ...
[193657, 203867, 177456,
107268, 12804, 161402, ...
7 17 [23427.0, 107268.0,
86661.0, 200330.0, ...
all_degree bi_friends bi_degree
14 [191454.0] 1
15 [] 0
18 [162977.0, 184267.0,
82540.0] ...
3
16 [166236.0, 42127.0] 2
20 [133355.0, 189724.0,
148446.0, 74702.0] ...
4
44 [108832.0, 179740.0,
207053.0, 22318.0] ...
4
18 [45224.0, 202978.0,
82615.0] ...
3
16 [] 0
16 [79291.0] 1
24 [] 0
[10 rows x 9 columns]

Now, we have several degree feautres for each user. Lets utilize these users' features and create features for each link in our positive and negative links dataset. Namely, for each link in the data conists of two users u and v, we will create SFrame with each user's degree features.

Note: runining can take several minutes.


In [7]:
sf_links = sf_links.join(users_sf, on={"src": "id"}, how="right")
sf_links.rename({"in_friends": "src_in_friends", "out_friends": "src_out_friends",
           "all_friends": "src_all_friends", "all_degree": "src_all_degree",
           "bi_friends": "src_bi_friends", "bi_degree": "src_bi_degree",
           "in_degree": "src_in_degree", "out_degree": "src_out_degree"
           })

sf_links = sf_links.join(users_sf, on={"dst": "id"}, how="right")
sf_links.rename({"in_friends": "dst_in_friends", "out_friends": "dst_out_friends",
           "all_friends": "dst_all_friends", "all_degree": "dst_all_degree",
           "bi_friends": "dst_bi_friends", "bi_degree": "dst_bi_degree",
           "in_degree": "dst_in_degree", "out_degree": "dst_out_degree"})

sf_links.head(10)


Out[7]:
src dst class src_in_friends src_out_friends src_out_degree src_in_degree
16400 60081 1 [398, 164476, 158289,
25877, 45051, 103519, ...
[60081, 72458, 55255,
23667, 45842, 510, ...
187 60
55441 34187 1 [47069, 203601, 202891,
74718, 40220, 95893, ...
[34187, 169679, 20260,
2828, 166652, 104248, ...
12 9
59472 67007 1 [153627, 170956, 59785,
107075, 42942, 119730, ...
[67007, 68849, 13940,
66935, 207534, 202531, ...
101 102
79102 174248 1 [17824, 86736, 40718,
80705, 183565, 206244, ...
[174248, 58527, 94279,
82240, 150362, 115046, ...
378 125
135415 63941 1 [25504, 159023, 79854,
85472, 899, 180460, ...
[63941, 123425, 33739,
6473, 2925, 154425, ...
86 54
21650 154098 1 [105155, 20292, 166059,
71126, 510, 187171, ...
[154098, 171451, 145995,
10888, 164476, 122311, ...
571 221
140358 152017 1 [26754, 196070, 181832,
89802, 127264, 150731, ...
[152017, 21469, 181065,
45442, 186338, 55170, ...
142 169
131272 19556 1 [177562, 119155, 4050,
80338, 84221, 113372, ...
[19556, 142061, 14344,
98752, 191884, 31489, ...
101 72
49627 29754 1 [54190, 29754, 84452,
126486, 208641, 135083, ...
[29754, 40361, 126486,
209533, 29482, 105542, ...
9 11
44402 82787 1 [169958, 17147, 171733,
138035, 118118, 28173, ...
[82787, 117486, 196181,
107067, 84349, 172619, ...
307 533
src_all_friends src_all_degree src_bi_friends src_bi_degree dst_in_friends
[80897.0, 46086.0,
10761.0, 56332.0, ...
210 [2304.0, 80897.0,
85769.0, 175755.0, ...
37 [16400, 59651, 209476,
58116, 200236, 168261, ...
[146608.0, 34187.0,
20260.0, 166652.0, ...
20 [202891.0] 1 [55441, 88052, 104248,
168497, 30237, 61864, ...
[171522.0, 169480.0,
122380.0, 139277.0, ...
131 [130689.0, 128259.0,
96214.0, 194310.0, ...
72 [59472, 97322, 22318,
106406, 208920, 191232, ...
[21504.0, 189443.0,
176133.0, 110598.0, ...
410 [172039.0, 89625.0,
150362.0, 151074.0, ...
93 [79102, 46738, 92118,
127523, 169659, 67969, ...
[147885.0, 148227.0,
85016.0, 123425.0, ...
101 [98563.0, 44406.0,
11142.0, 86153.0, 899.0, ...
39 [135415, 19114, 120093,
161025, 86153, 48014, ...
[18432.0, 192513.0,
188420.0, 79883.0, ...
608 [75268.0, 7690.0,
170507.0, 138254.0, ...
184 [21650, 191196, 136266,
165000, 154970, 72022, ...
[53254.0, 116396.0,
79883.0, 107778.0, ...
233 [75138.0, 197419.0,
75141.0, 68999.0, ...
78 [140358, 185088, 47848,
172151, 89802, 14902, ...
[31489.0, 14344.0,
91145.0, 117258.0, ...
132 [176897.0, 159618.0,
36099.0, 14344.0, ...
41 [131272, 91512, 120125,
121418, 74268, 192961, ...
[208641.0, 118339.0,
84452.0, 154853.0, ...
18 [29754.0, 126486.0] 2 [49627, 35625, 116496,
57902, 120441, 10251, ...
[139264.0, 30727.0,
182281.0, 75787.0, ...
585 [139264.0, 80388.0,
30330.0, 209926.0, ...
255 [44402, 116002, 138049,
46956, 48378, 42649, ...
dst_out_friends dst_out_degree dst_in_degree dst_all_friends dst_all_degree
[44239, 83549, 58116,
51593, 58974, 22393, ...
13 26 [185339.0, 132768.0,
59651.0, 58116.0, ...
35
[24309, 111920, 3614,
189367, 86360, 67261, ...
54 51 [66824.0, 2825.0,
146698.0, 152605.0, ...
68
[76733, 117581, 101235,
105747, 80, 103174, ...
475 81 [36864.0, 190465.0,
26626.0, 131086.0, ...
500
[92118, 165263, 172529,
27573, 203266, 48146, ...
57 148 [147457.0, 139778.0,
110598.0, 108555.0, ...
161
[124927, 11369, 128442,
160446, 165522, 38165, ...
100 32 [109575.0, 155662.0,
119825.0, 4867.0, ...
110
[136266, 161788, 158610,
97845, 103542, 33709, ...
33 20 [59264.0, 5249.0,
174980.0, 23685.0, ...
44
[99670, 12501, 8152,
34452, 127380, 163437, ...
249 192 [132100.0, 169989.0,
116396.0, 60429.0, ...
295
[191884, 97353, 169522,
143748, 113372, 165445, ...
38 38 [167040.0, 176897.0,
49666.0, 143748.0, ...
55
[12437, 80439, 49627,
10251, 110007, 124514, ...
9 10 [107945.0, 124514.0,
49627.0, 110007.0, ...
17
[203017, 48413, 146265,
18714, 10205, 155203, ...
12 99 [90862.0, 177158.0,
203017.0, 82443.0, ...
106
dst_bi_friends dst_bi_degree
[147448.0, 121340.0,
59651.0, 58116.0] ...
4
[66824.0, 2825.0,
106891.0, 11411.0, ...
37
[191232.0, 125568.0,
16514.0, 128259.0, ...
56
[39808.0, 67969.0,
203266.0, 205955.0, ...
44
[86153.0, 48014.0,
52880.0, 114321.0, ...
22
[5249.0, 23685.0,
136266.0, 150830.0, ...
9
[132100.0, 135691.0,
103443.0, 3673.0, ...
146
[176897.0, 159618.0,
113372.0, 194085.0, ...
21
[10251.0, 49627.0] 2
[31152.0, 203017.0,
10205.0, 104078.0, ...
5
[10 rows x 19 columns]

Beside adding each link's users u and v degree features, to create a decent link prediction classifier, we also need to add features based on the strength of connection between the users. In this notebook, we will add for each link three simple type of features:

  • Common-Friends Features - the number of friends both u and v have in common.
  • Total-Friends Features - the number of friends both u and v have in together.
  • Jaccard coefficient- the number of Common-Friends divided by the Number of Total-Friends.

Lets define the each feature type function:


In [8]:
def common_friends(u, v, u_friends, v_friends):
    u_friends = set(u_friends)
    if v in u_friends:
            u_friends.remove(v)

    v_friends = set(v_friends)
    if u in v_friends:
        v_friends.remove(u)
    return len(u_friends & v_friends)

def total_friends(u, v, u_friends, v_friends):
    u_friends = set(u_friends)
    if v in u_friends:
        u_friends.remove(v)

    v_friends = set(v_friends)
    if u in v_friends:
        v_friends.remove(u)

    return len(u_friends | v_friends)

def jacc_coef(u,v, u_friends, v_friends):
    t = total_friends(u,v,u_friends,v_friends)
    if  t == 0:
        return 0
    return common_friends(u,v,u_friends, v_friends)/ float(t)

Using these three features type we created 12 new features (4 feature for each feature type) that are based on direction of the friendship between u and v and their friends.

Please note that The formal mathematical defintion of the feature presented throught this section can be found in Fire et al. 2014.

The following code block may take a few minutes.


In [9]:
sf_links['common_friends'] = sf_links[['src','dst', 'src_all_friends', 'dst_all_friends']].apply(lambda r: common_friends(r['src'], r['dst'],r['src_all_friends'], r['dst_all_friends']))
sf_links['common_bi_friends'] = sf_links[['src','dst', 'src_bi_friends', 'dst_bi_friends']].apply(lambda r: common_friends(r['src'], r['dst'],r['src_bi_friends'], r['dst_bi_friends']))
sf_links['common_in_friends'] = sf_links[['src','dst', 'src_in_friends', 'dst_in_friends']].apply(lambda r: common_friends(r['src'], r['dst'],r['src_in_friends'], r['dst_in_friends']))
sf_links['common_out_friends'] = sf_links[['src','dst', 'src_out_friends', 'dst_out_friends']].apply(lambda r: common_friends(r['src'], r['dst'],r['src_out_friends'], r['dst_out_friends']))

sf_links['total_friends'] = sf_links[['src','dst', 'src_all_friends', 'dst_all_friends']].apply(lambda r: total_friends(r['src'], r['dst'],r['src_all_friends'], r['dst_all_friends']))
sf_links['total_bi_friends'] = sf_links[['src','dst', 'src_bi_friends', 'dst_bi_friends']].apply(lambda r: total_friends(r['src'], r['dst'],r['src_bi_friends'], r['dst_bi_friends']))
sf_links['total_in_friends'] = sf_links[['src','dst', 'src_in_friends', 'dst_in_friends']].apply(lambda r: total_friends(r['src'], r['dst'],r['src_in_friends'], r['dst_in_friends']))
sf_links['total_out_friends'] = sf_links[['src','dst', 'src_out_friends', 'dst_out_friends']].apply(lambda r: total_friends(r['src'], r['dst'],r['src_out_friends'], r['dst_out_friends']))


sf_links['jacc_coef'] = sf_links[['src','dst', 'src_all_friends', 'dst_all_friends']].apply(lambda r: jacc_coef(r['src'], r['dst'],r['src_all_friends'], r['dst_all_friends']))
sf_links['bi_jacc_coef'] = sf_links[['src','dst', 'src_bi_friends', 'dst_bi_friends']].apply(lambda r: jacc_coef(r['src'], r['dst'],r['src_bi_friends'], r['dst_bi_friends']))
sf_links['in_jacc_coef'] = sf_links[['src','dst', 'src_in_friends', 'dst_in_friends']].apply(lambda r: jacc_coef(r['src'], r['dst'],r['src_in_friends'], r['dst_in_friends']))
sf_links['out_jacc_coef'] = sf_links[['src','dst', 'src_out_friends', 'dst_out_friends']].apply(lambda r: jacc_coef(r['src'], r['dst'],r['src_out_friends'], r['dst_out_friends']))
sf_links.head(10)


Out[9]:
src dst class src_in_friends src_out_friends src_out_degree src_in_degree
16400 60081 1 [398, 164476, 158289,
25877, 45051, 103519, ...
[60081, 72458, 55255,
23667, 45842, 510, ...
187 60
55441 34187 1 [47069, 203601, 202891,
74718, 40220, 95893, ...
[34187, 169679, 20260,
2828, 166652, 104248, ...
12 9
59472 67007 1 [153627, 170956, 59785,
107075, 42942, 119730, ...
[67007, 68849, 13940,
66935, 207534, 202531, ...
101 102
79102 174248 1 [17824, 86736, 40718,
80705, 183565, 206244, ...
[174248, 58527, 94279,
82240, 150362, 115046, ...
378 125
135415 63941 1 [25504, 159023, 79854,
85472, 899, 180460, ...
[63941, 123425, 33739,
6473, 2925, 154425, ...
86 54
21650 154098 1 [105155, 20292, 166059,
71126, 510, 187171, ...
[154098, 171451, 145995,
10888, 164476, 122311, ...
571 221
140358 152017 1 [26754, 196070, 181832,
89802, 127264, 150731, ...
[152017, 21469, 181065,
45442, 186338, 55170, ...
142 169
131272 19556 1 [177562, 119155, 4050,
80338, 84221, 113372, ...
[19556, 142061, 14344,
98752, 191884, 31489, ...
101 72
49627 29754 1 [54190, 29754, 84452,
126486, 208641, 135083, ...
[29754, 40361, 126486,
209533, 29482, 105542, ...
9 11
44402 82787 1 [169958, 17147, 171733,
138035, 118118, 28173, ...
[82787, 117486, 196181,
107067, 84349, 172619, ...
307 533
src_all_friends src_all_degree src_bi_friends src_bi_degree dst_in_friends
[80897.0, 46086.0,
10761.0, 56332.0, ...
210 [2304.0, 80897.0,
85769.0, 175755.0, ...
37 [16400, 59651, 209476,
58116, 200236, 168261, ...
[146608.0, 34187.0,
20260.0, 166652.0, ...
20 [202891.0] 1 [55441, 88052, 104248,
168497, 30237, 61864, ...
[171522.0, 169480.0,
122380.0, 139277.0, ...
131 [130689.0, 128259.0,
96214.0, 194310.0, ...
72 [59472, 97322, 22318,
106406, 208920, 191232, ...
[21504.0, 189443.0,
176133.0, 110598.0, ...
410 [172039.0, 89625.0,
150362.0, 151074.0, ...
93 [79102, 46738, 92118,
127523, 169659, 67969, ...
[147885.0, 148227.0,
85016.0, 123425.0, ...
101 [98563.0, 44406.0,
11142.0, 86153.0, 899.0, ...
39 [135415, 19114, 120093,
161025, 86153, 48014, ...
[18432.0, 192513.0,
188420.0, 79883.0, ...
608 [75268.0, 7690.0,
170507.0, 138254.0, ...
184 [21650, 191196, 136266,
165000, 154970, 72022, ...
[53254.0, 116396.0,
79883.0, 107778.0, ...
233 [75138.0, 197419.0,
75141.0, 68999.0, ...
78 [140358, 185088, 47848,
172151, 89802, 14902, ...
[31489.0, 14344.0,
91145.0, 117258.0, ...
132 [176897.0, 159618.0,
36099.0, 14344.0, ...
41 [131272, 91512, 120125,
121418, 74268, 192961, ...
[208641.0, 118339.0,
84452.0, 154853.0, ...
18 [29754.0, 126486.0] 2 [49627, 35625, 116496,
57902, 120441, 10251, ...
[139264.0, 30727.0,
182281.0, 75787.0, ...
585 [139264.0, 80388.0,
30330.0, 209926.0, ...
255 [44402, 116002, 138049,
46956, 48378, 42649, ...
dst_out_friends dst_out_degree dst_in_degree dst_all_friends dst_all_degree
[44239, 83549, 58116,
51593, 58974, 22393, ...
13 26 [185339.0, 132768.0,
59651.0, 58116.0, ...
35
[24309, 111920, 3614,
189367, 86360, 67261, ...
54 51 [66824.0, 2825.0,
146698.0, 152605.0, ...
68
[76733, 117581, 101235,
105747, 80, 103174, ...
475 81 [36864.0, 190465.0,
26626.0, 131086.0, ...
500
[92118, 165263, 172529,
27573, 203266, 48146, ...
57 148 [147457.0, 139778.0,
110598.0, 108555.0, ...
161
[124927, 11369, 128442,
160446, 165522, 38165, ...
100 32 [109575.0, 155662.0,
119825.0, 4867.0, ...
110
[136266, 161788, 158610,
97845, 103542, 33709, ...
33 20 [59264.0, 5249.0,
174980.0, 23685.0, ...
44
[99670, 12501, 8152,
34452, 127380, 163437, ...
249 192 [132100.0, 169989.0,
116396.0, 60429.0, ...
295
[191884, 97353, 169522,
143748, 113372, 165445, ...
38 38 [167040.0, 176897.0,
49666.0, 143748.0, ...
55
[12437, 80439, 49627,
10251, 110007, 124514, ...
9 10 [107945.0, 124514.0,
49627.0, 110007.0, ...
17
[203017, 48413, 146265,
18714, 10205, 155203, ...
12 99 [90862.0, 177158.0,
203017.0, 82443.0, ...
106
dst_bi_friends dst_bi_degree common_friends common_bi_friends common_in_friends common_out_friends
[147448.0, 121340.0,
59651.0, 58116.0] ...
4 6 0 3 1
[66824.0, 2825.0,
106891.0, 11411.0, ...
37 3 0 1 1
[191232.0, 125568.0,
16514.0, 128259.0, ...
56 17 3 3 17
[39808.0, 67969.0,
203266.0, 205955.0, ...
44 67 8 36 27
[86153.0, 48014.0,
52880.0, 114321.0, ...
22 19 3 3 19
[5249.0, 23685.0,
136266.0, 150830.0, ...
9 25 3 5 22
[132100.0, 135691.0,
103443.0, 3673.0, ...
146 134 55 88 100
[176897.0, 159618.0,
113372.0, 194085.0, ...
21 34 18 23 26
[10251.0, 49627.0] 2 0 0 0 0
[31152.0, 203017.0,
10205.0, 104078.0, ...
5 75 4 74 4
total_friends total_bi_friends total_in_friends total_out_friends jacc_coef bi_jacc_coef
237 41 82 198 0.0253164556962 0.0
83 38 58 64 0.0361445783133 0.0
612 123 178 557 0.0277777777778 0.0243902439024
502 129 236 407 0.133466135458 0.062015503876
190 56 81 165 0.1 0.0535714285714
625 190 235 581 0.04 0.0157894736842
392 167 271 289 0.341836734694 0.329341317365
151 42 85 111 0.225165562914 0.428571428571
33 2 19 16 0.0 0.0
614 256 557 314 0.122149837134 0.015625
in_jacc_coef out_jacc_coef
0.0365853658537 0.00505050505051
0.0172413793103 0.015625
0.0168539325843 0.0305206463196
0.152542372881 0.0663390663391
0.037037037037 0.115151515152
0.0212765957447 0.0378657487091
0.324723247232 0.346020761246
0.270588235294 0.234234234234
0.0 0.0
0.132854578097 0.0127388535032
[10 rows x 31 columns]

Similar to the above method, we can also extract for each link, such as each user PageRank and neighborhood subgraph size. We let the reader to try to add these features (and maybe some additional features) by themselves.

Let us move to the next section, in which we explain how the extracted links' features can be utilized to create a link prediction classifier.

In order to create a link prediction classifier using our constructed links dataset (sf), lets first randomly split our dataset into training that contains 20% of the dataset, and testing datasets that contains 80% of the dataset.


In [10]:
train, test = sf_links.random_split(0.2)

We now can use GraphLab Create's classfication toolkit and create and evaluate a link prediction classifier based only on u and v degree features:


In [11]:
degree_features_list = [c for c in train.column_names() if "degree" in c]
print "Degree Features %s" % degree_features_list
cls = gl.classifier.create(train,features=degree_features_list, target="class")
results = cls.evaluate(test)
print results


Degree Features ['src_out_degree', 'src_in_degree', 'src_all_degree', 'src_bi_degree', 'dst_out_degree', 'dst_in_degree', 'dst_all_degree', 'dst_bi_degree']
PROGRESS: Creating a validation set from 5 percent of training data. This may take a while.
          You can set ``validation_set=None`` to disable validation tracking.

PROGRESS: The following methods are available for this type of problem.
PROGRESS: BoostedTreesClassifier, RandomForestClassifier, SVMClassifier, LogisticClassifier
PROGRESS: The returned model will be chosen according to validation accuracy.
PROGRESS: Boosted trees classifier:
PROGRESS: --------------------------------------------------------
PROGRESS: Number of examples          : 572642
PROGRESS: Number of classes           : 2
PROGRESS: Number of feature columns   : 8
PROGRESS: Number of unpacked features : 8
PROGRESS: Starting Boosted Trees
PROGRESS: --------------------------------------------------------
PROGRESS:   Iter      Accuracy          Elapsed time
PROGRESS:         (training) (validation)
PROGRESS:      0   8.777e-01   8.784e-01        2.43s
PROGRESS:      1   8.788e-01   8.792e-01        4.73s
PROGRESS:      2   8.790e-01   8.795e-01        7.43s
PROGRESS:      3   8.795e-01   8.799e-01       10.22s
PROGRESS:      4   8.800e-01   8.804e-01       15.51s
PROGRESS:      5   8.800e-01   8.808e-01       18.61s
PROGRESS:      6   8.801e-01   8.804e-01       20.28s
PROGRESS:      7   8.804e-01   8.806e-01       22.26s
PROGRESS:      8   8.803e-01   8.807e-01       24.04s
PROGRESS:      9   8.802e-01   8.809e-01       25.56s
PROGRESS: Random forest classifier:
PROGRESS: --------------------------------------------------------
PROGRESS: Number of examples          : 572642
PROGRESS: Number of classes           : 2
PROGRESS: Number of feature columns   : 8
PROGRESS: Number of unpacked features : 8
PROGRESS: Starting Boosted Trees
PROGRESS: --------------------------------------------------------
PROGRESS:   Iter      Accuracy          Elapsed time
PROGRESS:         (training) (validation)
PROGRESS:      0   8.778e-01   8.785e-01        1.17s
PROGRESS:      1   8.758e-01   8.761e-01        2.11s
PROGRESS:      2   8.789e-01   8.794e-01        3.06s
PROGRESS:      3   8.786e-01   8.792e-01        4.03s
PROGRESS:      4   8.784e-01   8.789e-01        5.10s
PROGRESS:      5   8.782e-01   8.790e-01        5.96s
PROGRESS:      6   8.782e-01   8.789e-01        6.85s
PROGRESS:      7   8.782e-01   8.790e-01        7.77s
PROGRESS:      8   8.782e-01   8.790e-01        9.28s
PROGRESS:      9   8.783e-01   8.794e-01       10.30s
PROGRESS: SVM:
PROGRESS: --------------------------------------------------------
PROGRESS: Number of examples          : 572642
PROGRESS: Number of classes           : 2
PROGRESS: Number of feature columns   : 8
PROGRESS: Number of unpacked features : 8
PROGRESS: Number of coefficients    : 9
PROGRESS: Starting L-BFGS
PROGRESS: --------------------------------------------------------
PROGRESS: +-----------+----------+-----------+--------------+-------------------+---------------------+
PROGRESS: | Iteration | Passes   | Step size | Elapsed Time | Training-accuracy | Validation-accuracy |
PROGRESS: +-----------+----------+-----------+--------------+-------------------+---------------------+
PROGRESS: | 1         | 3        | 0.000002  | 1.483738     | 0.500084          | 0.498494            |
PROGRESS: | 2         | 6        | 5.000000  | 2.059202     | 0.759162          | 0.760997            |
PROGRESS: | 3         | 7        | 5.000000  | 2.363243     | 0.788224          | 0.788502            |
PROGRESS: | 4         | 10       | 0.500000  | 2.767786     | 0.818641          | 0.819813            |
PROGRESS: | 5         | 11       | 0.500000  | 2.999682     | 0.820097          | 0.821170            |
PROGRESS: | 6         | 12       | 0.500000  | 3.238497     | 0.820701          | 0.821931            |
PROGRESS: | 10        | 17       | 1.000000  | 5.005246     | 0.815630          | 0.816536            |
PROGRESS: +-----------+----------+-----------+--------------+-------------------+---------------------+
PROGRESS: Logistic regression:
PROGRESS: --------------------------------------------------------
PROGRESS: Number of examples          : 572642
PROGRESS: Number of classes           : 2
PROGRESS: Number of feature columns   : 8
PROGRESS: Number of unpacked features : 8
PROGRESS: Number of coefficients    : 9
PROGRESS: Starting Newton Method
PROGRESS: --------------------------------------------------------
PROGRESS: +-----------+----------+--------------+-------------------+---------------------+
PROGRESS: | Iteration | Passes   | Elapsed Time | Training-accuracy | Validation-accuracy |
PROGRESS: +-----------+----------+--------------+-------------------+---------------------+
PROGRESS: | 1         | 2        | 2.047448     | 0.758268          | 0.758779            |
PROGRESS: | 2         | 3        | 3.694924     | 0.788889          | 0.790752            |
PROGRESS: | 3         | 4        | 5.178241     | 0.806286          | 0.807202            |
PROGRESS: | 4         | 5        | 6.195404     | 0.814799          | 0.816106            |
PROGRESS: | 5         | 6        | 6.890665     | 0.817518          | 0.819085            |
PROGRESS: | 6         | 7        | 7.453886     | 0.817960          | 0.819713            |
PROGRESS: | 7         | 8        | 8.026372     | 0.817986          | 0.819746            |
PROGRESS: | 8         | 9        | 8.587899     | 0.817986          | 0.819746            |
PROGRESS: +-----------+----------+--------------+-------------------+---------------------+
PROGRESS: Model selection based on validation accuracy:
PROGRESS: ---------------------------------------------
PROGRESS: BoostedTreesClassifier          : 0.880912190117
PROGRESS: RandomForestClassifier          : 0.879422765035
PROGRESS: SVMClassifier                   : 0.816536
PROGRESS: LogisticClassifier              : 0.819746
PROGRESS: ---------------------------------------------
PROGRESS: Selecting BoostedTreesClassifier based on validation set performance.
{'confusion_matrix': Columns:
	target_label	int
	predicted_label	int
	count	int

Rows: 4

Data:
+--------------+-----------------+---------+
| target_label | predicted_label |  count  |
+--------------+-----------------+---------+
|      0       |        1        |  62376  |
|      0       |        0        | 1143095 |
|      1       |        0        |  227131 |
|      1       |        1        |  978335 |
+--------------+-----------------+---------+
[4 rows x 3 columns]
, 'accuracy': 0.8799193010850138}

We get pretty good accuracy of 0.88. Let us add also the link based features and use the Boosted-Trees classifier to create and evaluate a link predicdiction classifier.


In [12]:
link_features_list = ['common_friends', 'common_in_friends', 'common_out_friends', 'common_bi_friends', 'total_friends', 'total_in_friends', 'total_out_friends', 'total_bi_friends',
'jacc_coef', 'bi_jacc_coef', 'in_jacc_coef', 'out_jacc_coef']
cls = gl.classifier.boosted_trees_classifier.create(train,target="class", features=degree_features_list + link_features_list)
results = cls.evaluate(test)
print results


PROGRESS: Creating a validation set from 5 percent of training data. This may take a while.
          You can set ``validation_set=None`` to disable validation tracking.

PROGRESS: Boosted trees classifier:
PROGRESS: --------------------------------------------------------
PROGRESS: Number of examples          : 572759
PROGRESS: Number of classes           : 2
PROGRESS: Number of feature columns   : 20
PROGRESS: Number of unpacked features : 20
PROGRESS: Starting Boosted Trees
PROGRESS: --------------------------------------------------------
PROGRESS:   Iter      Accuracy          Elapsed time
PROGRESS:         (training) (validation)
PROGRESS:      0   9.208e-01   9.185e-01        2.91s
PROGRESS:      1   9.238e-01   9.219e-01        5.34s
PROGRESS:      2   9.241e-01   9.221e-01        7.61s
PROGRESS:      3   9.334e-01   9.323e-01        9.99s
PROGRESS:      4   9.370e-01   9.349e-01       12.45s
PROGRESS:      5   9.386e-01   9.369e-01       14.79s
PROGRESS:      6   9.384e-01   9.370e-01       17.03s
PROGRESS:      7   9.393e-01   9.379e-01       19.54s
PROGRESS:      8   9.414e-01   9.405e-01       21.95s
PROGRESS:      9   9.424e-01   9.414e-01       24.39s
{'confusion_matrix': Columns:
	target_label	int
	predicted_label	int
	count	int

Rows: 4

Data:
+--------------+-----------------+---------+
| target_label | predicted_label |  count  |
+--------------+-----------------+---------+
|      0       |        1        |   9645  |
|      0       |        0        | 1195826 |
|      1       |        0        |  128186 |
|      1       |        1        | 1077280 |
+--------------+-----------------+---------+
[4 rows x 3 columns]
, 'accuracy': 0.9428309408333773}

Using the additional link features, we got considerbly better accuracy of around 0.94. We can try to further improve the accuracy by adding additional features or by increasing the size of the training dataset.

Further Readings

Further reading materials can be found in the following links:

  • Stanford Large Network Dataset Collection (SNAP)
  • BGU Social Networks Security Research Group
  • Liben‐Nowell, David, and Jon Kleinberg. "The link‐prediction problem for social networks." Journal of the American society for information science and technology 58.7 (2007): 1019-1031. ‏
  • Al Hasan, Mohammad, and Mohammed J. Zaki. "A survey of link prediction in social networks." Social network data analytics. Springer US, 2011. 243-275.‏
  • Fire, Michael, et al. "Link prediction in social networks using computationally efficient topological features." Privacy, Security, Risk and Trust (PASSAT) and 2011 IEEE Third Inernational Conference on Social Computing (SocialCom), 2011 IEEE Third International Conference on. IEEE, 2011.‏