Team members:
In order to read this properly (especially the graph visualization in the end), you might want to used this link to NBViewer.
Please go through the README.md file beforehand.
NB:
shortest_paths.pkl
from this link. It should be put in the directory data/
.data/
.network.pkl
shortest_paths.pkl
cross_val.pkl
This is a reminder of the project proposal adapted to the reality of the project:
Graph: Wikipedia hyperlink network
Problem: Does the structure of the graph bears info on the content of the nodes ? We would like to find out if it is possible to detect communities of pages just by looking at the hyperlink connections and match these communities with real-world data such as categories of the pages. Is spectral clustering a viable possibility compared to proven method of community detection.
Steps of the project:
In [1]:
import numpy as np
import seaborn as sns
import networkx as nx
import matplotlib.pyplot as plt
import operator
import community
import plotly
import plotly.graph_objs as go
import plotly.plotly as py
from networkx.drawing.nx_agraph import graphviz_layout
from scipy import linalg, cluster, sparse
from tqdm import tqdm_notebook
from utils import load_obj, save_obj
In [2]:
%matplotlib inline
%load_ext autoreload
%autoreload 2
We want to acquire a sub-network of the Wikipedia hyperlink network. In such a graph, each node is a Wikipedia page and there is a link between node $a$ and node $b$ if there is a link to page $b$ on page $a$. This is a directed network but we will make it undirected later on.
The process of the acquisition is the following :
We use the Wikipedia
API that allows us to scrap pages and get links and categories for each one. We chose to include in our network only real pages (not the disambiguation ones). Those pages are indeed useful during the scraping because they allow us to get a larger sample of the real graph. Disambiguation pages act like bridges between pages that have nothing to do together.
For each node we need to get URL, title, categories and links to other pages.
We use as root_node
the disambiguation page Jaguar (disambiguation) as it lists a really wide variety of themes (animals, cars, music, films, weapons...). It can lead us to a large selection of categories.
The function explore_page
is implemented in the file utils.py
. All implementation details are provided there.
In [ ]:
from utils import explore_page
In [ ]:
root_node = 'Jaguar (disambiguation)'
network = {} # This dict stores for each page a dictionnary containing the keys [url, links, categories]
first_nodes = []
explore_page(root_node, network, first_nodes)
second_nodes = []
for node in first_nodes:
explore_page(node, network, second_nodes)
Look for connections between second nodes and the rest of the nodes.
In [ ]:
all_nodes = list(network.keys()) + second_nodes
for link in tqdm_notebook(second_nodes):
explore_page(link, network, [], inner=True, all_nodes=all_nodes)
The above cell took 2 hours and 47 minutes to run (duration of scraping).
Now we need to go through all the nodes in order to remove from their neighbors any page that has not been scraped.
In [ ]:
all_nodes = list(network.keys())
for title in tqdm_notebook(network.keys()):
network[title]['links'] = list(set(network[title]['links']).intersection(set(all_nodes)))
The previous step can lead to pages with no neighbor (if a second node comes from a disambiguation page and all its neigbors are not in first or second nodes).
In [ ]:
l = list(network.keys())
for i in l:
if len(network[i]['links']) == 0:
del network[i]
In [ ]:
for i, title in enumerate(network.keys()):
cats = network[title]['categories']
new_cats = []
for c in cats:
if not c.startswith('Redundant') and not c.startswith('Pages') and not c.startswith('Webarchive') and not c.startswith('Wikipedia') and not c.startswith('Articles') and not c.startswith('Coordinates on Wikidata') and not 'Wikidata' in c and not c.startswith('CS1') and not c.startswith('EngvarB') and not c.startswith('All') and not c.startswith('Good articles') and not c.startswith('Use dmy'):
new_cats.append(c)
network[title]['categories'] = new_cats
As the scraping of the network takes quite some time ($\sim$ 3 hours) (especially getting the inner connections), we store the results in pickle files.
In [3]:
# save_obj(network, 'network')
network = load_obj('network')
In [4]:
neighbors = {}
for i in network.keys():
neighbors[i] = network[i]['links']
In [5]:
g = nx.Graph(neighbors) # undirected graph
In [6]:
print('Total number of nodes : {}'.format(len(g.nodes)))
print('Total number of edges : {}'.format(len(g.edges)))
if nx.is_connected(g):
print('The graph is connected.')
else:
print('The graph is not connected.')
In [7]:
adj = nx.adjacency_matrix(g)
In [8]:
plt.spy(adj.todense())
Out[8]:
Check if it's symmetric :
In [9]:
(adj != adj.T).count_nonzero() == 0
Out[9]:
In [10]:
degrees = np.array(adj.sum(axis=1)).squeeze()
degrees_truncated = degrees[degrees < 700]
In [11]:
fig, ax = plt.subplots(ncols=2, sharey=True, figsize=(15,5))
ax[0].set_title('Degree distribution')
ax[0].hist(degrees, bins=50)
ax[1].set_title('Truncated degree distribution')
ax[1].hist(degrees_truncated, bins=20)
plt.tight_layout()
plt.show()
In [12]:
fig, ax = plt.subplots(ncols=2, sharey=True, figsize=(15,5))
ax[0].set_title('Degree box plot')
sns.boxplot(degrees, ax=ax[0])
ax[1].set_title('Truncated degree box plot')
sns.boxplot(degrees_truncated, ax=ax[1])
plt.tight_layout()
plt.show()
In [13]:
avg_degree = np.mean(degrees)
print('The average degree of the network is {}.'.format(np.round(avg_degree, 2)))
First we compute the shortest paths lengths. NetworkX allows us to do the computation and returns a dictionnary. This will be useful later on.
In [14]:
# shortest_paths = dict(nx.shortest_path_length(g))
# save_obj(shortest_paths, 'shortest_paths')
As this computation is quite long ($\sim$ 5 hours), we dumped the resulting dictionnary in a pickle file.
In [15]:
shortest_paths = load_obj('shortest_paths')
Now the computing the diameter of the networks comes down to finding the largest distance. Let's turn the dictionnary into a numpy array that is faster to manipulate.
In [16]:
nodes = list(network.keys())
In [17]:
distances = np.zeros(shape=(len(nodes), len(nodes)))
for i in range(len(nodes)):
for j in range(len(nodes)):
distances[i, j] = shortest_paths[nodes[i]][nodes[j]]
In [18]:
diameter = np.amax(distances)
print('The diameter of the network is {}.'.format(int(diameter)))
At first sight, if we had scraped first nodes and then second nodes, we should have had a diameter less than 4. Because a node should be at distance at most 2 from the root node.
Here, thanks to the use of disambiguation pages, we manage to get nodes that are further away from the root node but surprisingly our graph is connected anyway.
In [19]:
nx.draw(g, node_size=5, figsize=(15, 15))
In this section, we are trying to model the collected network with a simpler one, trying to get the same main features like the number of nodes, the number of edges, the degree distribution, the shape of the giant components, and so on. Such a model is particularly useful to understand the original structure and compare it to other famous and already known networks.
In this modelisation part, we are using functions implemented in the utils.py
in order to plot degree distributions and to get the regression coefficient of a power law.
In [20]:
from utils import get_distribution, linear_regression_coefficient
from utils import plot_degree_distribution, print_distribution, print_denoised_degree_distribution
In [21]:
nNodes = len(g.nodes())
nEdges = g.size()
print('The network has {0} nodes and {1} edges.'.format(nNodes, nEdges))
print('The minimum and the maximum degrees are respectively {0} and {1}.'.format(np.min(degrees), np.max(degrees)))
In [22]:
print_distribution(g, a=1000)
The previous plots show that the degree distribution of the network is complicated and doesn't fit exactly any of the basic network structures studied during the semester. However the last log-log plot suggests that a scale-free network with a power law could approximate the distribution. Let's make a regression to see what coefficient would fit. We use the linear_model
function from sklearn.
In [23]:
linear_regression_coefficient(g, title='Linear regression of the original degree distribution')
The value of $R^2$ is not really close to 1 but a scale free network model does not seem too bad anyway.
We will later use that regression to build an approximation of the network. We make the assumption that the network distribution follows a power law of coefficient -1.0693.
The Erdős–Rényi graph models a random network where each pair of nodes has a fixed probability to be linked. We want this network to have the same number of nodes as the original one, and approximate the number of edges as much as possible.
In [24]:
p = 2 * nEdges / nNodes / (nNodes - 1)
print('The probability hyper-parameter giving the best approximation of the number of edges is {}'.format(np.round(p, 4)))
In [25]:
er = nx.erdos_renyi_graph(nNodes, p)
plot_degree_distribution(er, title='Degree distribution of the Erdős–Rényi graph')
As expected, it clearly doesn't match the distribution of our network. The random networks have a Poisson degree distribution (when the number of nodes is large) and it doesn't fit to the observed distribution.
The Barabási-Albert graph follows a power law distribution (in theory $p(k) = C \times k^{-3}$) so we can hope much better results than with the Erdős–Rényi model. The number of nodes that we want in the graph is fixed and we can only play with the parameter specifying the number of edges to attach from a new node to existing nodes. With the trial and error method, we found out that setting this parameter to 54 gives the closest number of edges to our original graph.
In [26]:
ba = nx.barabasi_albert_graph(nNodes, 54)
print('This Barabási-Albert network has {0} edges while our original network has {1} edges.'.format(ba.size(), nEdges))
plot_degree_distribution(ba, title='Degree distribution of the Barabási-Albert graph')
It indeed seems to be a power law distribution. Let's have a deeper insight and try to measure the parameter of this power law. The coefficient of such a random graph should be 3 in theory.
In [27]:
print_denoised_degree_distribution(ba, b=200, d=200)
Regression to measure the model law's coefficient
In [28]:
linear_regression_coefficient(ba, limit=300, title='Linear regression of the Barabási-Albert degree distribution')
We get a coefficient 2.7 that is close to 3 the expected value. Thus this network will be a better approximation than the random network precedently exposed but is still not ideal : we would like a power law network whose coefficient is closer to 1.0693 as computed earlier.
In [29]:
ba_degrees = list(dict(ba.degree()).values())
f, ax = plt.subplots(figsize=(15, 6))
sns.distplot(degrees_truncated, label='Collected network', ax=ax)
sns.distplot(ba_degrees, label='Barabási-Albert network', ax=ax)
plt.legend(loc='upper right')
plt.show()
We clearly see here that it is a better approximation than the Erdős–Rényi graph but is still not ideal.
In this section we are trying to make a power law network with a closer exponent to the one measured in the regression of the original network. We didn't find any method to make a graph with the exact exponent but we approximated it with the following code.
The configuration_model
method from NetworkX allows us to create a graph from a given list of degrees.
In order to create a list of degrees respecting a power law distribution of coefficient $\gamma$, we use the function powerlaw_sequence
from NetworkX. However this function can return zeros and we don't want to do that because our graph is connected. So what we do is generate each degree one at a time and check that it's not 0.
In [31]:
while True:
# Iterate the construction of a degree sequence until we find one that has a pair sum
# (this is a requirement of the function configuration_model)
s = []
while len(s) < nNodes:
# generate degrees one at a time
nextval = int(nx.utils.powerlaw_sequence(1, 1.6)[0])
if nextval != 0:
s.append(nextval)
if sum(s) % 2 == 0:
# we found a proper distribution, can break!
break
power_law = nx.configuration_model(s)
power_law = nx.Graph(power_law) # remove parallel edges
power_law.remove_edges_from(power_law.selfloop_edges())
print('This power law network has {0} nodes and {1} edges.'.format(len(power_law), power_law.size()))
We note right now that the number of edges in this model is really lower to the value in the collected network (367483).
It seems that the lowest coefficient we can set for the power law is 1.6. All the other attemps with smaller coefficient have crashed.
We can check that it indeed follows a power law distribution :
In [32]:
print_denoised_degree_distribution(power_law, a=100, b=200, d=200)
And we calculate here the coefficient of the power law :
In [33]:
linear_regression_coefficient(power_law, limit=79, title='Linear regression of the power-law degree distribution')
It's indeed closer to the original network but there is still a little gap (reminder, the obtjective is 1.0693). However as noted earlier, the number of edges of this power law network is extremely low compared to the original network. It seems like the Barabási-Albert network is a better approximation even if the fit of the distribution is not as good.
The following plot comparing the obtained degree distribution to the original one confirms that the Barabási-Albert network is a better approximation.
In [34]:
pl_degrees = list(dict(power_law.degree()).values())
f, ax = plt.subplots(figsize=(15, 6))
sns.distplot(degrees_truncated, label='Collected network', ax=ax)
sns.distplot(pl_degrees, label='Barabási-Albert network', ax=ax)
plt.legend(loc='upper right')
axes = plt.gca()
axes.set_xlim([-100, 1000])
plt.show()
In [35]:
giant_g = max(nx.connected_component_subgraphs(g), key=len)
giant_er = max(nx.connected_component_subgraphs(er), key=len)
giant_ba = max(nx.connected_component_subgraphs(ba), key=len)
giant_pl = max(nx.connected_component_subgraphs(power_law), key=len)
print('Size of the giant component / Size of the network ')
print('Collected network : \t {}/{}'.format(len(giant_g.nodes()), len(g.nodes())))
print('Erdős–Rényi model : \t {}/{}'.format(len(giant_er.node()), len(er.nodes())))
print('Barabási-Albert model : {}/{}'.format(len(giant_ba.nodes()), len(ba.nodes())))
print('Power law model : \t {}/{}'.format(len(giant_pl.nodes), len(power_law.nodes)))
The original network, the Erdős–Rényi and the Barabási-Albert graphs are fully connected. The modelisation with the last power law network has also a very big giant component and is almost fully connected. We can conclude that the connectedness of the network is respected.
In [36]:
avg_clustering_g = nx.average_clustering(g)
avg_clustering_er = nx.average_clustering(er)
avg_clustering_ba = nx.average_clustering(ba)
avg_clustering_pl = nx.average_clustering(power_law)
print('Clustering coefficients')
print('Collected network : \t {}'.format(np.round(avg_clustering_g, 3)))
print('Erdős–Rényi model : \t {}'.format(np.round(avg_clustering_er, 3)))
print('Barabási-Albert model : {}'.format(np.round(avg_clustering_ba, 3)))
print('Power law model : \t {}'.format(np.round(avg_clustering_pl, 3)))
The last model created following a power law has the closest clustering coefficient. However, the really low number of edges is critical to make it a good model.
Most scale-free networks follow a distribution of the form $p(k) = C\times k^{-\gamma}$ where $2 < \gamma < 3$ usually. In the approximation by a power law distribution we made, we found out that $\gamma \simeq 1.0693$ which is not really a common value as the following array shows (values seen during the lectures).
Network | Gamma |
---|---|
WWW in | 2.00 |
WWW out | 2.31 |
Emails in | 3.43 |
Emails out | 2.03 |
Actor | 2.12 |
Protein interactions | 2.12 |
Citations in | 3.03 |
Citations out | 4.00 |
Is a scale free network such a good model for our collected network ? We saw that the fit is not too bad but there are also more empirical reasons for such a model.
We may wonder why a scale free network seems to be a good approximation of the collected network. One of the most notable caracteristic of a scale free network is the presence of nodes with a degree much larger than the average which is the case here :
In [37]:
print('Average degree : {}'.format(np.round(np.mean(degrees), 1)))
print('{} nodes with degree 5 times bigger.'.format(np.count_nonzero(degrees > 5*np.mean(degrees))))
print('{} nodes with degree 10 times bigger.'.format(np.count_nonzero(degrees > 10*np.mean(degrees))))
It's in fact quite intuitive when you know that the network is composed of nodes representing wikipedia webpages and being linked if there is a link directing from one of the page to the other one. We expect a few large hubs (Wikipedia pages covering an important subject) appearing in such a network, followed by smaller ones (moderately important subjects) in a larger proportion and finally quite a lot of minor ones. The plots above show that the distribution respects that trend pretty much, except that there are fewer minor and very important topics that in a real scale free network.
This difference is likely to come directly from our sampling method. Indeed as we start from a central node and stop the collection somewhere, central nodes are important and at the end of the network we get what looks like minor pages. Those one could have been important if we had pushed the collection one hop further from the root node.
The preferential attachment process is another intuitive way to understand why the scrapped network looks like a scale free network. This process is also known as "the rich get richer and the poor get poorer" : a quantity (here the links between the nodes) is distributed according to how much they already have. It has been shown that such a process produces scale free networks and most algorithms (like the Barabási-Albert one) use this principle to create such networks. Regarding wikipedia, the more popular a page is and the more the topic is important, the more links it will have and conversely for lesser known pages. It is exactly a preferential attachment phenomena.
We will try to use the collected data to answer our problem which is: Can we isolate communities of pages just by looking at the hyperlink graph ?
This is the famous community detection problem for which a popular method is the Louvain Algorithm.
The measure of performance we will use for the community detection is the modularity. Modularity measures the strengh of the division of a network into sub-groups. A network with high modularity has dense intra-connections (within sub-groups) and sparse inter-connections (between different groups).
Louvain as been presented in 2008 and though it has been improved [1], we will use this as a baseline to compare the performance of spectral clustering for community detection.
The steps are the following :
In [38]:
from utils import get_bag_of_communities
We use the Python library community
that implements the Louvain algorithm.
This library also allows us to compute the modularity of a given partition of the nodes.
In [39]:
louvain_partition = community.best_partition(g)
In [40]:
louvain_modularity = community.modularity(louvain_partition, g)
louvain_modularity
Out[40]:
In [41]:
k_louvain = len(set(louvain_partition.values()))
print('Louvain algorithm found {} communities'.format(k_louvain))
We can try to visualize the categories of the nodes in each of these communities. From the scraping, we got for each page a list of categories in which the page belongs. Let's compute for each community what we'll call a bag of categories that is the list of all the categories of the nodes it contains and the count of the number of nodes that belong to this category for each one.
The function can be found in utils.py
, it's implementation is quite straight-forward.
In [42]:
louvain_bag = get_bag_of_communities(network, louvain_partition)
Let's get the number of pages in each community.
In [43]:
louvain_counts = [0 for _ in range(k_louvain)]
for i, title in enumerate(louvain_partition.keys()):
louvain_counts[louvain_partition[title]] += 1
Now we want to visualize the categories of the nodes in each community. We print for each community the 10 most represented categories of the community.
In [44]:
for i in range(k_louvain):
sorted_bag = sorted(louvain_bag[i].items(), key=operator.itemgetter(1), reverse=True)
print(' ')
print('Community {}/{} ({} pages) : '.format(i+1, k_louvain, louvain_counts[i]))
for ind in range(10):
print(sorted_bag[ind])
We can see that we get some nice results because it seems that a general topic can be infered for each community. The topics are:
Alphabetical Order |
---|
Aircrafts |
American Footbal |
Animals / mammals |
Apple inc. |
British ships |
Cars |
Comics and fictional characters |
Electronics |
Car racing |
Luxury in Britain |
Mexican soccer |
Music instruments |
Rugby |
Science |
Social science |
Songwriters |
Weapons |
We define the graph laplacian using the formula $L = D- A$ where $D$ is the diagonal matrix containing the degrees and $A$ is the adjacency matrix.
In [45]:
laplacian = np.diag(degrees) - adj.todense()
laplacian = sparse.csr_matrix(laplacian)
In [46]:
plt.spy(laplacian.todense())
Out[46]:
In order to do spectral clustering using this Laplacian, we need to compute the $k$ first eigenvalues and corresponding eigenvectors. We get a matrix $U$ of $\mathbb{R}^{n \times k}$ where $n$ is the number of nodes in the graph. Applying a k-means algorithm in order to clusterize the $n$ vectors of $\mathbb{R}^k$ corresponding to the lines of $U$ gives us a clustering of the $n$ nodes.
Here we need to specify the number of clusters (communities) we want to look for. As a reminder, Louvain returned 17 (sometimes it gives 16) communities (it seems that it gives the maximum modularity but let's recall that Louvain is a heuristic so we are not sure of that).
Later in this notebook (at the end of the development of the model), we run some sort of cross-validation on the parameter k_spectral
. For different values, we run the algorithm 5 times and take the mean and standard deviation of the modularity. It seems that 21 gives the best results. Please see below for details on this.
In [47]:
k_spectral = 21
In [48]:
eigenvalues, eigenvectors = sparse.linalg.eigsh(laplacian.asfptype(), k=k_spectral, which='SM')
In [49]:
plt.plot(eigenvalues, '.-', markersize=15)
Out[49]:
In [50]:
eigenvalues[:2]
Out[50]:
We check that the first eigenvalue is 0 but the second is not. The graph is connected.
Now we clusterize the resulting vectors in $\mathbb{R}^k$
In [51]:
centroids, labels = cluster.vq.kmeans2(eigenvectors, k_spectral)
This warning shows that at least one of the clusters is empty.
In order to get a first idea of how this algorithm did, let's look at the number of nodes in each cluster.
In [52]:
cc = [0 for i in range(k_spectral)]
for i in labels:
cc[i] += 1
', '.join([str(i) for i in cc])
Out[52]:
We can see that with almost all the clusters containing less than 3 nodes, this first algorithm did not perform really well.
As we have seen in class and in one of the homework. In order for spectral clustering to work, we need to assign edge weights that are stronger the closer the nodes are.
Let's build another graph with still the same vertex but some new edges between them.
We have already computed the distances in the graph let's define edges with weights using a kernel (e.g. the Gaussian kernel).
In [53]:
kernel_width = distances.mean()
weights = np.exp(-np.square(distances)/kernel_width**2)
np.fill_diagonal(weights, 0)
This creates a complete graph. We could sparsify it for faster computations but this is not really long and experience seems to show that results are better with the full graph.
In [54]:
degrees = np.sum(weights, axis=0)
plt.hist(degrees, bins=50);
In [55]:
laplacian = np.diag(1/np.sqrt(degrees)).dot((np.diag(degrees) - weights).dot(np.diag(1/np.sqrt(degrees))))
We can check that the obtained Laplacian matrix is symmetric.
In [56]:
tol = 1e-8
np.allclose(laplacian, laplacian.T, atol=tol)
Out[56]:
In [57]:
eigenvalues, eigenvectors = linalg.eigh(laplacian, eigvals=(0, k_spectral-1))
In [58]:
plt.plot(eigenvalues, '.-', markersize=15)
Out[58]:
In [59]:
centroids, labels = cluster.vq.kmeans2(eigenvectors, k_spectral)
In [60]:
cc = [0 for i in range(k_spectral)]
for i in labels:
cc[i] += 1
', '.join([str(i) for i in cc])
Out[60]:
This seems better. We get pages distributed among all the clusters (with somme clusters more important than the others of course).
First let's have a look at the categories of each cluster.
In [61]:
spectral_partition = {}
for i, title in enumerate(network.keys()):
spectral_partition[title] = labels[i]
In [62]:
spectral_bag = get_bag_of_communities(network, spectral_partition)
In [63]:
spectral_counts = [0 for _ in range(k_spectral)]
for i, title in enumerate(spectral_partition.keys()):
spectral_counts[spectral_partition[title]] += 1
In [64]:
for i in range(k_spectral):
sorted_bag = sorted(spectral_bag[i].items(), key=operator.itemgetter(1), reverse=True)
print(' ')
print('Community {}/{} ({} pages) : '.format(i+1, k_spectral, spectral_counts[i]))
if spectral_counts[i] > 0:
for ind in range(10):
print(sorted_bag[ind])
It seems that we get the same results. As we asked for more communities than Louvain, some of them are split but it's either a duplicate or a finer separation.
There are some inconsistensies in the partition we get:
but the community electronics is now split into video games and computer hardware.
So we get more communities. Sometimes its just duplicates but sometimes it is a finer separation of two groups.
In [65]:
spectral_modularity = community.modularity(spectral_partition, g)
In [66]:
spectral_modularity
Out[66]:
The modularity coefficient is lower.
k_spectral
Here we test different values of k. It seems after some testing that there is a high variance in the modularity of partitions returned by the algo (for a given k_spectral
). In order to find out if there is really a value better than the others. We compute the mean and variance of modularity for a given value by running 5 times the algorithm.
In [67]:
"""cross_val = {}
for k in tqdm_notebook(range(10, 30)):
tmp = []
for _ in range(5):
eigenvalues, eigenvectors = linalg.eigh(laplacian, eigvals=(0, k-1))
centroids, labels = cluster.vq.kmeans2(eigenvectors, k)
spectral_partition = {}
for i, title in enumerate(network.keys()):
spectral_partition[title] = labels[i]
spectral_modularity = community.modularity(spectral_partition, g)
tmp.append(spectral_modularity)
cross_val[k] = [np.mean(tmp), np.std(tmp)]
save_obj(d, 'cross_val')"""
Out[67]:
As this computation takes approximately one hour to terminate, the results have been stored in a pickle file.
In [68]:
cross_val = load_obj('cross_val')
cross_val
Out[68]:
We see that the best modularity seems to be achieved with the parameter k of 21. However we note that the standard deviation is quite high in all the cases.
It seems that no matter the value of k
we choose, we wont be able to have a higher modularity than the one achieved by the Louvain algorithm. So what could be the advantages of the spectral approach ?
The spectral clustering seems really more costly than the Louvain method. That's is something we had noticed in our study.
First we want to visualize the graph in the notebook using plotly.
In order to get clean representation of how our nodes build communities, we define a color map for each community that will help us differentiate clusters in our network.
In [69]:
community2color = {
0: sns.xkcd_rgb["peach"],
1: sns.xkcd_rgb["powder blue"],
2: sns.xkcd_rgb["light pink"],
3: sns.xkcd_rgb["chocolate"],
4: sns.xkcd_rgb["orange"],
5: sns.xkcd_rgb["magenta"],
6: sns.xkcd_rgb["purple"],
7: sns.xkcd_rgb["blue"],
8: sns.xkcd_rgb["deep blue"],
9: sns.xkcd_rgb["sky blue"],
10: sns.xkcd_rgb["olive"],
11: sns.xkcd_rgb["seafoam green"],
12: sns.xkcd_rgb["tan"],
13: sns.xkcd_rgb["mauve"],
14: sns.xkcd_rgb["hot pink"],
15: sns.xkcd_rgb["pale green"],
16: sns.xkcd_rgb["indigo"],
17: sns.xkcd_rgb["lavender"],
18: sns.xkcd_rgb["eggplant"],
19: sns.xkcd_rgb["brick"],
20: sns.xkcd_rgb["light blue"],
}
A simple representation of our graph can be obtained using the networkx tool for drawing. Here already, we get an idea of how each category of wikipedia articles creates a clear dense connection of nodes in our graph.
In [70]:
position = nx.spring_layout(g)
In [71]:
for community in set(louvain_partition.values()) :
list_nodes = [nodes for nodes in louvain_partition.keys() if louvain_partition[nodes] == community]
nx.draw_networkx_nodes(g, position, list_nodes, node_size=20, node_color=community2color[int(community)])
nx.draw_networkx_edges(g, position, alpha=0.5)
plt.show()
Now, in order to put this into a more interactive perspective we will be using plotly scatter plots, to help us play with our network. For each of the nodes, we set up attribute as in which community it belongs to based on Louvain or spectral partition. We can also assigne positions to each node. This is important in order to find a good representation of the network. Networkx and community packages come with built in functions for positioning networks according to various algorithms. After trying out various algorithms, we chose to use spring_layout. The result is not perfect but it is easy to use and to implement.
In [72]:
nx.set_node_attributes(g, spectral_partition, 'spectral')
nx.set_node_attributes(g, louvain_partition, 'louvain')
nx.set_node_attributes(g, position, 'position')
We implemented two functions in utils that allow us to plot interactive graphs.
In [73]:
from utils import build_communities, set_layout
In [74]:
data = build_communities('louvain','position', G=g, community2color=community2color)
layout = set_layout('Louvain')
fig = go.Figure(data=data, layout=layout)
py.iplot(fig)
Out[74]:
In [75]:
data = build_communities('spectral', 'position', G=g, community2color=community2color)
layout = set_layout('spectral clustering')
fig = go.Figure(data=data, layout=layout)
py.iplot(fig)
Out[75]:
The layout of the nodes is not ideal but we could not make it as clear as on Gephi (see below).
We can go through the nodes in graph and see the titles of the pages that belong to each cluster marked with different colors. As we have already observed, we managed to connect the communities to a category of articles, with using the network based on hyperlinks.
Though the previous visualizations can be handy, they are quite slow and the layout is not really representative of the structure of the graph. This is why we chose to use Gephi as our main visualization tool. Here is the result of some visualizations.
Using Gephi, we were able to get some nice visualizations of our clustering.
In [76]:
for i in louvain_partition.keys():
louvain_partition[i] = str(louvain_partition[i])
for i in spectral_partition.keys():
spectral_partition[i] = str(spectral_partition[i])
NetworkX requires all the node attributes to be strings to export them.
In [77]:
nx.set_node_attributes(g, louvain_partition, 'louvain')
nx.set_node_attributes(g, spectral_partition, 'spectral')
In [80]:
for n, d in g.nodes(data=True):
del d['position']
In [81]:
nx.write_graphml(g, 'data/full_graph.graphml')
Now opened in Gephi, this is what our graph looks like.
We used a repulsive layout in order to make the graph the more readable possible. Nodes in the same community are in the same color.
We can look more closely to some communities, for example the Apple one :
or the Mexican soccer one:
Some additional screenshots are provided in the folder images/. Feel free to have a look at different communities.
In this study, we first collected a sample network of the Wikipedia hyperlink graph. We tried different approaches using either the wikipedia API or the scraping of the raw pages. Eventually we managed to get a significant sample (6830 nodes) in a reasonable time (3 hours). The use of disambiguation pages proved to be handy to get a large variety of pages from a single root node.
Then we tried to model this collected network using usual network structures. The various plots of the degree distribution suggest that a power distribution of the degrees can be a good approximation. The Barabási-Albert model turned out to be the best approximation as it fits the collected network on many of its properties (number of nodes and edges, degrees distribution, shape of giant component).
Eventually we answered our initial question that was to know if the structure of the graph bears information on the content of the nodes. That proved to be true as we could extract communities of categories only by looking at proximity features. The Louvain algorithm seemed to provide a better community detection than the spectral clustering algorithm. It indeed gives a better modularity and is faster to run. We checked that the extracted communities share a common topic by visualizing the clusters using for example Gephi.
In order to continue our study we could try to measure properly the fit of the community detection with the categories of the nodes by implementing a natural language processing pipeline on those categories in order to extract topics.