In this tutorial, we learn how to use cycleindex package to calculate balance ratios $R_l$ for Gahuku-Gama network which is a signed network of tribes of Gahuku-Gama aliance structure 1.
First, let's import packages we use throughout this tutorial.
In [1]:
import time
import numpy as np
from cycleindex.sampling import nrsampling, vxsampling
from cycleindex import clean_matrix, cycle_count, balance_ratio
Now, define the network
In [2]:
gama_pos = np.array(
[[0,1,0,0,0,0,0,0,0,0,0,0,0,0,1,1],
[1,0,0,0,0,0,0,0,0,0,0,0,0,0,1,1],
[0,0,0,1,0,1,1,1,0,0,0,0,0,0,0,0],
[0,0,1,0,0,0,0,1,0,0,0,0,0,0,0,0],
[0,0,0,0,0,0,1,0,1,0,0,0,0,1,0,0],
[0,0,1,0,0,0,1,1,0,0,1,1,0,0,0,0],
[0,0,1,0,1,1,0,1,0,0,1,1,1,0,0,0],
[0,0,1,1,0,1,1,0,0,0,1,1,0,0,0,0],
[0,0,0,0,1,0,0,0,0,1,0,0,1,0,0,0],
[0,0,0,0,0,0,0,0,1,0,0,0,1,0,0,0],
[0,0,0,0,0,1,1,1,0,0,0,1,0,0,0,0],
[0,0,0,0,0,1,1,1,0,0,1,0,0,0,0,0],
[0,0,0,0,0,0,1,0,1,1,0,0,0,1,0,0],
[0,0,0,0,1,0,0,0,0,0,0,0,1,0,0,0],
[1,1,0,0,0,0,0,0,0,0,0,0,0,0,0,1],
[1,1,0,0,0,0,0,0,0,0,0,0,0,0,1,0]]
)
gama_neg = np.array(
[[0,0,1,1,1,1,0,0,0,0,0,1,0,0,0,0],
[0,0,1,0,1,1,0,0,1,1,0,0,0,0,0,0],
[1,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0],
[1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0],
[1,1,0,0,0,0,0,0,0,0,0,0,0,0,1,1],
[1,1,0,0,0,0,0,0,1,0,0,0,1,0,0,1],
[0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0],
[0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0],
[0,1,0,0,0,1,0,0,0,0,1,0,0,0,1,0],
[0,1,0,0,0,0,0,0,0,0,1,0,0,0,1,0],
[0,0,0,0,0,0,0,0,1,1,0,0,1,0,1,1],
[1,0,0,0,0,0,0,0,0,0,0,0,0,1,1,1],
[0,0,0,0,0,1,0,0,0,0,1,0,0,0,1,1],
[0,0,0,0,0,0,0,1,0,0,0,1,0,0,0,1],
[0,0,0,0,1,0,0,0,1,1,1,1,1,0,0,0],
[0,0,0,0,1,1,0,0,0,0,1,1,1,1,0,0]]
)
gama = gama_pos - gama_neg
print("# nodes: {}".format(len(gama)))
print("# positive edges: {}".format(np.sum(np.where(gama > 0))))
print("# negative edges: {}".format(np.sum(np.where(gama < 0))))
We know that isolated vertices, sinks and sources do not contribute in our calculations. So it is better to remove them. The function clean_matrix helps us with that. Gama network is not a good example, as it contains no isolated vertices nor sinks or sources. But we do it for demonstration purposes.
In [3]:
gama_reduced = clean_matrix(gama)
print("# nodes: {}".format(len(gama_reduced)))
print("# positive edges: {}".format(np.sum(np.where(gama_reduced > 0))))
print("# negative edges: {}".format(np.sum(np.where(gama_reduced < 0))))
We start by counting cycles in the network. To do that, we use cycle_count function. It gets the adjacency matrix and the maximum cycle length we need.
In [4]:
?cycle_count # run to see the documentation on the pager.
In [5]:
start = time.time()
counts = cycle_count(gama_reduced,5)
print("Runtime: {:.2f}s".format(time.time() - start))
In [6]:
print(counts)
print(np.array(counts)) # Numpy deals with floating-point issues better.
The first list shows $N_l^+ - N_l^-$ and the second list shows $N_l^+ + N_l^-$ for $l \in \{0,1,...,5\}$ where $N_l^+$ and $N_l^-$ are number of positive and negative simple cycles of length l. For weighted networks, the weight of a cycle is equal to multiplication of the weights of the edges in the cycle.
It is easy to calculate $N_l^+$ and $N_l^-$ using these two lists.
Use balance_ratio function to calculate $R_l = \dfrac{N_l^-}{N_l^+ + N_l^-} $. This function has a few tricky parameters that we will discuss later. For now, we want exact ratios as the network is small.
In [7]:
start = time.time()
ratios = balance_ratio(gama_reduced, 5, exact=True)
print("Runtime: {:.2f}s".format(time.time() - start))
print(ratios)
Cycleindex provides two functions for graph sampling:
vxsampling is an implementation of vertex expansion algorithm. It chooses a node at random and tries to expand the forming subgraph by selecting the neighbouring nodes at random and adding them to the subgraph 2.nrsampling which tries to sample subgraphs uniformly at random 2. Choose this algorithm if the degree distribution of the network at hand is skewed.Using these two functions, we are able to estimate balance ratios where exact calculation is not feasible.
In [8]:
start = time.time()
ratios = balance_ratio(gama_reduced, 5, exact=False, n_samples=3000, parallel=False, sampling_func=vxsampling)
print("Runtime: {:.2f}s".format(time.time() - start))
print(ratios)
As you can see, the ratios are not accurate, but good enough. We can also use multiple processes to calculate the ratio.
In [9]:
start = time.time()
ratios = balance_ratio(gama_reduced, 5, exact=False, n_samples=3000, parallel=True, sampling_func=vxsampling)
print("Runtime: {:.2f}s".format(time.time() - start))
print(ratios)
The PC I am using has only two cores, so the improvement is not that much. When more cores are available, the algorithm uses all of them.
In the previous section, we used n_samples argument to specify how many samples to use for estimation. Often, we are not sure how many samples we need to have an accurate estimation. We can use accuracy parameter to specify how accurate we expect the result to be. The function then samples the graph until the ratios converge, i.e. the standard deviation falls below the accuracy specified.
In [10]:
start = time.time()
ratios = balance_ratio(gama_reduced, 5, exact=False, accuracy=0.01, parallel=True, sampling_func=vxsampling)
print("Runtime: {:.2f}s".format(time.time() - start))
print(ratios)
[1] http://konect.uni-koblenz.de/networks/ucidata-gama
[2] Lu X., Bressan S. (2012) Sampling Connected Induced Subgraphs Uniformly at Random. In: Ailamaki A., Bowers S. (eds) Scientific and Statistical Database Management. SSDBM 2012. Lecture Notes in Computer Science, vol 7338. Springer, Berlin, Heidelberg
In [ ]: