The latest version of each current legal document is available at Serbian Legal Information System website. Crawler and Scraper were developed in order to collect all republican legislations with their ID card and list of related regulations. Only legislations with ID card were scraped. Original collected data can be found in dataset/original_data/.
In order to create legislation network, nodes and their links needs to be created.
Every collected document is a node. List of document names extracted from their ID cards with custom made id number can be found in dataset/new_data/.
Links between nodes are full-explicit references. References extraction process for every document:
After this, all detected references were aggregated into one document. Also, another document was created by replacing document names with their custom made id numbers.
Result of this process are references by document name and document id, and this documents are our legislation network. Also, references detected in every document can be found in dataset/new_data/graph/.
In order to evaluate process accuracy, full-explicit references were manually detected in 10 randomly selected documents. After validation, obtained accuracy of references extraction process was: %.
In [1]:
# imports
import pandas as pd
import codecs
import graphistry
import warnings
import networkx as nx
from networkx.algorithms import assortativity as assort
from networkx.algorithms import centrality as central
import matplotlib.pyplot as plt
from collections import OrderedDict
from operator import itemgetter
from collections import Counter
from itertools import islice
import numpy as np
import igraph as ig
# setup
warnings.filterwarnings('ignore')
api_key = open('API_key.txt').read()
graphistry.register(key=api_key)
%matplotlib inline
In [2]:
def load_num_doc(file_path):
"""
Reading num - doc dictonary for mapping document name and id
"""
f = codecs.open(file_path, 'r', 'utf8')
lines = f.readlines()
f.close()
num_doc_mapper = {}
clean_lines = [line[:-2] for line in lines if line.endswith('\r\n')]
clean_lines.append(lines[-1])
for line in clean_lines:
number = int(line.split(',')[0])
text = line[len(str(number))+1:]
num_doc_mapper[number] = text
return num_doc_mapper
In [3]:
def sort_dictionary_by_value_asc(input_dict):
output_dict = OrderedDict(sorted(input_dict.items(), key=itemgetter(1)))
return output_dict
def sort_dictionary_by_value_desc(input_dict):
output_dict = OrderedDict(sorted(input_dict.items(), key=itemgetter(1)))
return output_dict
In [4]:
# reading graph
edges = pd.read_csv('dataset/new_data/all_text_lines.txt', sep='\t\t\t', names=['src', 'dest'])
print(edges.head())
# reading igraph from file
this_igraph = ig.Graph.Read_Ncol('dataset/new_data/all_num_lines.txt', directed=True)
In [5]:
# visualization using graphistry
graphistry.bind(source='src', destination='dest').plot(edges)
Out[5]:
In [6]:
# reading num - name dictionary
num_doc_mapper = load_num_doc('dataset/new_data/new_num_doc_sorted.txt')
# reading and creating network using networkx
graph = nx.read_edgelist('dataset/new_data/all_num_lines.txt', create_using=nx.DiGraph(), nodetype=int)
print(nx.info(graph))
In [7]:
# highest degrees
print("Nodes with highest degrees: (in + out)\n")
degrees_high = sort_dictionary_by_value_desc(graph.degree())
degrees_high_count = Counter(degrees_high)
for k, v in degrees_high_count.most_common(5):
print('%s: %i (%i + %i)\n' % (num_doc_mapper[k], v, graph.in_degree(k), graph.out_degree(k)))
In [8]:
# lowest degrees
print("Nodes with lowest degrees: (in + out)\n")
deegrees_low = sort_dictionary_by_value_asc(graph.degree())
deegrees_low_count = islice(deegrees_low.items(), 0, 5)
for k, v in deegrees_low_count:
print('%s: %i (%i + %i)\n' % (num_doc_mapper[k], v, graph.in_degree(k), graph.out_degree(k)))
In [9]:
# highest in_degrees
print("Nodes with highest in degrees:\n")
in_degrees_high = sort_dictionary_by_value_desc(graph.in_degree())
in_degrees_high_count = Counter(in_degrees_high)
for k, v in in_degrees_high_count.most_common(5):
print('%s: %i\n' % (num_doc_mapper[k], v))
In [10]:
# highest out_degrees
print("Nodes with highest out degrees:\n")
out_degrees_high = sort_dictionary_by_value_desc(graph.out_degree())
out_degrees_high_count = Counter(out_degrees_high)
for k, v in out_degrees_high_count.most_common(5):
print('%s: %i\n' % (num_doc_mapper[k], v))
The diameter of a graph is the maximum eccentricity of any vertex in the graph. That is, d is the greatest distance between any pair of vertices.
For directed graphs, the graph density is defined as D = |E| / (|V| (|V| - 1)) where E is the number of edges and V is the number of vertices in the graph.
In [11]:
print("Graph diameter: {0}".format(this_igraph.diameter()))
print("Average path length: {0}".format(this_igraph.average_path_length()))
print("Graph density: {0}".format(this_igraph.density()))
In [12]:
print("Cycles: \n")
cycles = nx.algorithms.find_cycle(graph)
for cycle in cycles:
print('%s --> %s\n' % (num_doc_mapper[cycle[0]], num_doc_mapper[cycle[1]]))
The average degree connectivity is the average nearest neighbor degree of nodes with degree k.
In [13]:
print("Nodes with highest average degree connectivity: \n")
adc_high = sort_dictionary_by_value_desc(assort.average_degree_connectivity(graph))
adc_high_count = Counter(adc_high)
for k, v in adc_high_count.most_common(5):
print('k=%i: %f\n' % (k, v))
In graph theory and network analysis, indicators of centrality identify the most important vertices within a graph.
The degree centrality for a node v is the fraction of nodes it is connected to.
The in-degree centrality for a node v is the fraction of nodes its incoming edges are connected to.
The out-degree centrality for a node v is the fraction of nodes its outgoing edges are connected to.
In [14]:
print("Nodes with highest degree centrality: \n")
dc_high = sort_dictionary_by_value_desc(central.degree_centrality(graph))
dc_high_count = Counter(dc_high)
for k, v in dc_high_count.most_common(5):
print('%s: %f\n' % (num_doc_mapper[k], v))
In [15]:
print("Nodes with highest in-degree centrality: \n")
idc_high = sort_dictionary_by_value_desc(central.in_degree_centrality(graph))
idc_high_count = Counter(idc_high)
for k, v in idc_high_count.most_common(5):
print('%s: %f\n' % (num_doc_mapper[k], v))
print("\n\nNodes with highest out-degree centrality: \n")
odc_high = sort_dictionary_by_value_desc(central.out_degree_centrality(graph))
odc_high_count = Counter(odc_high)
for k, v in odc_high_count.most_common(5):
print('%s: %f\n' % (num_doc_mapper[k], v))
In [16]:
print("Nodes with highest closeness centrality: \n")
cc_high = sort_dictionary_by_value_desc(central.closeness_centrality(graph))
cc_high_count = Counter(cc_high)
for k, v in cc_high_count.most_common(5):
print('%s: %f\n' % (num_doc_mapper[k], v))
Computing the shortest-path betweenness centrality for nodes. Betweenness centrality of a node v is the sum of the fraction of all-pairs shortest paths that pass through v.
Betweenness centrality quantifies the number of times a node actts as a bridge along the shortest path between two other nodes.
In [17]:
print("Nodes with lowest betweenness centrality: \n")
bc_low = sort_dictionary_by_value_asc(central.betweenness_centrality(graph))
bc_low_count = islice(bc_low.items(), 0, 5)
for k, v in bc_low_count:
print('%s: %f\n' % (num_doc_mapper[k], v))
In [18]:
# highest PR
print("Nodes with highest PageRank values: \n")
pr_high = sort_dictionary_by_value_desc(nx.algorithms.pagerank(graph))
pr_high_count = Counter(pr_high)
for k, v in pr_high_count.most_common(10):
print('%s: %f\n' % (num_doc_mapper[k], v))
In [19]:
# lowest PR
print("Nodes with lowest PageRank values: \n")
pr_low = sort_dictionary_by_value_asc(nx.algorithms.pagerank(graph))
pr_low_count = islice(pr_low.items(), 0, 10)
for k, v in pr_low_count:
print('%s: %f\n' % (num_doc_mapper[k], v))
In [20]:
# highest K
print("Nodes with highest Katz centrality: \n")
k_high = sort_dictionary_by_value_desc(nx.katz_centrality_numpy(graph))
k_high_count = Counter(k_high)
for k, v in k_high_count.most_common(10):
print ('%s: %f\n' % (num_doc_mapper[k], v))
In [21]:
# lowest Katz
print("Nodes with lowest Katz centrality: \n")
k_low = sort_dictionary_by_value_asc(nx.katz_centrality_numpy(graph))
k_low_count = islice(k_low.items(), 0, 10)
for k, v in k_low_count:
print('%s: %f\n' % (num_doc_mapper[k], v))
Eigenvector centrality computes the centrality for a node based on the centrality of its neighbors. It assigns relative scores to all nodes in the network based on the concept that connections to high-scoring nodes contribute mode to the score of the node in question that equal connections to low-scoring nodes.
In [22]:
# highest ev
print("Nodes with highest Eigenvector centrality: \n")
ev_high = sort_dictionary_by_value_desc(nx.eigenvector_centrality(graph))
ev_high_count = Counter(ev_high)
for k, v in ev_high_count.most_common(10):
print('%s: %f\n' % (num_doc_mapper[k], v))
In [23]:
# lowest ev
print("Nodes with lowest Eigenvector centrality: \n")
ev_low = sort_dictionary_by_value_asc(nx.eigenvector_centrality(graph))
ev_low_count = islice(ev_low.items(),0, 10)
for k, v in ev_low_count:
print ('%s: %f\n' % (num_doc_mapper[k], v))
In graph theory, a clustering coefficient is a measure of the degree to which nodes in a graph tend to cluster together. The global version of this measure was designed to give an overall indication of the clustering in the network.
The global clustering coefficient is based on triplets of nodes. A triplet consists of three connected nodes. A triangle therefore includes three closed triplets, one centered on each of the nodes. The global clustering coefficient is the number of closed triplets (or 3 x triangles) over the total number of triplets (both open and closed):
C = 3 x number of triangles / number of connected triplets of vertices = number of closed triplets / number of connected triplets of vertices.
In [24]:
print("Global Clustering Coefficient: {0}".format(nx.transitivity(graph)))
Latapy and Pons algorithm for community detection base on random walks. This algorithm (Walktrap) can be adatped to directed edges, weights, and can alter resolution by walk length. Basic idea: Simulate many short random walks on the network and compute pairwise similarity measures bassed on these walks. Use these similarity values to aggreggate vertices into communities. Time Complexity: depends on walk length, O(|V|^2 log |V|) typically.
In [25]:
dendogram = this_igraph.community_walktrap(steps=10)
clusters = dendogram.as_clustering()
membership = clusters.membership
memberships_dict = {}
for name, membership in zip(this_igraph.vs["name"], membership):
if membership not in memberships_dict:
memberships_dict[membership] = []
memberships_dict[membership].append(num_doc_mapper[int(name)])
print("Total number of communities: {0}".format(len(memberships_dict)))
# number of members
memership_nums = {}
for k in memberships_dict:
memership_nums[k] = len(memberships_dict[k])
In [26]:
# top communities
print("Communities with most members: \n")
mem_num_high = sort_dictionary_by_value_desc(memership_nums)
mem_num_high_count = Counter(mem_num_high)
for k, v in mem_num_high_count.most_common(5):
print('\nCommunity %i with %i members:' % (k, v))
#print(memberships_dict[k])
In [27]:
# bottom communities
print("Communities with least members: \n")
mem_num_low = sort_dictionary_by_value_asc(memership_nums)
mem_num_low_count = islice(mem_num_low.items(), 0, 5)
for k, v in mem_num_low_count:
print('\nCommunity %i with %i members:' % (k, v))
print(memberships_dict[k])
In [ ]: