Ordering by group / modules gives us a visual indication of how well the system accomplishes the design goal of loosely coupled and highly cohesive modules. We can quantify this idea.
Clustering is a type of assignment problem seeking the optimal allocation of N components to M clusters. One of the prominent heuristics of system architecting is to choose modules such that they are as independent as possible...low coupling and high cohesion.
We can objectively score these clustering algorithms using an objective function that considers both the size of the clusters ($C_i$) and the number of interactions outside the clusters ($I_0$) according to the following equation, where $\alpha = 10$, $\beta = 100$ or $\alpha = 1$, $\beta = 10$, and $M$ is the number of clusters:
$Obj = \alpha \sum_{i=1}^{M}C_i^2 + \beta I_0$
Clustering objectives work against two competing extremes:
The objective function can be evaluated for any number of potential designs that were manually or automatically created. This provides a real-time feedback loop about the potential quality of a design. The range of the function is immediately bound by the two extremes. Your job as an architect and designer is to minimize this function while preserving semantically meaningful modules.
For more information, see Eppinger & Browning, Design Structure Matrix Methods and Applications, MIT Press, Cambridge, 2012, p. 25
In [2]:
import sand
network_collection = "lein-topology"
network_name = "57af741"
data_path = "./data/" + network_collection + "-" + network_name
edge_file = data_path + ".csv"
edgelist = sand.csv_to_dicts(edge_file,header=['source', 'target', 'weight'])
g = sand.from_edges(edgelist)
g.summary()
Out[2]:
Namespaces are the modules of the system and will be used in the modularity score:
In [16]:
g.vs['group'] = sand.fqn_to_groups(g.vs['label'])
In [17]:
len(set(g.vs['group']))
Out[17]:
For us to apply this scoring methodology meaningfully, we'll make a couple of simplifying assumptions:
clojure.core
functions aren't moving to a different namespace.With these, we can apply the filtering from above a bit more strictly to get an even smaller subgraph of the function call network:
In [18]:
v_to_keep = g.vs(lambda v: 'topology' in v['label'] and not 'test' in v['label'])
tg = g.subgraph(v_to_keep)
# Recompute degree after building the subgraph:
tg.vs['indegree'] = tg.degree(mode="in")
tg.vs['outdegree'] = tg.degree(mode="out")
tg.summary()
Out[18]:
The baseline modularity score of lein-topology
's core function dependency graph is:
In [19]:
%load_ext autoreload
%autoreload 2
In [20]:
import sand.modularity as mod
mod.objective(tg, tg.vs['group'])
Out[20]:
Where is this on the range of possibilities?
Suppose all functions were in the same namespace. We'll simulate this by setting the group membership vector to all 1's:
In [21]:
mod.objective(tg, [1 for _ in range(len(tg.vs))])
Out[21]:
This is the degenerate case of M=1, so the objective function simply returns the square of the number of vertices:
In [22]:
len(tg.vs) * len(tg.vs)
Out[22]:
The other extreme occurs when we have the extreme of M=N, or all functions in their own namespace. We can simulate this by providing a unique group membership id for each vertex:
In [23]:
mod.objective(tg, range(len(tg.vs)))
Out[23]:
Finally, we can compare our actual modularity score to a computational result. We can use Girvan-Newman edge-betweenness community detection to generate a modular design based on the network structure alone:
In [25]:
eb_membership = sand.edge_betweenness(tg, directed=True)
len(set(eb_membership))
Out[25]:
In [26]:
len(set(tg.vs['group']))
Out[26]:
So the edge betweenness algorithm comes up with fewer communities, i.e. namespace in this context. Let's see how the modularity score compares:
In [27]:
mod.objective(tg, eb_membership)
Out[27]:
If this score is lower than our actual baseline, than the computational community structure may represent an improvement over the current structure. Which namespaces have changed groups? We may wish to refactor the code to reflect this structure.
If the edge betweenness modularity score is higher than our baseline, this fact acts as a quantitative defense of our design.