Organizing the system by scoring coupling and cohesion

Intuition

Ordering by group / modules gives us a visual indication of how well the system accomplishes the design goal of loosely coupled and highly cohesive modules. We can quantify this idea.

Clustering is a type of assignment problem seeking the optimal allocation of N components to M clusters. One of the prominent heuristics of system architecting is to choose modules such that they are as independent as possible...low coupling and high cohesion.

We can objectively score these clustering algorithms using an objective function that considers both the size of the clusters ($C_i$) and the number of interactions outside the clusters ($I_0$) according to the following equation, where $\alpha = 10$, $\beta = 100$ or $\alpha = 1$, $\beta = 10$, and $M$ is the number of clusters:

$Obj = \alpha \sum_{i=1}^{M}C_i^2 + \beta I_0$

Clustering objectives work against two competing extremes:

  • M=1 => We want to minimize the size of the largest modules...otherwise, we could just take the trivial result of putting everything into one module.
  • M=N => We want to minimize the number and/or strength of interactions among components that cross the module boundaries. As we get to more components, more and more interactions will be required to cross module boundaries.

The objective function can be evaluated for any number of potential designs that were manually or automatically created. This provides a real-time feedback loop about the potential quality of a design. The range of the function is immediately bound by the two extremes. Your job as an architect and designer is to minimize this function while preserving semantically meaningful modules.

For more information, see Eppinger & Browning, Design Structure Matrix Methods and Applications, MIT Press, Cambridge, 2012, p. 25

Scoring lein-topology

Let's start by loading the included sample network data from lein-topology:


In [2]:
import sand

network_collection = "lein-topology"
network_name = "57af741"

data_path = "./data/" + network_collection + "-" + network_name

edge_file = data_path + ".csv"
edgelist = sand.csv_to_dicts(edge_file,header=['source', 'target', 'weight'])
g = sand.from_edges(edgelist)
g.summary()


Out[2]:
'IGRAPH DNW- 107 206 -- \n+ attr: group (v), indegree (v), label (v), name (v), outdegree (v), weight (e)'

Namespaces are the modules of the system and will be used in the modularity score:


In [16]:
g.vs['group'] = sand.fqn_to_groups(g.vs['label'])

In [17]:
len(set(g.vs['group']))


Out[17]:
20

For us to apply this scoring methodology meaningfully, we'll make a couple of simplifying assumptions:

  • clojure.core functions aren't moving to a different namespace.
  • tests shouldn't factor in the score of how the production code is organized.

With these, we can apply the filtering from above a bit more strictly to get an even smaller subgraph of the function call network:


In [18]:
v_to_keep = g.vs(lambda v: 'topology' in v['label'] and not 'test' in v['label'])
tg = g.subgraph(v_to_keep)

# Recompute degree after building the subgraph:
tg.vs['indegree'] = tg.degree(mode="in")
tg.vs['outdegree'] = tg.degree(mode="out")

tg.summary()


Out[18]:
'IGRAPH DNW- 19 18 -- \n+ attr: group (v), indegree (v), label (v), name (v), outdegree (v), weight (e)'

The baseline modularity score of lein-topology's core function dependency graph is:


In [19]:
%load_ext autoreload
%autoreload 2


The autoreload extension is already loaded. To reload it, use:
  %reload_ext autoreload

In [20]:
import sand.modularity as mod

mod.objective(tg, tg.vs['group'])


Out[20]:
121

Where is this on the range of possibilities?

Suppose all functions were in the same namespace. We'll simulate this by setting the group membership vector to all 1's:


In [21]:
mod.objective(tg, [1 for _ in range(len(tg.vs))])


Out[21]:
361

This is the degenerate case of M=1, so the objective function simply returns the square of the number of vertices:


In [22]:
len(tg.vs) * len(tg.vs)


Out[22]:
361

The other extreme occurs when we have the extreme of M=N, or all functions in their own namespace. We can simulate this by providing a unique group membership id for each vertex:


In [23]:
mod.objective(tg, range(len(tg.vs)))


Out[23]:
199

Finally, we can compare our actual modularity score to a computational result. We can use Girvan-Newman edge-betweenness community detection to generate a modular design based on the network structure alone:


In [25]:
eb_membership = sand.edge_betweenness(tg, directed=True)
len(set(eb_membership))


Out[25]:
4

In [26]:
len(set(tg.vs['group']))


Out[26]:
7

So the edge betweenness algorithm comes up with fewer communities, i.e. namespace in this context. Let's see how the modularity score compares:


In [27]:
mod.objective(tg, eb_membership)


Out[27]:
133

If this score is lower than our actual baseline, than the computational community structure may represent an improvement over the current structure. Which namespaces have changed groups? We may wish to refactor the code to reflect this structure.

If the edge betweenness modularity score is higher than our baseline, this fact acts as a quantitative defense of our design.

The novelty here is receiving an algorithmic recommendation about how to improve the organization of the code.