ACM Digital Library bibliometric analysis of legacy software

An incomplete bibliographical inquiry into what the ACM Digital Library has to say about legacy.

Basically two research questions come to mind immidiately. Firstly, have legacy-related publications been on the raise. Secondly, what subtopics can be analyzed?

The first question could afford analysing knowledge being captured in new concept and practices e.g. refactoring or soa. The second question could afford validation against qualitative methods.


In [3]:
import pandas as pd
import networkx as nx
import community
import itertools
import matplotlib.pyplot as plt
import numpy as np
import re
%matplotlib inline

Data loading, sanitization and massage

Search "legacy" at ACM Digital Library. Just a simple search, to which the web interface gives 1541 results in mid November 2016. The number of items in the library is ~460000.

A CSV is downloaded from ACM DL, in default sorting order of the library's own idea of relevance... whatever that means for them. BibTeX is also available.


In [4]:
legacybib = pd.read_csv("ACMDL201612108240806.csv")

The available data columns are


In [9]:
legacybib.columns


Out[9]:
Index(['type', 'id', 'author', 'editor', 'advisor', 'note', 'title', 'pages',
       'article_no', 'num_pages', 'keywords', 'doi', 'journal', 'issue_date',
       'volume', 'issue_no', 'description', 'month', 'year', 'issn',
       'booktitle', 'acronym', 'edition', 'isbn', 'conf_loc', 'publisher',
       'publisher_loc'],
      dtype='object')

A peek at the topmost data items.


In [10]:
legacybib.head(3)


Out[10]:
type id author editor advisor note title pages article_no num_pages ... month year issn booktitle acronym edition isbn conf_loc publisher publisher_loc
0 article 505877 Sakib Abdul Mondal and Kingshuk Das Gupta NaN NaN NaN Choosing a Middleware for Web-integration of a... 50--53 NaN 4.0 ... May 2000.0 0163-5948 NaN NaN NaN NaN NaN ACM New York, NY, USA
1 article 2487308 Brandon Kyle Phillips and Sherry Ryan and Gin... NaN NaN NaN Motivating Students to Acquire Mainframe Skills 73--78 NaN 6.0 ... NaN 2013.0 NaN Proceedings of the 2013 Annual Conference on C... SIGMIS-CPR '13 NaN 978-1-4503-1975-1 Cincinnati, Ohio, USA ACM New York, NY, USA
2 article 2048229 Dennis Mancl and Steven D. Fraser and Bill O... NaN NaN NaN Workshop: Beyond Green-field Software Developm... 321--322 NaN 2.0 ... NaN 2011.0 NaN Proceedings of the ACM International Conferenc... OOPSLA '11 NaN 978-1-4503-0942-4 Portland, Oregon, USA ACM New York, NY, USA

3 rows × 27 columns

Does the id field uniquely identify items on the search list? If so, using it as index could be a good idea.


In [11]:
assert 0, sum(legacybib.id.duplicated())


---------------------------------------------------------------------------
AssertionError                            Traceback (most recent call last)
<ipython-input-11-2f98843d8004> in <module>()
----> 1 assert 0, sum(legacybib.id.duplicated())

AssertionError: 77

In [5]:
legacybib[legacybib.id.duplicated(keep=False)].head(2 * 2)


Out[5]:
type id author editor advisor note title pages article_no num_pages ... month year issn booktitle acronym edition isbn conf_loc publisher publisher_loc
21 article 1352602 Carsten Weinhold and Hermann H&#228;rtig NaN NaN NaN VPFS: Building a Virtual Private File System w... 81--93 NaN 13.0 ... April 2008.0 0163-5980 NaN NaN NaN NaN NaN ACM New York, NY, USA
22 article 1352602 Carsten Weinhold and Hermann H&#228;rtig NaN NaN NaN VPFS: Building a Virtual Private File System w... 81--93 NaN 13.0 ... NaN 2008.0 NaN Proceedings of the 3rd ACM SIGOPS/EuroSys Euro... Eurosys '08 NaN 978-1-60558-013-5 Glasgow, Scotland UK ACM New York, NY, USA
35 article 1402988 Maxim Podlesny and Sergey Gorinsky NaN NaN NaN Rd Network Services: Differentiation Through P... 255--266 NaN 12.0 ... August 2008.0 0146-4833 NaN NaN NaN NaN NaN ACM New York, NY, USA
36 article 1402988 Maxim Podlesny and Sergey Gorinsky NaN NaN NaN Rd Network Services: Differentiation Through P... 255--266 NaN 12.0 ... NaN 2008.0 NaN Proceedings of the ACM SIGCOMM 2008 Conference... SIGCOMM '08 NaN 978-1-60558-175-0 Seattle, WA, USA ACM New York, NY, USA

4 rows × 27 columns

Ok apparently it is used for deduplication. Who knows why are the items twice in the downloaded list.

What datatypes did Pandas infer from the CSV


In [14]:
legacybib.dtypes


Out[14]:
type              object
id                object
author            object
editor            object
advisor          float64
note              object
title             object
pages             object
article_no       float64
num_pages        float64
keywords          object
doi               object
journal           object
issue_date        object
volume            object
issue_no          object
description       object
month             object
year             float64
issn              object
booktitle         object
acronym           object
edition          float64
isbn              object
conf_loc          object
publisher         object
publisher_loc     object
dtype: object

Massage the keywords to be lists. Note that str.split(',') returns a [''], therefore the little if filter in there.


In [6]:
legacybib.keywords.fillna('', inplace=True)
legacybib.keywords = legacybib.keywords.map(lambda l: [k.lower().strip() for k in l.split(',') if k])

Are any items missing the year?


In [7]:
legacybib[legacybib.year.isnull()].year


Out[7]:
116   NaN
561   NaN
Name: year, dtype: float64

Complementary data

To contextualize the legacy search results, get the number of total publications in ACM per year.

These were semimanually extracted from the ACM DL search results listing DOM, with the following Javascript

acmYearly = {};
theChartData.labels.forEach(function(y) {acmYearly[y] = theChartData.datasets[0].data[theChartData.labels.indexOf(y)]});
console.log(acmYearly);

In [8]:
acmPerYearData = { 1951: 43, 1952: 77, 1953: 34, 1954: 71, 1955: 72, 1956: 162, 1957: 144, 1958: 234, 1959: 335,
              1960: 302, 1961: 521, 1962: 519, 1963: 451, 1964: 537, 1965: 561, 1966: 633, 1967: 754, 1968: 669, 1969: 907,
              1970: 800, 1971: 1103, 1972: 1304, 1973: 1704, 1974: 1698, 1975: 1707, 1976: 2086, 1977: 1943, 1978: 2235, 1979: 1687,
              1980: 2152, 1981: 2241, 1982: 2578, 1983: 2485, 1984: 2531, 1985: 2608, 1986: 3143, 1987: 3059, 1988: 3827, 1989: 4155,
              1990: 4313, 1991: 4551, 1992: 5019, 1993: 5107, 1994: 5939, 1995: 6179, 1996: 6858, 1997: 7181, 1998: 8003, 1999: 7628,
              2000: 9348, 2001: 8691, 2002: 10965, 2003: 11624, 2004: 14493, 2005: 16715, 2006: 19222, 2007: 19865, 2008: 21631, 2009: 23827,
              2010: 27039, 2011: 25985, 2012: 27737, 2013: 25832, 2014: 26928, 2015: 27131, 2016: 25557, 2017: 39}
acmPerYear = pd.Series(acmPerYearData)

Data overview

Let's check how many percent all the 1541 search results given by the website search list. Would be great if this was 100%.


In [18]:
round(len(legacybib) / 1541 * 100, 2)


Out[18]:
69.89

With the above peek at the ID field, how many unique items did we receive in the download?


In [19]:
len(legacybib.id.unique())


Out[19]:
1000

Ok capped at 1000 I guess, which brings the percentage of the website search results available to us down to


In [20]:
round(len(legacybib.id.unique()) / 1541 * 100, 2)


Out[20]:
64.89

Data exploration

Fraction of published items per year which ACM identifies as relevant for legacy search.

Histogram of publication years


In [21]:
legacybib.year.hist(bins=legacybib.year.max() - legacybib.year.min(), figsize=(10,2))


/Library/Frameworks/Python.framework/Versions/3.5/lib/python3.5/site-packages/numpy/lib/function_base.py:564: VisibleDeprecationWarning: using a non-integer number instead of an integer will result in an error in the future
  n = np.zeros(bins, ntype)
/Library/Frameworks/Python.framework/Versions/3.5/lib/python3.5/site-packages/numpy/lib/function_base.py:611: VisibleDeprecationWarning: using a non-integer number instead of an integer will result in an error in the future
  n += np.bincount(indices, weights=tmp_w, minlength=bins).astype(ntype)
Out[21]:
<matplotlib.axes._subplots.AxesSubplot at 0x10325d470>

What about the ACM Digital Library total, what does it's profile look like over time?


In [22]:
acmPerYear.plot(figsize=(10, 2))


Out[22]:
<matplotlib.axes._subplots.AxesSubplot at 0x10996d860>

Similar overall shape, which isn't a surprise. Overlay, with arbitrary scaling of 300.


In [23]:
#plt.hist(legacybib.year.dropna(), label="Year histogram")
plt.plot(legacybib.year.groupby(legacybib.year).count(), label='legacy publication')
plt.plot(acmPerYear * 0.003, label="total publications * 0.003")
plt.legend()
plt.legend(loc='best')


Out[23]:
<matplotlib.legend.Legend at 0x10908ed30>

Right, so they seem to have somewhat similar shape. Legacy as a concept was lagging overall ACM DL, until it caught by increasing growth during 1990s.

What about the ratio of this subset of the whole ACM DL? Has it increased or decreased over time? Ie. has the proportion of publications about legacy changed?


In [24]:
plt.plot(pd.Series(legacybib.groupby(legacybib.year).year.count() / acmPerYear), 'o')


Out[24]:
[<matplotlib.lines.Line2D at 0x109bb9f98>]

All the pre-1990 publications are:


In [25]:
legacybib[legacybib.year <= 1990][["year", "title"]].sort_values("year")


Out[25]:
year title
101 1971.0 The Legacy of MATHLAB 68
716 1981.0 V-Compiler: A Next-generation Tool for Micropr...
92 1984.0 Redocumentation: Addressing the Maintenance Le...
105 1986.0 Ada: A Life and Legacy: Dorothy Stein Book Review
329 1989.0 Saving Legacy with Objects
330 1989.0 Saving Legacy with Objects

And over 1000 publications after 1990, until 2016. First 10 of which are


In [26]:
legacybib[legacybib.year > 1990][["year", "title"]].sort_values("year").head(10)


Out[26]:
year title
475 1991.0 On the Semantic Equivalence of Heterogeneous R...
378 1991.0 A Software Reverse Engineering Experience
665 1991.0 From Under the Rubble: Computing and the Resus...
154 1992.0 Assessing Design-quality Metrics on Legacy Sof...
984 1992.0 Experiences in Program Understanding
487 1993.0 Engineering an SQL Gateway to IMS
335 1993.0 Re-engineering Design Trade-offs in a Legacy C...
314 1993.0 The Development of a Partial Design Recovery E...
309 1993.0 Issues and Approaches for Migration/Cohabitati...
308 1993.0 Issues and Approaches for Migration/Cohabitati...

Did something happen around 1990s, as the fraction of publications related to legacy started increasing? Let's plot a global linear regression model, as well as separate linear regression models before and after 1990.


In [27]:
pre1990range = np.arange(legacybib.year.min(), 1991)
post1990range = np.arange(1990, legacybib.year.max())

# Linear regression models
# note the use of np.polyfit
propLm = np.polyfit(pd.Series(legacybib.groupby(legacybib.year).year.count() / acmPerYear).dropna().index, pd.Series(legacybib.groupby(legacybib.year).year.count() / acmPerYear).dropna(), 1)
pre1990 = np.polyfit(pd.Series(legacybib.groupby(legacybib.year).year.count() / acmPerYear)[pre1990range].dropna().index, pd.Series(legacybib.groupby(legacybib.year).year.count() / acmPerYear)[pre1990range].dropna(), 1)
post1990 = np.polyfit(pd.Series(legacybib.groupby(legacybib.year).year.count() / acmPerYear)[post1990range].dropna().index, pd.Series(legacybib.groupby(legacybib.year).year.count() / acmPerYear)[post1990range].dropna(), 1)

# Plot the fractions of legacy vs. all publications, the models, and a legend
plt.plot(pd.Series(legacybib.groupby(legacybib.year).year.count() / acmPerYear), 'o')
plt.plot(np.arange(legacybib.year.min(), legacybib.year.max()), np.poly1d(propLm)(np.arange(legacybib.year.min(), legacybib.year.max())), label="global lm")
plt.plot(pre1990range, np.poly1d(pre1990)(pre1990range), linestyle="dashed", label="pre 1990 lm")
plt.plot(post1990range, np.poly1d(post1990)(post1990range), linestyle="dashed", label="post 1990 lm")
plt.title("Fraction of legacy related publications against ACM")
plt.legend(loc="best")


Out[27]:
<matplotlib.legend.Legend at 0x109c3feb8>

Statistical validation of the above would be good, of course, to check against randomness.

A histogram of keywords

The keywords are interesting. All keywords in this dataset are already related to legacy one way or the other, since the data under inspection here is a subset of the total ACM Digital Library.

Keywords of course live a life of their own, and I guess there increase in number forever.

Which keywords are popular?


In [10]:
# this could be a pandas.Series instead of dict
keywordhist = {}
for kws in legacybib.keywords:
    for k in kws:
        if k in keywordhist:
            keywordhist[k] = keywordhist[k] + 1
        else:
            keywordhist[k] = 1

How many keywords do each item have?


In [11]:
legacybib.keywords.map(lambda kws: len(kws)).describe()


Out[11]:
count    1077.000000
mean        3.065924
std         3.181662
min         0.000000
25%         0.000000
50%         3.000000
75%         5.000000
max        50.000000
Name: keywords, dtype: float64

In [30]:
plt.title("Histogram of numbers of keywords per item")
plt.hist(legacybib.keywords.map(lambda kws: len(kws)), bins=max(legacybib.keywords.map(lambda kws: len(kws))) - 1)


Out[30]:
(array([ 385.,   50.,  156.,  174.,  136.,   79.,   37.,   25.,   19.,
           6.,    4.,    2.,    0.,    0.,    1.,    0.,    0.,    0.,
           0.,    0.,    0.,    1.,    0.,    0.,    0.,    0.,    0.,
           0.,    1.,    0.,    0.,    0.,    0.,    0.,    0.,    0.,
           0.,    0.,    0.,    0.,    0.,    0.,    0.,    0.,    0.,
           0.,    0.,    0.,    1.]),
 array([  0.        ,   1.02040816,   2.04081633,   3.06122449,
          4.08163265,   5.10204082,   6.12244898,   7.14285714,
          8.16326531,   9.18367347,  10.20408163,  11.2244898 ,
         12.24489796,  13.26530612,  14.28571429,  15.30612245,
         16.32653061,  17.34693878,  18.36734694,  19.3877551 ,
         20.40816327,  21.42857143,  22.44897959,  23.46938776,
         24.48979592,  25.51020408,  26.53061224,  27.55102041,
         28.57142857,  29.59183673,  30.6122449 ,  31.63265306,
         32.65306122,  33.67346939,  34.69387755,  35.71428571,
         36.73469388,  37.75510204,  38.7755102 ,  39.79591837,
         40.81632653,  41.83673469,  42.85714286,  43.87755102,
         44.89795918,  45.91836735,  46.93877551,  47.95918367,
         48.97959184,  50.        ]),
 <a list of 49 Patch objects>)

Ok almost 400 items are without any keywords. There are some outliers, let's inspect the ones with more than 15 keywords. Sounds excessive...


In [31]:
legacybib[legacybib.keywords.map(lambda kws: len(kws)) > 15][["id", "title", "author", "keywords"]]


Out[31]:
id title author keywords
371 372119 'TSUPDOOD?: Repackaged Problems for You and MMI Rebecca G. Bace and Marvin Schaefer [commercial products, computer interconnection...
802 2857706 Interoperability of Relationship- and Role-Bas... Syed Zain R. Rizvi and Philip W.L. Fong [access control, authorization graph, authoriz...
895 1138284 Customer Driven Innovation: Quicken&Reg; Renta... Suzanne Pellican and Matt Homier [advanced prototypes, analysis, architecture, ...

And the keyword lists for the above


In [32]:
[keywordlist for keywordlist in legacybib[legacybib.keywords.map(lambda kws: len(kws)) > 15].keywords]


Out[32]:
[['commercial products',
  'computer interconnection',
  'computer security',
  'computer takeover',
  'computer usage',
  'computer viruses',
  'confidentiality compromises',
  'data destruction',
  'information security',
  'inter-computer links',
  'interconnected systems',
  'interconnected workstations',
  'internetworking',
  'legacy systems',
  'man-machine interfaces',
  'mutual suspicion',
  'network security',
  'security breaches',
  'security of data',
  'security protection',
  'separation kernel concept',
  'service denial',
  'software distribution',
  'trapdoors',
  'uncontrolled activity',
  'user interfaces',
  'user perception',
  'willful promiscuity',
  'worms'],
 ['access control',
  'authorization graph',
  'authorization principal',
  'authorization rules',
  'constraints',
  'core rebac2015',
  'demarcation',
  'guards',
  'hybrid logic',
  'liberal-grant',
  'meap',
  'mutually exclusive authorization principals',
  'papr',
  'prerequisite authorization principal requirement',
  'principal matching',
  'rebac2015',
  'rebac2015/constraints',
  'rebac2015/demarcation',
  'relationship predicate',
  'relationship-based access control',
  'role-based access control',
  'string-grant'],
 ['advanced prototypes',
  'analysis',
  'architecture',
  'business case',
  'business strategy',
  'concept design',
  'concept evaluation',
  'concept generation',
  'concepts',
  'conceptual model',
  'customer interviews',
  'customer-driven innovation',
  'customer-focus',
  'design planning',
  'desktop software',
  'detailed design',
  'experience strategy',
  'focus groups',
  'generative research',
  'interaction design',
  'interdisciplinary design',
  'launch and learn',
  'macromedia flash',
  'marketing / market research',
  'office',
  'organizational culture',
  'organizational planning',
  'participatory design',
  'performance metrics',
  'personal finance',
  'pilot testing',
  'polish',
  'process',
  'process improvement',
  'product design',
  'product management',
  'prototyping',
  'simple',
  'tax',
  'trade-offs',
  'usability research',
  'user experience',
  'user feedback',
  'user interface design',
  'user research',
  'user studies',
  'user types',
  'user-centered design / human-centered design',
  'version 1.0',
  'visual design']]

That is excessive, but seems legit to me.

Total number of unique keywords:


In [33]:
len(keywordhist)


Out[33]:
2284

Of which occur in 10 or more items in the subset


In [34]:
[(k, keywordhist[k]) for k in sorted(keywordhist, key=keywordhist.get, reverse=True) if keywordhist[k] >= 10]


Out[34]:
[('legacy systems', 27),
 ('reverse engineering', 24),
 ('legacy software', 19),
 ('reengineering', 18),
 ('refactoring', 17),
 ('java', 16),
 ('migration', 14),
 ('software evolution', 14),
 ('software architecture', 12),
 ('cloud computing', 12),
 ('security', 12),
 ('legacy', 11),
 ('legacy system', 10),
 ('middleware', 10),
 ('interoperability', 10),
 ('architecture', 10),
 ('design', 10)]

and further those that occur in 3-10 items


In [35]:
[(k, keywordhist[k]) for k in sorted(keywordhist, key=keywordhist.get, reverse=True) if keywordhist[k] < 10 and keywordhist[k] >= 3]


Out[35]:
[('c', 9),
 ('software engineering', 9),
 ('soa', 9),
 ('performance', 9),
 ('legacy code', 8),
 ('reuse', 8),
 ('code generation', 7),
 ('software product lines', 7),
 ('measurement', 7),
 ('virtualization', 7),
 ('corba', 7),
 ('trusted computing', 7),
 ('parallelism', 7),
 ('fortran', 7),
 ('c++', 7),
 ('multimedia', 7),
 ('transactional memory', 7),
 ('sdn', 6),
 ('personalization', 6),
 ('integration', 6),
 ('openflow', 6),
 ('usability', 6),
 ('static analysis', 6),
 ('evolution', 5),
 ('metadata', 5),
 ('death', 5),
 ('digital legacy', 5),
 ('semantic web', 5),
 ('storage', 5),
 ('reflection', 5),
 ('type inference', 5),
 ('privacy', 5),
 ('patterns', 5),
 ('android', 5),
 ('user experience', 5),
 ('memory', 5),
 ('eclipse', 5),
 ('software visualization', 4),
 ('late launch', 4),
 ('web services', 4),
 ('secure execution', 4),
 ('software reengineering', 4),
 ('object-oriented', 4),
 ('maintenance', 4),
 ('transformation', 4),
 ('information retrieval', 4),
 ('embedded systems', 4),
 ('software reuse', 4),
 ('compilers', 4),
 ('mapping', 4),
 ('testing', 4),
 ('bytecode', 4),
 ('software-defined networking', 4),
 ('business process', 4),
 ('gpu', 4),
 ('domain-specific languages', 4),
 ('concurrency', 4),
 ('open source', 4),
 ('evolutionary computation', 4),
 ('case study', 4),
 ('software architectures', 4),
 ('fault localization', 4),
 ('energy efficiency', 4),
 ('hierarchical scheduling', 4),
 ('software maintenance', 4),
 ('assembly code', 4),
 ('web applications', 4),
 ('machine learning', 4),
 ('facebook', 4),
 ('real-time systems', 4),
 ('mobility', 4),
 ('software product line', 4),
 ('simulation', 4),
 ('relational databases', 4),
 ('architecture recovery', 4),
 ('service oriented architecture', 4),
 ('software product line engineering', 4),
 ('mobile devices', 3),
 ('component', 3),
 ('middlebox', 3),
 ('atomicity', 3),
 ('clustering', 3),
 ('parallel simulation', 3),
 ('social network sites', 3),
 ('annotations', 3),
 ('runtime systems', 3),
 ('software modernization', 3),
 ('gpu acceleration', 3),
 ('web service', 3),
 ('legacy integration', 3),
 ('domain-specific language', 3),
 ('software migration', 3),
 ('legacy modernization', 3),
 ('imap', 3),
 ('certification', 3),
 ('gcc', 3),
 ('features', 3),
 ('groupware', 3),
 ('ontology', 3),
 ('relaxed memory models', 3),
 ('context-awareness', 3),
 ('cscw', 3),
 ('model-driven engineering', 3),
 ('cross-site scripting', 3),
 ('802.11', 3),
 ('uml', 3),
 ('quality of service', 3),
 ('fence placement', 3),
 ('inheritance', 3),
 ('opencl', 3),
 ('framework', 3),
 ('frameworks', 3),
 ('software quality', 3),
 ('foreign function interface', 3),
 ('multicore', 3),
 ('rest', 3),
 ('modernization', 3),
 ('stewardship', 3),
 ('experimentation', 3),
 ('user interface design', 3),
 ('social media', 3),
 ('tools', 3),
 ('innovation', 3),
 ('dns', 3),
 ('identity management', 3),
 ('context-free grammars', 3),
 ('real-time', 3),
 ('high-performance computing', 3),
 ('behavioral intervals', 3),
 ('rootkit detection', 3),
 ('software clustering', 3),
 ('re-engineering', 3),
 ('automated program repair', 3),
 ('languages', 3),
 ('memcached', 3),
 ('dtn', 3),
 ('heterogeneous', 3),
 ('aop', 3),
 ('ims', 3),
 ('portability', 3),
 ('method', 3),
 ('schema evolution', 3),
 ('parallelization', 3),
 ('dynamic analysis', 3),
 ('libraries', 3),
 ('risk management', 3),
 ('interaction design', 3),
 ('resource management', 3),
 ('design patterns', 3),
 ('software execution cost analysis', 3),
 ('parallel programming', 3),
 ('probes', 3),
 ('mde', 3),
 ('linux', 3),
 ('open standards', 3),
 ('incremental deployment', 3),
 ('model driven engineering', 3),
 ('system-level timing validation', 3),
 ('modularization', 3),
 ('personal information management', 3)]

Of the remainder, number of keywords which appear on only two items


In [36]:
len([k for k in keywordhist if keywordhist[k] == 2])


Out[36]:
305

and only on one item


In [37]:
len([k for k in keywordhist if keywordhist[k] == 1])


Out[37]:
1802

Keywords with 'legacy' in them


In [12]:
sorted([(k, keywordhist[k]) for k in keywordhist if re.match("legacy", k)], key=lambda k: k[1], reverse=True)


Out[12]:
[('legacy systems', 27),
 ('legacy software', 19),
 ('legacy', 11),
 ('legacy system', 10),
 ('legacy code', 8),
 ('legacy modernization', 3),
 ('legacy integration', 3),
 ('legacy programs', 2),
 ('legacy infrastructure', 2),
 ('legacy traffic', 2),
 ('legacy reuse', 2),
 ('legacy data', 2),
 ('legacy system maintenance', 1),
 ('legacy devices', 1),
 ('legacy information systems', 1),
 ('legacy bias', 1),
 ('legacy technology', 1),
 ('legacy networks', 1),
 ('legacy system analysis', 1),
 ('legacy models', 1),
 ('legacy database', 1),
 ('legacy software product lines', 1),
 ('legacy study', 1),
 ('legacy data conversion', 1),
 ('legacy contact', 1),
 ('legacy code wrapping', 1),
 ('legacy support', 1),
 ('legacy assets mining', 1),
 ('legacy file formats', 1),
 ('legacy migration', 1),
 ('legacy application', 1),
 ('legacy applications', 1),
 ('legacy system integration', 1),
 ('legacy c program parallelization', 1),
 ('legacy document conversion', 1),
 ('legacy systems analysis', 1)]

Network analysis of keywords

Keywords are a comma separated list in keywords, let's pull all of them out to a graph.

An analysis of which keywords are actually plentiful, their temporal distibution etc. centrality metrics, subgraph overlap etc. would be great.


In [13]:
keywordg = nx.Graph()
legacybib.keywords.map(lambda item: keywordg.add_edges_from([p for p in itertools.permutations(item, 2)]), na_action='ignore')
print("Number of components", len([comp for comp in nx.connected_components(keywordg)]))
print("Largest ten components sizes", sorted([len(comp) for comp in nx.connected_components(keywordg)], reverse=True)[:10])


Number of components 156
Largest ten components sizes [1624, 13, 11, 11, 11, 10, 9, 9, 9, 9]

So there is one dominant component, and 150 small ones. It's best to explore them interactively with Gephi.


In [40]:
nx.write_gexf(keywordg, "keywordg.gexf")

Degree distribution of the keyword graph, ie. are there a few nodes which have huge degree and then a large number of nodes with smaller number of connections, like a power network. Additionally, let's see where the keywords with the work legacy in them are placed, by indicating them with green vertical lines. In the left diagram below, hubs are towards the right.


In [14]:
fig, (ax1, ax2) = plt.subplots(1,2)
fig.set_size_inches(10, 2)
ax1.set_title("Keyword degree histogram")
ax1.plot(nx.degree_histogram(keywordg))
ax1.vlines([keywordg.degree(l) for l in keywordg if re.match('legacy', l)], ax1.get_ylim()[0], ax1.get_ylim()[1], colors='green')
ax2.set_title("Keyword degree diagram, log/log")
ax2.loglog(nx.degree_histogram(keywordg))


Out[14]:
[<matplotlib.lines.Line2D at 0x10413ad30>]

Eyeballing the above, most of the legacy keywords are where the mass of the distribution is, ie. at low degrees. One of the legacy nodes is a top hub, and there are some in the mid-ranges.

The top 3 keywords with the highest degree, ie. towards the right of the above graph are:


In [42]:
keywordgDegrees = pd.Series(keywordg.degree()).sort_values(ascending=False)
keywordgDegrees.head(3)


Out[42]:
legacy systems         132
architecture            85
reverse engineering     81
dtype: int64

Let's plot the top hub out.


In [43]:
def plotNeighborhood(graph, ego, color = "green", includeEgo = False):
    from math import sqrt
    """
    Plot neighbourhood of keyword in graph, after possibly removing the ego.
    
    graph : networkx.Graph-like graph
        The graph to get the neighbourhood from
    ego : node in graph
        The node whose neighbourhood to plot
    color : string
        Name of the color to use for plotting
    includeEgo : bool
        Include the ego node
        
    The function defaults to removing the ego node, because by definition
    it is connected to each of the nodes in the subgraph. With the ego
    removed, the result basically tells how the neighbours are connected
    with one another.
    """
    plt.rcParams["figure.figsize"] = (10, 10)
    subgraph = nx.Graph()
    if includeEgo:
        subgraph = graph.subgraph(graph.neighbors(ego) + [ego])
    else:
        subgraph = graph.subgraph(graph.neighbors(ego))
    plt.title("Neighbourhood of " + ego + " (" + str(len(subgraph)) + ")")
    plt.axis('off')
    pos = nx.spring_layout(subgraph, k = 1/sqrt(len(subgraph) * 2))
    nx.draw_networkx(subgraph,
                     pos = pos,
                     font_size = 9,
                     node_color = color,
                     alpha = 0.8,
                     edge_color = "light" + color)
    plt.show()

In [45]:
plotNeighborhood(keywordg, "legacy systems")



In [48]:
plotNeighborhood(keywordg, "legacy software")


Communities

Community detection with the Louvain algorithm, explained in Blondel, Guillaume, Lambiotte, Lefebvre: Fast unfolding of communities in large networks (2008). For weighted networks, modularity of a partition is $Q = \frac{1}{2m}\sum_{i, j} \Big[A_{ij} - \frac{k_i k_j}{2m}\Big] \delta(c_i, c_j)$, where $A_{ij}$ is the weight matrix, $c_i$ the community of $i$, and $\delta$-function is 1 if $u = v$, and 0 otherwise (erm what?) and $m = \frac{1}{2}\sum_{i,j}A_{ij}$.


In [46]:
def plotCommunities(graph):
    """Plot community information from a graph.
    
    Basically just copied from http://perso.crans.org/aynaud/communities/index.html
    at this point, while in development
    """
    # zoom in on something, for dev. purposes
    graph = graph.subgraph(graph.neighbors('legacy software'))
    # graph = [c for c in nx.connected_component_subgraphs(graph)][0]
    graph = max(nx.connected_component_subgraphs(graph), key=len) # I love you Python
    partition = community.best_partition(graph)
    size = float(len(set(partition.values())))
    pos = nx.spring_layout(graph)
    count = 0
    for com in set(partition.values()):
        count = count + 1
        list_nodes = [nodes for nodes in partition.keys() if partition[nodes] == com]
        plt.axis('off')
        nx.draw_networkx_nodes(graph, pos, list_nodes, node_size = 40, node_color = str(count/size), alpha=0.4)
        nx.draw_networkx_labels(graph, pos, font_size = 9)
        
    nx.draw_networkx_edges(graph, pos, alpha=0.1)
    plt.show()

In [45]:
plotCommunities(keywordg)