ACM Digital Library bibliometric analysis of legacy software

An incomplete bibliographical inquiry into what the ACM Digital Library has to say about legacy.

Basically two research questions come to mind immidiately. Firstly, have legacy-related publications been on the raise. Secondly, what subtopics can be analyzed?

The first question could afford analysing knowledge being captured in new concept and practices e.g. refactoring or soa. The second question could afford validation against qualitative methods.



In [3]:

    
import pandas as pd
import networkx as nx
import community
import itertools
import matplotlib.pyplot as plt
import numpy as np
import re
%matplotlib inline

Data loading, sanitization and massage

Search "legacy" at ACM Digital Library. Just a simple search, to which the web interface gives 1541 results in mid November 2016. The number of items in the library is ~460000.

A CSV is downloaded from ACM DL, in default sorting order of the library's own idea of relevance... whatever that means for them. BibTeX is also available.



In [4]:

    
legacybib = pd.read_csv("ACMDL201612108240806.csv")

The available data columns are



In [9]:

    
legacybib.columns









    Out[9]:





Index(['type', 'id', 'author', 'editor', 'advisor', 'note', 'title', 'pages',
       'article_no', 'num_pages', 'keywords', 'doi', 'journal', 'issue_date',
       'volume', 'issue_no', 'description', 'month', 'year', 'issn',
       'booktitle', 'acronym', 'edition', 'isbn', 'conf_loc', 'publisher',
       'publisher_loc'],
      dtype='object')

A peek at the topmost data items.



In [10]:

    
legacybib.head(3)









    Out[10]:






  
    
      
      type
      id
      author
      editor
      advisor
      note
      title
      pages
      article_no
      num_pages
      ...
      month
      year
      issn
      booktitle
      acronym
      edition
      isbn
      conf_loc
      publisher
      publisher_loc
    
  
  
    
      0
      article
      505877
      Sakib Abdul Mondal and Kingshuk Das Gupta
      NaN
      NaN
      NaN
      Choosing a Middleware for Web-integration of a...
      50--53
      NaN
      4.0
      ...
      May
      2000.0
      0163-5948
      NaN
      NaN
      NaN
      NaN
      NaN
      ACM
      New York, NY, USA
    
    
      1
      article
      2487308
      Brandon Kyle Phillips and Sherry  Ryan and Gin...
      NaN
      NaN
      NaN
      Motivating Students to Acquire Mainframe Skills
      73--78
      NaN
      6.0
      ...
      NaN
      2013.0
      NaN
      Proceedings of the 2013 Annual Conference on C...
      SIGMIS-CPR '13
      NaN
      978-1-4503-1975-1
      Cincinnati, Ohio, USA
      ACM
      New York, NY, USA
    
    
      2
      article
      2048229
      Dennis  Mancl and Steven D. Fraser and Bill  O...
      NaN
      NaN
      NaN
      Workshop: Beyond Green-field Software Developm...
      321--322
      NaN
      2.0
      ...
      NaN
      2011.0
      NaN
      Proceedings of the ACM International Conferenc...
      OOPSLA '11
      NaN
      978-1-4503-0942-4
      Portland, Oregon, USA
      ACM
      New York, NY, USA
    
  

3 rows × 27 columns

Does the id field uniquely identify items on the search list? If so, using it as index could be a good idea.



In [11]:

    
assert 0, sum(legacybib.id.duplicated())









    



---------------------------------------------------------------------------
AssertionError                            Traceback (most recent call last)
<ipython-input-11-2f98843d8004> in <module>()
----> 1 assert 0, sum(legacybib.id.duplicated())

AssertionError: 77



In [5]:

    
legacybib[legacybib.id.duplicated(keep=False)].head(2 * 2)









    Out[5]:






  
    
      
      type
      id
      author
      editor
      advisor
      note
      title
      pages
      article_no
      num_pages
      ...
      month
      year
      issn
      booktitle
      acronym
      edition
      isbn
      conf_loc
      publisher
      publisher_loc
    
  
  
    
      21
      article
      1352602
      Carsten  Weinhold and Hermann  H&#228;rtig
      NaN
      NaN
      NaN
      VPFS: Building a Virtual Private File System w...
      81--93
      NaN
      13.0
      ...
      April
      2008.0
      0163-5980
      NaN
      NaN
      NaN
      NaN
      NaN
      ACM
      New York, NY, USA
    
    
      22
      article
      1352602
      Carsten  Weinhold and Hermann  H&#228;rtig
      NaN
      NaN
      NaN
      VPFS: Building a Virtual Private File System w...
      81--93
      NaN
      13.0
      ...
      NaN
      2008.0
      NaN
      Proceedings of the 3rd ACM SIGOPS/EuroSys Euro...
      Eurosys '08
      NaN
      978-1-60558-013-5
      Glasgow, Scotland UK
      ACM
      New York, NY, USA
    
    
      35
      article
      1402988
      Maxim  Podlesny and Sergey  Gorinsky
      NaN
      NaN
      NaN
      Rd Network Services: Differentiation Through P...
      255--266
      NaN
      12.0
      ...
      August
      2008.0
      0146-4833
      NaN
      NaN
      NaN
      NaN
      NaN
      ACM
      New York, NY, USA
    
    
      36
      article
      1402988
      Maxim  Podlesny and Sergey  Gorinsky
      NaN
      NaN
      NaN
      Rd Network Services: Differentiation Through P...
      255--266
      NaN
      12.0
      ...
      NaN
      2008.0
      NaN
      Proceedings of the ACM SIGCOMM 2008 Conference...
      SIGCOMM '08
      NaN
      978-1-60558-175-0
      Seattle, WA, USA
      ACM
      New York, NY, USA
    
  

4 rows × 27 columns

Ok apparently it is used for deduplication. Who knows why are the items twice in the downloaded list.

What datatypes did Pandas infer from the CSV



In [14]:

    
legacybib.dtypes









    Out[14]:





type              object
id                object
author            object
editor            object
advisor          float64
note              object
title             object
pages             object
article_no       float64
num_pages        float64
keywords          object
doi               object
journal           object
issue_date        object
volume            object
issue_no          object
description       object
month             object
year             float64
issn              object
booktitle         object
acronym           object
edition          float64
isbn              object
conf_loc          object
publisher         object
publisher_loc     object
dtype: object

Massage the keywords to be lists. Note that str.split(',') returns a [''], therefore the little if filter in there.



In [6]:

    
legacybib.keywords.fillna('', inplace=True)
legacybib.keywords = legacybib.keywords.map(lambda l: [k.lower().strip() for k in l.split(',') if k])

Are any items missing the year?



In [7]:

    
legacybib[legacybib.year.isnull()].year









    Out[7]:





116   NaN
561   NaN
Name: year, dtype: float64

Complementary data

To contextualize the legacy search results, get the number of total publications in ACM per year.

These were semimanually extracted from the ACM DL search results listing DOM, with the following Javascript

acmYearly = {};
theChartData.labels.forEach(function(y) {acmYearly[y] = theChartData.datasets[0].data[theChartData.labels.indexOf(y)]});
console.log(acmYearly);



In [8]:

    
acmPerYearData = { 1951: 43, 1952: 77, 1953: 34, 1954: 71, 1955: 72, 1956: 162, 1957: 144, 1958: 234, 1959: 335,
              1960: 302, 1961: 521, 1962: 519, 1963: 451, 1964: 537, 1965: 561, 1966: 633, 1967: 754, 1968: 669, 1969: 907,
              1970: 800, 1971: 1103, 1972: 1304, 1973: 1704, 1974: 1698, 1975: 1707, 1976: 2086, 1977: 1943, 1978: 2235, 1979: 1687,
              1980: 2152, 1981: 2241, 1982: 2578, 1983: 2485, 1984: 2531, 1985: 2608, 1986: 3143, 1987: 3059, 1988: 3827, 1989: 4155,
              1990: 4313, 1991: 4551, 1992: 5019, 1993: 5107, 1994: 5939, 1995: 6179, 1996: 6858, 1997: 7181, 1998: 8003, 1999: 7628,
              2000: 9348, 2001: 8691, 2002: 10965, 2003: 11624, 2004: 14493, 2005: 16715, 2006: 19222, 2007: 19865, 2008: 21631, 2009: 23827,
              2010: 27039, 2011: 25985, 2012: 27737, 2013: 25832, 2014: 26928, 2015: 27131, 2016: 25557, 2017: 39}
acmPerYear = pd.Series(acmPerYearData)

Data overview

Let's check how many percent all the 1541 search results given by the website search list. Would be great if this was 100%.



In [18]:

    
round(len(legacybib) / 1541 * 100, 2)









    Out[18]:





69.89

With the above peek at the ID field, how many unique items did we receive in the download?



In [19]:

    
len(legacybib.id.unique())









    Out[19]:





1000

Ok capped at 1000 I guess, which brings the percentage of the website search results available to us down to



In [20]:

    
round(len(legacybib.id.unique()) / 1541 * 100, 2)









    Out[20]:





64.89

Data exploration

Fraction of published items per year which ACM identifies as relevant for legacy search.

Histogram of publication years



In [21]:

    
legacybib.year.hist(bins=legacybib.year.max() - legacybib.year.min(), figsize=(10,2))









    



/Library/Frameworks/Python.framework/Versions/3.5/lib/python3.5/site-packages/numpy/lib/function_base.py:564: VisibleDeprecationWarning: using a non-integer number instead of an integer will result in an error in the future
  n = np.zeros(bins, ntype)
/Library/Frameworks/Python.framework/Versions/3.5/lib/python3.5/site-packages/numpy/lib/function_base.py:611: VisibleDeprecationWarning: using a non-integer number instead of an integer will result in an error in the future
  n += np.bincount(indices, weights=tmp_w, minlength=bins).astype(ntype)






    Out[21]:





<matplotlib.axes._subplots.AxesSubplot at 0x10325d470>

What about the ACM Digital Library total, what does it's profile look like over time?



In [22]:

    
acmPerYear.plot(figsize=(10, 2))









    Out[22]:





<matplotlib.axes._subplots.AxesSubplot at 0x10996d860>

Similar overall shape, which isn't a surprise. Overlay, with arbitrary scaling of 300.



In [23]:

    
#plt.hist(legacybib.year.dropna(), label="Year histogram")
plt.plot(legacybib.year.groupby(legacybib.year).count(), label='legacy publication')
plt.plot(acmPerYear * 0.003, label="total publications * 0.003")
plt.legend()
plt.legend(loc='best')









    Out[23]:





<matplotlib.legend.Legend at 0x10908ed30>

Right, so they seem to have somewhat similar shape. Legacy as a concept was lagging overall ACM DL, until it caught by increasing growth during 1990s.

What about the ratio of this subset of the whole ACM DL? Has it increased or decreased over time? Ie. has the proportion of publications about legacy changed?



In [24]:

    
plt.plot(pd.Series(legacybib.groupby(legacybib.year).year.count() / acmPerYear), 'o')









    Out[24]:





[<matplotlib.lines.Line2D at 0x109bb9f98>]

All the pre-1990 publications are:



In [25]:

    
legacybib[legacybib.year <= 1990][["year", "title"]].sort_values("year")









    Out[25]:






  
    
      
      year
      title
    
  
  
    
      101
      1971.0
      The Legacy of MATHLAB 68
    
    
      716
      1981.0
      V-Compiler: A Next-generation Tool for Micropr...
    
    
      92
      1984.0
      Redocumentation: Addressing the Maintenance Le...
    
    
      105
      1986.0
      Ada: A Life and Legacy: Dorothy Stein Book Review
    
    
      329
      1989.0
      Saving Legacy with Objects
    
    
      330
      1989.0
      Saving Legacy with Objects

And over 1000 publications after 1990, until 2016. First 10 of which are



In [26]:

    
legacybib[legacybib.year > 1990][["year", "title"]].sort_values("year").head(10)









    Out[26]:






  
    
      
      year
      title
    
  
  
    
      475
      1991.0
      On the Semantic Equivalence of Heterogeneous R...
    
    
      378
      1991.0
      A Software Reverse Engineering Experience
    
    
      665
      1991.0
      From Under the Rubble: Computing and the Resus...
    
    
      154
      1992.0
      Assessing Design-quality Metrics on Legacy Sof...
    
    
      984
      1992.0
      Experiences in Program Understanding
    
    
      487
      1993.0
      Engineering an SQL Gateway to IMS
    
    
      335
      1993.0
      Re-engineering Design Trade-offs in a Legacy C...
    
    
      314
      1993.0
      The Development of a Partial Design Recovery E...
    
    
      309
      1993.0
      Issues and Approaches for Migration/Cohabitati...
    
    
      308
      1993.0
      Issues and Approaches for Migration/Cohabitati...

Did something happen around 1990s, as the fraction of publications related to legacy started increasing? Let's plot a global linear regression model, as well as separate linear regression models before and after 1990.



In [27]:

    
pre1990range = np.arange(legacybib.year.min(), 1991)
post1990range = np.arange(1990, legacybib.year.max())

# Linear regression models
# note the use of np.polyfit
propLm = np.polyfit(pd.Series(legacybib.groupby(legacybib.year).year.count() / acmPerYear).dropna().index, pd.Series(legacybib.groupby(legacybib.year).year.count() / acmPerYear).dropna(), 1)
pre1990 = np.polyfit(pd.Series(legacybib.groupby(legacybib.year).year.count() / acmPerYear)[pre1990range].dropna().index, pd.Series(legacybib.groupby(legacybib.year).year.count() / acmPerYear)[pre1990range].dropna(), 1)
post1990 = np.polyfit(pd.Series(legacybib.groupby(legacybib.year).year.count() / acmPerYear)[post1990range].dropna().index, pd.Series(legacybib.groupby(legacybib.year).year.count() / acmPerYear)[post1990range].dropna(), 1)

# Plot the fractions of legacy vs. all publications, the models, and a legend
plt.plot(pd.Series(legacybib.groupby(legacybib.year).year.count() / acmPerYear), 'o')
plt.plot(np.arange(legacybib.year.min(), legacybib.year.max()), np.poly1d(propLm)(np.arange(legacybib.year.min(), legacybib.year.max())), label="global lm")
plt.plot(pre1990range, np.poly1d(pre1990)(pre1990range), linestyle="dashed", label="pre 1990 lm")
plt.plot(post1990range, np.poly1d(post1990)(post1990range), linestyle="dashed", label="post 1990 lm")
plt.title("Fraction of legacy related publications against ACM")
plt.legend(loc="best")









    Out[27]:





<matplotlib.legend.Legend at 0x109c3feb8>

Statistical validation of the above would be good, of course, to check against randomness.

A histogram of keywords

The keywords are interesting. All keywords in this dataset are already related to legacy one way or the other, since the data under inspection here is a subset of the total ACM Digital Library.

Keywords of course live a life of their own, and I guess there increase in number forever.

Which keywords are popular?



In [10]:

    
# this could be a pandas.Series instead of dict
keywordhist = {}
for kws in legacybib.keywords:
    for k in kws:
        if k in keywordhist:
            keywordhist[k] = keywordhist[k] + 1
        else:
            keywordhist[k] = 1

How many keywords do each item have?



In [11]:

    
legacybib.keywords.map(lambda kws: len(kws)).describe()









    Out[11]:





count    1077.000000
mean        3.065924
std         3.181662
min         0.000000
25%         0.000000
50%         3.000000
75%         5.000000
max        50.000000
Name: keywords, dtype: float64



In [30]:

    
plt.title("Histogram of numbers of keywords per item")
plt.hist(legacybib.keywords.map(lambda kws: len(kws)), bins=max(legacybib.keywords.map(lambda kws: len(kws))) - 1)









    Out[30]:





(array([ 385.,   50.,  156.,  174.,  136.,   79.,   37.,   25.,   19.,
           6.,    4.,    2.,    0.,    0.,    1.,    0.,    0.,    0.,
           0.,    0.,    0.,    1.,    0.,    0.,    0.,    0.,    0.,
           0.,    1.,    0.,    0.,    0.,    0.,    0.,    0.,    0.,
           0.,    0.,    0.,    0.,    0.,    0.,    0.,    0.,    0.,
           0.,    0.,    0.,    1.]),
 array([  0.        ,   1.02040816,   2.04081633,   3.06122449,
          4.08163265,   5.10204082,   6.12244898,   7.14285714,
          8.16326531,   9.18367347,  10.20408163,  11.2244898 ,
         12.24489796,  13.26530612,  14.28571429,  15.30612245,
         16.32653061,  17.34693878,  18.36734694,  19.3877551 ,
         20.40816327,  21.42857143,  22.44897959,  23.46938776,
         24.48979592,  25.51020408,  26.53061224,  27.55102041,
         28.57142857,  29.59183673,  30.6122449 ,  31.63265306,
         32.65306122,  33.67346939,  34.69387755,  35.71428571,
         36.73469388,  37.75510204,  38.7755102 ,  39.79591837,
         40.81632653,  41.83673469,  42.85714286,  43.87755102,
         44.89795918,  45.91836735,  46.93877551,  47.95918367,
         48.97959184,  50.        ]),
 <a list of 49 Patch objects>)

Ok almost 400 items are without any keywords. There are some outliers, let's inspect the ones with more than 15 keywords. Sounds excessive...



In [31]:

    
legacybib[legacybib.keywords.map(lambda kws: len(kws)) > 15][["id", "title", "author", "keywords"]]









    Out[31]:






  
    
      
      id
      title
      author
      keywords
    
  
  
    
      371
      372119
      'TSUPDOOD?: Repackaged Problems for You and MMI
      Rebecca G. Bace and Marvin  Schaefer
      [commercial products, computer interconnection...
    
    
      802
      2857706
      Interoperability of Relationship- and Role-Bas...
      Syed Zain R. Rizvi and Philip W.L. Fong
      [access control, authorization graph, authoriz...
    
    
      895
      1138284
      Customer Driven Innovation: Quicken&Reg; Renta...
      Suzanne  Pellican and Matt  Homier
      [advanced prototypes, analysis, architecture, ...

And the keyword lists for the above



In [32]:

    
[keywordlist for keywordlist in legacybib[legacybib.keywords.map(lambda kws: len(kws)) > 15].keywords]









    Out[32]:





[['commercial products',
  'computer interconnection',
  'computer security',
  'computer takeover',
  'computer usage',
  'computer viruses',
  'confidentiality compromises',
  'data destruction',
  'information security',
  'inter-computer links',
  'interconnected systems',
  'interconnected workstations',
  'internetworking',
  'legacy systems',
  'man-machine interfaces',
  'mutual suspicion',
  'network security',
  'security breaches',
  'security of data',
  'security protection',
  'separation kernel concept',
  'service denial',
  'software distribution',
  'trapdoors',
  'uncontrolled activity',
  'user interfaces',
  'user perception',
  'willful promiscuity',
  'worms'],
 ['access control',
  'authorization graph',
  'authorization principal',
  'authorization rules',
  'constraints',
  'core rebac2015',
  'demarcation',
  'guards',
  'hybrid logic',
  'liberal-grant',
  'meap',
  'mutually exclusive authorization principals',
  'papr',
  'prerequisite authorization principal requirement',
  'principal matching',
  'rebac2015',
  'rebac2015/constraints',
  'rebac2015/demarcation',
  'relationship predicate',
  'relationship-based access control',
  'role-based access control',
  'string-grant'],
 ['advanced prototypes',
  'analysis',
  'architecture',
  'business case',
  'business strategy',
  'concept design',
  'concept evaluation',
  'concept generation',
  'concepts',
  'conceptual model',
  'customer interviews',
  'customer-driven innovation',
  'customer-focus',
  'design planning',
  'desktop software',
  'detailed design',
  'experience strategy',
  'focus groups',
  'generative research',
  'interaction design',
  'interdisciplinary design',
  'launch and learn',
  'macromedia flash',
  'marketing / market research',
  'office',
  'organizational culture',
  'organizational planning',
  'participatory design',
  'performance metrics',
  'personal finance',
  'pilot testing',
  'polish',
  'process',
  'process improvement',
  'product design',
  'product management',
  'prototyping',
  'simple',
  'tax',
  'trade-offs',
  'usability research',
  'user experience',
  'user feedback',
  'user interface design',
  'user research',
  'user studies',
  'user types',
  'user-centered design / human-centered design',
  'version 1.0',
  'visual design']]

That is excessive, but seems legit to me.

Total number of unique keywords:



In [33]:

    
len(keywordhist)









    Out[33]:





2284

Of which occur in 10 or more items in the subset



In [34]:

    
[(k, keywordhist[k]) for k in sorted(keywordhist, key=keywordhist.get, reverse=True) if keywordhist[k] >= 10]









    Out[34]:





[('legacy systems', 27),
 ('reverse engineering', 24),
 ('legacy software', 19),
 ('reengineering', 18),
 ('refactoring', 17),
 ('java', 16),
 ('migration', 14),
 ('software evolution', 14),
 ('software architecture', 12),
 ('cloud computing', 12),
 ('security', 12),
 ('legacy', 11),
 ('legacy system', 10),
 ('middleware', 10),
 ('interoperability', 10),
 ('architecture', 10),
 ('design', 10)]

and further those that occur in 3-10 items



In [35]:

    
[(k, keywordhist[k]) for k in sorted(keywordhist, key=keywordhist.get, reverse=True) if keywordhist[k] < 10 and keywordhist[k] >= 3]









    Out[35]:





[('c', 9),
 ('software engineering', 9),
 ('soa', 9),
 ('performance', 9),
 ('legacy code', 8),
 ('reuse', 8),
 ('code generation', 7),
 ('software product lines', 7),
 ('measurement', 7),
 ('virtualization', 7),
 ('corba', 7),
 ('trusted computing', 7),
 ('parallelism', 7),
 ('fortran', 7),
 ('c++', 7),
 ('multimedia', 7),
 ('transactional memory', 7),
 ('sdn', 6),
 ('personalization', 6),
 ('integration', 6),
 ('openflow', 6),
 ('usability', 6),
 ('static analysis', 6),
 ('evolution', 5),
 ('metadata', 5),
 ('death', 5),
 ('digital legacy', 5),
 ('semantic web', 5),
 ('storage', 5),
 ('reflection', 5),
 ('type inference', 5),
 ('privacy', 5),
 ('patterns', 5),
 ('android', 5),
 ('user experience', 5),
 ('memory', 5),
 ('eclipse', 5),
 ('software visualization', 4),
 ('late launch', 4),
 ('web services', 4),
 ('secure execution', 4),
 ('software reengineering', 4),
 ('object-oriented', 4),
 ('maintenance', 4),
 ('transformation', 4),
 ('information retrieval', 4),
 ('embedded systems', 4),
 ('software reuse', 4),
 ('compilers', 4),
 ('mapping', 4),
 ('testing', 4),
 ('bytecode', 4),
 ('software-defined networking', 4),
 ('business process', 4),
 ('gpu', 4),
 ('domain-specific languages', 4),
 ('concurrency', 4),
 ('open source', 4),
 ('evolutionary computation', 4),
 ('case study', 4),
 ('software architectures', 4),
 ('fault localization', 4),
 ('energy efficiency', 4),
 ('hierarchical scheduling', 4),
 ('software maintenance', 4),
 ('assembly code', 4),
 ('web applications', 4),
 ('machine learning', 4),
 ('facebook', 4),
 ('real-time systems', 4),
 ('mobility', 4),
 ('software product line', 4),
 ('simulation', 4),
 ('relational databases', 4),
 ('architecture recovery', 4),
 ('service oriented architecture', 4),
 ('software product line engineering', 4),
 ('mobile devices', 3),
 ('component', 3),
 ('middlebox', 3),
 ('atomicity', 3),
 ('clustering', 3),
 ('parallel simulation', 3),
 ('social network sites', 3),
 ('annotations', 3),
 ('runtime systems', 3),
 ('software modernization', 3),
 ('gpu acceleration', 3),
 ('web service', 3),
 ('legacy integration', 3),
 ('domain-specific language', 3),
 ('software migration', 3),
 ('legacy modernization', 3),
 ('imap', 3),
 ('certification', 3),
 ('gcc', 3),
 ('features', 3),
 ('groupware', 3),
 ('ontology', 3),
 ('relaxed memory models', 3),
 ('context-awareness', 3),
 ('cscw', 3),
 ('model-driven engineering', 3),
 ('cross-site scripting', 3),
 ('802.11', 3),
 ('uml', 3),
 ('quality of service', 3),
 ('fence placement', 3),
 ('inheritance', 3),
 ('opencl', 3),
 ('framework', 3),
 ('frameworks', 3),
 ('software quality', 3),
 ('foreign function interface', 3),
 ('multicore', 3),
 ('rest', 3),
 ('modernization', 3),
 ('stewardship', 3),
 ('experimentation', 3),
 ('user interface design', 3),
 ('social media', 3),
 ('tools', 3),
 ('innovation', 3),
 ('dns', 3),
 ('identity management', 3),
 ('context-free grammars', 3),
 ('real-time', 3),
 ('high-performance computing', 3),
 ('behavioral intervals', 3),
 ('rootkit detection', 3),
 ('software clustering', 3),
 ('re-engineering', 3),
 ('automated program repair', 3),
 ('languages', 3),
 ('memcached', 3),
 ('dtn', 3),
 ('heterogeneous', 3),
 ('aop', 3),
 ('ims', 3),
 ('portability', 3),
 ('method', 3),
 ('schema evolution', 3),
 ('parallelization', 3),
 ('dynamic analysis', 3),
 ('libraries', 3),
 ('risk management', 3),
 ('interaction design', 3),
 ('resource management', 3),
 ('design patterns', 3),
 ('software execution cost analysis', 3),
 ('parallel programming', 3),
 ('probes', 3),
 ('mde', 3),
 ('linux', 3),
 ('open standards', 3),
 ('incremental deployment', 3),
 ('model driven engineering', 3),
 ('system-level timing validation', 3),
 ('modularization', 3),
 ('personal information management', 3)]

Of the remainder, number of keywords which appear on only two items



In [36]:

    
len([k for k in keywordhist if keywordhist[k] == 2])









    Out[36]:





305

and only on one item



In [37]:

    
len([k for k in keywordhist if keywordhist[k] == 1])









    Out[37]:





1802

Keywords with 'legacy' in them



In [12]:

    
sorted([(k, keywordhist[k]) for k in keywordhist if re.match("legacy", k)], key=lambda k: k[1], reverse=True)









    Out[12]:





[('legacy systems', 27),
 ('legacy software', 19),
 ('legacy', 11),
 ('legacy system', 10),
 ('legacy code', 8),
 ('legacy modernization', 3),
 ('legacy integration', 3),
 ('legacy programs', 2),
 ('legacy infrastructure', 2),
 ('legacy traffic', 2),
 ('legacy reuse', 2),
 ('legacy data', 2),
 ('legacy system maintenance', 1),
 ('legacy devices', 1),
 ('legacy information systems', 1),
 ('legacy bias', 1),
 ('legacy technology', 1),
 ('legacy networks', 1),
 ('legacy system analysis', 1),
 ('legacy models', 1),
 ('legacy database', 1),
 ('legacy software product lines', 1),
 ('legacy study', 1),
 ('legacy data conversion', 1),
 ('legacy contact', 1),
 ('legacy code wrapping', 1),
 ('legacy support', 1),
 ('legacy assets mining', 1),
 ('legacy file formats', 1),
 ('legacy migration', 1),
 ('legacy application', 1),
 ('legacy applications', 1),
 ('legacy system integration', 1),
 ('legacy c program parallelization', 1),
 ('legacy document conversion', 1),
 ('legacy systems analysis', 1)]

Network analysis of keywords

Keywords are a comma separated list in keywords, let's pull all of them out to a graph.

An analysis of which keywords are actually plentiful, their temporal distibution etc. centrality metrics, subgraph overlap etc. would be great.



In [13]:

    
keywordg = nx.Graph()
legacybib.keywords.map(lambda item: keywordg.add_edges_from([p for p in itertools.permutations(item, 2)]), na_action='ignore')
print("Number of components", len([comp for comp in nx.connected_components(keywordg)]))
print("Largest ten components sizes", sorted([len(comp) for comp in nx.connected_components(keywordg)], reverse=True)[:10])









    



Number of components 156
Largest ten components sizes [1624, 13, 11, 11, 11, 10, 9, 9, 9, 9]

So there is one dominant component, and 150 small ones. It's best to explore them interactively with Gephi.



In [40]:

    
nx.write_gexf(keywordg, "keywordg.gexf")

Degree distribution of the keyword graph, ie. are there a few nodes which have huge degree and then a large number of nodes with smaller number of connections, like a power network. Additionally, let's see where the keywords with the work legacy in them are placed, by indicating them with green vertical lines. In the left diagram below, hubs are towards the right.



In [14]:

    
fig, (ax1, ax2) = plt.subplots(1,2)
fig.set_size_inches(10, 2)
ax1.set_title("Keyword degree histogram")
ax1.plot(nx.degree_histogram(keywordg))
ax1.vlines([keywordg.degree(l) for l in keywordg if re.match('legacy', l)], ax1.get_ylim()[0], ax1.get_ylim()[1], colors='green')
ax2.set_title("Keyword degree diagram, log/log")
ax2.loglog(nx.degree_histogram(keywordg))









    Out[14]:





[<matplotlib.lines.Line2D at 0x10413ad30>]

Eyeballing the above, most of the legacy keywords are where the mass of the distribution is, ie. at low degrees. One of the legacy nodes is a top hub, and there are some in the mid-ranges.

The top 3 keywords with the highest degree, ie. towards the right of the above graph are:



In [42]:

    
keywordgDegrees = pd.Series(keywordg.degree()).sort_values(ascending=False)
keywordgDegrees.head(3)









    Out[42]:





legacy systems         132
architecture            85
reverse engineering     81
dtype: int64

Let's plot the top hub out.



In [43]:

    
def plotNeighborhood(graph, ego, color = "green", includeEgo = False):
    from math import sqrt
    """
    Plot neighbourhood of keyword in graph, after possibly removing the ego.
    
    graph : networkx.Graph-like graph
        The graph to get the neighbourhood from
    ego : node in graph
        The node whose neighbourhood to plot
    color : string
        Name of the color to use for plotting
    includeEgo : bool
        Include the ego node
        
    The function defaults to removing the ego node, because by definition
    it is connected to each of the nodes in the subgraph. With the ego
    removed, the result basically tells how the neighbours are connected
    with one another.
    """
    plt.rcParams["figure.figsize"] = (10, 10)
    subgraph = nx.Graph()
    if includeEgo:
        subgraph = graph.subgraph(graph.neighbors(ego) + [ego])
    else:
        subgraph = graph.subgraph(graph.neighbors(ego))
    plt.title("Neighbourhood of " + ego + " (" + str(len(subgraph)) + ")")
    plt.axis('off')
    pos = nx.spring_layout(subgraph, k = 1/sqrt(len(subgraph) * 2))
    nx.draw_networkx(subgraph,
                     pos = pos,
                     font_size = 9,
                     node_color = color,
                     alpha = 0.8,
                     edge_color = "light" + color)
    plt.show()



In [45]:

    
plotNeighborhood(keywordg, "legacy systems")



In [48]:

    
plotNeighborhood(keywordg, "legacy software")

Communities

Community detection with the Louvain algorithm, explained in Blondel, Guillaume, Lambiotte, Lefebvre: Fast unfolding of communities in large networks (2008). For weighted networks, modularity of a partition is $Q = \frac{1}{2m}\sum_{i, j} \Big[A_{ij} - \frac{k_i k_j}{2m}\Big] \delta(c_i, c_j)$, where $A_{ij}$ is the weight matrix, $c_i$ the community of $i$, and $\delta$-function is 1 if $u = v$, and 0 otherwise (erm what?) and $m = \frac{1}{2}\sum_{i,j}A_{ij}$.



In [46]:

    
def plotCommunities(graph):
    """Plot community information from a graph.
    
    Basically just copied from http://perso.crans.org/aynaud/communities/index.html
    at this point, while in development
    """
    # zoom in on something, for dev. purposes
    graph = graph.subgraph(graph.neighbors('legacy software'))
    # graph = [c for c in nx.connected_component_subgraphs(graph)][0]
    graph = max(nx.connected_component_subgraphs(graph), key=len) # I love you Python
    partition = community.best_partition(graph)
    size = float(len(set(partition.values())))
    pos = nx.spring_layout(graph)
    count = 0
    for com in set(partition.values()):
        count = count + 1
        list_nodes = [nodes for nodes in partition.keys() if partition[nodes] == com]
        plt.axis('off')
        nx.draw_networkx_nodes(graph, pos, list_nodes, node_size = 40, node_color = str(count/size), alpha=0.4)
        nx.draw_networkx_labels(graph, pos, font_size = 9)
        
    nx.draw_networkx_edges(graph, pos, alpha=0.1)
    plt.show()



In [45]:

    
plotCommunities(keywordg)

	type	id	author	editor	advisor	note	title	pages	article_no	num_pages	...	month	year	issn	booktitle	acronym	edition	isbn	conf_loc	publisher	publisher_loc
0	article	505877	Sakib Abdul Mondal and Kingshuk Das Gupta	NaN	NaN	NaN	Choosing a Middleware for Web-integration of a...	50--53	NaN	4.0	...	May	2000.0	0163-5948	NaN	NaN	NaN	NaN	NaN	ACM	New York, NY, USA
1	article	2487308	Brandon Kyle Phillips and Sherry Ryan and Gin...	NaN	NaN	NaN	Motivating Students to Acquire Mainframe Skills	73--78	NaN	6.0	...	NaN	2013.0	NaN	Proceedings of the 2013 Annual Conference on C...	SIGMIS-CPR '13	NaN	978-1-4503-1975-1	Cincinnati, Ohio, USA	ACM	New York, NY, USA
2	article	2048229	Dennis Mancl and Steven D. Fraser and Bill O...	NaN	NaN	NaN	Workshop: Beyond Green-field Software Developm...	321--322	NaN	2.0	...	NaN	2011.0	NaN	Proceedings of the ACM International Conferenc...	OOPSLA '11	NaN	978-1-4503-0942-4	Portland, Oregon, USA	ACM	New York, NY, USA

	type	id	author	editor	advisor	note	title	pages	article_no	num_pages	...	month	year	issn	booktitle	acronym	edition	isbn	conf_loc	publisher	publisher_loc
21	article	1352602	Carsten Weinhold and Hermann Härtig	NaN	NaN	NaN	VPFS: Building a Virtual Private File System w...	81--93	NaN	13.0	...	April	2008.0	0163-5980	NaN	NaN	NaN	NaN	NaN	ACM	New York, NY, USA
22	article	1352602	Carsten Weinhold and Hermann Härtig	NaN	NaN	NaN	VPFS: Building a Virtual Private File System w...	81--93	NaN	13.0	...	NaN	2008.0	NaN	Proceedings of the 3rd ACM SIGOPS/EuroSys Euro...	Eurosys '08	NaN	978-1-60558-013-5	Glasgow, Scotland UK	ACM	New York, NY, USA
35	article	1402988	Maxim Podlesny and Sergey Gorinsky	NaN	NaN	NaN	Rd Network Services: Differentiation Through P...	255--266	NaN	12.0	...	August	2008.0	0146-4833	NaN	NaN	NaN	NaN	NaN	ACM	New York, NY, USA
36	article	1402988	Maxim Podlesny and Sergey Gorinsky	NaN	NaN	NaN	Rd Network Services: Differentiation Through P...	255--266	NaN	12.0	...	NaN	2008.0	NaN	Proceedings of the ACM SIGCOMM 2008 Conference...	SIGCOMM '08	NaN	978-1-60558-175-0	Seattle, WA, USA	ACM	New York, NY, USA

	year	title
101	1971.0	The Legacy of MATHLAB 68
716	1981.0	V-Compiler: A Next-generation Tool for Micropr...
92	1984.0	Redocumentation: Addressing the Maintenance Le...
105	1986.0	Ada: A Life and Legacy: Dorothy Stein Book Review
329	1989.0	Saving Legacy with Objects
330	1989.0	Saving Legacy with Objects

	year	title
475	1991.0	On the Semantic Equivalence of Heterogeneous R...
378	1991.0	A Software Reverse Engineering Experience
665	1991.0	From Under the Rubble: Computing and the Resus...
154	1992.0	Assessing Design-quality Metrics on Legacy Sof...
984	1992.0	Experiences in Program Understanding
487	1993.0	Engineering an SQL Gateway to IMS
335	1993.0	Re-engineering Design Trade-offs in a Legacy C...
314	1993.0	The Development of a Partial Design Recovery E...
309	1993.0	Issues and Approaches for Migration/Cohabitati...
308	1993.0	Issues and Approaches for Migration/Cohabitati...

	id	title	author	keywords
371	372119	'TSUPDOOD?: Repackaged Problems for You and MMI	Rebecca G. Bace and Marvin Schaefer	[commercial products, computer interconnection...
802	2857706	Interoperability of Relationship- and Role-Bas...	Syed Zain R. Rizvi and Philip W.L. Fong	[access control, authorization graph, authoriz...
895	1138284	Customer Driven Innovation: Quicken&Reg; Renta...	Suzanne Pellican and Matt Homier	[advanced prototypes, analysis, architecture, ...