Introduction

Idea

The claim was that the directory structure would be very similar to each other over a period of time. We want to identify this time span by using a time-based analysis on the commits and their corresponding directory structures. We can use some advanced Git repo analysis for this task.

Data creation script

Iterates over all commits and extracts basic information about the commit like sha, author and commit date (in log.txt) as well as the file's list of a specific version (in files.txt).

For each information set, it a new directory with the sha as unique identifier is created.

cd $1

sha_list=`git rev-list master`

for sha in $sha_list:
do
  data_dir="../data/$1/$sha"
  mkdir -p $data_dir
  git checkout $sha
  git log -n 1 $sha > $data_dir/log.txt
  git ls-files > $data_dir/files.txt
done

You can store this script e. g. into extract.sh and execute it for a repository with

sh execute.sh <path_git_repo>

and you'll get a directory / files structure like this

.
├── data
│   ├── lerna
│   │   ├── 001ec5882630cedd895f2c95a56a755617bb036c
│   │   │   ├── files.txt
│   │   │   └── log.txt
│   │   ├── 00242afa1efa43a98dc84815ac8f554ffa58d472
│   │   │   ├── files.txt
│   │   │   └── log.txt
│   │   ├── 007f20b89ae33721bd08f8bcdd0768923bcc6bc5
│   │   │   ├── files.txt
│   │   │   └── log.txt

The content is as follows:

files.txt

.babelrc
.editorconfig
.eslintrc.yaml
.github/ISSUE_TEMPLATE.md
.github/PULL_REQUEST_TEMPLATE.md
.gitignore
.npmignore
.travis.yml
CHANGELOG.md
CODE_OF_CONDUCT.md
CONTRIBUTING.md
FAQ.md
LICENSE
README.md
appveyor.yml
bin/lerna.js
doc/hoist.md
doc/troubleshooting.md
lerna.json
package.json
src/ChildProcessUtilities.js
src/Command.js
src/ConventionalCommitUtilities.js
src/FileSystemUtilities.js
src/GitUtilities.js
src/NpmUtilities.js
...

log.txt

commit 001ec5882630cedd895f2c95a56a755617bb036c
Author: Daniel Stockman <daniels@zillowgroup.com>
Date:   Thu Aug 10 09:56:14 2017 -0700

    chore: fs-extra 4.x

With this data, we have the base for analysing a probably similar directure structure layout over time.

abd83718682d7496426bb35f2f9ca20f10c2468d,2015-12-04 23:29:27 +1100
.gitignore
LICENSE
README.md
bin/lerna.js
lib/commands/bootstrap.js
lib/commands/index.js
lib/commands/publish.js
lib/init.js
lib/progress-bar.js
package.json

Load all files with the files listings

I've executed the script for lerna as well as the web-build-tools. First, we first get all the files.txt using glob.



In [79]:

    
import glob

file_list = glob.glob(r'C:/dev/forensic/data/**/*.txt', recursive=True)
file_list = [x.replace("\\", "/") for x in file_list]
file_list[:5]









    Out[79]:





['C:/dev/forensic/data/lerna/001ec5882630cedd895f2c95a56a755617bb036c.txt',
 'C:/dev/forensic/data/lerna/00242afa1efa43a98dc84815ac8f554ffa58d472.txt',
 'C:/dev/forensic/data/lerna/007f20b89ae33721bd08f8bcdd0768923bcc6bc5.txt',
 'C:/dev/forensic/data/lerna/0083f33f50f69069245325e25c5b9d08445860b0.txt',
 'C:/dev/forensic/data/lerna/00b979f45b6c3886380f6dad01473e2e2ff88db0.txt']

We can then import the data by looping through all the files and read in the corresponding files' content. We further extract the information items we need on the fly from the path as well as the content of log.txt. The result is stored into a Pandas DataFrame for further analysis.



In [80]:

    
import pandas as pd

dfs = []

for files_file in file_list:
    
    try:
        files_df = pd.read_csv(files_file, names=['sha', 'timestamp'])
        files_df['project'] = files_file.split("/")[-2]
        files_df['file'] = files_df.sha
        files_df['sha'] = files_df.sha[0]
        files_df['timestamp'] = pd.to_datetime(files_df.timestamp[0])
        files_df = files_df[1:]
        files_df
        dfs.append(files_df)
    except OSError as e:
        print((e,files_file))
    
file_log = pd.concat(dfs, ignore_index=True)
file_log.head()









    



(OSError('Initializing from file failed',), 'C:/dev/forensic/data/lerna/e9b2ac3c5ee815af933f997b867035f6b7ac24ae\uf03a.txt')
(OSError('Initializing from file failed',), 'C:/dev/forensic/data/web-build-tools/2fd33432c0ff0e951cfaca91425c513e4ce394ab\uf03a.txt')






    Out[80]:







  
    
      
      sha
      timestamp
      project
      file
    
  
  
    
      0
      001ec5882630cedd895f2c95a56a755617bb036c
      2017-08-10 16:56:14
      lerna
      .babelrc
    
    
      1
      001ec5882630cedd895f2c95a56a755617bb036c
      2017-08-10 16:56:14
      lerna
      .editorconfig
    
    
      2
      001ec5882630cedd895f2c95a56a755617bb036c
      2017-08-10 16:56:14
      lerna
      .eslintrc.yaml
    
    
      3
      001ec5882630cedd895f2c95a56a755617bb036c
      2017-08-10 16:56:14
      lerna
      .github/ISSUE_TEMPLATE.md
    
    
      4
      001ec5882630cedd895f2c95a56a755617bb036c
      2017-08-10 16:56:14
      lerna
      .github/PULL_REQUEST_TEMPLATE.md



In [81]:

    
file_log.file = pd.Categorical(file_log.file)
file_log.info()









    



<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2241204 entries, 0 to 2241203
Data columns (total 4 columns):
sha          object
timestamp    datetime64[ns]
project      object
file         category
dtypes: category(1), datetime64[ns](1), object(2)
memory usage: 55.8+ MB



In [141]:

    
dir_log = dir_log[
    (dir_log.project=='lerna') & (dir_log.file.str.endswith(".js")) |
    (dir_log.project=='web-build-tools') & (dir_log.file.str.endswith(".ts"))
]
dir_log.project.value_counts()









    Out[141]:





web-build-tools    825298
lerna              107734
Name: project, dtype: int64



In [145]:

    
dir_log = dir_log[dir_log.file.str.contains("/")].copy()
dir_log['last_dir'] = dir_log.file.str.split("/").str[-2]
dir_log['last_dir_id'] = pd.factorize(dir_log.last_dir)[0]
dir_log.head()









    Out[145]:







  
    
      
      sha
      timestamp
      project
      file
      last_dir
      last_dir_id
      date
    
  
  
    
      15
      001ec5882630cedd895f2c95a56a755617bb036c
      2017-08-10 16:56:14
      lerna
      bin/lerna.js
      bin
      0
      2017-08-10
    
    
      20
      001ec5882630cedd895f2c95a56a755617bb036c
      2017-08-10 16:56:14
      lerna
      src/ChildProcessUtilities.js
      src
      1
      2017-08-10
    
    
      21
      001ec5882630cedd895f2c95a56a755617bb036c
      2017-08-10 16:56:14
      lerna
      src/Command.js
      src
      1
      2017-08-10
    
    
      22
      001ec5882630cedd895f2c95a56a755617bb036c
      2017-08-10 16:56:14
      lerna
      src/ConventionalCommitUtilities.js
      src
      1
      2017-08-10
    
    
      23
      001ec5882630cedd895f2c95a56a755617bb036c
      2017-08-10 16:56:14
      lerna
      src/FileSystemUtilities.js
      src
      1
      2017-08-10



In [143]:

    
dir_log['date'] = dir_log.timestamp.dt.date
dir_log.head()









    Out[143]:







  
    
      
      sha
      timestamp
      project
      file
      last_dir
      last_dir_id
      date
    
  
  
    
      15
      001ec5882630cedd895f2c95a56a755617bb036c
      2017-08-10 16:56:14
      lerna
      bin/lerna.js
      bin
      0
      2017-08-10
    
    
      20
      001ec5882630cedd895f2c95a56a755617bb036c
      2017-08-10 16:56:14
      lerna
      src/ChildProcessUtilities.js
      src
      1
      2017-08-10
    
    
      21
      001ec5882630cedd895f2c95a56a755617bb036c
      2017-08-10 16:56:14
      lerna
      src/Command.js
      src
      1
      2017-08-10
    
    
      22
      001ec5882630cedd895f2c95a56a755617bb036c
      2017-08-10 16:56:14
      lerna
      src/ConventionalCommitUtilities.js
      src
      1
      2017-08-10
    
    
      23
      001ec5882630cedd895f2c95a56a755617bb036c
      2017-08-10 16:56:14
      lerna
      src/FileSystemUtilities.js
      src
      1
      2017-08-10



In [133]:

    
grouped = dir_log.groupby(['project', pd.Grouper(level='date', freq="D"),'last_dir_id'])[['sha']].last()
grouped.head()









    



---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
<ipython-input-133-e347d7a2a7ab> in <module>()
----> 1 grouped = dir_log.groupby(['project', pd.Grouper(level='date', freq="D"),'last_dir_id'])[['sha']].last()
      2 grouped.head()

C:\dev\apps\Anaconda3\lib\site-packages\pandas\core\generic.py in groupby(self, by, axis, level, as_index, sort, group_keys, squeeze, **kwargs)
   5160         return groupby(self, by=by, axis=axis, level=level, as_index=as_index,
   5161                        sort=sort, group_keys=group_keys, squeeze=squeeze,
-> 5162                        **kwargs)
   5163 
   5164     def asfreq(self, freq, method=None, how=None, normalize=False,

C:\dev\apps\Anaconda3\lib\site-packages\pandas\core\groupby.py in groupby(obj, by, **kwds)
   1846         raise TypeError('invalid type: %s' % type(obj))
   1847 
-> 1848     return klass(obj, by, **kwds)
   1849 
   1850 

C:\dev\apps\Anaconda3\lib\site-packages\pandas\core\groupby.py in __init__(self, obj, keys, axis, level, grouper, exclusions, selection, as_index, sort, group_keys, squeeze, **kwargs)
    514                                                     level=level,
    515                                                     sort=sort,
--> 516                                                     mutated=self.mutated)
    517 
    518         self.obj = obj

C:\dev\apps\Anaconda3\lib\site-packages\pandas\core\groupby.py in _get_grouper(obj, key, axis, level, sort, mutated, validate)
   2955                         sort=sort,
   2956                         in_axis=in_axis) \
-> 2957             if not isinstance(gpr, Grouping) else gpr
   2958 
   2959         groupings.append(ping)

C:\dev\apps\Anaconda3\lib\site-packages\pandas\core\groupby.py in __init__(self, index, grouper, obj, name, level, sort, in_axis)
   2666             # check again as we have by this point converted these
   2667             # to an actual value (rather than a pd.Grouper)
-> 2668             _, grouper, _ = self.grouper._get_grouper(self.obj, validate=False)
   2669             if self.name is None:
   2670                 self.name = grouper.result_index.name

C:\dev\apps\Anaconda3\lib\site-packages\pandas\core\resample.py in _get_grouper(self, obj, validate)
   1118     def _get_grouper(self, obj, validate=True):
   1119         # create the resampler and return our binner
-> 1120         r = self._get_resampler(obj)
   1121         r._set_binner()
   1122         return r.binner, r.grouper, r.obj

C:\dev\apps\Anaconda3\lib\site-packages\pandas\core\resample.py in _get_resampler(self, obj, kind)
   1094 
   1095         """
-> 1096         self._set_grouper(obj)
   1097 
   1098         ax = self.ax

C:\dev\apps\Anaconda3\lib\site-packages\pandas\core\groupby.py in _set_grouper(self, obj, sort)
    432                     if level not in (0, ax.name):
    433                         raise ValueError(
--> 434                             "The level {0} is not valid".format(level))
    435 
    436         # possibly sort

ValueError: The level date is not valid



In [121]:

    
grouped['existent'] = 1
grouped.head()









    Out[121]:







  
    
      
      
      
      sha
      existent
    
    
      project
      date
      last_dir_id
      
      
    
  
  
    
      lerna
      2015-12-04
      1
      1
      1
    
    
      4
      1
      1
    
    
      77
      1
      1
    
    
      2015-12-06
      1
      1
      1
    
    
      4
      1
      1



In [129]:

    
test = grouped.pivot_table('existent', ['project', 'date'], 'last_dir_id').fillna(0)
test.head()









    



---------------------------------------------------------------------------
AttributeError                            Traceback (most recent call last)
<ipython-input-129-91e4043aa399> in <module>()
----> 1 test = grouped.pivot_table('existent', ['project', pd.Grouper(key=grouped.date, freq="D")], 'last_dir_id').fillna(0)
      2 test.head()

C:\dev\apps\Anaconda3\lib\site-packages\pandas\core\generic.py in __getattr__(self, name)
   3612             if name in self._info_axis:
   3613                 return self[name]
-> 3614             return object.__getattribute__(self, name)
   3615 
   3616     def __setattr__(self, name, value):

AttributeError: 'DataFrame' object has no attribute 'date'



In [127]:

    
lerna = test.loc['lerna'][0]
lerna









    Out[127]:





date
2015-12-04    0.0
2015-12-06    0.0
2015-12-08    0.0
2015-12-14    0.0
2015-12-21    0.0
2015-12-22    0.0
2015-12-24    0.0
2015-12-29    0.0
2016-01-07    0.0
2016-01-17    0.0
2016-01-20    0.0
2016-01-22    0.0
2016-01-23    0.0
2016-01-25    0.0
2016-01-26    0.0
2016-01-27    0.0
2016-01-28    0.0
2016-01-29    0.0
2016-01-30    0.0
2016-02-01    0.0
2016-02-02    0.0
2016-02-08    0.0
2016-02-11    0.0
2016-02-12    0.0
2016-02-13    0.0
2016-02-16    0.0
2016-02-20    0.0
2016-02-22    0.0
2016-02-23    0.0
2016-02-24    0.0
             ... 
2018-03-28    1.0
2018-03-29    1.0
2018-03-30    1.0
2018-03-31    1.0
2018-04-01    1.0
2018-04-02    1.0
2018-04-03    1.0
2018-04-06    1.0
2018-04-09    1.0
2018-04-10    1.0
2018-04-11    1.0
2018-04-13    1.0
2018-04-16    1.0
2018-04-17    1.0
2018-04-18    1.0
2018-04-23    1.0
2018-04-24    1.0
2018-04-26    1.0
2018-04-27    1.0
2018-05-01    1.0
2018-05-03    1.0
2018-05-07    1.0
2018-05-08    1.0
2018-05-09    1.0
2018-05-12    1.0
2018-05-14    1.0
2018-05-24    1.0
2018-05-25    1.0
2018-05-29    1.0
2018-06-04    1.0
Name: 0, Length: 311, dtype: float64



In [105]:

    
%maplotlib inline 
test.plot()









    Out[105]:





<matplotlib.axes._subplots.AxesSubplot at 0x2169b28eef0>



In [83]:

    
timed_log = dir_log.set_index(['timestamp', 'project'])
timed_log.head()









    Out[83]:







  
    
      
      
      sha
      file
      last_dir
      last_dir_id
    
    
      timestamp
      project
      
      
      
      
    
  
  
    
      2017-08-10 16:56:14
      lerna
      001ec5882630cedd895f2c95a56a755617bb036c
      .github/ISSUE_TEMPLATE.md
      .github
      0
    
    
      lerna
      001ec5882630cedd895f2c95a56a755617bb036c
      .github/PULL_REQUEST_TEMPLATE.md
      .github
      0
    
    
      lerna
      001ec5882630cedd895f2c95a56a755617bb036c
      bin/lerna.js
      bin
      1
    
    
      lerna
      001ec5882630cedd895f2c95a56a755617bb036c
      doc/hoist.md
      doc
      2
    
    
      lerna
      001ec5882630cedd895f2c95a56a755617bb036c
      doc/troubleshooting.md
      doc
      2



In [84]:

    
timed_log.resample("W").first()









    



---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
<ipython-input-84-341a25bff683> in <module>()
----> 1 timed_log.resample("W").first()

C:\dev\apps\Anaconda3\lib\site-packages\pandas\core\generic.py in resample(self, rule, how, axis, fill_method, closed, label, convention, kind, loffset, limit, base, on, level)
   5520                      axis=axis, kind=kind, loffset=loffset,
   5521                      convention=convention,
-> 5522                      base=base, key=on, level=level)
   5523         return _maybe_process_deprecations(r,
   5524                                            how=how,

C:\dev\apps\Anaconda3\lib\site-packages\pandas\core\resample.py in resample(obj, kind, **kwds)
    997     """ create a TimeGrouper and return our resampler """
    998     tg = TimeGrouper(**kwds)
--> 999     return tg._get_resampler(obj, kind=kind)
   1000 
   1001 

C:\dev\apps\Anaconda3\lib\site-packages\pandas\core\resample.py in _get_resampler(self, obj, kind)
   1114         raise TypeError("Only valid with DatetimeIndex, "
   1115                         "TimedeltaIndex or PeriodIndex, "
-> 1116                         "but got an instance of %r" % type(ax).__name__)
   1117 
   1118     def _get_grouper(self, obj, validate=True):

TypeError: Only valid with DatetimeIndex, TimedeltaIndex or PeriodIndex, but got an instance of 'MultiIndex'



In [ ]:

    
%matplotlib inline
timed.\
pivot_table('last_dir_id', timed.index, 'project')\
.fillna(method='ffill').dropna().plot()

For each file, we have now a row the complete commit information available for both repositories.



In [ ]:

    
file_log[file_log.project == "lerna"].iloc[0]



In [ ]:

    
file_log[file_log.project == "web-build-tools"].iloc[0]

Basic statistics

Let's take a look at our read-in data.



In [ ]:

    
file_log.info()

These are the number of entries for each repository.



In [ ]:

    
file_log.project.value_counts()

The amount of commits for each repository are.



In [ ]:

    
file_log.groupby('project').sha.nunique()

Data preparation

We need to adopt the data to the domain analyzed. We want to create a similarity measure between the directory structure of the lerna repository and the rush componente of the web-build-tools repository. The later is a little bit tricky, because there is a shift in the directory renaming.



In [ ]:

    
file_log[file_log.project=="web-build-tools"].iloc[0]



In [ ]:

    
file_log[file_log.project=="web-build-tools"].file.iloc[-10:]



In [ ]:

    
lerna = file_log[file_log.project == "lerna"]
lerna.info()



In [ ]:

    
rush = file_log[file_log.project == "web-build-tools"]
rush.info()



In [ ]:

    
from scipy.spatial.distance import hamming

def calculate_hamming(row):
    lerna = row.file_list_lerna.split("\n")
    lerna = [x.rsplit(".", maxsplit=1)[0] for x in lerna]
    rush = row.file_list_rush.split("\n")
    rush = [x.rsplit(".", maxsplit=1)[0] for x in rush]
    count = 0
    for i in lerna:
        if i in rush:
            count = count + 1
    return count
    
comp["amount"] = comp.apply(calculate_hamming, axis=1)
comp.head()



In [ ]:

    
%matplotlib inline
comp.amount.plot()



In [ ]:

    
comp.resample("W").amount.mean().plot()



In [ ]:

    
comp[comp.amount == comp.amount.max()]

	sha	timestamp	project	file
0	001ec5882630cedd895f2c95a56a755617bb036c	2017-08-10 16:56:14	lerna	.babelrc
1	001ec5882630cedd895f2c95a56a755617bb036c	2017-08-10 16:56:14	lerna	.editorconfig
2	001ec5882630cedd895f2c95a56a755617bb036c	2017-08-10 16:56:14	lerna	.eslintrc.yaml
3	001ec5882630cedd895f2c95a56a755617bb036c	2017-08-10 16:56:14	lerna	.github/ISSUE_TEMPLATE.md
4	001ec5882630cedd895f2c95a56a755617bb036c	2017-08-10 16:56:14	lerna	.github/PULL_REQUEST_TEMPLATE.md

			sha	existent
project	date	last_dir_id
lerna	2015-12-04	1	1	1
		4	1	1
		77	1	1
	2015-12-06	1	1	1
	2015-12-06	4	1	1

		sha	file	last_dir	last_dir_id
timestamp	project
2017-08-10 16:56:14	lerna	001ec5882630cedd895f2c95a56a755617bb036c	.github/ISSUE_TEMPLATE.md	.github	0
	lerna	001ec5882630cedd895f2c95a56a755617bb036c	.github/PULL_REQUEST_TEMPLATE.md	.github	0
	lerna	001ec5882630cedd895f2c95a56a755617bb036c	bin/lerna.js	bin	1
	lerna	001ec5882630cedd895f2c95a56a755617bb036c	doc/hoist.md	doc	2
	lerna	001ec5882630cedd895f2c95a56a755617bb036c	doc/troubleshooting.md	doc	2