Iterates over all commits and extracts basic information about the commit like sha, author and commit date (in log.txt
) as well as the file's list of a specific version (in files.txt
).
For each information set, it a new directory with the sha as unique identifier is created.
cd $1
sha_list=`git rev-list master`
for sha in $sha_list:
do
data_dir="../data/$1/$sha"
mkdir -p $data_dir
git checkout $sha
git log -n 1 $sha > $data_dir/log.txt
git ls-files > $data_dir/files.txt
done
You can store this script e. g. into extract.sh
and execute it for a repository with
sh execute.sh <path_git_repo>
and you'll get a directory / files structure like this
.
├── data
│ ├── lerna
│ │ ├── 001ec5882630cedd895f2c95a56a755617bb036c
│ │ │ ├── files.txt
│ │ │ └── log.txt
│ │ ├── 00242afa1efa43a98dc84815ac8f554ffa58d472
│ │ │ ├── files.txt
│ │ │ └── log.txt
│ │ ├── 007f20b89ae33721bd08f8bcdd0768923bcc6bc5
│ │ │ ├── files.txt
│ │ │ └── log.txt
The content is as follows:
files.txt
.babelrc
.editorconfig
.eslintrc.yaml
.github/ISSUE_TEMPLATE.md
.github/PULL_REQUEST_TEMPLATE.md
.gitignore
.npmignore
.travis.yml
CHANGELOG.md
CODE_OF_CONDUCT.md
CONTRIBUTING.md
FAQ.md
LICENSE
README.md
appveyor.yml
bin/lerna.js
doc/hoist.md
doc/troubleshooting.md
lerna.json
package.json
src/ChildProcessUtilities.js
src/Command.js
src/ConventionalCommitUtilities.js
src/FileSystemUtilities.js
src/GitUtilities.js
src/NpmUtilities.js
...
log.txt
commit 001ec5882630cedd895f2c95a56a755617bb036c
Author: Daniel Stockman <daniels@zillowgroup.com>
Date: Thu Aug 10 09:56:14 2017 -0700
chore: fs-extra 4.x
With this data, we have the base for analysing a probably similar directure structure layout over time.
abd83718682d7496426bb35f2f9ca20f10c2468d,2015-12-04 23:29:27 +1100
.gitignore
LICENSE
README.md
bin/lerna.js
lib/commands/bootstrap.js
lib/commands/index.js
lib/commands/publish.js
lib/init.js
lib/progress-bar.js
package.json
In [79]:
import glob
file_list = glob.glob(r'C:/dev/forensic/data/**/*.txt', recursive=True)
file_list = [x.replace("\\", "/") for x in file_list]
file_list[:5]
Out[79]:
We can then import the data by looping through all the files and read in the corresponding files' content. We further extract the information items we need on the fly from the path as well as the content of log.txt
. The result is stored into a Pandas DataFrame for further analysis.
In [80]:
import pandas as pd
dfs = []
for files_file in file_list:
try:
files_df = pd.read_csv(files_file, names=['sha', 'timestamp'])
files_df['project'] = files_file.split("/")[-2]
files_df['file'] = files_df.sha
files_df['sha'] = files_df.sha[0]
files_df['timestamp'] = pd.to_datetime(files_df.timestamp[0])
files_df = files_df[1:]
files_df
dfs.append(files_df)
except OSError as e:
print((e,files_file))
file_log = pd.concat(dfs, ignore_index=True)
file_log.head()
Out[80]:
In [81]:
file_log.file = pd.Categorical(file_log.file)
file_log.info()
In [141]:
dir_log = dir_log[
(dir_log.project=='lerna') & (dir_log.file.str.endswith(".js")) |
(dir_log.project=='web-build-tools') & (dir_log.file.str.endswith(".ts"))
]
dir_log.project.value_counts()
Out[141]:
In [145]:
dir_log = dir_log[dir_log.file.str.contains("/")].copy()
dir_log['last_dir'] = dir_log.file.str.split("/").str[-2]
dir_log['last_dir_id'] = pd.factorize(dir_log.last_dir)[0]
dir_log.head()
Out[145]:
In [143]:
dir_log['date'] = dir_log.timestamp.dt.date
dir_log.head()
Out[143]:
In [133]:
grouped = dir_log.groupby(['project', pd.Grouper(level='date', freq="D"),'last_dir_id'])[['sha']].last()
grouped.head()
In [121]:
grouped['existent'] = 1
grouped.head()
Out[121]:
In [129]:
test = grouped.pivot_table('existent', ['project', 'date'], 'last_dir_id').fillna(0)
test.head()
In [127]:
lerna = test.loc['lerna'][0]
lerna
Out[127]:
In [105]:
%maplotlib inline
test.plot()
Out[105]:
In [83]:
timed_log = dir_log.set_index(['timestamp', 'project'])
timed_log.head()
Out[83]:
In [84]:
timed_log.resample("W").first()
In [ ]:
%matplotlib inline
timed.\
pivot_table('last_dir_id', timed.index, 'project')\
.fillna(method='ffill').dropna().plot()
For each file, we have now a row the complete commit information available for both repositories.
In [ ]:
file_log[file_log.project == "lerna"].iloc[0]
In [ ]:
file_log[file_log.project == "web-build-tools"].iloc[0]
Let's take a look at our read-in data.
In [ ]:
file_log.info()
These are the number of entries for each repository.
In [ ]:
file_log.project.value_counts()
The amount of commits for each repository are.
In [ ]:
file_log.groupby('project').sha.nunique()
We need to adopt the data to the domain analyzed. We want to create a similarity measure between the directory structure of the lerna repository and the rush componente of the web-build-tools repository. The later is a little bit tricky, because there is a shift in the directory renaming.
In [ ]:
file_log[file_log.project=="web-build-tools"].iloc[0]
In [ ]:
file_log[file_log.project=="web-build-tools"].file.iloc[-10:]
In [ ]:
lerna = file_log[file_log.project == "lerna"]
lerna.info()
In [ ]:
rush = file_log[file_log.project == "web-build-tools"]
rush.info()
In [ ]:
from scipy.spatial.distance import hamming
def calculate_hamming(row):
lerna = row.file_list_lerna.split("\n")
lerna = [x.rsplit(".", maxsplit=1)[0] for x in lerna]
rush = row.file_list_rush.split("\n")
rush = [x.rsplit(".", maxsplit=1)[0] for x in rush]
count = 0
for i in lerna:
if i in rush:
count = count + 1
return count
comp["amount"] = comp.apply(calculate_hamming, axis=1)
comp.head()
In [ ]:
%matplotlib inline
comp.amount.plot()
In [ ]:
comp.resample("W").amount.mean().plot()
In [ ]:
comp[comp.amount == comp.amount.max()]