This data was generated in the Git repository
JavaOnAutobahn/spring-petclinic
with
git log --stat > git_log_stat.log.
This exports the history of the Git repository including some information about the file changes per commit.
Here is an excerpt form this created dataset:
commit 4d3d9de655faa813781027d8b1baed819c6a56fe
Author: Markus Harrer <feststelltaste@googlemail.com>
Date: Tue Mar 5 22:32:20 2019 +0100
add virtual bounded contexts
20 1 jqassistant/business.adoc
It doesn't contain also any tabular structured data but more a row-based style of data (hint: if you want this, you can use Git's --format options to create such things).
The question is: Can we get this kind of data into a pandas DataFrame?
Warning: Please just read on if you can stand all the brain pain that follows.
In [1]:
import pandas as pd
log = pd.read_csv(
"../../joa_spring-petclinic/git_log_numstat.log",
sep="\n",
names=['raw'])
log.head()
Out[1]:
In [2]:
log['sha'] = log.loc[log['raw'].str.startswith("commit ")]['raw'].str.split("commit ").str[1]
log['author'] = log.loc[log['raw'].str.startswith("Author: ")]['raw'].str.split("Author: ").str[1]
log['timestamp'] = log.loc[log['raw'].str.startswith("Date: ")]['raw'].str.split("Date: ").str[1]
log.head()
Out[2]:
In [3]:
log['message'] = log.loc[log['raw'].str.startswith(" "*4)]['raw'].str[4:]
log.head()
Out[3]:
In [4]:
log['no_entry'] = \
log['sha'].isna() & \
log['author'].isna() & \
log['timestamp'].isna() & \
log['message'].isna()
log.head()
Out[4]:
In [5]:
log['sha'] = log['sha'].fillna(method="ffill")
log.head()
Out[5]:
In [6]:
sha_msg = log.dropna(subset=['message']).groupby('sha')['message'].apply(' '.join)
sha_msg.head()
Out[6]:
In [7]:
sha_files = log[log['no_entry']][['sha', 'raw']]
sha_files = sha_files.set_index('sha')
sha_files.head()
Out[7]:
In [8]:
sha_files[['additions', 'deletions', 'filename']] = sha_files['raw'].str.split("\t", expand=True)
del(sha_files['raw'])
sha_files.head()
Out[8]:
In [9]:
df = log.groupby('sha')[['author', 'timestamp']].first()
df.head()
Out[9]:
In [10]:
df = df.join(sha_msg)
df.head()
Out[10]:
In [11]:
df = df.join(sha_files, how='right')
df.head()
Out[11]: