This data was generated in the Git repository

JavaOnAutobahn/spring-petclinic

with

git log --stat > git_log_stat.log.

This exports the history of the Git repository including some information about the file changes per commit.

Here is an excerpt form this created dataset:

commit 4d3d9de655faa813781027d8b1baed819c6a56fe
Author: Markus Harrer <feststelltaste@googlemail.com>
Date:   Tue Mar 5 22:32:20 2019 +0100

    add virtual bounded contexts

20  1   jqassistant/business.adoc

It doesn't contain also any tabular structured data but more a row-based style of data (hint: if you want this, you can use Git's --format options to create such things).

The question is: Can we get this kind of data into a pandas DataFrame?

Warning: Please just read on if you can stand all the brain pain that follows.



In [1]:

    
import pandas as pd

log = pd.read_csv(
    "../../joa_spring-petclinic/git_log_numstat.log",
    sep="\n",
    names=['raw'])
log.head()









    Out[1]:







  
    
      
      raw
    
  
  
    
      0
      commit 4d3d9de655faa813781027d8b1baed819c6a56fe
    
    
      1
      Author: Markus Harrer <feststelltaste@googlema...
    
    
      2
      Date:   Tue Mar 5 22:32:20 2019 +0100
    
    
      3
      add virtual bounded contexts
    
    
      4
      test\ttest\t



In [2]:

    
log['sha'] = log.loc[log['raw'].str.startswith("commit ")]['raw'].str.split("commit ").str[1]
log['author'] = log.loc[log['raw'].str.startswith("Author: ")]['raw'].str.split("Author: ").str[1]
log['timestamp'] = log.loc[log['raw'].str.startswith("Date: ")]['raw'].str.split("Date: ").str[1]
log.head()









    Out[2]:







  
    
      
      raw
      sha
      author
      timestamp
    
  
  
    
      0
      commit 4d3d9de655faa813781027d8b1baed819c6a56fe
      4d3d9de655faa813781027d8b1baed819c6a56fe
      NaN
      NaN
    
    
      1
      Author: Markus Harrer <feststelltaste@googlema...
      NaN
      Markus Harrer <feststelltaste@googlemail.com>
      NaN
    
    
      2
      Date:   Tue Mar 5 22:32:20 2019 +0100
      NaN
      NaN
      Tue Mar 5 22:32:20 2019 +0100
    
    
      3
      add virtual bounded contexts
      NaN
      NaN
      NaN
    
    
      4
      test\ttest\t
      NaN
      NaN
      NaN



In [3]:

    
log['message'] = log.loc[log['raw'].str.startswith(" "*4)]['raw'].str[4:]
log.head()









    Out[3]:







  
    
      
      raw
      sha
      author
      timestamp
      message
    
  
  
    
      0
      commit 4d3d9de655faa813781027d8b1baed819c6a56fe
      4d3d9de655faa813781027d8b1baed819c6a56fe
      NaN
      NaN
      NaN
    
    
      1
      Author: Markus Harrer <feststelltaste@googlema...
      NaN
      Markus Harrer <feststelltaste@googlemail.com>
      NaN
      NaN
    
    
      2
      Date:   Tue Mar 5 22:32:20 2019 +0100
      NaN
      NaN
      Tue Mar 5 22:32:20 2019 +0100
      NaN
    
    
      3
      add virtual bounded contexts
      NaN
      NaN
      NaN
      add virtual bounded contexts
    
    
      4
      test\ttest\t
      NaN
      NaN
      NaN
      test\ttest\t



In [4]:

    
log['no_entry'] = \
    log['sha'].isna() & \
    log['author'].isna() & \
    log['timestamp'].isna() & \
    log['message'].isna()
log.head()









    Out[4]:







  
    
      
      raw
      sha
      author
      timestamp
      message
      no_entry
    
  
  
    
      0
      commit 4d3d9de655faa813781027d8b1baed819c6a56fe
      4d3d9de655faa813781027d8b1baed819c6a56fe
      NaN
      NaN
      NaN
      False
    
    
      1
      Author: Markus Harrer <feststelltaste@googlema...
      NaN
      Markus Harrer <feststelltaste@googlemail.com>
      NaN
      NaN
      False
    
    
      2
      Date:   Tue Mar 5 22:32:20 2019 +0100
      NaN
      NaN
      Tue Mar 5 22:32:20 2019 +0100
      NaN
      False
    
    
      3
      add virtual bounded contexts
      NaN
      NaN
      NaN
      add virtual bounded contexts
      False
    
    
      4
      test\ttest\t
      NaN
      NaN
      NaN
      test\ttest\t
      False



In [5]:

    
log['sha'] = log['sha'].fillna(method="ffill")
log.head()









    Out[5]:







  
    
      
      raw
      sha
      author
      timestamp
      message
      no_entry
    
  
  
    
      0
      commit 4d3d9de655faa813781027d8b1baed819c6a56fe
      4d3d9de655faa813781027d8b1baed819c6a56fe
      NaN
      NaN
      NaN
      False
    
    
      1
      Author: Markus Harrer <feststelltaste@googlema...
      4d3d9de655faa813781027d8b1baed819c6a56fe
      Markus Harrer <feststelltaste@googlemail.com>
      NaN
      NaN
      False
    
    
      2
      Date:   Tue Mar 5 22:32:20 2019 +0100
      4d3d9de655faa813781027d8b1baed819c6a56fe
      NaN
      Tue Mar 5 22:32:20 2019 +0100
      NaN
      False
    
    
      3
      add virtual bounded contexts
      4d3d9de655faa813781027d8b1baed819c6a56fe
      NaN
      NaN
      add virtual bounded contexts
      False
    
    
      4
      test\ttest\t
      4d3d9de655faa813781027d8b1baed819c6a56fe
      NaN
      NaN
      test\ttest\t
      False



In [6]:

    
sha_msg = log.dropna(subset=['message']).groupby('sha')['message'].apply(' '.join)
sha_msg.head()









    Out[6]:





sha
024811d252f8d8218e6795d46203cff25971bc19                        simplifying access to Integer
0365d34d2977dd24ec0bb3e8b0edff5694908c80             downgrade jqassistant due to weird error
0504ec9fe345d9d34b15c374333f709fb147e6d6    Update petclinic_db_setup_mysql.txt Correct in...
053c84ecc95b246ef4a40fb3d4304e8908604af4                             migrated to Spring 4.0.1
057015c14cce4791ff309419de8a8bd6339fd6e7    Spring MVC Test Framework and migration to Spr...
Name: message, dtype: object



In [7]:

    
sha_files = log[log['no_entry']][['sha', 'raw']]
sha_files = sha_files.set_index('sha')
sha_files.head()









    Out[7]:







  
    
      
      raw
    
    
      sha
      
    
  
  
    
      4d3d9de655faa813781027d8b1baed819c6a56fe
      20\t1\tjqassistant/business.adoc
    
    
      4d3d9de655faa813781027d8b1baed819c6a56fe
      1\t1\tsrc/main/java/org/springframework/sample...
    
    
      4d3d9de655faa813781027d8b1baed819c6a56fe
      2\t0\tsrc/main/java/org/springframework/sample...
    
    
      4d3d9de655faa813781027d8b1baed819c6a56fe
      2\t1\tsrc/main/java/org/springframework/sample...
    
    
      4d3d9de655faa813781027d8b1baed819c6a56fe
      2\t0\tsrc/main/java/org/springframework/sample...



In [8]:

    
sha_files[['additions', 'deletions', 'filename']] = sha_files['raw'].str.split("\t", expand=True)
del(sha_files['raw'])
sha_files.head()









    Out[8]:







  
    
      
      additions
      deletions
      filename
    
    
      sha
      
      
      
    
  
  
    
      4d3d9de655faa813781027d8b1baed819c6a56fe
      20
      1
      jqassistant/business.adoc
    
    
      4d3d9de655faa813781027d8b1baed819c6a56fe
      1
      1
      src/main/java/org/springframework/samples/petc...
    
    
      4d3d9de655faa813781027d8b1baed819c6a56fe
      2
      0
      src/main/java/org/springframework/samples/petc...
    
    
      4d3d9de655faa813781027d8b1baed819c6a56fe
      2
      1
      src/main/java/org/springframework/samples/petc...
    
    
      4d3d9de655faa813781027d8b1baed819c6a56fe
      2
      0
      src/main/java/org/springframework/samples/petc...



In [9]:

    
df = log.groupby('sha')[['author', 'timestamp']].first()
df.head()









    Out[9]:







  
    
      
      author
      timestamp
    
    
      sha
      
      
    
  
  
    
      024811d252f8d8218e6795d46203cff25971bc19
      Mic <misvy@vmware.com>
      Thu Mar 14 18:04:36 2013 +0800
    
    
      0365d34d2977dd24ec0bb3e8b0edff5694908c80
      Markus Harrer <feststelltaste@googlemail.com>
      Mon Nov 12 10:28:34 2018 +0100
    
    
      0504ec9fe345d9d34b15c374333f709fb147e6d6
      thinksh <thinkshihang@gmail.com>
      Wed Feb 3 23:19:46 2016 -0500
    
    
      053c84ecc95b246ef4a40fb3d4304e8908604af4
      Mic <misvy@vmware.com>
      Mon Feb 3 09:31:44 2014 +0800
    
    
      057015c14cce4791ff309419de8a8bd6339fd6e7
      Mic <misvy@vmware.com>
      Fri Feb 15 15:31:04 2013 +0800



In [10]:

    
df = df.join(sha_msg)
df.head()









    Out[10]:







  
    
      
      author
      timestamp
      message
    
    
      sha
      
      
      
    
  
  
    
      024811d252f8d8218e6795d46203cff25971bc19
      Mic <misvy@vmware.com>
      Thu Mar 14 18:04:36 2013 +0800
      simplifying access to Integer
    
    
      0365d34d2977dd24ec0bb3e8b0edff5694908c80
      Markus Harrer <feststelltaste@googlemail.com>
      Mon Nov 12 10:28:34 2018 +0100
      downgrade jqassistant due to weird error
    
    
      0504ec9fe345d9d34b15c374333f709fb147e6d6
      thinksh <thinkshihang@gmail.com>
      Wed Feb 3 23:19:46 2016 -0500
      Update petclinic_db_setup_mysql.txt Correct in...
    
    
      053c84ecc95b246ef4a40fb3d4304e8908604af4
      Mic <misvy@vmware.com>
      Mon Feb 3 09:31:44 2014 +0800
      migrated to Spring 4.0.1
    
    
      057015c14cce4791ff309419de8a8bd6339fd6e7
      Mic <misvy@vmware.com>
      Fri Feb 15 15:31:04 2013 +0800
      Spring MVC Test Framework and migration to Spr...



In [11]:

    
df = df.join(sha_files, how='right')
df.head()









    Out[11]:







  
    
      
      author
      timestamp
      message
      additions
      deletions
      filename
    
    
      sha
      
      
      
      
      
      
    
  
  
    
      024811d252f8d8218e6795d46203cff25971bc19
      Mic <misvy@vmware.com>
      Thu Mar 14 18:04:36 2013 +0800
      simplifying access to Integer
      1
      1
      src/main/java/org/springframework/samples/petc...
    
    
      0365d34d2977dd24ec0bb3e8b0edff5694908c80
      Markus Harrer <feststelltaste@googlemail.com>
      Mon Nov 12 10:28:34 2018 +0100
      downgrade jqassistant due to weird error
      1
      1
      pom.xml
    
    
      0504ec9fe345d9d34b15c374333f709fb147e6d6
      thinksh <thinkshihang@gmail.com>
      Wed Feb 3 23:19:46 2016 -0500
      Update petclinic_db_setup_mysql.txt Correct in...
      1
      1
      src/main/resources/db/mysql/petclinic_db_setup...
    
    
      053c84ecc95b246ef4a40fb3d4304e8908604af4
      Mic <misvy@vmware.com>
      Mon Feb 3 09:31:44 2014 +0800
      migrated to Spring 4.0.1
      1
      1
      pom.xml
    
    
      057015c14cce4791ff309419de8a8bd6339fd6e7
      Mic <misvy@vmware.com>
      Fri Feb 15 15:31:04 2013 +0800
      Spring MVC Test Framework and migration to Spr...
      1
      18
      .springBeans

	raw
0	commit 4d3d9de655faa813781027d8b1baed819c6a56fe
1	Author: Markus Harrer <feststelltaste@googlema...
2	Date: Tue Mar 5 22:32:20 2019 +0100
3	add virtual bounded contexts
4	test\ttest\t

	raw	sha	author	timestamp
0	commit 4d3d9de655faa813781027d8b1baed819c6a56fe	4d3d9de655faa813781027d8b1baed819c6a56fe	NaN	NaN
1	Author: Markus Harrer <feststelltaste@googlema...	NaN	Markus Harrer <feststelltaste@googlemail.com>	NaN
2	Date: Tue Mar 5 22:32:20 2019 +0100	NaN	NaN	Tue Mar 5 22:32:20 2019 +0100
3	add virtual bounded contexts	NaN	NaN	NaN
4	test\ttest\t	NaN	NaN	NaN

	raw	sha	author	timestamp	message	no_entry
0	commit 4d3d9de655faa813781027d8b1baed819c6a56fe	4d3d9de655faa813781027d8b1baed819c6a56fe	NaN	NaN	NaN	False
1	Author: Markus Harrer <feststelltaste@googlema...	NaN	Markus Harrer <feststelltaste@googlemail.com>	NaN	NaN	False
2	Date: Tue Mar 5 22:32:20 2019 +0100	NaN	NaN	Tue Mar 5 22:32:20 2019 +0100	NaN	False
3	add virtual bounded contexts	NaN	NaN	NaN	add virtual bounded contexts	False
4	test\ttest\t	NaN	NaN	NaN	test\ttest\t	False

	raw
sha
4d3d9de655faa813781027d8b1baed819c6a56fe	20\t1\tjqassistant/business.adoc
4d3d9de655faa813781027d8b1baed819c6a56fe	1\t1\tsrc/main/java/org/springframework/sample...
4d3d9de655faa813781027d8b1baed819c6a56fe	2\t0\tsrc/main/java/org/springframework/sample...
4d3d9de655faa813781027d8b1baed819c6a56fe	2\t1\tsrc/main/java/org/springframework/sample...
4d3d9de655faa813781027d8b1baed819c6a56fe	2\t0\tsrc/main/java/org/springframework/sample...

	additions	deletions	filename
sha
4d3d9de655faa813781027d8b1baed819c6a56fe	20	1	jqassistant/business.adoc
4d3d9de655faa813781027d8b1baed819c6a56fe	1	1	src/main/java/org/springframework/samples/petc...
4d3d9de655faa813781027d8b1baed819c6a56fe	2	0	src/main/java/org/springframework/samples/petc...
4d3d9de655faa813781027d8b1baed819c6a56fe	2	1	src/main/java/org/springframework/samples/petc...
4d3d9de655faa813781027d8b1baed819c6a56fe	2	0	src/main/java/org/springframework/samples/petc...

	author	timestamp
sha
024811d252f8d8218e6795d46203cff25971bc19	Mic <misvy@vmware.com>	Thu Mar 14 18:04:36 2013 +0800
0365d34d2977dd24ec0bb3e8b0edff5694908c80	Markus Harrer <feststelltaste@googlemail.com>	Mon Nov 12 10:28:34 2018 +0100
0504ec9fe345d9d34b15c374333f709fb147e6d6	thinksh <thinkshihang@gmail.com>	Wed Feb 3 23:19:46 2016 -0500
053c84ecc95b246ef4a40fb3d4304e8908604af4	Mic <misvy@vmware.com>	Mon Feb 3 09:31:44 2014 +0800
057015c14cce4791ff309419de8a8bd6339fd6e7	Mic <misvy@vmware.com>	Fri Feb 15 15:31:04 2013 +0800