Get base data

The data, we want to derive another dataset. It's just there to get some realistic file names



In [1]:

    
from lib.ozapfdis import git_tc

log = git_tc.log_numstat("C:/dev/repos/buschmais-spring-petclinic")
log.head()









    Out[1]:







  
    
      
      additions
      deletions
      file
      sha
      timestamp
      author
    
  
  
    
      1
      1
      1
      pom.xml
      f96d80e
      2018-06-12 08:32:28
      Dirk Mahler
    
    
      3
      5
      5
      jqassistant/layer.adoc
      d6e9509
      2018-05-30 14:59:44
      Dirk Mahler
    
    
      5
      1
      1
      jqassistant/layer.adoc
      87b88d9
      2018-05-18 22:43:32
      Dirk Mahler
    
    
      7
      4
      0
      pom.xml
      ebb50e0
      2018-05-17 20:51:14
      Dirk Mahler
    
    
      9
      1
      1
      jqassistant/index.adoc
      b9b6dcf
      2018-05-16 21:32:29
      Dirk Mahler



In [2]:

    
log = log[log.file.str.contains(".java")]
log.loc[log.file.str.contains("/jdbc/"), 'type'] = "jdbc"
log.loc[log.file.str.contains("/jpa/"), 'type'] = "jpa"
log.loc[log.type.isna(), 'type'] = "other"
log.head()









    Out[2]:







  
    
      
      additions
      deletions
      file
      sha
      timestamp
      author
      type
    
  
  
    
      234
      4
      5
      src/test/java/org/springframework/samples/petc...
      e525415
      2016-08-19 16:54:56
      Antoine Rey
      other
    
    
      235
      25
      7
      src/test/java/org/springframework/samples/petc...
      e525415
      2016-08-19 16:54:56
      Antoine Rey
      other
    
    
      236
      21
      9
      src/test/java/org/springframework/samples/petc...
      e525415
      2016-08-19 16:54:56
      Antoine Rey
      other
    
    
      237
      23
      3
      src/test/java/org/springframework/samples/petc...
      e525415
      2016-08-19 16:54:56
      Antoine Rey
      other
    
    
      238
      10
      6
      src/test/java/org/springframework/samples/petc...
      e525415
      2016-08-19 16:54:56
      Antoine Rey
      other

Create synthetic dataset 1

For the first technology, where "JDBC" was used.

Create committed lines



In [3]:

    
import numpy as np
import pandas as pd

np.random.seed(0)
# adding period
added_lines = [int(np.random.normal(30,50)) for i in range(0,600)]
# deleting period
added_lines.extend([int(np.random.normal(-50,100)) for i in range(0,200)])
added_lines.extend([int(np.random.normal(-2,20)) for i in range(0,200)])
added_lines.extend([int(np.random.normal(-3,10)) for i in range(0,200)])
df_jdbc = pd.DataFrame()
df_jdbc['lines'] = added_lines
df_jdbc.head()

Add timestamp



In [4]:

    
times = pd.timedelta_range("00:00:00","23:59:59", freq="s")
times = pd.Series(times)
times.head()









    Out[4]:





0   00:00:00
1   00:00:01
2   00:00:02
3   00:00:03
4   00:00:04
dtype: timedelta64[ns]



In [5]:

    
dates = pd.date_range('2013-05-15', '2017-07-23')
dates = pd.to_datetime(dates)
dates = dates[~dates.dayofweek.isin([5,6])]
dates = pd.Series(dates)
dates = dates.add(times.sample(len(dates), replace=True).values)
dates.head()









    Out[5]:





0   2013-05-15 03:35:33
1   2013-05-16 02:15:44
2   2013-05-17 15:12:26
3   2013-05-20 00:16:06
4   2013-05-21 17:43:53
dtype: datetime64[ns]



In [6]:

    
df_jdbc['timestamp'] = dates.sample(len(df_jdbc), replace=True).sort_values().reset_index(drop=True)
df_jdbc = df_jdbc.sort_index()
df_jdbc.head()









    Out[6]:







  
    
      
      lines
      timestamp
    
  
  
    
      0
      118
      2013-05-15 03:35:33
    
    
      1
      50
      2013-05-16 02:15:44
    
    
      2
      78
      2013-05-17 15:12:26
    
    
      3
      142
      2013-05-24 05:52:31
    
    
      4
      123
      2013-05-28 08:15:35

Treat first commit separetely

Set a fixed value because we have to start with some code at the beginning



In [7]:

    
df_jdbc.loc[0, 'lines'] = 250
df_jdbc.head()









    Out[7]:







  
    
      
      lines
      timestamp
    
  
  
    
      0
      250
      2013-05-15 03:35:33
    
    
      1
      50
      2013-05-16 02:15:44
    
    
      2
      78
      2013-05-17 15:12:26
    
    
      3
      142
      2013-05-24 05:52:31
    
    
      4
      123
      2013-05-28 08:15:35



In [8]:

    
df_jdbc = df_jdbc

Add file names

Sample file names including their paths from an existing dataset



In [9]:

    
df_jdbc['file'] = log[log['type'] == 'jdbc']['file'].sample(len(df_jdbc), replace=True).values

Check dataset



In [10]:

    
%matplotlib inline
df_jdbc.lines.hist()









    Out[10]:





<matplotlib.axes._subplots.AxesSubplot at 0x1e7b24efe10>

Sum up the data and check if it was created as wanted.



In [11]:

    
df_jdbc_timed = df_jdbc.set_index('timestamp')
df_jdbc_timed['count'] = df_jdbc_timed.lines.cumsum()
df_jdbc_timed['count'].plot()









    Out[11]:





<matplotlib.axes._subplots.AxesSubplot at 0x1e7b25926d8>



In [12]:

    
last_non_zero_timestamp = df_jdbc_timed[df_jdbc_timed['count'] >= 0].index.max()
last_non_zero_timestamp









    Out[12]:





Timestamp('2017-07-21 19:02:47')



In [13]:

    
df_jdbc = df_jdbc[df_jdbc.timestamp <= last_non_zero_timestamp]
df_jdbc.head()









    Out[13]:







  
    
      
      lines
      timestamp
      file
    
  
  
    
      0
      250
      2013-05-15 03:35:33
      src/main/java/org/springframework/samples/petc...
    
    
      1
      50
      2013-05-16 02:15:44
      src/main/java/org/springframework/samples/petc...
    
    
      2
      78
      2013-05-17 15:12:26
      src/main/java/org/springframework/samples/petc...
    
    
      3
      142
      2013-05-24 05:52:31
      src/main/java/org/springframework/samples/petc...
    
    
      4
      123
      2013-05-28 08:15:35
      src/main/java/org/springframework/samples/petc...

Create synthetic dataset 2



In [14]:

    
df_jpa = pd.DataFrame([int(np.random.normal(20,50)) for i in range(0,600)], columns=['lines'])
df_jpa.loc[0,'lines'] = 150
df_jpa['timestamp'] = pd.DateOffset(years=2) + dates.sample(len(df_jpa), replace=True).sort_values().reset_index(drop=True)
df_jpa = df_jpa.sort_index()
df_jpa['file'] = log[log['type'] == 'jpa']['file'].sample(len(df_jpa), replace=True).values
df_jpa.head()









    Out[14]:







  
    
      
      lines
      timestamp
      file
    
  
  
    
      0
      150
      2015-05-17 15:12:26
      src/main/java/org/springframework/samples/petc...
    
    
      1
      86
      2015-05-20 00:16:06
      src/main/java/org/springframework/samples/petc...
    
    
      2
      -27
      2015-05-24 05:52:31
      src/main/java/org/springframework/samples/petc...
    
    
      3
      14
      2015-06-04 21:09:15
      src/main/java/org/springframework/samples/petc...
    
    
      4
      66
      2015-06-06 19:22:39
      src/main/java/org/springframework/samples/petc...

Check dataset



In [15]:

    
df_jpa.lines.hist()









    Out[15]:





<matplotlib.axes._subplots.AxesSubplot at 0x1e7b2613898>



In [16]:

    
df_jpa_timed = df_jpa.set_index('timestamp')
df_jpa_timed['count'] = df_jpa_timed.lines.cumsum()
df_jpa_timed['count'].plot()









    Out[16]:





<matplotlib.axes._subplots.AxesSubplot at 0x1e7b372c6d8>

Add some noise



In [17]:

    
dates_other = pd.date_range(df_jdbc.timestamp.min(), df_jpa.timestamp.max())
dates_other = pd.to_datetime(dates_other)
dates_other = dates_other[~dates_other.dayofweek.isin([5,6])]
dates_other = pd.Series(dates_other)
dates_other = dates_other.add(times.sample(len(dates_other), replace=True).values)
dates_other.head()









    Out[17]:





0   2013-05-15 17:36:46
1   2013-05-16 22:05:34
2   2013-05-17 19:27:07
3   2013-05-20 06:28:34
4   2013-05-21 03:46:00
dtype: datetime64[ns]



In [18]:

    
df_other = pd.DataFrame([int(np.random.normal(5,100)) for i in range(0,40000)], columns=['lines'])
df_other['timestamp'] = dates_other.sample(len(df_other), replace=True).sort_values().reset_index(drop=True)
df_other = df_other.sort_index()
df_other['file'] = log[log['type'] == 'other']['file'].sample(len(df_other), replace=True).values
df_other.head()









    Out[18]:







  
    
      
      lines
      timestamp
      file
    
  
  
    
      0
      38
      2013-05-15 17:36:46
      src/test/java/org/springframework/samples/petc...
    
    
      1
      74
      2013-05-15 17:36:46
      src/main/java/org/springframework/samples/petc...
    
    
      2
      143
      2013-05-15 17:36:46
      src/main/java/org/springframework/samples/petc...
    
    
      3
      -54
      2013-05-15 17:36:46
      src/test/java/org/springframework/samples/petc...
    
    
      4
      -46
      2013-05-15 17:36:46
      src/main/java/org/springframework/samples/petc...

Check dataset



In [19]:

    
df_other.lines.hist()









    Out[19]:





<matplotlib.axes._subplots.AxesSubplot at 0x1e7b380af28>



In [20]:

    
df_other_timed = df_other.set_index('timestamp')
df_other_timed['count'] = df_other_timed.lines.cumsum()
df_other_timed['count'].plot()









    Out[20]:





<matplotlib.axes._subplots.AxesSubplot at 0x1e7b387e6a0>



In [ ]:

Concatenate all datasets



In [21]:

    
df = pd.concat([df_jpa, df_jdbc, df_other], ignore_index=True).sort_values(by='timestamp')
df.loc[df.lines > 0, 'additions'] = df.lines
df.loc[df.lines < 0, 'deletions'] = df.lines * -1
df = df.fillna(0).reset_index(drop=True)
df = df[['additions', 'deletions', 'file', 'timestamp']]
df.loc[(df.deletions > 0) & (df.loc[0].timestamp == df.timestamp),'additions'] = df.deletions
df.loc[df.loc[0].timestamp == df.timestamp,'deletions'] = 0
df['additions'] = df.additions.astype(int)
df['deletions'] = df.deletions.astype(int)
df = df.sort_values(by='timestamp', ascending=False)
df.head()









    Out[21]:







  
    
      
      additions
      deletions
      file
      timestamp
    
  
  
    
      41799
      56
      0
      src/main/java/org/springframework/samples/petc...
      2019-07-19 19:16:44
    
    
      41798
      0
      62
      src/main/java/org/springframework/samples/petc...
      2019-07-19 19:16:44
    
    
      41770
      0
      117
      src/main/java/org/springframework/samples/petc...
      2019-07-19 06:09:16
    
    
      41777
      0
      0
      src/main/java/org/springframework/samples/petc...
      2019-07-19 06:09:16
    
    
      41776
      0
      46
      src/main/java/org/springframework/samples/petc...
      2019-07-19 06:09:16

Truncate data until fixed date



In [22]:

    
df = df[df.timestamp < pd.Timestamp('2018-01-01')]
df.head()









    Out[22]:







  
    
      
      additions
      deletions
      file
      timestamp
    
  
  
    
      31486
      19
      0
      src/main/java/org/springframework/samples/petc...
      2017-12-31 19:41:29
    
    
      31485
      55
      0
      src/main/java/org/springframework/samples/petc...
      2017-12-30 12:48:20
    
    
      31484
      29
      0
      src/main/java/org/springframework/samples/petc...
      2017-12-30 12:48:20
    
    
      31461
      0
      99
      src/main/java/org/springframework/samples/petc...
      2017-12-30 00:38:54
    
    
      31467
      19
      0
      src/main/java/org/springframework/samples/petc...
      2017-12-30 00:38:54

Export the data



In [23]:

    
df.to_csv("datasets/git_log_refactoring.gz", index=None, compression='gzip')

Check loaded data



In [24]:

    
df_loaded = pd.read_csv("datasets/git_log_refactoring.gz")
df_loaded.head()









    Out[24]:







  
    
      
      additions
      deletions
      file
      timestamp
    
  
  
    
      0
      19
      0
      src/main/java/org/springframework/samples/petc...
      2017-12-31 19:41:29
    
    
      1
      55
      0
      src/main/java/org/springframework/samples/petc...
      2017-12-30 12:48:20
    
    
      2
      29
      0
      src/main/java/org/springframework/samples/petc...
      2017-12-30 12:48:20
    
    
      3
      0
      99
      src/main/java/org/springframework/samples/petc...
      2017-12-30 00:38:54
    
    
      4
      19
      0
      src/main/java/org/springframework/samples/petc...
      2017-12-30 00:38:54



In [25]:

    
df_loaded.info()









    



<class 'pandas.core.frame.DataFrame'>
RangeIndex: 31487 entries, 0 to 31486
Data columns (total 4 columns):
additions    31487 non-null int64
deletions    31487 non-null int64
file         31487 non-null object
timestamp    31487 non-null object
dtypes: int64(2), object(2)
memory usage: 984.0+ KB

	additions	deletions	file	sha	timestamp	author
1	1	1	pom.xml	f96d80e	2018-06-12 08:32:28	Dirk Mahler
3	5	5	jqassistant/layer.adoc	d6e9509	2018-05-30 14:59:44	Dirk Mahler
5	1	1	jqassistant/layer.adoc	87b88d9	2018-05-18 22:43:32	Dirk Mahler
7	4	0	pom.xml	ebb50e0	2018-05-17 20:51:14	Dirk Mahler
9	1	1	jqassistant/index.adoc	b9b6dcf	2018-05-16 21:32:29	Dirk Mahler

	additions	deletions	file	sha	timestamp	author	type
234	4	5	src/test/java/org/springframework/samples/petc...	e525415	2016-08-19 16:54:56	Antoine Rey	other
235	25	7	src/test/java/org/springframework/samples/petc...	e525415	2016-08-19 16:54:56	Antoine Rey	other
236	21	9	src/test/java/org/springframework/samples/petc...	e525415	2016-08-19 16:54:56	Antoine Rey	other
237	23	3	src/test/java/org/springframework/samples/petc...	e525415	2016-08-19 16:54:56	Antoine Rey	other
238	10	6	src/test/java/org/springframework/samples/petc...	e525415	2016-08-19 16:54:56	Antoine Rey	other

	lines	timestamp
0	118	2013-05-15 03:35:33
1	50	2013-05-16 02:15:44
2	78	2013-05-17 15:12:26
3	142	2013-05-24 05:52:31
4	123	2013-05-28 08:15:35

	lines	timestamp	file
0	250	2013-05-15 03:35:33	src/main/java/org/springframework/samples/petc...
1	50	2013-05-16 02:15:44	src/main/java/org/springframework/samples/petc...
2	78	2013-05-17 15:12:26	src/main/java/org/springframework/samples/petc...
3	142	2013-05-24 05:52:31	src/main/java/org/springframework/samples/petc...
4	123	2013-05-28 08:15:35	src/main/java/org/springframework/samples/petc...

	lines	timestamp	file
0	150	2015-05-17 15:12:26	src/main/java/org/springframework/samples/petc...
1	86	2015-05-20 00:16:06	src/main/java/org/springframework/samples/petc...
2	-27	2015-05-24 05:52:31	src/main/java/org/springframework/samples/petc...
3	14	2015-06-04 21:09:15	src/main/java/org/springframework/samples/petc...
4	66	2015-06-06 19:22:39	src/main/java/org/springframework/samples/petc...

	lines	timestamp	file
0	38	2013-05-15 17:36:46	src/test/java/org/springframework/samples/petc...
1	74	2013-05-15 17:36:46	src/main/java/org/springframework/samples/petc...
2	143	2013-05-15 17:36:46	src/main/java/org/springframework/samples/petc...
3	-54	2013-05-15 17:36:46	src/test/java/org/springframework/samples/petc...
4	-46	2013-05-15 17:36:46	src/main/java/org/springframework/samples/petc...

	additions	deletions	file	timestamp
41799	56	0	src/main/java/org/springframework/samples/petc...	2019-07-19 19:16:44
41798	0	62	src/main/java/org/springframework/samples/petc...	2019-07-19 19:16:44
41770	0	117	src/main/java/org/springframework/samples/petc...	2019-07-19 06:09:16
41777	0	0	src/main/java/org/springframework/samples/petc...	2019-07-19 06:09:16
41776	0	46	src/main/java/org/springframework/samples/petc...	2019-07-19 06:09:16

	additions	deletions	file	timestamp
31486	19	0	src/main/java/org/springframework/samples/petc...	2017-12-31 19:41:29
31485	55	0	src/main/java/org/springframework/samples/petc...	2017-12-30 12:48:20
31484	29	0	src/main/java/org/springframework/samples/petc...	2017-12-30 12:48:20
31461	0	99	src/main/java/org/springframework/samples/petc...	2017-12-30 00:38:54
31467	19	0	src/main/java/org/springframework/samples/petc...	2017-12-30 00:38:54