Get base data

The data, we want to derive another dataset. It's just there to get some realistic file names


In [1]:
from lib.ozapfdis import git_tc

log = git_tc.log_numstat("C:/dev/repos/buschmais-spring-petclinic")
log.head()


Out[1]:
additions deletions file sha timestamp author
1 1 1 pom.xml f96d80e 2018-06-12 08:32:28 Dirk Mahler
3 5 5 jqassistant/layer.adoc d6e9509 2018-05-30 14:59:44 Dirk Mahler
5 1 1 jqassistant/layer.adoc 87b88d9 2018-05-18 22:43:32 Dirk Mahler
7 4 0 pom.xml ebb50e0 2018-05-17 20:51:14 Dirk Mahler
9 1 1 jqassistant/index.adoc b9b6dcf 2018-05-16 21:32:29 Dirk Mahler

In [2]:
log = log[log.file.str.contains(".java")]
log.loc[log.file.str.contains("/jdbc/"), 'type'] = "jdbc"
log.loc[log.file.str.contains("/jpa/"), 'type'] = "jpa"
log.loc[log.type.isna(), 'type'] = "other"
log.head()


Out[2]:
additions deletions file sha timestamp author type
234 4 5 src/test/java/org/springframework/samples/petc... e525415 2016-08-19 16:54:56 Antoine Rey other
235 25 7 src/test/java/org/springframework/samples/petc... e525415 2016-08-19 16:54:56 Antoine Rey other
236 21 9 src/test/java/org/springframework/samples/petc... e525415 2016-08-19 16:54:56 Antoine Rey other
237 23 3 src/test/java/org/springframework/samples/petc... e525415 2016-08-19 16:54:56 Antoine Rey other
238 10 6 src/test/java/org/springframework/samples/petc... e525415 2016-08-19 16:54:56 Antoine Rey other

Create synthetic dataset 1

For the first technology, where "JDBC" was used.

Create committed lines


In [3]:
import numpy as np
import pandas as pd

np.random.seed(0)
# adding period
added_lines = [int(np.random.normal(30,50)) for i in range(0,600)]
# deleting period
added_lines.extend([int(np.random.normal(-50,100)) for i in range(0,200)])
added_lines.extend([int(np.random.normal(-2,20)) for i in range(0,200)])
added_lines.extend([int(np.random.normal(-3,10)) for i in range(0,200)])
df_jdbc = pd.DataFrame()
df_jdbc['lines'] = added_lines
df_jdbc.head()


Out[3]:
lines
0 118
1 50
2 78
3 142
4 123

Add timestamp


In [4]:
times = pd.timedelta_range("00:00:00","23:59:59", freq="s")
times = pd.Series(times)
times.head()


Out[4]:
0   00:00:00
1   00:00:01
2   00:00:02
3   00:00:03
4   00:00:04
dtype: timedelta64[ns]

In [5]:
dates = pd.date_range('2013-05-15', '2017-07-23')
dates = pd.to_datetime(dates)
dates = dates[~dates.dayofweek.isin([5,6])]
dates = pd.Series(dates)
dates = dates.add(times.sample(len(dates), replace=True).values)
dates.head()


Out[5]:
0   2013-05-15 03:35:33
1   2013-05-16 02:15:44
2   2013-05-17 15:12:26
3   2013-05-20 00:16:06
4   2013-05-21 17:43:53
dtype: datetime64[ns]

In [6]:
df_jdbc['timestamp'] = dates.sample(len(df_jdbc), replace=True).sort_values().reset_index(drop=True)
df_jdbc = df_jdbc.sort_index()
df_jdbc.head()


Out[6]:
lines timestamp
0 118 2013-05-15 03:35:33
1 50 2013-05-16 02:15:44
2 78 2013-05-17 15:12:26
3 142 2013-05-24 05:52:31
4 123 2013-05-28 08:15:35

Treat first commit separetely

Set a fixed value because we have to start with some code at the beginning


In [7]:
df_jdbc.loc[0, 'lines'] = 250
df_jdbc.head()


Out[7]:
lines timestamp
0 250 2013-05-15 03:35:33
1 50 2013-05-16 02:15:44
2 78 2013-05-17 15:12:26
3 142 2013-05-24 05:52:31
4 123 2013-05-28 08:15:35

In [8]:
df_jdbc = df_jdbc

Add file names

Sample file names including their paths from an existing dataset


In [9]:
df_jdbc['file'] = log[log['type'] == 'jdbc']['file'].sample(len(df_jdbc), replace=True).values

Check dataset


In [10]:
%matplotlib inline
df_jdbc.lines.hist()


Out[10]:
<matplotlib.axes._subplots.AxesSubplot at 0x1e7b24efe10>

Sum up the data and check if it was created as wanted.


In [11]:
df_jdbc_timed = df_jdbc.set_index('timestamp')
df_jdbc_timed['count'] = df_jdbc_timed.lines.cumsum()
df_jdbc_timed['count'].plot()


Out[11]:
<matplotlib.axes._subplots.AxesSubplot at 0x1e7b25926d8>

In [12]:
last_non_zero_timestamp = df_jdbc_timed[df_jdbc_timed['count'] >= 0].index.max()
last_non_zero_timestamp


Out[12]:
Timestamp('2017-07-21 19:02:47')

In [13]:
df_jdbc = df_jdbc[df_jdbc.timestamp <= last_non_zero_timestamp]
df_jdbc.head()


Out[13]:
lines timestamp file
0 250 2013-05-15 03:35:33 src/main/java/org/springframework/samples/petc...
1 50 2013-05-16 02:15:44 src/main/java/org/springframework/samples/petc...
2 78 2013-05-17 15:12:26 src/main/java/org/springframework/samples/petc...
3 142 2013-05-24 05:52:31 src/main/java/org/springframework/samples/petc...
4 123 2013-05-28 08:15:35 src/main/java/org/springframework/samples/petc...

Create synthetic dataset 2


In [14]:
df_jpa = pd.DataFrame([int(np.random.normal(20,50)) for i in range(0,600)], columns=['lines'])
df_jpa.loc[0,'lines'] = 150
df_jpa['timestamp'] = pd.DateOffset(years=2) + dates.sample(len(df_jpa), replace=True).sort_values().reset_index(drop=True)
df_jpa = df_jpa.sort_index()
df_jpa['file'] = log[log['type'] == 'jpa']['file'].sample(len(df_jpa), replace=True).values
df_jpa.head()


Out[14]:
lines timestamp file
0 150 2015-05-17 15:12:26 src/main/java/org/springframework/samples/petc...
1 86 2015-05-20 00:16:06 src/main/java/org/springframework/samples/petc...
2 -27 2015-05-24 05:52:31 src/main/java/org/springframework/samples/petc...
3 14 2015-06-04 21:09:15 src/main/java/org/springframework/samples/petc...
4 66 2015-06-06 19:22:39 src/main/java/org/springframework/samples/petc...

Check dataset


In [15]:
df_jpa.lines.hist()


Out[15]:
<matplotlib.axes._subplots.AxesSubplot at 0x1e7b2613898>

In [16]:
df_jpa_timed = df_jpa.set_index('timestamp')
df_jpa_timed['count'] = df_jpa_timed.lines.cumsum()
df_jpa_timed['count'].plot()


Out[16]:
<matplotlib.axes._subplots.AxesSubplot at 0x1e7b372c6d8>

Add some noise


In [17]:
dates_other = pd.date_range(df_jdbc.timestamp.min(), df_jpa.timestamp.max())
dates_other = pd.to_datetime(dates_other)
dates_other = dates_other[~dates_other.dayofweek.isin([5,6])]
dates_other = pd.Series(dates_other)
dates_other = dates_other.add(times.sample(len(dates_other), replace=True).values)
dates_other.head()


Out[17]:
0   2013-05-15 17:36:46
1   2013-05-16 22:05:34
2   2013-05-17 19:27:07
3   2013-05-20 06:28:34
4   2013-05-21 03:46:00
dtype: datetime64[ns]

In [18]:
df_other = pd.DataFrame([int(np.random.normal(5,100)) for i in range(0,40000)], columns=['lines'])
df_other['timestamp'] = dates_other.sample(len(df_other), replace=True).sort_values().reset_index(drop=True)
df_other = df_other.sort_index()
df_other['file'] = log[log['type'] == 'other']['file'].sample(len(df_other), replace=True).values
df_other.head()


Out[18]:
lines timestamp file
0 38 2013-05-15 17:36:46 src/test/java/org/springframework/samples/petc...
1 74 2013-05-15 17:36:46 src/main/java/org/springframework/samples/petc...
2 143 2013-05-15 17:36:46 src/main/java/org/springframework/samples/petc...
3 -54 2013-05-15 17:36:46 src/test/java/org/springframework/samples/petc...
4 -46 2013-05-15 17:36:46 src/main/java/org/springframework/samples/petc...

Check dataset


In [19]:
df_other.lines.hist()


Out[19]:
<matplotlib.axes._subplots.AxesSubplot at 0x1e7b380af28>

In [20]:
df_other_timed = df_other.set_index('timestamp')
df_other_timed['count'] = df_other_timed.lines.cumsum()
df_other_timed['count'].plot()


Out[20]:
<matplotlib.axes._subplots.AxesSubplot at 0x1e7b387e6a0>

In [ ]:

Concatenate all datasets


In [21]:
df = pd.concat([df_jpa, df_jdbc, df_other], ignore_index=True).sort_values(by='timestamp')
df.loc[df.lines > 0, 'additions'] = df.lines
df.loc[df.lines < 0, 'deletions'] = df.lines * -1
df = df.fillna(0).reset_index(drop=True)
df = df[['additions', 'deletions', 'file', 'timestamp']]
df.loc[(df.deletions > 0) & (df.loc[0].timestamp == df.timestamp),'additions'] = df.deletions
df.loc[df.loc[0].timestamp == df.timestamp,'deletions'] = 0
df['additions'] = df.additions.astype(int)
df['deletions'] = df.deletions.astype(int)
df = df.sort_values(by='timestamp', ascending=False)
df.head()


Out[21]:
additions deletions file timestamp
41799 56 0 src/main/java/org/springframework/samples/petc... 2019-07-19 19:16:44
41798 0 62 src/main/java/org/springframework/samples/petc... 2019-07-19 19:16:44
41770 0 117 src/main/java/org/springframework/samples/petc... 2019-07-19 06:09:16
41777 0 0 src/main/java/org/springframework/samples/petc... 2019-07-19 06:09:16
41776 0 46 src/main/java/org/springframework/samples/petc... 2019-07-19 06:09:16

Truncate data until fixed date


In [22]:
df = df[df.timestamp < pd.Timestamp('2018-01-01')]
df.head()


Out[22]:
additions deletions file timestamp
31486 19 0 src/main/java/org/springframework/samples/petc... 2017-12-31 19:41:29
31485 55 0 src/main/java/org/springframework/samples/petc... 2017-12-30 12:48:20
31484 29 0 src/main/java/org/springframework/samples/petc... 2017-12-30 12:48:20
31461 0 99 src/main/java/org/springframework/samples/petc... 2017-12-30 00:38:54
31467 19 0 src/main/java/org/springframework/samples/petc... 2017-12-30 00:38:54

Export the data


In [23]:
df.to_csv("datasets/git_log_refactoring.gz", index=None, compression='gzip')

Check loaded data


In [24]:
df_loaded = pd.read_csv("datasets/git_log_refactoring.gz")
df_loaded.head()


Out[24]:
additions deletions file timestamp
0 19 0 src/main/java/org/springframework/samples/petc... 2017-12-31 19:41:29
1 55 0 src/main/java/org/springframework/samples/petc... 2017-12-30 12:48:20
2 29 0 src/main/java/org/springframework/samples/petc... 2017-12-30 12:48:20
3 0 99 src/main/java/org/springframework/samples/petc... 2017-12-30 00:38:54
4 19 0 src/main/java/org/springframework/samples/petc... 2017-12-30 00:38:54

In [25]:
df_loaded.info()


<class 'pandas.core.frame.DataFrame'>
RangeIndex: 31487 entries, 0 to 31486
Data columns (total 4 columns):
additions    31487 non-null int64
deletions    31487 non-null int64
file         31487 non-null object
timestamp    31487 non-null object
dtypes: int64(2), object(2)
memory usage: 984.0+ KB