Daten erzeugen



In [1]:

    
cd ~/projekte/openbsd/openbsd-src/









    



/home/dloss/projekte/openbsd/openbsd-src



In [2]:

    
!git log --format=format:"%ai, %an" > ../commits



In [3]:

    
cd ..









    



/home/dloss/projekte/openbsd



In [4]:

    
ls









    



commits  genucommits.pdf  openbsd-src/



In [5]:

    
!head commits









    



2013-04-30 13:57:02 +0000, patrick
2013-04-30 13:23:52 +0000, patrick
2013-04-30 13:10:45 +0000, patrick
2013-04-30 13:05:44 +0000, jmc
2013-04-30 13:04:25 +0000, patrick
2013-04-30 13:00:21 +0000, patrick
2013-04-30 12:30:40 +0000, florian
2013-04-30 12:29:04 +0000, florian
2013-04-30 12:07:21 +0000, eric
2013-04-30 12:02:39 +0000, eric

CSV einlesen sdf



In [6]:

    
import pandas as pd



In [7]:

    
%time df=pd.read_csv("commits", header=None, names=["time", "author"])









    



CPU times: user 0.39 s, sys: 0.04 s, total: 0.43 s
Wall time: 0.43 s



In [8]:

    
df









    Out[8]:




&ltclass 'pandas.core.frame.DataFrame'>
Int64Index: 142758 entries, 0 to 142757
Data columns (total 2 columns):
time      142758  non-null values
author    142758  non-null values
dtypes: object(2)

Es gibt mehr als 142000 commits seit Oktober 1995.



In [9]:

    
df.head()









    Out[9]:






  
    
      
      time
      author
    
  
  
    
      0
       2013-04-30 13:57:02 +0000
        patrick
    
    
      1
       2013-04-30 13:23:52 +0000
        patrick
    
    
      2
       2013-04-30 13:10:45 +0000
        patrick
    
    
      3
       2013-04-30 13:05:44 +0000
            jmc
    
    
      4
       2013-04-30 13:04:25 +0000
        patrick

Autoren



In [10]:

    
df.author.value_counts()









    Out[10]:





 deraadt     20602
 miod         7559
 millert      7046
 jmc          6274
 mickey       5860
 espie        4561
 henning      3327
 brad         3321
 niklas       3039
 art          3012
 kettenis     2939
 jason        2732
 markus       2368
 claudio      2216
 krw          2108
...
 hsuenaga    1
 rfreeman    1
 tomo        1
 gonzalo     1
 kristaps    1
 joey        1
 mhitch      1
 fgont       1
 ehrhardt    1
 tqbf        1
 yanick      1
 matt        1
 obecian     1
 tobiasu     1
 felix       1
Length: 307, dtype: int64

Daten bereinigen

In echte Zeiten konvertieren (dauert lange):



In [11]:

    
%time df.index = pd.to_datetime(df['time'])









    



CPU times: user 69.55 s, sys: 1.32 s, total: 70.87 s
Wall time: 70.18 s



In [12]:

    
df.tail()









    Out[12]:






  
    
      
      time
      author
    
    
      time
      
      
    
  
  
    
      1995-10-18 08:43:09
       1995-10-18 08:43:09 +0000
        deraadt
    
    
      1995-10-18 08:43:09
       1995-10-18 08:43:09 +0000
        deraadt
    
    
      1995-10-18 08:38:49
       1995-10-18 08:38:49 +0000
        deraadt
    
    
      1995-10-18 08:38:49
       1995-10-18 08:38:49 +0000
        deraadt
    
    
      1995-10-18 08:53:39
       1995-10-18 08:53:39 +0000
        deraadt



In [13]:

    
df.sort_index(inplace=True)



In [14]:

    
df.tail()









    Out[14]:






  
    
      
      time
      author
    
    
      time
      
      
    
  
  
    
      2013-04-30 13:04:25
       2013-04-30 13:04:25 +0000
        patrick
    
    
      2013-04-30 13:05:44
       2013-04-30 13:05:44 +0000
            jmc
    
    
      2013-04-30 13:10:45
       2013-04-30 13:10:45 +0000
        patrick
    
    
      2013-04-30 13:23:52
       2013-04-30 13:23:52 +0000
        patrick
    
    
      2013-04-30 13:57:02
       2013-04-30 13:57:02 +0000
        patrick



In [15]:

    
del df['time']



In [16]:

    
df["c"]=1



In [17]:

    
df['author'][2343]









    Out[17]:





' tholo'

Leerzeichen am Anfang und Ende von Authorennamen entfernen:



In [18]:

    
df['author'] = df['author'].map(lambda x: x.strip())



In [19]:

    
df.head()









    Out[19]:






  
    
      
      author
      c
    
    
      time
      
      
    
  
  
    
      1995-10-18 08:38:49
       deraadt
       1
    
    
      1995-10-18 08:38:49
       deraadt
       1
    
    
      1995-10-18 08:43:09
       deraadt
       1
    
    
      1995-10-18 08:43:09
       deraadt
       1
    
    
      1995-10-18 08:43:49
       deraadt
       1

Commits pro Person



In [20]:

    
commits_per_person = df.author.value_counts()



In [21]:

    
commits_per_person.describe()









    Out[21]:





count      307.000000
mean       465.009772
std       1495.620881
min          1.000000
25%         13.000000
50%         68.000000
75%        350.000000
max      20602.000000
dtype: float64

307 Committer? (zumindest 307 verschiedene Commit-Author Namen)

grafisch



In [22]:

    
%pylab inline









    



Welcome to pylab, a matplotlib-based Python environment [backend: module://IPython.zmq.pylab.backend_inline].
For more information, type 'help(pylab)'.

Größe der Grafik heraufsetzen:



In [23]:

    
import matplotlib.pyplot as plt
plt.figsize(10,6)



In [24]:

    
commits_per_person.plot()









    Out[24]:





<matplotlib.axes.AxesSubplot at 0x1679a10>



In [25]:

    
top30=commits_per_person[:30]
top30









    Out[25]:





deraadt     20602
miod         7559
millert      7046
jmc          6274
mickey       5860
espie        4561
henning      3327
brad         3321
niklas       3039
art          3012
kettenis     2939
jason        2732
markus       2368
claudio      2216
krw          2108
dlg          2053
drahn        1840
jsg          1790
angelos      1671
aaron        1643
nicm         1551
otto         1550
itojun       1520
fgsch        1401
damien       1349
tedu         1306
djm          1269
reyk         1120
downsj       1109
schwarze     1069
dtype: int64



In [26]:

    
top30.plot(kind="barh")









    Out[26]:





<matplotlib.axes.AxesSubplot at 0x52f3f90>

Genuesen



In [27]:

    
commits_per_person['markus']









    Out[27]:





2368



In [28]:

    
genuesen = ["markus", "bluhm", "mpf", "hshoexer", "grunk"]
genu_commits = commits_per_person.ix[genuesen]
genu_commits









    Out[28]:





markus      2368
bluhm        338
mpf          252
hshoexer     591
grunk         83
dtype: int64



In [29]:

    
bluhm = df[df.author == "bluhm"]



In [30]:

    
mpf = df[df.author == "mpf"]



In [31]:

    
bluhm.c.cumsum().plot(style="r", label="bluhm")
mpf.c.cumsum().plot(style="--", label="mpf")
title("mpf vs. bluhm")
legend(loc="best")









    Out[31]:





<matplotlib.legend.Legend at 0x552ded0>



In [32]:

    
markus = df[df.author == "markus"]
hshoexer = df[df.author == "hshoexer"]
grunk = df[df.author == "grunk"]



In [33]:

    
bluhm.c.cumsum().plot(style="r.", label="bluhm", alpha=0.2)
mpf.c.cumsum().plot(style="--.", label="mpf", alpha=0.2)

hshoexer.c.cumsum().plot(style="g-..", label="hshoexer")
grunk.c.cumsum().plot(style="m--.", label="grunk")
markus.c.cumsum().plot(style='k:.', label="markus", alpha=0.2)
legend(loc=0)
savefig("genucommits.pdf")

Die ersten Commits



In [34]:

    
bluhm.head(1)









    Out[34]:






  
    
      
      author
      c
    
    
      time
      
      
    
  
  
    
      2007-01-24 13:24:58
       bluhm
       1



In [35]:

    
mpf.head(1)









    Out[35]:






  
    
      
      author
      c
    
    
      time
      
      
    
  
  
    
      2004-01-05 23:53:24
       mpf
       1



In [36]:

    
markus.head(1)









    Out[36]:






  
    
      
      author
      c
    
    
      time
      
      
    
  
  
    
      1999-10-03 19:17:40
       markus
       1

Die letzten Commits



In [37]:

    
hshoexer.tail(1)









    Out[37]:






  
    
      
      author
      c
    
    
      time
      
      
    
  
  
    
      2009-01-29 10:05:50
       hshoexer
       1



In [38]:

    
grunk.tail(1)









    Out[38]:






  
    
      
      author
      c
    
    
      time
      
      
    
  
  
    
      2009-10-01 11:37:33
       grunk
       1

Commits 2013



In [39]:

    
df.ix['2013']









    Out[39]:




&ltclass 'pandas.core.frame.DataFrame'>
DatetimeIndex: 2208 entries, 2013-01-01 00:02:01 to 2013-04-30 13:57:02
Data columns (total 2 columns):
author    2208  non-null values
c         2208  non-null values
dtypes: int64(1), object(1)

Commits über die Zeit



In [40]:

    
df.c.cumsum().plot()









    Out[40]:





<matplotlib.axes.AxesSubplot at 0x5986890>

Recht stetiges Wachstum. Wird aber langsamer. Hmm...??

Peak OpenBSD?



In [41]:

    
cmon=df.resample("M", how="sum").c.cumsum()



In [42]:

    
y=cmon.values
y









    Out[42]:





array([   196,    394,    799,   1126,   1346,   1533,   1717,   2333,
         2927,   3335,   4039,   4735,   5330,   5652,   6147,   6673,
         7358,   7750,   8560,   9297,   9932,  10709,  11221,  11802,
        12294,  12741,  13141,  13417,  13852,  14484,  14816,  15259,
        15621,  16203,  16686,  17279,  17585,  18047,  18438,  18854,
        19585,  20143,  20553,  21058,  21493,  22162,  22726,  23338,
        23865,  24392,  25050,  25898,  26408,  27032,  27949,  28426,
        29173,  29757,  30220,  30767,  31434,  31930,  32382,  33241,
        34017,  34937,  35577,  36533,  38236,  39212,  40442,  41478,
        42122,  42838,  43539,  44341,  45358,  46182,  46972,  47909,
        49421,  50387,  50882,  51505,  52007,  52619,  53370,  54163,
        54931,  55779,  56657,  57733,  58839,  59647,  60389,  60960,
        61640,  62143,  62986,  64162,  65225,  65917,  66718,  67519,
        68404,  69346,  70226,  71031,  71647,  72483,  73487,  74358,
        74941,  76016,  77245,  78447,  79155,  80005,  80827,  81584,
        82619,  83757,  84943,  85879,  86559,  87673,  88622,  89883,
        91002,  91834,  92737,  93233,  94028,  95103,  95887,  96582,
        97511,  98444,  99362, 100740, 101582, 102054, 102470, 103309,
       104166, 105154, 105726, 106269, 106742, 107278, 107791, 108603,
       109685, 110339, 110897, 111336, 111906, 112698, 113396, 114075,
       114624, 115236, 115970, 116685, 117807, 118466, 119210, 119672,
       120410, 121375, 122024, 122700, 123310, 123819, 124533, 125124,
       126019, 127017, 127556, 128249, 128767, 129158, 129603, 130165,
       130300, 130706, 131552, 132037, 132555, 133557, 133780, 134245,
       134582, 134995, 135294, 135748, 135950, 136323, 136820, 137161,
       137500, 138145, 138737, 139254, 139748, 140222, 140550, 141060,
       141355, 142187, 142758])



In [43]:

    
x=arange(cmon.size)
x









    Out[43]:





array([  0,   1,   2,   3,   4,   5,   6,   7,   8,   9,  10,  11,  12,
        13,  14,  15,  16,  17,  18,  19,  20,  21,  22,  23,  24,  25,
        26,  27,  28,  29,  30,  31,  32,  33,  34,  35,  36,  37,  38,
        39,  40,  41,  42,  43,  44,  45,  46,  47,  48,  49,  50,  51,
        52,  53,  54,  55,  56,  57,  58,  59,  60,  61,  62,  63,  64,
        65,  66,  67,  68,  69,  70,  71,  72,  73,  74,  75,  76,  77,
        78,  79,  80,  81,  82,  83,  84,  85,  86,  87,  88,  89,  90,
        91,  92,  93,  94,  95,  96,  97,  98,  99, 100, 101, 102, 103,
       104, 105, 106, 107, 108, 109, 110, 111, 112, 113, 114, 115, 116,
       117, 118, 119, 120, 121, 122, 123, 124, 125, 126, 127, 128, 129,
       130, 131, 132, 133, 134, 135, 136, 137, 138, 139, 140, 141, 142,
       143, 144, 145, 146, 147, 148, 149, 150, 151, 152, 153, 154, 155,
       156, 157, 158, 159, 160, 161, 162, 163, 164, 165, 166, 167, 168,
       169, 170, 171, 172, 173, 174, 175, 176, 177, 178, 179, 180, 181,
       182, 183, 184, 185, 186, 187, 188, 189, 190, 191, 192, 193, 194,
       195, 196, 197, 198, 199, 200, 201, 202, 203, 204, 205, 206, 207,
       208, 209, 210])



In [44]:

    
p=np.polyfit(x,y, 3)
p









    Out[44]:





array([ -1.87535016e-02,   6.07155155e+00,   2.17719734e+02,
         1.83929708e+03])



In [45]:

    
x1=arange(300)



In [46]:

    
plot(x,y)
plot(x1,np.polyval(p,x1), "r")
xlabel("months")









    Out[46]:





<matplotlib.text.Text at 0x5ef8210>



In [47]:

    
np.polyval(p,x1).max()









    Out[47]:





144967.34021697595



In [48]:

    
y.max()









    Out[48]:





142758



In [49]:

    
pd.Series(np.polyval(p,x1)).idxmax()









    Out[49]:





232



In [50]:

    
pd.Series(x).idxmax()









    Out[50]:





210

Commits pro Tag



In [51]:

    
per_day=df.resample("D", how="sum")



In [52]:

    
per_day.head()

Der produktivste Tag



In [53]:

    
per_day.c.max()









    Out[53]:





221.0



In [54]:

    
per_day.c.idxmax()









    Out[54]:





<Timestamp: 2001-06-25 00:00:00>



In [55]:

    
df.ix["2001-06-25"]









    Out[55]:




&ltclass 'pandas.core.frame.DataFrame'>
DatetimeIndex: 221 entries, 2001-06-25 00:00:44 to 2001-06-25 23:49:28
Data columns (total 2 columns):
author    221  non-null values
c         221  non-null values
dtypes: int64(1), object(1)



In [56]:

    
per_month=df.resample("M", how="sum")
per_month.max()









    Out[56]:





c    1703
dtype: int64



In [57]:

    
per_month.idxmax()









    Out[57]:





c   2001-06-30 00:00:00
dtype: datetime64[ns]

Die meisten Commits gab es im Juni 2001: 1699 Stück



In [58]:

    
per_month.c.describe()









    Out[58]:





count     211.000000
mean      676.578199
std       261.998541
min       135.000000
25%       495.000000
50%       645.000000
75%       840.500000
max      1703.000000
dtype: float64

Die Anzahl der Commits schwankt stark.



In [59]:

    
per_month.c.plot()









    Out[59]:





<matplotlib.axes.AxesSubplot at 0x688fb50>



In [60]:

    
pd.rolling_mean(per_month.c, 15).plot()









    Out[60]:





<matplotlib.axes.AxesSubplot at 0x6eb2490>

Wann wird typischerweise committet (Wochentag, Uhrzeit)?

Neue Spalte erzeugen mit dem Zeitstempel, denn der Index lässt sich nicht so gut auswerten.



In [61]:

    
df["datetime"]=df.index
df.head()









    Out[61]:






  
    
      
      author
      c
      datetime
    
    
      time
      
      
      
    
  
  
    
      1995-10-18 08:38:49
       deraadt
       1
      1995-10-18 08:38:49
    
    
      1995-10-18 08:38:49
       deraadt
       1
      1995-10-18 08:38:49
    
    
      1995-10-18 08:43:09
       deraadt
       1
      1995-10-18 08:43:09
    
    
      1995-10-18 08:43:09
       deraadt
       1
      1995-10-18 08:43:09
    
    
      1995-10-18 08:43:49
       deraadt
       1
      1995-10-18 08:43:49



In [62]:

    
df['weekday'] = df['datetime'].apply(lambda x: x.isoweekday())

Montag = 1, Sonntag = 7



In [63]:

    
df.head()









    Out[63]:






  
    
      
      author
      c
      datetime
      weekday
    
    
      time
      
      
      
      
    
  
  
    
      1995-10-18 08:38:49
       deraadt
       1
      1995-10-18 08:38:49
       3
    
    
      1995-10-18 08:38:49
       deraadt
       1
      1995-10-18 08:38:49
       3
    
    
      1995-10-18 08:43:09
       deraadt
       1
      1995-10-18 08:43:09
       3
    
    
      1995-10-18 08:43:09
       deraadt
       1
      1995-10-18 08:43:09
       3
    
    
      1995-10-18 08:43:49
       deraadt
       1
      1995-10-18 08:43:49
       3



In [64]:

    
wd=df.ix[:,["c","weekday"]]



In [65]:

    
per_weekday=wd.groupby("weekday")



In [66]:

    
per_weekday.sum()

Die meisten Commits am Wochenanfang. Am Wochenende weniger (insbesondere Samstag).



In [67]:

    
per_weekday.sum().plot(kind="bar")









    Out[67]:





<matplotlib.axes.AxesSubplot at 0x694aa10>



In [68]:

    
df['hour'] = df['datetime'].apply(lambda x: x.hour)



In [69]:

    
per_hour = df.groupby('hour')['c'].sum()
per_hour.plot(kind="bar")









    Out[69]:





<matplotlib.axes.AxesSubplot at 0x7d137d0>

Die meisten Commits abends.

	c
weekday
1	22131
2	22230
3	21833
4	21516
5	20204
6	16665
7	18179

	time	author
0	2013-04-30 13:57:02 +0000	patrick
1	2013-04-30 13:23:52 +0000	patrick
2	2013-04-30 13:10:45 +0000	patrick
3	2013-04-30 13:05:44 +0000	jmc
4	2013-04-30 13:04:25 +0000	patrick

	time	author
time
1995-10-18 08:43:09	1995-10-18 08:43:09 +0000	deraadt
1995-10-18 08:43:09	1995-10-18 08:43:09 +0000	deraadt
1995-10-18 08:38:49	1995-10-18 08:38:49 +0000	deraadt
1995-10-18 08:38:49	1995-10-18 08:38:49 +0000	deraadt
1995-10-18 08:53:39	1995-10-18 08:53:39 +0000	deraadt

	time	author
time
2013-04-30 13:04:25	2013-04-30 13:04:25 +0000	patrick
2013-04-30 13:05:44	2013-04-30 13:05:44 +0000	jmc
2013-04-30 13:10:45	2013-04-30 13:10:45 +0000	patrick
2013-04-30 13:23:52	2013-04-30 13:23:52 +0000	patrick
2013-04-30 13:57:02	2013-04-30 13:57:02 +0000	patrick

	c
time
1995-10-18	52
1995-10-19	16
1995-10-20	4
1995-10-21	8
1995-10-22	14