INTRODUCTION

This notebook aims at providing a first approach to manage the new code review database. The usual code review process is done through the mailing lists. This approach migrates that code review into a database schema closer to how Gerrit or another review system works.

The list of metrics for this notebook are the following ones:

Activity General Overview

  • Evolution of patch series (1 or more patches)
  • Evolution of patch series submitters
  • Evolution of comments
  • Evolution of people commenting the patch series
  • Evolution of people reviewing patches (using the flag reviewed by)

Time analysis

A patch can be divided into several steps:

1---------2----------3---------4--------5---------A--------C

And each version can be divided into several steps:

1----a)review------b)review------c)review-----------d)review--------------2----------------------------------------------3

Where 1, 2, 3, .., are the several iterations, A the point where all of the patches were 'Acked-by' and C the commit action into master. And a, b, c, d comments and reviews.

  • Time to merge: time between 1 and C
  • Time to commit: time between A and C
  • Time to re-work a patch: time between 1_d and the new iteration. Time between the last comment and a new patch.
  • Cycle time: time between each pair of iterations: 1&2, 2&3, etc.
  • Time to first review: time between 1, 2, 3, etc and its first review.

Backlog analysis

  • Review (of series/patch) completed: all of the patch series merged
  • Review (of series/patch) active: patches that were recently reviewed, we'll take the last 7 days as a potential timeframe.
  • Review (of series/patch) stalled: patches older than 1 year
  • Review (of series/patch) ongoing: patches younger than 1 year. This would include the active reviews.

Patch series complexity analysis

  • Number of versions per patch serie
  • From patches merged: check number of 'touched' files plus added and removed lines
  • Comments received per patch
  • Number of patches per patch serie

Patch series community

  • Top people sending patches
  • Top people reviewing patches
  • For all of those, basic analysis with organization info based on email domain

Analysis

0) LIBRARIES NEEDED FOR THIS NOTEBOOK


In [1]:
%pylab inline
# Enabling libraries to manage data and plot
import pandas as pd
import numpy as np
from pandas import Series, DataFrame, Panel


Populating the interactive namespace from numpy and matplotlib

0) DATABASE CONNECTION


In [2]:
# Working with the xen-database with code review info
import MySQLdb


def connect():
   user = "root"
   password = ""
   host = "localhost"
   db = "xen_reports_patchsets3"

   try:
      db = MySQLdb.connect(user = user, passwd = password, db = db)
      return db, db.cursor()
   except:
      logging.error("Database connection error")
      raise
        


def execute_query(connector, query):
   results = int (connector.execute(query))
   cont = 0
   if results > 0:
      result1 = connector.fetchall()
      return result1
   else:
      return []
    

db, cursor = connect()

In [3]:
def draw_distribution(data_df, key_field, hist, boxplot, boxplot_log=True):
    # This function draws as histogram and boxplot the distribution 
    # of the specified charts.
    # data: raw data coming from database
    # columns_list: list of columns represented in 'data'
    # key_field: key field to be represented in the histogram and boxplot
    # hist: contains the xlabel, ylabel and title fields
    # boxplot: contains the ylabel and title fields
    
    fig, axs = plt.subplots(1,2, figsize=(12,5))

    hist_chart = data_df[key_field].plot(kind="hist", ax=axs[0])
    boxplot_chart = data_df[key_field].plot(kind="box", ax=axs[1], logy=boxplot_log)

    #chart labels
    hist_chart.set_xlabel(hist["xlabel"])
    hist_chart.set_ylabel(hist["ylabel"])
    hist_chart.set_title(hist["title"])

    boxplot_chart.set_ylabel(boxplot["ylabel"])
    boxplot_chart.set_title(boxplot["title"])

1) ACTIVITY GENERAL OVERVIEW

Metrics to retrieve:

  • Evolution of patch series (1 or more patches)
  • Evolution of patch series submitters
  • Evolution of comments
  • Evolution of people commenting the patch series
  • Evolution of people reviewing patches (using the flag reviewed by)

Evolution of Patch Series

Xen community uses the term patch or patch serie when sending pieces of code to the mailing list. In order to simplify this process, a patch serie is now conformed of one or more patches.


In [4]:
query = """ SELECT year,
                   month,
                   count(distinct(ps)) as patch_series
            FROM 
                 (SELECT ps_version_id as ps,
                         year(min(date_utc)) as year,
                         month(min(date_utc)) as month
                  FROM patches
                  GROUP BY ps_version_id
                  ORDER BY min(date_utc)) t
            GROUP BY year, month
        """
data = execute_query(cursor, query)

In [5]:
np_data = np.array(data)
np_data_filtered = np_data[1:-2] # This remove 2001 and 2016 data
#Migrating to use Series in pandas (nicer view and easier to manage series)
dates = pd.date_range('2004-08', '2015-11', freq='M') # data starting in 2004-08 and ending in 2015-08
#Checking size of both arrays: dates.shape and np_data_filtered.shape
patchserieTS = Series(np_data_filtered[:,2], index=dates)

In [6]:
patchserieschart = patchserieTS.plot()
patchserieschart.set_xlabel("Time evolution")
patchserieschart.set_ylabel("Number of patches")
patchserieschart.set_title("Evolution of patch series per month")


Out[6]:
<matplotlib.text.Text at 0x7fcfb1d53c50>

Evolution of Patch Series Submitters


In [7]:
query = """ SELECT year,
                   month,
                   count(distinct(submitter)) as submitters
            FROM 
                 (SELECT submitter_id as submitter,
                         year(min(date_utc)) as year,
                         month(min(date_utc)) as month
                  FROM patches
                  GROUP BY ps_version_id
                  ORDER BY min(date_utc)) t
            GROUP BY year, month
        """
data_submitters = execute_query(cursor, query)

In [8]:
np_data = np.array(data_submitters)
np_data_filtered = np_data[1:-2] # This remove 2001 and 2016 data
patchserie_submittersTS = Series(np_data_filtered[:,2], index=dates)

In [9]:
patchserie_submitters_chart = patchserie_submittersTS.plot()
patchserie_submitters_chart.set_xlabel("Time evolution")
patchserie_submitters_chart.set_ylabel("Number of patch submitters")
patchserie_submitters_chart.set_title("Evolution of unique patch submitters per month")


Out[9]:
<matplotlib.text.Text at 0x7fcfb1bf3410>

Evolution of comments


In [10]:
# Let's calculate evolution of comments in general when they were sent
# We should ignore 1970 emails with wrong sent date
query = """ SELECT year(date_utc) as year,
                   month(date_utc) as month,
                   count(distinct(comments.id)) as comments
            FROM comments
            WHERE year(date_utc) > 2004
            GROUP BY year(date_utc),
                     month(date_utc)
        """
comments = execute_query(cursor, query)

In [11]:
dates = pd.date_range('2005-01', '2015-11', freq='M') # data starting in 2005-01 and ending in 2015-08
np_data = np.array(comments)
commentsTS = Series(np_data[:,2], index=dates)

In [12]:
comments_chart = commentsTS.plot()
comments_chart.set_xlabel("Time evolution")
comments_chart.set_ylabel("Number of comments")
comments_chart.set_title("Evolution of comments per month")


Out[12]:
<matplotlib.text.Text at 0x7fcfb1b5cb10>

Evolution of people commenting the patch series


In [13]:
query = """ SELECT psv.ps_id as patch_serie,
                   MIN(psv.date_utc) as patch_serie_sent_date,
                   COUNT(DISTINCT(c.submitter_id)) as commenters
            FROM patch_series_version psv,
                 patches p,
                 comments c 
            WHERE psv.id = p.ps_version_id AND 
                  p.id=c.patch_id 
            GROUP BY psv.ps_id """
peoplecomments_ps = execute_query(cursor, query)

In [14]:
peoplecomments_list = list(peoplecomments_ps)
peoplecomments_df = DataFrame(peoplecomments_list, columns=["patch_serie", "partch_serie_sent_date", "commenters"])

In [15]:
hist = {"xlabel":"People commenting", "ylabel":"Number of patch series", "title":"Histogram: number of reviews"}
boxplot = {"ylabel":"People commenting per patch", "title":"Log scale: Dist. people commenting"}
draw_distribution(peoplecomments_df, "commenters", hist, boxplot)



In [16]:
mean_peoplecomments = peoplecomments_df.set_index("partch_serie_sent_date").resample('M', how={"commenters":np.mean})
median_peoplecomments = peoplecomments_df.set_index("partch_serie_sent_date").resample('M', how={"commenters":np.median})
dates = pd.date_range('2010-10', '2015-11', freq='M') # data starting in 2010-10 and ending in 2015-08
peoplecommentsTS = Series(mean_peoplecomments["commenters"], index=dates)
peoplecomments_medianTS = Series(median_peoplecomments["commenters"], index=dates)

peoplecommentsTS_chart = peoplecommentsTS.plot()
peoplecomments_medianTS.plot()
peoplecommentsTS_chart.set_xlabel("Time evolution")
peoplecommentsTS_chart.set_ylabel("People commenting per patch serie")
peoplecommentsTS_chart.set_title("Evolution of the mean (blue) and median (green) number of people per patch serie")


Out[16]:
<matplotlib.text.Text at 0x7fcfb1473d50>

Evolution of people reviewing patches (using the flag reviewed by)


In [17]:
#TODO

2) TIME ANALYSIS

Metrics to retrieve:

  • Time to merge: time between 1 and C
  • Time to commit: time between A and C
  • Time to re-work a patch: time between 1_d and the new iteration. Time between the last comment and a new patch.
  • Cycle time: time between each pair of iterations: 1&2, 2&3, etc.
  • Time to first review: time between 1, 2, 3, etc and its first review.
  • Time to ack

Time to merge

Time to merge needs to be calculated as the time when a patch was initially sent till this is merge into code. As each patch is associated to a commit, this is obtained as the first date when the patch was sent till the last commit for that patch serie is merged into code.


In [18]:
query = """ SELECT psv.ps_id as patch_serie,
                   TIMESTAMPDIFF(SECOND, MIN(psv.date_utc), MAX(c.committer_date_utc)) as time2merge,
                   MIN(psv.date_utc) as first_patch_date,
                   MAX(c.committer_date_utc) as merge_time
            FROM patch_series_version psv,
                 patches p,
                 commits c
            WHERE psv.id=p.ps_version_id AND 
                  p.commit_id = c.id 
            GROUP BY psv.ps_id
            HAVING time2merge >=0 """
time2merge = execute_query(cursor, query)

In [19]:
time2merge_list = list(time2merge)
time2merge_df = DataFrame(time2merge_list, columns=["patch_serie", "time2merge", "first_patch_date", "merge_time"])
fig, axs = plt.subplots(1,2, figsize=(12,5))


#Migrate seconds to days
time2merge_df["time2merge_days"] = (time2merge_df["time2merge"]/3600)/24


hist_chart = time2merge_df["time2merge_days"].plot(kind="hist", ax=axs[0])
boxplot_chart = time2merge_df["time2merge_days"].plot(kind="box", ax=axs[1], logy=True)

#chart labels
hist_chart.set_xlabel("Days to merge")
hist_chart.set_ylabel("Number of reviews")
hist_chart.set_title("Histogram: number of reviews")

boxplot_chart.set_ylabel("Days to merge")
boxplot_chart.set_title("Log scale: Dist. time2merge")


Out[19]:
<matplotlib.text.Text at 0x7fcfb16cfad0>

In [20]:
mean_time2merge = time2merge_df.set_index("merge_time").resample('M', how={"time2merge_days":np.mean})
median_time2merge = time2merge_df.set_index("merge_time").resample('M', how={"time2merge_days":np.median})
dates = pd.date_range('2010-10', '2015-11', freq='M') # data starting in 2010-10 and ending in 2015-08
time2mergeTS = Series(mean_time2merge["time2merge_days"], index=dates)
time2merge_medianTS = Series(median_time2merge["time2merge_days"], index=dates)

time2mergeTS_chart = time2mergeTS.plot()
time2merge_medianTS.plot()
time2mergeTS_chart.set_xlabel("Time evolution")
time2mergeTS_chart.set_ylabel("Number of days")
time2mergeTS_chart.set_title("Evolution of the mean (blue) and median (green) time to merge a patch serie")


Out[20]:
<matplotlib.text.Text at 0x7fcfb12cfb10>

Time to commit


In [21]:
query = """ SELECT psv.ps_id as patch_serie,
                   TIMESTAMPDIFF(SECOND, MAX(c.date_utc), MAX(commits.committer_date_utc)) as time2commit,
                   MAX(c.date_utc) as last_comment_date, 
                   MAX(commits.committer_date_utc) as commit_time 
            FROM patch_series_version psv, 
                 patches p, 
                 comments c, 
                 commits 
            WHERE psv.id = p.ps_version_id AND 
                  p.commit_id = commits.id AND 
                  p.id=c.patch_id 
            GROUP BY psv.ps_id
            HAVING time2commit >= 0 and time2commit < 2000*3600*24"""
time2commit = execute_query(cursor, query)

In [22]:
time2commit_list = list(time2commit)
time2commit_df = DataFrame(time2commit_list, columns=["patch_serie", "time2commit", "last_comment_date", "commit_time"])
fig, axs = plt.subplots(1,2, figsize=(12,5))


#Migrate seconds to days
time2commit_df["time2commit_days"] = (time2commit_df["time2commit"]/3600)/24


hist_chart = time2commit_df["time2commit_days"].plot(kind="hist", ax=axs[0])
boxplot_chart = time2commit_df["time2commit_days"].plot(kind="box", ax=axs[1], logy=True)

#chart labels
hist_chart.set_xlabel("Days to commit")
hist_chart.set_ylabel("Number of patch series")
hist_chart.set_title("Histogram: days to commit")

boxplot_chart.set_ylabel("Days to commit")
boxplot_chart.set_title("Log scale: Dist. time2commit")


Out[22]:
<matplotlib.text.Text at 0x7fcfb1135c90>

In [23]:
mean_time2commit = time2commit_df.set_index("commit_time").resample('M', how={"time2commit_days":np.mean})
median_time2commit = time2commit_df.set_index("commit_time").resample('M', how={"time2commit_days":np.median})
dates = pd.date_range('2010-10', '2015-11', freq='M') # data starting in 2010-10 and ending in 2015-08
time2commitTS = Series(mean_time2commit["time2commit_days"], index=dates)
time2commit_medianTS = Series(median_time2commit["time2commit_days"], index=dates)

time2commitTS_chart = time2commitTS.plot()
time2commit_medianTS.plot()
time2commitTS_chart.set_xlabel("Time evolution")
time2commitTS_chart.set_ylabel("Number of days")
time2commitTS_chart.set_title("Evolution of the mean (blue) and median (green) time to commit a patch serie")


Out[23]:
<matplotlib.text.Text at 0x7fcfb0edaf50>

Time to re-work a patch


In [24]:
query = """ SELECT psv.ps_id as patch_serie,
                   psv.version as version,
                   psv.date_utc as rework_time, 
                   t.last_activity as last_version_activity, 
                   timestampdiff(SECOND, psv.date_utc, t.last_activity) as time2rework 
            FROM patch_series_version psv, 
                 (SELECT psv.ps_id, 
                         psv.version, 
                         greatest(max(c.date_utc), max(p.date_utc)) as last_activity 
                  FROM patch_series_version psv, 
                       patches p, 
                       comments c 
                  WHERE psv.id = p.ps_version_id AND 
                        c.patch_id=p.id 
                  GROUP BY psv.ps_id, psv.version) t 
            WHERE psv.ps_id = t.ps_id AND 
                  psv.version = t.version -1
            HAVING time2rework >= 0 and time2rework < 1000*3600*24"""
time2rework = execute_query(cursor, query)

In [25]:
time2rework_list = list(time2rework)
time2rework_df = DataFrame(time2rework_list, columns=["patch_serie", "version", "rework_time", "last_version_activity", "time2rework"])
fig, axs = plt.subplots(1,2, figsize=(12,5))


#Migrate seconds to days
time2rework_df["time2rework_days"] = (time2rework_df["time2rework"]/3600)/24


hist_chart = time2rework_df["time2rework_days"].plot(kind="hist", ax=axs[0])
boxplot_chart = time2rework_df["time2rework_days"].plot(kind="box", ax=axs[1], logy=True)

#chart labels
hist_chart.set_xlabel("Days to re-work")
hist_chart.set_ylabel("Number of patch series")
hist_chart.set_title("Histogram: days to patch serie re-work")

boxplot_chart.set_ylabel("Days to re-work")
boxplot_chart.set_title("Log scale: Dist. time2rework")


Out[25]:
<matplotlib.text.Text at 0x7fcfb0d3ac50>

In [26]:
mean_time2rework = time2rework_df.set_index("rework_time").resample('M', how={"time2rework_days":np.mean})
median_time2rework = time2rework_df.set_index("rework_time").resample('M', how={"time2rework_days":np.median})
dates = pd.date_range('2010-10', '2015-09', freq='M') # data starting in 2010-10 and ending in 2015-08
time2reworkTS = Series(mean_time2rework["time2rework_days"], index=dates)
time2rework_medianTS = Series(median_time2rework["time2rework_days"], index=dates)

time2reworkTS_chart = time2reworkTS.plot()
time2rework_medianTS.plot()
time2reworkTS_chart.set_xlabel("Time evolution")
time2reworkTS_chart.set_ylabel("Number of days")
time2reworkTS_chart.set_title("Evolution of the mean (blue) and median (green) time to re-work a patch serie")


Out[26]:
<matplotlib.text.Text at 0x7fcfb0a8ae50>

Cycle time


In [27]:
query = """ SELECT p1.ps_id as patch_serie,
                   p1.version as version,
                   p1.date_utc as cycle1,
                   p2.date_utc as cycle2,
                   timestampdiff(SECOND, p1.date_utc, p2.date_utc) as time2newcycle
            FROM patch_series_version p1,
                 patch_series_version p2
            WHERE p1.ps_id = p2.ps_id AND 
                  p1.version = p2.version -1 
            HAVING time2newcycle >= 0 and time2newcycle<1000*3600*24
            ORDER BY p1.ps_id  """
time2newcycle = execute_query(cursor, query)

In [28]:
time2newcycle_list = list(time2newcycle)
time2newcycle_df = DataFrame(time2newcycle_list, columns=["patch_serie", "version", "cycle1", "cycle2", "time2newcycle"])
fig, axs = plt.subplots(1,2, figsize=(12,5))


#Migrate seconds to days
time2newcycle_df["time2newcycle_days"] = (time2newcycle_df["time2newcycle"]/3600)/24


hist_chart = time2newcycle_df["time2newcycle_days"].plot(kind="hist", ax=axs[0])
boxplot_chart = time2newcycle_df["time2newcycle_days"].plot(kind="box", ax=axs[1], logy=True)

#chart labels
hist_chart.set_xlabel("Days to new cycle")
hist_chart.set_ylabel("Number of patch series")
hist_chart.set_title("Histogram: days to new cycle")

boxplot_chart.set_ylabel("Days to new cycle")
boxplot_chart.set_title("Log scale: Dist. time2newcycle")


Out[28]:
<matplotlib.text.Text at 0x7fcfb088fa10>

In [29]:
mean_time2newcycle = time2newcycle_df.set_index("cycle2").resample('M', how={"time2newcycle_days":np.mean})
median_time2newcycle = time2newcycle_df.set_index("cycle2").resample('M', how={"time2newcycle_days":np.median})
dates = pd.date_range('2010-10', '2015-09', freq='M') # data starting in 2010-10 and ending in 2015-08
time2newcycleTS = Series(mean_time2newcycle["time2newcycle_days"], index=dates)
time2newcycle_medianTS = Series(median_time2newcycle["time2newcycle_days"], index=dates)

time2newcycleTS_chart = time2newcycleTS.plot()
time2newcycle_medianTS.plot()
time2newcycleTS_chart.set_xlabel("Time evolution")
time2newcycleTS_chart.set_ylabel("Number of days")
time2newcycleTS_chart.set_title("Evolution of the mean (blue) and median (green) time to have a new cycle in a patch serie")


Out[29]:
<matplotlib.text.Text at 0x7fcfb097ec10>

Time to first review

Either this is a patch or a patch serie, this analysis is focused at the level of patch. This ignores all of the patches that are not reviewed. At the level of 'flags', this counts the time between a patch is sent till a 'Reviewed-by' flag appears.

Some issues found:

Assumption:

  • Let's ignore first time reviews, so let's analyze only those times whose time to review is greater than 0 and when a patch has more than 1 review, selecting the minimum time among them. In this way, we'll avoid to have more than one repeated review.

In [30]:
query = """ SELECT p.subject as subject,
                   p.date_utc as sent_date,
                   f.flag as flag,
                   f.date_utc as reviewed_date,
                   TIMESTAMPDIFF(SECOND, p.date_utc, min(f.date_utc)) as time2review
            FROM patches p,
                 flags f
            WHERE p.id = f.patch_id  and
                  f.flag='Reviewed-by' 
            GROUP BY p.subject 
            HAVING TIMESTAMPDIFF(SECOND, p.date_utc, min(f.date_utc)) > 0 and TIMESTAMPDIFF(SECOND, p.date_utc, min(f.date_utc))
        """
time2review = execute_query(cursor, query)

In [31]:
time2review_list = list(time2review)
time2review_df = DataFrame(time2review_list, columns=["subject", "date", "flag", "review_time", "time2review"])
fig, axs = plt.subplots(1,2, figsize=(12,5))


#Migrate seconds to days
time2review_df["time2review_days"] = (time2review_df["time2review"]/3600)/24


hist_chart = time2review_df["time2review_days"].plot(kind="hist", ax=axs[0])
boxplot_chart = time2review_df["time2review_days"].plot(kind="box", ax=axs[1], logy=True)

#chart labels
hist_chart.set_xlabel("Days to review")
hist_chart.set_ylabel("Number of reviews")
hist_chart.set_title("Histogram: number of reviews")

boxplot_chart.set_ylabel("Days to review")
boxplot_chart.set_title("Log scale: Dist. time2review")


Out[31]:
<matplotlib.text.Text at 0x7fcfb052c750>

In [32]:
mean_time2review = time2review_df.set_index("review_time").resample('M', how={"time2review_days":np.mean})
median_time2review = time2review_df.set_index("review_time").resample('M', how={"time2review_days":np.median})
dates = pd.date_range('2010-10', '2015-11', freq='M') # data starting in 2010-10 and ending in 2015-08
time2reviewTS = Series(mean_time2review["time2review_days"], index=dates)
time2review_medianTS = Series(median_time2review["time2review_days"], index=dates)

time2reviewTS_chart = time2reviewTS.plot()
time2review_medianTS.plot()
time2reviewTS_chart.set_xlabel("Time evolution")
time2reviewTS_chart.set_ylabel("Number of days")
time2reviewTS_chart.set_title("Evolution of the mean (blue) and median (green) time to first-review a patch")


Out[32]:
<matplotlib.text.Text at 0x7fcfb04ac810>

Time to ack


In [33]:
query = """ SELECT t1.ps_id as patch_serie,
                   t1.first_patch_date,
                   f.flag as flag,
                   f.date_utc as acked_date,
                   TIMESTAMPDIFF(SECOND, t1.first_patch_date, min(f.date_utc)) as time2ack
            FROM patches p,
                 flags f,
                 (select psv.ps_id,
                         p.id as patch_id, 
                         min(psv.date_utc) as first_patch_date
                  from patch_series_version psv,
                       patches p
                  where p.ps_version_id=psv.id
                  group by psv.ps_id) t1
            WHERE p.id = f.patch_id  and
                  f.flag='Acked-by' and
                  p.id = t1.patch_id
            GROUP BY t1.ps_id 
            HAVING TIMESTAMPDIFF(SECOND, t1.first_patch_date, min(f.date_utc)) > 0 and TIMESTAMPDIFF(SECOND, t1.first_patch_date, min(f.date_utc))
        """
time2ack = execute_query(cursor, query)

In [34]:
time2ack_list = list(time2ack)
time2ack_df = DataFrame(time2ack_list, columns=["subject", "date", "flag", "ack_time", "time2ack"])
fig, axs = plt.subplots(1,2, figsize=(12,5))


#Migrate seconds to days
time2ack_df["time2ack_days"] = (time2ack_df["time2ack"]/3600)/24


hist_chart = time2ack_df["time2ack_days"].plot(kind="hist", ax=axs[0])
boxplot_chart = time2ack_df["time2ack_days"].plot(kind="box", ax=axs[1], logy=True)

#chart labels
hist_chart.set_xlabel("Days to ack")
hist_chart.set_ylabel("Number of reviews")
hist_chart.set_title("Histogram: number of reviews")

boxplot_chart.set_ylabel("Days to ack")
boxplot_chart.set_title("Log scale: Dist. time2ack")


Out[34]:
<matplotlib.text.Text at 0x7fcfb01144d0>

In [35]:
mean_time2ack = time2ack_df.set_index("ack_time").resample('M', how={"time2ack_days":np.mean})
median_time2ack = time2ack_df.set_index("ack_time").resample('M', how={"time2ack_days":np.median})
dates = pd.date_range('2011-10', '2015-11', freq='M') # data starting in 2010-10 and ending in 2015-08
time2ackTS = Series(mean_time2ack["time2ack_days"], index=dates)
time2ack_medianTS = Series(median_time2ack["time2ack_days"], index=dates)

time2ackTS_chart = time2ackTS.plot()
time2ack_medianTS.plot()
time2ackTS_chart.set_xlabel("Time evolution")
time2ackTS_chart.set_ylabel("Number of days")
time2ackTS_chart.set_title("Evolution of the mean (blue) and median (green) time to ack a patch")


Out[35]:
<matplotlib.text.Text at 0x7fcfaff82910>

3) BACKLOG ANALYSIS

Metrics to retrieve:

  • Review (of series/patch) completed: all of the patch series merged
  • Review (of series/patch) active: patches that were recently reviewed, we'll take the last 7 days as a potential timeframe.
  • Review (of series/patch) stalled: patches older than 1 year
  • Review (of series/patch) ongoing: patches younger than 1 year. This would include the active reviews.

Reviews completed


In [36]:
# List of the most recent and completed reviews up to the date of this analysis
query = """ SELECT distinct ps.id as patch_serie, 
                   ps.subject, 
                   MAX(c.committer_date_utc) as merge_time
            FROM patch_series ps,
                 patch_series_version psv,
                 patches p,
                 commits c
            WHERE ps.id = psv.ps_id AND 
                  psv.id = p.ps_version_id AND p.commit_id = c.id 
            GROUP BY ps.id 
            ORDER BY merge_time desc """
completed_reviews_list = execute_query(cursor, query)

In [37]:
completed_reviews = list(completed_reviews_list)
completed_reviews_df = DataFrame(completed_reviews, columns=["patch_serie", "subject", "merge_time"])
completed_reviews_df


Out[37]:
patch_serie subject merge_time
0 17668 VT-d: x2APIC + IR adjustments 2015-10-13 19:17:16
1 17711 build: don't shadow debug with "@debug@" in to... 2015-10-12 18:02:53
2 17709 x86/shadow: Fix missing newline in dprintk() 2015-10-12 18:01:56
3 17703 xen: sched: fix locking of {insert, remove}_vc... 2015-10-12 18:01:22
4 17672 xen: sched: improve (a lot! :-D) Credit2 runq... 2015-10-12 18:00:52
5 17701 MAINTAINERS: Tam 2015-10-12 17:59:28
6 17666 x86/p2m-pt: delay freeing of intermediate page... 2015-10-08 14:45:08
7 17647 vt-d: Fix IM bit mask and unmask of Fault Even... 2015-10-08 14:44:27
8 17605 xen/xsm: Make p->policyvers be a local variabl... 2015-10-08 14:43:53
9 17601 x86/sysctl: Don't clobber memory if NCAPINTS >... 2015-10-08 14:43:18
10 17591 x86/MSI: fail if no hardware support 2015-10-08 14:42:50
11 17579 x86/p2m: fix mismatched unlock 2015-10-08 14:42:16
12 16661 x86: multiboot2 protocol support 2015-10-08 13:26:37
13 17653 xen/arm: vgic: Support 32-bit access for 64-bi... 2015-10-08 13:22:46
14 17654 xen/arm: vgic: Support 32-bit access for 64-bi... 2015-10-08 13:22:46
15 17341 Introduce HVM without dm and new boot ABI 2015-10-08 13:22:45
16 17700 x86/p2m: fix typo "populete" 2015-10-08 13:02:47
17 17249 Porting the intel_pstate driver to Xen 2015-10-08 13:01:58
18 17253 x86/intel_pstate: relocate the driver register... 2015-10-08 13:01:58
19 17593 libxl: use LOG() macro where appropriate 2015-10-07 13:30:36
20 17699 Add Libc multiarch package as build prerequisi... 2015-10-07 13:29:10
21 17691 tools/libxc: Improve efficiency of xc_cpuid_ap... 2015-10-07 13:26:25
22 17694 docs: xl.cfg: permissive option is not PV only. 2015-10-07 13:25:11
23 17676 Block script performance with shared image files 2015-10-07 13:23:51
24 17690 xen/arm: psci: use SMC64 function ID when avai... 2015-10-07 13:16:16
25 17697 gitingore: ignore extras/mini-os* 2015-10-07 13:13:26
26 17692 build: drop unused config variable CONFIG_HVM 2015-10-06 19:39:33
27 17693 xen/vcpu: add missing dummy_vcpu_info to compa... 2015-10-06 19:38:41
28 17689 flask: Allow initial domain to use XENPF_get_s... 2015-10-06 10:23:36
29 17331 [RESEND] x86/intel_pstate: APERF/MPERF feature... 2015-10-05 20:16:39
... ... ... ...
5384 1015 enable xm console for vmx guest 2005-11-10 13:04:29
5385 1001 fix VMX decoder for MOVZ instruction 2005-11-02 12:19:48
5386 991 Make check_pkgs work with the Bourne shell 2005-10-31 18:11:24
5387 992 Fix PAE shadow on a machine with RAM above 4G ... 2005-10-31 18:08:54
5388 923 Fix NAT for domU checksum offload 2005-10-15 09:33:19
5389 760 Call dominfo.device_delete instead of non-exis... 2005-09-14 12:49:13
5390 778 Call dominfo.device_delete instead of non-exis... 2005-09-14 12:49:13
5391 755 tpmback compilation fix 2005-09-09 08:56:14
5392 706 Hook up sysrq for xencons 2005-09-01 10:38:57
5393 709 Fix closing of /var/log/xend.log fd 2005-09-01 10:26:17
5394 683 Vnet update 2005-08-26 10:52:53
5395 682 Fix vnets in xm 2005-08-26 10:51:10
5396 676 Don't put vide in .setup.init section 2005-08-24 09:16:52
5397 655 Xenstore testsuite robustness: save output rat... 2005-08-23 19:58:59
5398 530 gcc-3.4/ia64 compilation fixes. 2005-08-03 09:33:09
5399 551 Don't abort on untranslatable errors: EINVAL a... 2005-08-02 17:59:20
5400 516 fix x86_64 domain0 /dev/mem issue 2005-07-29 10:22:03
5401 546 Test and fix acknowedge_watch from returning E... 2005-07-26 15:09:43
5402 491 Fix problem of can not create two or more vmx ... 2005-07-21 14:15:35
5403 490 Save per cpu IDT information into the VMCS hos... 2005-07-21 13:59:49
5404 489 Propagate guest MSR writes to machine MSRs imm... 2005-07-21 13:59:08
5405 469 Enable VMX domains on a SMP dom0. 2005-07-15 08:50:48
5406 471 Cleanup mov to CR4 handling. 2005-07-14 08:00:35
5407 470 Enable CR4.VME to improve the performance of v... 2005-07-14 08:00:23
5408 465 Need to save and restore MSRs for VMX domains ... 2005-07-13 08:36:38
5409 459 Device model SXP cleanup 2005-07-11 20:09:10
5410 458 The 1:1 page table should be a 3 level PAE pag... 2005-07-11 14:39:10
5411 5274 2005-07-01 13:17:53
5412 7061 Update and Resend: 2005-07-01 13:17:53
5413 424 use hg changeset info for xen banner 2005-07-01 10:15:25

5414 rows × 3 columns


In [38]:
# Total number of completed reviews up to the date of this analysis
query = """ SELECT count(distinct(ps.id)) as merged_patch_series
            FROM patch_series ps,
                 patch_series_version psv,
                 patches p,
                 commits c
            WHERE ps.id = psv.ps_id AND 
                  psv.id = p.ps_version_id AND p.commit_id = c.id 
            """
total_completed_reviews = execute_query(cursor, query)

In [39]:
total_completed_reviews


Out[39]:
((5414L,),)

Reviews active


In [40]:
# List of recent reviewed patch series but not merged.
query = """ SELECT DISTINCT ps.id as reviewed_patch_serie,
                   ps.subject,
                   max(f.date_utc) as last_reviewed_date
            FROM patch_series ps, 
                 patch_series_version psv, 
                 patches p, 
                 flags f 
            WHERE ps.id = psv.ps_id AND 
                  psv.id = p.ps_version_id AND 
                  f.patch_id = p.id AND 
                  f.flag='Reviewed-by' AND 
                  ps.id not in 
                  (SELECT COUNT(DISTINCT(ps.id)) as merged_patch_series
                   FROM patch_series ps,
                        patch_series_version psv,
                        patches p,
                        commits c
                   WHERE ps.id = psv.ps_id AND
                         psv.id = p.ps_version_id AND
                         p.commit_id = c.id) AND 
                  f.date_utc BETWEEN DATE_SUB(NOW(), INTERVAL 60 DAY) AND NOW() 
            GROUP BY ps.id
            ORDER BY f.date_utc desc """ # INITIAL VALUE OF 60 DAYS
recent_reviews = execute_query(cursor, query)

In [41]:
recent_reviews_list = list(recent_reviews)
recent_reviews_df = DataFrame(recent_reviews_list, columns=["patch_serie", "subject", "last_reviewed_date"])
recent_reviews_df


Out[41]:
patch_serie subject last_reviewed_date
0 17705 libxc: remove superpages option for pv domains 2015-10-08 17:36:46
1 17703 xen: sched: fix locking of {insert, remove}_vc... 2015-10-09 09:31:28
2 17700 x86/p2m: fix typo "populete" 2015-10-07 11:47:37
3 17698 tools/python: remove broken xl binding 2015-10-06 19:13:04
4 17692 build: drop unused config variable CONFIG_HVM 2015-10-06 16:53:28
5 17690 xen/arm: psci: use SMC64 function ID when avai... 2015-10-05 19:09:48
6 17684 Various fixes - libelf, paging, blktap. 2015-10-05 10:49:07
7 17689 flask: Allow initial domain to use XENPF_get_s... 2015-10-05 07:16:18
8 17680 use mask operations instead of test_bit() 2015-10-05 16:31:23
9 17679 x86/p2m-pt: tighten conditions of IOMMU mappi... 2015-10-01 18:16:40
10 17672 xen: sched: improve (a lot! :-D) Credit2 runq... 2015-10-01 09:48:57
11 17674 xen: Introduce VM_EVENT_FLAG_SET_REGISTERS 2015-09-30 14:34:53
12 17657 x86/PoD: shorten certain operations on higher... 2015-10-01 17:50:03
13 17660 x86/EPT: defer enabling of A/D maintenance un... 2015-09-29 14:51:21
14 17668 VT-d: x2APIC + IR adjustments 2015-09-29 14:49:37
15 17667 x86/p2m-pt: ignore pt-share flag for shadow m... 2015-09-29 12:55:08
16 17666 x86/p2m-pt: delay freeing of intermediate page... 2015-09-29 12:54:40
17 17615 detect and initialize CDP (Code/Data Prioritiz... 2015-10-10 22:11:53
18 17624 x86: further P2M adjustments 2015-10-01 18:31:16
19 17658 VT-d: don't suppress invalidation address writ... 2015-09-28 18:01:07
20 17659 x86/EPT: adjust types in ept_split_super_page() 2015-09-28 16:46:39
21 17595 Introspection optimization helpers 2015-09-28 12:36:53
22 17661 x86/xen: Do not clip xen_e820_map to xen_e820_... 2015-09-28 08:56:47
23 17655 PVH Dom0 RMRR IOMMU mapping regression fix 2015-09-27 19:04:48
24 17656 cleanup domain builder declarations and relate... 2015-09-28 11:44:16
25 17239 xen pvusb toolstack work 2015-09-25 18:11:37
26 17649 xen: sched: adjustments to some performance co... 2015-09-25 14:19:31
27 17651 various clean-ups 2015-09-25 14:02:21
28 17648 xen: credit1: fix tickling when it happens fr... 2015-09-29 15:47:44
29 17646 xen: sched: rename vcpu_destroy perf counter t... 2015-09-25 08:44:29
... ... ... ...
48 17582 Intel Code and Data Prioritization (CDP) featu... 2015-09-15 01:04:14
49 17587 KVM: arm64: add workaround for Cortex-A57 erra... 2015-09-14 17:46:28
50 17580 x86/PoD: use clear_domain_page() 2015-09-14 12:27:28
51 17569 libxl: slightly refine pci-assignable-{add, r... 2015-09-14 11:50:35
52 17579 x86/p2m: fix mismatched unlock 2015-09-14 12:49:25
53 17554 xen: use correct type for HYPERVISOR_memory_op() 2015-09-14 06:06:06
54 17512 Add VT-d Posted-Interrupts support 2015-10-13 00:55:03
55 17564 Handle unsupported distros with a prettier mes... 2015-09-11 16:52:56
56 17571 xen/swiotlb: Add support for 64KB page granula... 2015-09-10 18:30:04
57 17567 x86/hvm: fix saved pmtimer value 2015-09-10 17:01:05
58 17570 xen: arm: Give PTE bits explict values 2015-09-10 04:56:28
59 17562 x86/VPMU: Set VPMU context pointer to NULL whe... 2015-09-09 11:45:19
60 17349 Remove XenPTReg->data and use dev.config for g... 2015-09-08 17:25:00
61 17348 Cleanups + various fixes due to libxl ABI + mo... 2015-09-08 05:55:33
62 17558 xen/arm64: Add support for 64KB page in Linux 2015-09-30 12:45:37
63 17529 Gracefully handle an unsupported distro. 2015-09-07 16:46:39
64 17556 xen-block: support multi hardware-queues/rings 2015-09-06 04:39:34
65 17555 Migration v2 fix 2015-09-07 11:07:43
66 17305 toolstack-assisted approach to PVHVM guest kexec 2015-09-04 17:39:49
67 17289 32-bit domU PVH support 2015-09-04 09:05:13
68 17520 support gzipped kernels on arm 2015-09-22 00:51:33
69 17536 x86: wrap kexec feature with CONFIG_KEXEC 2015-09-01 04:35:33
70 17341 Introduce HVM without dm and new boot ABI 2015-10-05 12:36:32
71 17533 x86/IO-APIC: don't create pIRQ mapping from ma... 2015-08-21 16:58:08
72 17388 xen/arm64: Add support for 64KB page 2015-08-20 12:08:56
73 17352 Refactor ioreq server for better performance. 2015-08-24 01:33:18
74 17530 xen/tmem: Pass page instead of pfn to xen_tmem... 2015-08-19 10:23:55
75 17516 More vNUMA fixes 2015-08-17 20:57:01
76 17527 xen: arm re-order assignments in mfn_to_xen_en... 2015-08-17 04:24:55
77 17313 mm, xen/balloon: memory hotplug improvements 2015-08-17 03:41:09

78 rows × 3 columns


In [42]:
query = """ SELECT COUNT(DISTINCT(ps.id)) as reviewed_patch_serie
            FROM patch_series ps, 
                 patch_series_version psv, 
                 patches p, 
                 flags f 
            WHERE ps.id = psv.ps_id AND 
                  psv.id = p.ps_version_id AND 
                  f.patch_id = p.id AND 
                  f.flag='Reviewed-by' AND 
                  ps.id not in 
                  (SELECT COUNT(DISTINCT(ps.id)) as merged_patch_series
                   FROM patch_series ps,
                        patch_series_version psv,
                        patches p,
                        commits c
                   WHERE ps.id = psv.ps_id AND
                         psv.id = p.ps_version_id AND
                         p.commit_id = c.id) AND 
                  f.date_utc BETWEEN DATE_SUB(NOW(), INTERVAL 60 DAY) AND NOW()  """ # INITIAL VALUE OF 60 DAYS
total_recent_reviews = execute_query(cursor, query)

In [43]:
total_recent_reviews


Out[43]:
((78L,),)

Reviews ongoing


In [44]:
# List of last year reviewed patch series but not merged (and not recent reviewed)
query = """ SELECT DISTINCT ps.id as reviewed_patch_serie,
                   ps.subject,
                   max(f.date_utc) as last_reviewed_date
            FROM patch_series ps, 
                 patch_series_version psv, 
                 patches p, 
                 flags f 
            WHERE ps.id = psv.ps_id AND 
                  psv.id = p.ps_version_id AND 
                  f.patch_id = p.id AND 
                  f.flag='Reviewed-by' AND 
                  ps.id not in 
                  (SELECT COUNT(DISTINCT(ps.id)) as merged_patch_series
                   FROM patch_series ps,
                        patch_series_version psv,
                        patches p,
                        commits c
                   WHERE ps.id = psv.ps_id AND
                         psv.id = p.ps_version_id AND
                         p.commit_id = c.id) AND 
                  f.date_utc BETWEEN DATE_SUB(NOW(), INTERVAL 365 DAY) AND DATE_SUB(NOW(), INTERVAL 60 DAY)
            GROUP BY ps.id
            ORDER BY f.date_utc desc """ # INITIAL VALUE OF 60 DAYS
last_year_reviews = execute_query(cursor, query)

In [45]:
last_year_reviews_list = list(last_year_reviews)
last_year_reviews_df = DataFrame(last_year_reviews_list, columns=["patch_serie", "subject", "last_reviewed_date"])
last_year_reviews_df


Out[45]:
patch_serie subject last_reviewed_date
0 17352 Refactor ioreq server for better performance. 2015-08-14 13:32:17
1 17516 More vNUMA fixes 2015-08-14 14:19:07
2 17517 add page_get_owner_and_reference() related AS... 2015-08-12 22:49:41
3 17515 x86/hvm: Fix non-onchangeonly CR write events ... 2015-08-12 19:45:13
4 17512 Add VT-d Posted-Interrupts support 2015-08-13 13:06:01
5 17509 x86/p2m: clear_identity_p2m_entry() must cope ... 2015-08-12 18:06:26
6 17508 x86/HVM: hvm_map_guest_frame_rw() adjustments 2015-08-12 17:44:26
7 17510 gitignore: Don't ignore *.rej 2015-08-12 22:09:37
8 17505 Update microcode driver 2015-08-11 09:11:28
9 17438 multiboot2: Add two extensions and fix some is... 2015-08-11 10:59:56
10 17504 xen/xenbus: Don't leak memory when unmapping t... 2015-08-10 20:46:37
11 17503 xen/events/fifo: Handle linked events when clo... 2015-08-10 08:09:50
12 17498 Use correctly the Xen memory terminologies 2015-08-10 12:06:55
13 17492 x86/mm: Make {hap, shadow}_teardown() preemptible 2015-08-05 19:00:57
14 17483 x86/gdt: Drop write-only, xalloc()'d array fro... 2015-08-04 16:17:28
15 17478 x86, amd_ucode: Skip microcode updates for fin... 2015-08-03 18:43:27
16 17481 Migration regressions with Xen. 2015-08-03 16:53:10
17 17479 xl/libxl: disable PV vNUMA 2015-07-30 22:00:42
18 17476 x86/p2m.c: fix missed off-by-one in altp2m commit 2015-07-29 18:46:01
19 17475 x86/hvm.c: fix regression in guest destruction 2015-07-29 18:44:38
20 17471 Use correctly the Xen memory terminologies in ... 2015-08-05 12:08:55
21 17470 tools/hvmloader: sync memory map[] 2015-07-28 12:49:08
22 17451 tools: fixes inspired by Coverity scan 2015-07-28 13:49:21
23 17466 xen/events: Support event channel rebind on ARM 2015-07-28 11:10:42
24 17463 libxc: fix memory leak in migration v2 2015-07-26 23:36:05
25 17433 VT-d: add iommu=igfx_off option to workaround ... 2015-07-26 18:47:53
26 17445 x86: modify_ldt improvement, test, and config ... 2015-07-30 11:49:43
27 17455 xen: arm: Document xenheap_megabytes limitation 2015-07-23 18:49:33
28 17442 hvmloader: don't build with __XEN_TOOLS__ defined 2015-07-23 13:51:55
29 17450 x86/MSI: drop bogus NULL check from pci_resto... 2015-07-23 13:45:53
... ... ... ...
373 16340 tools: libxl: do not overrun input buffer in l... 2014-11-06 14:40:22
374 16337 xen/arm: Add support for Huawei hip04-d01 plat... 2014-11-06 10:12:05
375 16129 xen: arm: configure correct dom0_gnttab_start/... 2014-11-04 12:46:52
376 16303 xen/blkfront: improve protection against issui... 2014-12-03 06:57:32
377 16320 x86: Full support of PAT 2014-10-31 16:00:33
378 16319 x86/PVH: replace bogus assertion with conditional 2014-10-31 10:45:25
379 16315 VMX: values written to MSR_IA32_SYSENTER_E[IS]... 2014-10-30 14:46:18
380 16316 x86/HVM: only kill guest when unknown VM exit ... 2014-10-30 14:57:14
381 16311 xen/common: Process softirqs while dumping dom... 2014-10-29 19:11:33
382 16298 xen: arm: Parse PCI DT nodes' ranges and inte... 2014-10-29 19:03:36
383 16308 blktap: CONFIG_GCRYPT detection 2014-10-28 14:35:21
384 16306 Use MSI controller framework to configure MSI/... 2014-10-28 05:22:22
385 16062 x86: Full support of PAT 2014-11-03 18:44:40
386 16305 pygrub: fix non-interactive parsing of grub1 c... 2014-10-27 08:35:37
387 16294 EFI: runtime services related improvements 2014-10-24 11:52:01
388 16287 xen/public: Correct the documentation of where... 2014-10-24 09:07:21
389 16268 Sanity check xsave area when migrating or rest... 2014-10-23 23:59:54
390 16234 introduce GNTTABOP_cache_flush 2014-11-12 11:40:54
391 16293 tools/libxl: Fix libxl_list_vcpu() following c... 2014-10-23 15:08:50
392 16292 hvm/load: Correct length checks for zeroextend... 2014-10-24 17:42:50
393 16257 xen/smp: Use die_complete completion when taki... 2014-10-22 17:32:59
394 16289 x86/boot: fix reloc.S build dependencies 2014-10-22 11:58:39
395 16286 Xen and tools: Fix listing of vcpus when domai... 2014-10-21 09:09:45
396 16278 x86/setup: Correct register clobbers for the a... 2014-10-21 18:27:16
397 16185 Add support for Xen access to vmport 2014-10-20 12:30:06
398 16096 Introducing Xen PV block driver to OVMF 2014-10-23 20:17:32
399 16265 x86/hvm: Further restrict access to x2apic MSRs 2014-10-21 12:11:54
400 16220 xen/arm: introduce GNTTABOP_cache_flush 2014-10-23 10:34:48
401 16174 Use MSI chip framework to configure MSI/MSI-X ... 2014-10-15 19:06:59
402 16178 x86/viridian: Add Partition Reference Time enl... 2015-01-07 10:20:10

403 rows × 3 columns


In [46]:
query = """ SELECT COUNT(DISTINCT(ps.id)) as reviewed_patch_serie
            FROM patch_series ps, 
                 patch_series_version psv, 
                 patches p, 
                 flags f 
            WHERE ps.id = psv.ps_id AND 
                  psv.id = p.ps_version_id AND 
                  f.patch_id = p.id AND 
                  f.flag='Reviewed-by' AND 
                  ps.id not in 
                  (SELECT COUNT(DISTINCT(ps.id)) as merged_patch_series
                   FROM patch_series ps,
                        patch_series_version psv,
                        patches p,
                        commits c
                   WHERE ps.id = psv.ps_id AND
                         psv.id = p.ps_version_id AND
                         p.commit_id = c.id) AND 
                  f.date_utc BETWEEN DATE_SUB(NOW(), INTERVAL 365 DAY) AND DATE_SUB(NOW(), INTERVAL 60 DAY)  """ # INITIAL VALUE OF 60 DAYS
total_last_year_reviews = execute_query(cursor, query)

In [47]:
total_last_year_reviews


Out[47]:
((403L,),)

Reviews stalled


In [48]:
# Stalled list of reviews: those with review activity and older than one year
query = """ SELECT DISTINCT ps.id as reviewed_patch_serie,
                   ps.subject,
                   max(f.date_utc) as last_reviewed_date
            FROM patch_series ps, 
                 patch_series_version psv, 
                 patches p, 
                 flags f 
            WHERE ps.id = psv.ps_id AND 
                  psv.id = p.ps_version_id AND 
                  f.patch_id = p.id AND 
                  f.flag='Reviewed-by' AND 
                  ps.id not in 
                  (SELECT COUNT(DISTINCT(ps.id)) as merged_patch_series
                   FROM patch_series ps,
                        patch_series_version psv,
                        patches p,
                        commits c
                   WHERE ps.id = psv.ps_id AND
                         psv.id = p.ps_version_id AND
                         p.commit_id = c.id) AND 
                  f.date_utc < DATE_SUB(NOW(), INTERVAL 365 DAY)
            GROUP BY ps.id
            ORDER BY f.date_utc desc """ 
stalled_reviews = execute_query(cursor, query)

In [49]:
stalled_reviews_list = list(stalled_reviews)
stalled_reviews_df = DataFrame(stalled_reviews_list, columns=["patch_serie", "subject", "last_reviewed_date"])
stalled_reviews_df


Out[49]:
patch_serie subject last_reviewed_date
0 16243 move XENMEM_get_vnumainfo out of tools-only s... 2014-10-13 12:14:55
1 16178 x86/viridian: Add Partition Reference Time enl... 2014-10-10 15:55:22
2 16234 introduce GNTTABOP_cache_flush 2014-10-10 13:51:42
3 16236 Separate runtime debug output from debug symbols 2014-10-09 11:58:38
4 16227 xen/balloon: Don't continue ballooning when BP... 2014-10-08 11:30:11
5 16233 Decouple SandyBridge quirk from VTd timeout 2014-10-08 04:34:31
6 16177 xenstore: extend the xenstore ring with a 'clo... 2014-10-07 13:26:44
7 16224 kernel: Add support for poweroff handler call ... 2014-10-06 15:28:45
8 16211 introduce XENMEM_cache_flush 2014-10-03 17:05:14
9 16220 xen/arm: introduce GNTTABOP_cache_flush 2014-10-13 17:36:50
10 16219 amd/seattle: Initial revision of AMD Seattle s... 2014-10-03 16:31:21
11 16076 Xen VMware tools support 2014-10-03 00:21:15
12 16214 don't allow Dom0 access to IOMMUs' MMIO pages 2014-10-02 19:53:13
13 16212 x86: restore reserving of IO-APIC pages in XE... 2014-10-02 17:56:21
14 16215 x86: favor function parameter over global in ... 2014-10-02 17:46:56
15 16210 xen/arm: introduce XENMEM_cache_flush 2014-10-02 13:57:07
16 16208 x86/MSI: fix MSI-X case of freeing IRQ 2014-10-02 10:03:39
17 16200 xen/arm: Update early print console UART addre... 2014-10-01 17:30:31
18 16199 xen/arm: Broadcom 7445D0 ARM cpu support. 2014-10-03 12:41:02
19 16202 Introduce AMD Seattle platform support 2014-10-01 16:43:45
20 16196 Refactoring for future mem_event and mem_acces... 2014-09-29 19:55:13
21 16181 Improve "Emulation failed" error message 2014-09-29 11:46:29
22 16188 x86/hvm: remove stray lock release from hvm_io... 2014-09-26 14:43:38
23 16174 Use MSI chip framework to configure MSI/MSI-X ... 2014-09-26 12:09:49
24 16179 enable Cache Monitoring Technology (CMT) feature 2014-10-04 03:55:13
25 16176 Trivial patches for PVH doc and libxl Makefile 2014-09-25 06:56:36
26 16172 xen: Break multiboot (v1) dependency and add m... 2014-09-24 20:11:56
27 16171 x86/LAPIC: drop support for non-integrated APIC 2014-09-24 18:18:57
28 16169 x86: make dump_pageframe_info() slightly more... 2014-09-24 17:34:47
29 16165 tools/libxl: Fix two errors with libxl_userdat... 2014-09-24 14:39:08
... ... ... ...
570 11352 Modpost section mismatch fix 2011-07-08 02:46:48
571 11313 qemu_ram_ptr_length: take ram_addr_t as arguments 2011-06-28 12:52:09
572 11320 Linux: 6 arguments hypercall v3 2011-06-24 17:13:24
573 11254 x86: remove x86_init.mapping.pagetable_reserve 2011-06-07 20:13:29
574 10985 mm: Extend memory hotplug API to allow memory ... 2011-05-25 02:27:33
575 10984 mm: Add SECTION_ALIGN_UP() and SECTION_ALIGN_D... 2011-05-18 01:37:50
576 11160 nestedsvm: fix fpu context switch 2011-05-12 16:44:55
577 11109 Two patches fixing regression introduced by 'x... 2011-05-02 09:22:21
578 11095 pv-grub: Fix for incorrect dom->p2m_host[] lis... 2011-04-27 00:58:09
579 10983 mm: Optimize pfn calculation in online_page() 2011-03-29 02:44:39
580 10684 xen network backend driver 2011-03-15 10:06:18
581 10803 Xen PCI fronted fixes for 2.6.39 2011-02-18 14:15:43
582 10794 xen: suspend and resume system devices when ru... 2011-02-15 08:36:03
583 10731 Consider E820 non-RAM and E820 gaps as 1-1 map... 2011-01-31 12:44:34
584 10639 Xen VGA dirtybit support 2011-01-18 20:20:35
585 10616 Under Xen, consider E820 non-RAM and E820 gaps... 2011-01-10 07:17:39
586 10122 xen: initial domain support 2010-10-19 12:24:47
587 10136 Xen PCI + Xen PCI frontend driver. 2010-10-13 08:16:36
588 9969 PV on HVM: receive interrupts as xen events 2010-10-12 18:34:49
589 9422 svm: Avoid VINTR injection during NMI shadow 2010-05-07 18:41:49
590 8265 fix hang on migration 2009-07-29 23:50:45
591 7502 IRQ handling race and spurious IIR read in ser... 2009-03-12 19:57:27
592 7144 fix oshelp fcntl_setfd_cloexec typo 2009-01-09 11:35:29
593 7034 VT-d: check return value of pirq_guest_bind() 2008-12-09 10:24:09
594 5696 Fix AMD threshold register definitions and act... 2014-01-28 11:13:35
595 5450 Enable VirtualPC 2007 run on top of XEN: add_w... 2008-02-01 22:24:31
596 5449 Enable VirtualPC 2007 run on top of XEN: REPS ... 2008-02-01 22:24:24
597 5448 Enable VirtualPC 2007 run on top of XEN: fxsav... 2008-02-01 22:24:15
598 5447 Enable VirtualPC 2007 run on top of XEN : ltr ... 2008-02-01 22:24:08
599 4291 Add callgraph support to Xenoprofile 2007-06-07 22:45:38

600 rows × 3 columns


In [50]:
query = """ SELECT COUNT(DISTINCT(ps.id)) as reviewed_patch_serie
            FROM patch_series ps, 
                 patch_series_version psv, 
                 patches p, 
                 flags f 
            WHERE ps.id = psv.ps_id AND 
                  psv.id = p.ps_version_id AND 
                  f.patch_id = p.id AND 
                  f.flag='Reviewed-by' AND 
                  ps.id not in 
                  (SELECT COUNT(DISTINCT(ps.id)) as merged_patch_series
                   FROM patch_series ps,
                        patch_series_version psv,
                        patches p,
                        commits c
                   WHERE ps.id = psv.ps_id AND
                         psv.id = p.ps_version_id AND
                         p.commit_id = c.id) AND 
                  f.date_utc < DATE_SUB(NOW(), INTERVAL 365 DAY)"""
total_stalled_reviews = execute_query(cursor, query)

In [51]:
int(total_stalled_reviews[0][0])


Out[51]:
600

4) PATCH SERIES COMPLEXITY ANALYSIS

Metrics to retrieve:

  • Number of versions per patch serie
  • From patches merged: check number of 'touched' files plus added and removed lines
  • Comments received per patch
  • Number of patches per patch serie

Number of versions per patch serie


In [52]:
query = """ SELECT ps_id as patch_serie,
                   COUNT(DISTINCT(version)) as versions,
                   MIN(date_utc) as patch_serie_first_date
            FROM patch_series_version
            GROUP BY ps_id """
iterations = execute_query(cursor, query)

In [53]:
iterations_list = list(iterations)
iterations_df = DataFrame(iterations_list, columns=["patch_serie", "versions", "patch_serie_first_date"])

fig, axs = plt.subplots(1,2, figsize=(12,5))

hist_chart = iterations_df["versions"].plot(kind="hist", ax=axs[0])
boxplot_chart = iterations_df["versions"].plot(kind="box", ax=axs[1])

#chart labels
hist_chart.set_xlabel("Iterations")
hist_chart.set_ylabel("Patches")
hist_chart.set_title("Histogram: number of iterations per patch")

boxplot_chart.set_ylabel("Iterations per patch")
boxplot_chart.set_title("Log scale: Dist. iterations per patch")


Out[53]:
<matplotlib.text.Text at 0x7fcfaf91f990>

In [54]:
mean_iterations = iterations_df.set_index("patch_serie_first_date").resample('M', how={"versions":np.mean})
median_iterations = iterations_df.set_index("patch_serie_first_date").resample('M', how={"versions":np.median})

dates = pd.date_range('2005-01', '2015-11', freq='M') # data starting in 2010-10 and ending in 2015-08
iterationsTS = Series(mean_iterations["versions"], index=dates)
iterations_medianTS = Series(median_iterations["versions"], index=dates)

iterations_chart = iterationsTS.plot()
iterations_medianTS.plot()
iterations_chart.set_xlabel("Time evolution")
iterations_chart.set_ylabel("iterations")
iterations_chart.set_title("Evolution of the mean (blue) and median (green) iterations per patch")


Out[54]:
<matplotlib.text.Text at 0x7fcfaf875b90>

Number of 'touched' files, added and removed lines


In [55]:
query = """ SELECT psv.ps_id as patch_serie,
                   max(c.committer_date_utc) as merge_time, 
                   COUNT(DISTINCT(a.file_id)) as touched_files,
                   SUM(cl.added) as added_lines,
                   SUM(cl.removed) as removed_lines
            FROM patch_series_version psv, 
                 patches p, 
                 commits c, 
                 xen_reports_cvsanaly.scmlog s, 
                 xen_reports_cvsanaly.actions a, 
                 xen_reports_cvsanaly.commits_lines cl 
            WHERE psv.id = p.ps_version_id AND 
                  p.commit_id = c.id AND 
                  c.rev=s.rev AND 
                  s.id=a.commit_id AND 
                  s.id=cl.commit_id 
            GROUP BY psv.ps_id """
modified_code = execute_query(cursor, query)

In [56]:
modified_code_list = list(modified_code)
modified_code_df = DataFrame(modified_code_list, columns=["patch_serie", "merge_time", "touched_files", "added_lines", "removed_lines"])

fig, axs = plt.subplots(1,2, figsize=(12,5))

hist_chart = modified_code_df["touched_files"].plot(kind="hist", ax=axs[0])
boxplot_chart = modified_code_df["touched_files"].plot(kind="box", ax=axs[1], logy=True)

#chart labels
hist_chart.set_xlabel("Touched files")
hist_chart.set_ylabel("Patch serie")
hist_chart.set_title("Histogram: number of touched files per patch")

boxplot_chart.set_ylabel("Touched files per patch serie")
boxplot_chart.set_title("Log scale: Dist. touched files per patch serie")


Out[56]:
<matplotlib.text.Text at 0x7fcfaf3b1590>

In [57]:
# Touched files per patch serie in mean. This is also time series distributed to check if there are more touched files per patch series in mean and median
mean_modified_code = modified_code_df.set_index("merge_time").resample('M', how={"touched_files":np.mean})
median_modified_code = modified_code_df.set_index("merge_time").resample('M', how={"touched_files":np.median})

dates = pd.date_range('2005-01', '2015-11', freq='M') # data starting in 2010-10 and ending in 2015-08
modified_codeTS = Series(mean_modified_code["touched_files"], index=dates)
modified_code_medianTS = Series(median_modified_code["touched_files"], index=dates)

modified_code_chart = modified_codeTS.plot()
modified_code_medianTS.plot()
modified_code_chart.set_xlabel("Time evolution")
modified_code_chart.set_ylabel("Touched files")
modified_code_chart.set_title("Evolution of the mean (blue) and median (green) touched files per patch")


Out[57]:
<matplotlib.text.Text at 0x7fcfaf2ddad0>

Comments received per patch serie


In [58]:
# Let's calculate evolution of comments in general when they were sent
# We should ignore 1970 emails with wrong sent date
query = """ SELECT MIN(psv.date_utc) as patch_serie_first_date,
                   ps_id as patch_serie,
                   count(distinct(c.id)) as comments
            FROM patch_series_version psv,
                 patches p,
                 comments c 
            WHERE psv.id = p.ps_version_id and 
                  p.id = c.patch_id and
                  year(psv.date_utc) > 2004 and
                  year(psv.date_utc) < 2016
            GROUP BY ps_id;
           
        """
comments = execute_query(cursor, query)

In [59]:
comments_list = list(comments)
comments_df = DataFrame(comments_list, columns=["patch_serie_first_date", "patch_serie", "comments"])

fig, axs = plt.subplots(1,2, figsize=(12,5))

hist_chart = comments_df["comments"].plot(kind="hist", ax=axs[0])
boxplot_chart = comments_df["comments"].plot(kind="box", ax=axs[1], logy=True)

#chart labels
hist_chart.set_xlabel("Comments")
hist_chart.set_ylabel("Patch series")
hist_chart.set_title("Histogram: number of comments per patch serie")

boxplot_chart.set_ylabel("Comments per patch serie")
boxplot_chart.set_title("Log scale: Dist. comments per patch serie")


Out[59]:
<matplotlib.text.Text at 0x7fcfaee39310>

In [60]:
# Comments per patch serie in mean. This is also time series distributed to check if there are more comments per patch series in mean and median
mean_comments = comments_df.set_index("patch_serie_first_date").resample('M', how={"comments":np.mean})
median_comments = comments_df.set_index("patch_serie_first_date").resample('M', how={"comments":np.median})

dates = pd.date_range('2005-01', '2015-11', freq='M') # data starting in 2010-10 and ending in 2015-08
commentsTS = Series(mean_comments["comments"], index=dates)
comments_medianTS = Series(median_comments["comments"], index=dates)

comments_chart = commentsTS.plot()
comments_medianTS.plot()
comments_chart.set_xlabel("Time evolution")
comments_chart.set_ylabel("comments")
comments_chart.set_title("Evolution of the mean (blue) and median (green) comments per patch")


Out[60]:
<matplotlib.text.Text at 0x7fcfaec39c50>

Number of patches per patch serie


In [61]:
query = """ SELECT  distinct psv.ps_id as patch_serie,  
                   psv.date_utc as first_send, 
                   max(t.patches) as numpatches 
             FROM patch_series_version psv, 
                  patches p, 
                     (SELECT psv.ps_id, 
                             p.ps_version_id, 
                             count(distinct(p.id)) as patches 
                      FROM patch_series_version psv, 
                           patches p 
                      WHERE psv.id = p.ps_version_id  
                      GROUP BY psv.ps_id, 
                               p.ps_version_id) t 
              WHERE psv.ps_id = t.ps_id AND 
                    p.ps_version_id=t.ps_version_id AND 
                    psv.id = p.ps_version_id 
              GROUP BY psv.ps_id
          """
patches_ps = execute_query(cursor, query)

In [62]:
numpatches_list = list(patches_ps)
numpatches_df = DataFrame(numpatches_list, columns=["patch_serie", "first_send", "numpatches"])

fig, axs = plt.subplots(1,2, figsize=(12,5))

hist_chart = numpatches_df["numpatches"].plot(kind="hist", ax=axs[0])
boxplot_chart = numpatches_df["numpatches"].plot(kind="box", ax=axs[1], logy=True)

#chart labels
hist_chart.set_xlabel("numpatches")
hist_chart.set_ylabel("Patches")
hist_chart.set_title("Histogram: number of numpatches per patch serie")

boxplot_chart.set_ylabel("numpatches per patch serie")
boxplot_chart.set_title("Log scale: Dist. numpatches per patch serie")


Out[62]:
<matplotlib.text.Text at 0x7fcfae731910>

In [63]:
# numpatches per patch serie in mean. This is also time series distributed to check if there are more numpatches per patch series in mean and median
mean_numpatches = numpatches_df.set_index("first_send").resample('M', how={"numpatches":np.mean})
median_numpatches = numpatches_df.set_index("first_send").resample('M', how={"numpatches":np.median})

dates = pd.date_range('2005-01', '2015-11', freq='M') # data starting in 2010-10 and ending in 2015-08
numpatchesTS = Series(mean_numpatches["numpatches"], index=dates)
numpatches_medianTS = Series(median_numpatches["numpatches"], index=dates)

numpatches_chart = numpatchesTS.plot()
numpatches_medianTS.plot()
numpatches_chart.set_xlabel("Time evolution")
numpatches_chart.set_ylabel("numpatches")
numpatches_chart.set_title("Evolution of the mean (blue) and median (green) numpatches per patch")


Out[63]:
<matplotlib.text.Text at 0x7fcfae5d7290>

5) PATCH SERIES COMMUNITY

Metrics to retrieve:

  • Top people sending patches
  • Top people reviewing patches
  • For all of those, basic analysis with organization info based on email domain

Top people sending patches (with affiliation if possible)


In [64]:
#TODO

Top people reviewing patches (with affiliation if possible)


In [65]:
#TODO

6) ANALYSIS

Splitting by date


In [66]:
# Taking time2merge dataframe
time2merge_2015 = time2merge_df[time2merge_df["first_patch_date"] >= '2015-01-01']

time2merge_2014 = time2merge_df[time2merge_df["first_patch_date"] >= '2014-01-01']
time2merge_2014 = time2merge_2014[time2merge_2014["first_patch_date"] < '2015-01-01']

time2merge_2013 = time2merge_df[time2merge_df["first_patch_date"] >= '2013-01-01']
time2merge_2013 = time2merge_2013[time2merge_2013["first_patch_date"] < '2014-01-01']

time2merge_2012 = time2merge_df[time2merge_df["first_patch_date"] >= '2012-01-01']
time2merge_2012 = time2merge_2012[time2merge_2012["first_patch_date"] < '2013-01-01']


data=[time2merge_2012["time2merge_days"], time2merge_2013["time2merge_days"], time2merge_2014["time2merge_days"], time2merge_2015["time2merge_days"]]

figure = pylab.boxplot(data, sym="")
ticks = pylab.xticks([1, 2, 3, 4], ['2012', '2013', '2014', '2015'])
pylab.title("Time to merge divided by year (without outliers)")
pylab.ylabel("Days")


Out[66]:
<matplotlib.text.Text at 0x7fcfae4f0c10>

Splitting by semester and taking the time when the changeset was open


In [67]:
# Taking time2merge dataframe
time2merge_2015_s1 = time2merge_df[time2merge_df["first_patch_date"] >= '2015-01-01']
time2merge_2015_s1 = time2merge_2015_s1[time2merge_2015_s1["first_patch_date"] < '2015-07-01']

time2merge_2014_s2 = time2merge_df[time2merge_df["first_patch_date"] >= '2014-07-01']
time2merge_2014_s2 = time2merge_2014_s2[time2merge_2014_s2["first_patch_date"] < '2015-01-01']
time2merge_2014_s1 = time2merge_df[time2merge_df["first_patch_date"] >= '2014-01-01']
time2merge_2014_s1 = time2merge_2014_s1[time2merge_2014_s1["first_patch_date"] < '2014-07-01']

time2merge_2013_s2 = time2merge_df[time2merge_df["first_patch_date"] >= '2013-01-01']
time2merge_2013_s2 = time2merge_2013_s2[time2merge_2013_s2["first_patch_date"] < '2014-01-01']
time2merge_2013_s1 = time2merge_df[time2merge_df["first_patch_date"] >= '2013-01-01']
time2merge_2013_s1 = time2merge_2013_s1[time2merge_2013_s1["first_patch_date"] < '2014-01-01']

time2merge_2012_s2 = time2merge_df[time2merge_df["first_patch_date"] >= '2012-07-01']
time2merge_2012_s2 = time2merge_2012_s2[time2merge_2012_s2["first_patch_date"] < '2013-01-01']
time2merge_2012_s1 = time2merge_df[time2merge_df["first_patch_date"] >= '2012-01-01']
time2merge_2012_s1 = time2merge_2012_s1[time2merge_2012_s1["first_patch_date"] < '2012-07-01']


data=[time2merge_2012_s1["time2merge_days"], time2merge_2012_s2["time2merge_days"], time2merge_2013_s1["time2merge_days"], time2merge_2013_s2["time2merge_days"], time2merge_2014_s1["time2merge_days"], time2merge_2014_s2["time2merge_days"], time2merge_2015_s1["time2merge_days"]]

figure = pylab.boxplot(data, sym="")
ticks = pylab.xticks([1, 2, 3, 4, 5, 6, 7], ['2012S1', '2012S2', '2013S1', '2013S2', '2014S1', '2014S2', '2015S1'])
pylab.title("Time to merge divided by year (without outliers)")
pylab.ylabel("Days")


Out[67]:
<matplotlib.text.Text at 0x7fcfae2f5d50>

Splitting by semester and taking the time when the changeset was merged


In [68]:
# Taking time2merge dataframe
time2merge_2015_s1 = time2merge_df[time2merge_df["merge_time"] >= '2015-01-01']
time2merge_2015_s1 = time2merge_2015_s1[time2merge_2015_s1["merge_time"] < '2015-07-01']

time2merge_2014_s2 = time2merge_df[time2merge_df["merge_time"] >= '2014-07-01']
time2merge_2014_s2 = time2merge_2014_s2[time2merge_2014_s2["merge_time"] < '2015-01-01']
time2merge_2014_s1 = time2merge_df[time2merge_df["merge_time"] >= '2014-01-01']
time2merge_2014_s1 = time2merge_2014_s1[time2merge_2014_s1["merge_time"] < '2014-07-01']

time2merge_2013_s2 = time2merge_df[time2merge_df["merge_time"] >= '2013-01-01']
time2merge_2013_s2 = time2merge_2013_s2[time2merge_2013_s2["merge_time"] < '2014-01-01']
time2merge_2013_s1 = time2merge_df[time2merge_df["merge_time"] >= '2013-01-01']
time2merge_2013_s1 = time2merge_2013_s1[time2merge_2013_s1["merge_time"] < '2014-01-01']

time2merge_2012_s2 = time2merge_df[time2merge_df["merge_time"] >= '2012-07-01']
time2merge_2012_s2 = time2merge_2012_s2[time2merge_2012_s2["merge_time"] < '2013-01-01']
time2merge_2012_s1 = time2merge_df[time2merge_df["merge_time"] >= '2012-01-01']
time2merge_2012_s1 = time2merge_2012_s1[time2merge_2012_s1["merge_time"] < '2012-07-01']


data=[time2merge_2012_s1["time2merge_days"], time2merge_2012_s2["time2merge_days"], time2merge_2013_s1["time2merge_days"], time2merge_2013_s2["time2merge_days"], time2merge_2014_s1["time2merge_days"], time2merge_2014_s2["time2merge_days"], time2merge_2015_s1["time2merge_days"]]

figure = pylab.boxplot(data, sym="")
ticks = pylab.xticks([1, 2, 3, 4, 5, 6, 7], ['2012S1', '2012S2', '2013S1', '2013S2', '2014S1', '2014S2', '2015S1'])
pylab.title("Time to merge divided by year (without outliers)")
pylab.ylabel("Days")


Out[68]:
<matplotlib.text.Text at 0x7fcfae2cbc50>

Splitting by number of patches in the patch serie


In [69]:
query = """ SELECT psv.ps_id as patch_serie,
                   TIMESTAMPDIFF(SECOND, MIN(psv.date_utc), MAX(c.committer_date_utc)) as time2merge,
                   MIN(psv.date_utc) as first_patch_date, 
                   MAX(c.committer_date_utc) as merge_time,
                   MAX(t.patches) as numpatches 
             FROM patch_series_version psv, 
                  patches p, 
                     (SELECT psv.ps_id, 
                             p.ps_version_id, 
                             count(*) as patches 
                      FROM patch_series_version psv, 
                           patches p 
                      WHERE psv.id = p.ps_version_id  
                      GROUP BY psv.ps_id, 
                               p.ps_version_id) t,
                   commits c
              WHERE psv.ps_id = t.ps_id AND 
                    p.ps_version_id=t.ps_version_id AND 
                    psv.id = p.ps_version_id AND
                    p.commit_id = c.id
              GROUP BY psv.ps_id 
              HAVING time2merge > 0 and time2merge < 1000*3600*24"""
time2merge_extra = execute_query(cursor, query)

In [70]:
time2merge_extra_list = list(time2merge_extra)
time2merge_extra_df = DataFrame(time2merge_extra_list, columns=["patch_serie", "time2merge", "first_patch_date", "merge_time", "numpatches"])
time2merge_extra_df["time2merge_days"] = (time2merge_extra_df["time2merge"]/3600)/24

In [71]:
time2merge_extra_df_1patch = time2merge_extra_df[time2merge_extra_df["numpatches"]==1]
time2merge_extra_df_2patch = time2merge_extra_df[time2merge_extra_df["numpatches"]==2]
time2merge_extra_df_3patch = time2merge_extra_df[time2merge_extra_df["numpatches"]==3]
time2merge_extra_df_4patch = time2merge_extra_df[time2merge_extra_df["numpatches"]==4]
time2merge_extra_df_5patch = time2merge_extra_df[time2merge_extra_df["numpatches"]>4]

data=[time2merge_extra_df_1patch["time2merge_days"], time2merge_extra_df_2patch["time2merge_days"], time2merge_extra_df_3patch["time2merge_days"],
      time2merge_extra_df_4patch["time2merge_days"], time2merge_extra_df_5patch["time2merge_days"]]
figure = pylab.boxplot(data, sym="")
ticks = pylab.xticks([1, 2, 3, 4, 5], ['1 patch', '2 patches', '3 patches', '4 patches', '> 4 patches'])
pylab.title("Time to merge divided by number of patches (without outliers)")
pylab.ylabel("Days")


Out[71]:
<matplotlib.text.Text at 0x7fcfadf6b4d0>

Splitting by number of patchsets in the patch series and by semester


In [72]:
def draw_boxplots(title, data_df, field):
    # Taking data dataframe
    data_2015_s1 = data_df[data_df[field] >= '2015-01-01']
    data_2015_s1 = data_2015_s1[data_2015_s1[field] < '2015-07-01']

    data_2014_s2 = data_df[data_df[field] >= '2014-07-01']
    data_2014_s2 = data_2014_s2[data_2014_s2[field] < '2015-01-01']
    data_2014_s1 = data_df[data_df[field] >= '2014-01-01']
    data_2014_s1 = data_2014_s1[data_2014_s1[field] < '2014-07-01']

    data_2013_s2 = data_df[data_df[field] >= '2013-01-01']
    data_2013_s2 = data_2013_s2[data_2013_s2[field] < '2014-01-01']
    data_2013_s1 = data_df[data_df[field] >= '2013-01-01']
    data_2013_s1 = data_2013_s1[data_2013_s1[field] < '2014-01-01']

    data_2012_s2 = data_df[data_df[field] >= '2012-07-01']
    data_2012_s2 = data_2012_s2[data_2012_s2[field] < '2013-01-01']
    data_2012_s1 = data_df[data_df[field] >= '2012-01-01']
    data_2012_s1 = data_2012_s1[data_2012_s1[field] < '2012-07-01']


    data=[data_2012_s1["time2merge_days"], data_2012_s2["time2merge_days"], data_2013_s1["time2merge_days"], data_2013_s2["time2merge_days"], data_2014_s1["time2merge_days"], data_2014_s2["time2merge_days"], data_2015_s1["time2merge_days"]]

    figure = pylab.boxplot(data, sym="")
    ticks = pylab.xticks([1, 2, 3, 4, 5, 6, 7], ['2012S1', '2012S2', '2013S1', '2013S2', '2014S1', '2014S2', '2015S1'])
    pylab.title("Time to merge divided by year (without outliers)" + title)
    pylab.ylabel("Days")

In [73]:
draw_boxplots(" 1 patch", time2merge_extra_df_1patch, "first_patch_date")



In [74]:
draw_boxplots(" 2 patches", time2merge_extra_df_2patch, "first_patch_date")



In [75]:
draw_boxplots(" 3 patches", time2merge_extra_df_3patch, "first_patch_date")



In [76]:
draw_boxplots(" 4 patches", time2merge_extra_df_4patch, "first_patch_date")



In [77]:
draw_boxplots(" 5 patches", time2merge_extra_df_5patch, "first_patch_date")



In [78]:
draw_boxplots(" 5 patches", time2merge_extra_df_5patch, "merge_time")



In [79]:
draw_boxplots(" 1 patch", time2merge_extra_df_1patch, "merge_time")


Evolution of the number of patches per patch serie (so, are longer patch series taking place lately?)


In [80]:
total_patch_series_1 = time2merge_extra_df_1patch.set_index("first_patch_date").resample('M', how={"numpatches":np.size})
total_patch_series_2 = time2merge_extra_df_2patch.set_index("first_patch_date").resample('M', how={"numpatches":np.size})
total_patch_series_3 = time2merge_extra_df_3patch.set_index("first_patch_date").resample('M', how={"numpatches":np.size})
total_patch_series_4 = time2merge_extra_df_4patch.set_index("first_patch_date").resample('M', how={"numpatches":np.size})
total_patch_series_5 = time2merge_extra_df_5patch.set_index("first_patch_date").resample('M', how={"numpatches":np.size})

#fig, ax = plt.subplots()
dates = pd.date_range('2005-01', '2015-11', freq='M') # data starting in 2005-10 and ending in 2015-08

total_patch_series_1["numpatches1"] = total_patch_series_1["numpatches"]
total_patch_series_2["numpatches2"] = total_patch_series_2["numpatches"]
total_patch_series_3["numpatches3"] = total_patch_series_3["numpatches"]
total_patch_series_4["numpatches4"] = total_patch_series_4["numpatches"]
total_patch_series_5["numpatches5"] = total_patch_series_5["numpatches"]

patches1TS = Series(total_patch_series_1["numpatches1"], index=dates)
patches2TS = Series(total_patch_series_2["numpatches2"], index=dates)
patches3TS = Series(total_patch_series_3["numpatches3"], index=dates)
patches4TS = Series(total_patch_series_4["numpatches4"], index=dates)
patches5TS = Series(total_patch_series_5["numpatches5"], index=dates)

patches_chart = patches1TS.plot()
patches2TS.plot()
patches3TS.plot()
patches4TS.plot()
patches5TS.plot()
patches_chart.legend()


Out[80]:
<matplotlib.legend.Legend at 0x7fcfada6ea50>

Relationship between the median time to merge and the existence of patch series bigger than 4 patches


In [81]:
patches5TS.corr(time2merge_medianTS)


Out[81]:
0.45023097714471938

Coverage analysis


In [82]:
#unique patches
query = """select year(p.date_utc),
                  count(distinct(p.id)) 
           from patches p,
                patch_series_version psv 
           where psv.version=1 and 
                 psv.id=p.ps_version_id 
           group by year(p.date_utc) """
unique_patches = execute_query(cursor, query)
unique_patches


Out[82]:
((2004L, 15L),
 (2005L, 1312L),
 (2006L, 2003L),
 (2007L, 1969L),
 (2008L, 1934L),
 (2009L, 1915L),
 (2010L, 1948L),
 (2011L, 1559L),
 (2012L, 1907L),
 (2013L, 2345L),
 (2014L, 2035L),
 (2015L, 2060L),
 (2016L, 9L))

In [83]:
#commits detected by patch
query = """ select year(date_utc),
                   count(distinct(commit_id)) 
            from patches 
            group by year(date_utc) """
matched_commits = execute_query(cursor, query)
matched_commits


Out[83]:
((2001L, 2L),
 (2004L, 0L),
 (2005L, 37L),
 (2006L, 99L),
 (2007L, 277L),
 (2008L, 487L),
 (2009L, 468L),
 (2010L, 557L),
 (2011L, 581L),
 (2012L, 954L),
 (2013L, 1396L),
 (2014L, 1315L),
 (2015L, 1244L),
 (2016L, 0L))

In [84]:
#Actual commits
query = """ select year(s.date),
                   count(s.id)
            from xen_reports_cvsanaly.scmlog s
            group by year(s.date) """
current_commits = execute_query(cursor, query)
current_commits


Out[84]:
((1970L, 1L),
 (2002L, 82L),
 (2003L, 940L),
 (2004L, 2350L),
 (2005L, 5102L),
 (2006L, 4749L),
 (2007L, 3474L),
 (2008L, 2265L),
 (2009L, 1782L),
 (2010L, 2236L),
 (2011L, 2181L),
 (2012L, 2293L),
 (2013L, 2503L),
 (2014L, 2332L),
 (2015L, 2204L))

In [ ]: