In [1]:
#!/usr/lib/env python3
# -*- code: utf-8 -*-

import numpy as np
import pandas as pd

In [2]:
data_dir = "../data/data_by_ocean/Eclipse_raw/"
bug_list_dir = "buglist/"
bug_description_dir = "description/"
bug_history_dir = "bughistory_raw/"

In [3]:
bug_list_record = pd.read_csv(data_dir+bug_list_dir+'bugs1.csv')

In [5]:
bug_list_record.head()


Out[5]:
Bug ID Classification Product Component Assignee Status Resolution Number of Comments Summary
0 1 Eclipse Platform Team James_Moody@ca.ibm.com CLOSED FIXED 14 Usability issue with external editors (1GE6IRL)
1 2 Eclipse Platform Team James_Moody@ca.ibm.com RESOLVED FIXED 14 Opening repository resources doesn't honor typ...
2 3 Eclipse Platform Team James_Moody@ca.ibm.com RESOLVED FIXED 5 Sync does not indicate deletion (1GIEN83)
3 4 Eclipse Platform Team Michael_Valenta@ca.ibm.com RESOLVED FIXED 3 need better error message if catching up over ...
4 10 Eclipse Platform Team jean-michel_lemieux@ca.ibm.com VERIFIED FIXED 6 API - VCM event notification (1G8G6RR)

In [4]:
with open('../data/data_by_ocean/Eclipse_raw/description/bugs1/1.txt', 'r', encoding='utf-8') as f:
    bug_description_record = " ".join([line.strip() for line in f.readlines()])
bug_description_record


Out[4]:
'Usability issue with external editors (1GE6IRL) [reply] Description Andre Weinand 2001-10-10 21:34:46 EDT - Setup a project that contains a *.gif resource - release project to CVS - edit the *.gif resource with an external editor, e.g. PaintShop - save and close external editor - in Navigator open the icon resource and verify that your changes are there - release project -> nothing to release! - in Navigator open the icon resource and verify that your changes are still there  Problem: because I never "Refreshed from local", the workspace hasn\'t changed so "Release" didn\'t find anything. However opening the resource with an external editor found the modified file on disk and showed the changes.  The real problem occurs if "Release" actually finds something to release but you don\'t spot that some resources are missing. This is extremely error prone: one of my changes didn\'t made it into build 110 because of this!  NOTES: EG (5/23/01 3:00:33 PM) Release should do a refresh from local before doing the release. Moving to VCM   KM (05/27/01 5:10:19 PM) Comments from JM in related email:  Should not do this for free.  Could have a setting which made it optoinal but should nt be mandatory.  Default setting could be to have it on. Consider the SWT team who keep their workspaces on network drives.  This will be slow.  Side effects will be that a build runs when the refresh is completed unless you somehow do it in a workspace runnable and don\'t end the runnable until after the release.  This would be less than optimal as some builders may be responsible for maintaining some invariants and deriving resources which are releasable.  If you don\'t run the builders before releasing, the invariants will not be maintained and you will release inconsistent state.  Summary:  Offer to "ensure local consistency" before releasing.  KM (5/31/01 1:30:35 PM) See also 1GEAG1A: ITPVCM:WINNT - Internal error comparing with a document which failed with an error.  Never got log from Tod though.  [reply] Comment 1 James Moody 2001-10-19 10:32:10 EDT *** Bug 183 has been marked as a duplicate of this bug. ***  [reply] Comment 2 James Moody 2001-10-19 16:36:00 EDT Implemented \'auto refresh\' option. Default value is off.  [reply] Comment 3 DJ Houghton 2001-10-23 23:39:03 EDT PRODUCT VERSION: 109   [reply] Comment 4 James Moody 2001-10-25 10:19:43 EDT Fixed in v206  [reply] Comment 5 Boris Bokowski 2006-11-01 16:25:53 EST I looked at this because of the link in Ian\'s blog (http://ianskerrett.wordpress.com/2006/11/01/looking-back-in-time-at-eclipse/).  Much to my surprise, I can still reproduce the original issue with Eclipse 3.2.1!  Why didn\'t we turn auto-refresh on by default?  Does the SWT team still have their workspaces on network drives?  [reply] Comment 6 John Arthorne 2006-11-01 16:35:52 EST Yes, people still use network drives. In fact, since in 3.2 we now allow projects to be backed by arbitrary file systems via EFS, there are even more scenarios where refresh will be expensive.  Note also that James was referring to an "auto-refresh" option in the sync view. This was replaced with a global auto-refresh in the 3.0 release. This is still off by default, because as mentioned before, it can be expensive.  [reply] Comment 7 Boris Bokowski 2006-11-01 16:42:41 EST (In reply to comment #6) > This is still off by default, because as mentioned before, it can be expensive.  Couldn\'t we turn it on by default now that we have jobs, and a place where jobs show up in the UI? If the refresh happened in a background job that was displayed in the status line, people would have a way to find out what\'s going on and disable it if they don\'t like it.  [reply] Comment 8 John Arthorne 2006-11-01 17:22:38 EST Boris: see bug 89672  [reply] Comment 9 Philippe Ombredanne 2006-11-09 14:03:39 EST Happy birthday bug 1, you are five years old, and still kicking :-D  [reply] Comment 10 John Arthorne 2006-11-09 14:17:41 EST Closing.  [reply] Comment 11 Eclipse Genie 2015-05-19 05:30:50 EDT New Gerrit change created: https://git.eclipse.org/r/48136  [reply] Comment 12 Eclipse Genie 2015-05-19 06:41:33 EDT Gerrit change https://git.eclipse.org/r/48136 was merged to [master]. Commit: http://git.eclipse.org/c/sirius/org.eclipse.sirius.git/commit/?id=980cf72c9ea237e5f896bc0cc74ec2b2dc05ccf5  [reply] Comment 13 Denis Roy 2015-05-19 09:47:26 EDT (In reply to Eclipse Genie from comment #11) > New Gerrit change created: https://git.eclipse.org/r/48136   I will fix her.  Add Comment  Collapse All Comments Expand All Comments'

In [21]:
bug_history_record = pd.read_csv(data_dir + bug_history_dir +'bugs1/1.csv' )

In [8]:
bug = bug_list_record.values[0]

In [9]:
bug


Out[9]:
array([1, 'Eclipse', 'Platform', 'Team', 'James_Moody@ca.ibm.com',
       'CLOSED', 'FIXED', 14,
       'Usability issue with external editors (1GE6IRL)'], dtype=object)

In [10]:
bug[8]

In [22]:
bug_history = bug_history_record.sort(['When'], ascending=False)
bug_history


D:\ProgramData\Anaconda3\envs\tensorflow\lib\site-packages\ipykernel_launcher.py:1: FutureWarning: sort(columns=....) is deprecated, use sort_values(by=.....)
  """Entry point for launching an IPython kernel.
Out[22]:
When Who What Added Removed
17 2015-05-19 06:41:33 EDT genie@eclipse.org See Also https://git.eclipse.org/c/sirius/org.eclipse.s... NaN
16 2015-05-19 05:30:50 EDT genie@eclipse.org See Also https://git.eclipse.org/r/48136 NaN
15 2012-02-09 15:57:47 EST milesparker@gmail.com Depends on NaN 371034
14 2012-02-09 13:17:24 EST milesparker@gmail.com Depends on 371034 NaN
13 2011-12-09 14:17:46 EST denis.roy@eclipse.org CC denis.roy@eclipse.org NaN
12 2011-07-20 05:52:58 EDT manfredend@yahoo.com.cn CC manfredend@yahoo.com.cn NaN
11 2010-07-08 01:01:19 EDT hirujung@gmail.com CC hirujung@gmail.com NaN
10 2008-07-03 16:01:52 EDT bokowski@google.com CC mik.kersten@tasktop.com NaN
9 2007-10-31 04:31:53 EDT mauromol@tiscali.it CC mauromol@tiscali.it NaN
8 2007-07-07 06:50:13 EDT yagnesh@infoworldpune.com CC yagnesh@infoworldpune.com NaN
7 2006-11-09 14:17:41 EST john_arthorne@ca.ibm.com Status CLOSED VERIFIED
6 2006-11-09 14:01:54 EST pombredanne@nexb.com CC pombredanne@nexb.com NaN
5 2006-11-01 16:35:52 EST john_arthorne@ca.ibm.com CC john_arthorne@ca.ibm.com NaN
4 2006-11-01 16:25:53 EST bokowski@google.com CC Boris_Bokowski@ca.ibm.com NaN
3 2001-10-25 10:40:50 EDT James_Moody@ca.ibm.com Status VERIFIED RESOLVED
1 2001-10-19 16:36:00 EDT James_Moody@ca.ibm.com Status RESOLVED ASSIGNED
2 2001-10-19 16:36:00 EDT James_Moody@ca.ibm.com Resolution FIXED ---
0 2001-10-19 10:32:09 EDT James_Moody@ca.ibm.com CC Kevin_McGuire@oti.com NaN

In [28]:
bug_history_record = bug_history[(bug_history[' Added'] == 'FIXED') & (bug_history[' What'] == 'Resolution')]

In [55]:
line = bug_history_record['When'].values[0] + ',' + str(bug[0]) + ',' + str(bug[8]) + ',' + bug_description_record + ',' + \
                   bug_history_record[' Added'].values[0]

In [57]:
bug_history_record[' Added'].values[0]


Out[57]:
'FIXED'

In [4]:
bug_raw = pd.read_csv(data_dir+'raw/bug_raw.csv', sep='@@,,@@', engine='python', parse_dates=True, encoding='latin-1')

In [5]:
bug_sorted_raw = bug_raw.sort_values('when')

In [14]:
bug_sorted_raw.to_csv(data_dir + 'raw/sorted_summary_description.csv', columns=['description', 'summary'], header=False, index=False)

In [17]:
bug_sorted_raw.to_csv(data_dir + 'raw/sorted_bug_id_date_who.csv', columns=['when', 'bug_id', 'who'], index=False)

In [26]:
bug_sorted_raw.iloc[:18179, :].shape


Out[26]:
(18179, 5)

In [27]:
bug_len = len(bug_sorted_raw)
bug_part_size = int((bug_len-1)/11) + 1
for i in range(11):
    begin_index = i * bug_part_size
    end_index = min((i+1)*bug_part_size, bug_len-1)
    bug_parted = bug_sorted_raw.iloc[begin_index:end_index]
    bug_parted.to_csv(data_dir + 'raw/'+str(i)+'_summary_description.csv', columns=['description', 'summary'], header=False, index=False)
    bug_parted.to_csv(data_dir + 'raw/'+str(i)+'_bug_id_date_who.csv', columns=['when', 'bug_id', 'who'], index=False)


0 18179 18179
(18179, 5)
18179 36358 18179
(18179, 5)
36358 54537 18179
(18179, 5)
54537 72716 18179
(18179, 5)
72716 90895 18179
(18179, 5)
90895 109074 18179
(18179, 5)
109074 127253 18179
(18179, 5)
127253 145432 18179
(18179, 5)
145432 163611 18179
(18179, 5)
163611 181790 18179
(18179, 5)
181790 199959 18169
(18169, 5)

In [36]:
# 此方法做旧,因为数据处理不均衡,故而放弃
begin = 2001
end = 2016
for year in range(begin, end):
    condition = (bug_sorted_raw.when > str(year)) & (bug_sorted_raw.when < str(year+1))
    bug_sorted_condition_raw = bug_sorted_raw[condition]
    bug_sorted_condition_raw.to_csv(data_dir + 'raw/' + str(year) + '_summary_description.csv'.format(year), 
                                    columns=['description', 'summary'], header=False, index=False)
    bug_sorted_condition_raw.to_csv(data_dir + 'raw/' + str(year) + '_bug_id_date_who.csv'.format(year),
                                    columns=['when', 'bug_id', 'who'], index=False)

In [ ]: