Due Thursday October 24 at 9am
UC Berkeley "Python for Data Science" Seminar AY 250. Fall 2013.
Created by Guest Lecturer Wes McKinney.
Download a dump of data about closed GitHub issues for the pandas project here:
https://www.dropbox.com/s/pe6dqooznrfynii/closed.json
Use the built-in json library to read this file into memory. Each element in the list contains information about a GitHub issue and all developer comments that were made on it in the 'comments' field.
1) Make a DataFrame with one row per issue with the following columns extracted from the issue data:
ntitle, created_at, labels, closed_at, user, id
Transform the user values to be simply the 'login' string, so that the user column contains only string usernames.
2) Remove duplicate rows by id from the DataFrame you just created using the id column's duplicated method.
4) Convert the created_at
and closed_at columns
from string to datetime type.
5) Now construct appropriate time series and pandas functions to make the following plots:
Number of issues created by month
Number of distinct users creating issues each month (hint: you can pass a function to resample's how argument, and there's nothing wrong with having string values in a TimeSeries)
6) Make a table and an accompanying plot illustrating:
total_seconds
function on the
timedelta objects computed when subtracting datetime objects). Also show the
number of issues in each month in the table.7) Make a DataFrame containing all the comments for all of the issues. You will
want to add an id
attribute to each comment while doing so so that each row
contains a single comment and has the id of the issue it belongs to.
Convert the created
column to datetime format; note you will need to multiply
the values (appropriately converted to integers) by 1000000 to get them in
nanoseconds and pass to to_datetime.
8) For each month, compute a table summarizing the following for each month:
9) Create a helper labels
table from the issues data with two columns: id and
label. If an issue has 3 elements in its 'labels' value, add 3 rows to the
table. If an issue does not have any labels, place a single row with None as
the label (hint: construct a list of tuples, then make the DataFrame).
10) Now, join the issues data with the labels helper table (pandas.merge). Add a column to this table containing the number of days (as a floating point number) it took to close each issue.
11) Compute a table containing the average time to close for each label type. Now make a plot comparing mean time to close by month for Enhancement versus Bug issue types.
In [ ]: