In [89]:
import pandas as pd
commits = pd.read_csv("../../intellij-community/git_oneline.log")
commits.head()
Out[89]:
In [92]:
commits_raw.info()
In [93]:
commits['author'].value_counts().head(10).plot(kind='pie', figsize=(5,5))
Out[93]:
In [94]:
commits.head()
Out[94]:
In [95]:
commits = commits.set_index(pd.to_datetime(commits['timestamp'], unit="s"))
commits.head()
Out[95]:
In [96]:
commits_per_month['commits_cum'] = commits_per_month['author'].cumsum()
commits_per_month
Out[96]:
In [38]:
commits_per_month['commits_cum'].plot()
Out[38]:
In [ ]:
commits.
In [77]:
%matplotlib inline
commits_raw['time'].plot()
In [73]:
commits_raw['unix'].dtypes
Out[73]:
In [38]:
commits_raw['time'] = pd.to_datetime(commits_raw['timestamp'])
commits_raw.dtypes
Out[38]:
In [ ]:
commits_raw.head()
In [ ]:
commits_raw
In [1]:
import git
GIT_LOG_FILE = r'C:/dev/repos/intellij-community/'
repo = git.Repo(GIT_LOG_FILE, odbt=git.GitCmdObjectDB)
g = repo.git
log = g.log('--all', '--numstat', '--no-renames', '--pretty=format:#%h#%ad#%aN')
log[0:100]
Out[1]:
After this, we have to tell Git which information we want. We can do this via the pretty-format option.
For each commit, we choose to create a header line with the following commit info (by using --pretty=format:'--%h--%ad--%aN'), which gives us the following output:
--fa1ca6f--Thu Dec 22 08:04:18 2016 +0100--feststelltasteIt contains the SHA key, the timestamp as well as the author's name of the commit, separated by a character that isn't certaninly in these information--. My favorite separator for this is \u0012
We also want to have some details about the modifications of each file per commit. This is why we use the --numstat flag.
Together with the --all flag to get all commits and the --no-renames flag to avoid commits that only rename files, we retrieve all the needed information directly via Git.
For each other row, we got some statistics about the modified files:
2 0 src/main/asciidoc/appendices/bibliography.adoc
It contains the number of lines inserted, the number of lines deleted and the relative path of the file. With a little trick and a little bit of data wrangling, we can read that information into a nicely structured DataFrame.
The first entries of that file look something like this:
Let's get started!
First, I'll show you my approach on how to read nearly everything into a DataFrame. The key is to use Pandas' read_csv for reading "non-character separated values". How to do that? We simply choose a separator that doesn't occur in the file that we want to read. My favorite character for this is the "DEVICE CONTROL TWO" character U+0012. I haven't encountered a situation yet where this character was included in a data set.
We just read our git.log file without any headers (because there are none) and give the only column a nice name.
In [3]:
import pandas as pd
from io import StringIO
commits_raw = pd.read_csv(StringIO(log),
sep="#",
header=None,
names=['file_stats','sha', 'date', 'author'])
commits_raw.head()
Out[3]:
OK, but now we have a problem data wrangling challenge. We have the commit info as well as the statistic for the modified file in one column, but they don't belong together. What we want is to have the commit info along with the file statistics in separate columns to get some serious analysis started.
In [5]:
commit_metadata = commits_raw[['sha', 'date', 'author']].fillna(method='ffill')
commit_metadata.head(5)
Out[5]:
With this, we can focus on extracting the information of a commit info row. The next command could be looking a little frightening, but don't worry. We go through it step by step.
In [6]:
file_info = commits_raw['file_stats'].dropna().str.split("\t", expand=True)
file_info.columns = ['additions', "deletions", "filename"]
file_info.head()
Out[6]:
In [7]:
file_info['additions'] = pd.to_numeric(file_info['additions'], errors='coerce')
file_info['deletions'] = pd.to_numeric(file_info['deletions'], errors='coerce')
file_info.dtypes
Out[7]:
In [8]:
commits = commit_metadata.join(file_info, how='right')
commits = commits.dropna()
commits.head()
Out[8]:
In [87]:
commits.groupby('author').sum()[['additions']].sort_values(by='additions', ascending=False)
Out[87]:
OK, this part is ready, let's have a look at the file statistics!
We're done!
Just some milliseconds to run through, not bad!
In this notebook, I showed you how to read some non-perfect structured data via the non-character separator trick. I also showed you how to transform the rows that contain multiple kinds of data into one nicely structured DataFrame.
Now that we have the Git repository DataFrame, we can do some nice things with it e. g. visualizing the code churn of a project, but that's a story for another notebook! But to give you a short preview:
In [88]:
%matplotlib inline
timed_commits = commits.set_index(pd.DatetimeIndex(commits['date']))[['insertions', 'deletions']].resample('1D').sum()
(timed_commits['insertions'] - timed_commits['deletions']).cumsum().fillna(method='ffill').plot()
In [ ]:
%matplotlib inline
commits['author'].value_counts().plot(kind='pie', figsize=(10,10))
Stay tuned!