With this worksheet, you will learn the first steps with Jupyter, Python, pandas and matplotlib using a practical example: The analysis of the development history of the Linux kernel.
The complete and detailed mini-tutorial can also be found on my blog at https://www.feststelltaste.de/mini-tutorial-git-log-analyse-mit-python-und-pandas/.
As starting point, we have a record in a file that lists the time stamp and the author of the code change for each commit:
timestamp,author
2017-12-31 14:47:43,Linus Torvalds
2017-12-31 13:13:56,Linus Torvalds
2017-12-31 13:03:05,Linus Torvalds
2017-12-31 12:30:34,Linus Torvalds
2017-12-31 12:29:02,Linus Torvalds
This data was basically generated by git
(https://git-scm.com) from the GitHub repository https://github.com/torvalds/linux/ (and simplified a little for the mini-tutorial...).
Let's get to know the tools we use!
In [1]:
"Hello World"
Out[1]:
ESC
key.b
key. m
.Enter
(note the color to the left of the cell, which turns green instead of blue).Ctrl
+ Enter
.This is a text
In [2]:
text = "Hello World"
text
Out[2]:
In [3]:
text[0]
Out[3]:
In [4]:
text[-1]
Out[4]:
In [5]:
text[2:4]
Out[5]:
In [6]:
text.upper()
Out[6]:
In [7]:
text.split("l",maxsplit=1)
Out[7]:
In [8]:
import pandas as pd
pd?
In [9]:
PATH = "https://github.com/feststelltaste/software-analytics/raw/master/demos/dataset/git_demo_timestamp_linux.gz"
git_log = pd.read_csv(PATH)
git_log.head()
Out[9]:
In [10]:
git_log.info()
We see that git_log
is
timestamp
(=commit time)author
(=programmer).
In [11]:
missing = git_log.author.isnull()
missing.head()
Out[11]:
In [12]:
git_log[missing]
Out[12]:
In [13]:
top10 = git_log.author.value_counts().head(10)
top10
Out[13]:
In [14]:
%matplotlib inline
top10.plot()
Out[14]:
In [15]:
top10.plot.bar()
Out[15]:
In [16]:
top10.plot.bar();
In [17]:
top10.plot.pie();
In [18]:
top10.plot(
kind='pie',
title="TOP 10 authors",
label="",
figsize=[5,5]);
In [19]:
git_log.timestamp.head()
Out[19]:
In [20]:
ts = pd.to_datetime(git_log.timestamp)
ts.head()
Out[20]:
In [21]:
ts.dt.hour.head()
Out[21]:
In [22]:
commits_per_hour = ts.dt.hour.value_counts(sort=False)
commits_per_hour.head()
Out[22]:
In [23]:
commits_per_hour.plot.bar();
Well done! Congratulations!
You have now learned some basics about pandas. This will get us a long way in our daily work. The other important topics that are still missing are:
merge
and join
groupby
.pivot_table
. I hope that this mini-tutorial will show you the potential of data analysis using Jupyter, Python, pandas and matplotlib!
I am looking forward to your comments and feedback!
Markus Harrer
Blog: https://www.feststelltaste.de
Mail: talk@markusharrer.de
Twitter: @feststelltaste
Consulting and training: http://markusharrer.de