With this worksheet, you will learn the first steps with Jupyter, Python, pandas and matplotlib using a practical example: The analysis of the development history of the Linux kernel.
The complete and detailed mini-tutorial can also be found on my blog at https://www.feststelltaste.de/mini-tutorial-git-log-analyse-mit-python-und-pandas/.
As starting point, we have a record in a file that lists the time stamp and the author of the code change for each commit:
timestamp,author
2017-12-31 14:47:43,Linus Torvalds
2017-12-31 13:13:56,Linus Torvalds
2017-12-31 13:03:05,Linus Torvalds
2017-12-31 12:30:34,Linus Torvalds
2017-12-31 12:29:02,Linus Torvalds
This data was basically generated by git (https://git-scm.com) from the GitHub repository https://github.com/torvalds/linux/ (and simplified a little for the mini-tutorial...).
Let's get to know the tools we use!
In [ ]:
"Hello World"
ESC key.b key. m.Enter (note the color to the left of the cell, which turns green instead of blue).Ctrl + Enter.
In [ ]:
In [ ]:
In [ ]:
In [ ]:
In [ ]:
In [ ]:
In [ ]:
In [ ]:
PATH = "https://github.com/feststelltaste/software-analytics/raw/master/demos/dataset/git_demo_timestamp_linux.gz"
In [ ]:
We see that git_log is
timestamp (=commit time)author (=programmer).
In [ ]:
In [ ]:
In [ ]:
In [ ]:
In [ ]:
In [ ]:
In [ ]:
In [ ]:
In [ ]:
In [ ]:
In [ ]:
In [ ]:
In [ ]:
Well done! Congratulations!
You have now learned some basics about pandas. This will get us a long way in our daily work. The other important topics that are still missing are:
merge and joingroupby.pivot_table. I hope that this mini-tutorial will show you the potential of data analysis using Jupyter, Python, pandas and matplotlib!
I am looking forward to your comments and feedback!
Markus Harrer
Blog: https://www.feststelltaste.de
Mail: talk@markusharrer.de
Twitter: @feststelltaste
Consulting and training: http://markusharrer.de