Michaël Defferrard, PhD student, EPFL LTS2
This short primer is an introduction to the scientific Python stack for Data Science. It is designed as a tour around the major Python packages used for the main computational tasks encountered in the sexiest job of the 21st century. At the end of this tour, you'll have a broad overview of the available libraries as well as why and how they are used for each task. This notebook aims at answering the following question: which tool should I use for which task and how. Before starting, two remarks:
This notebook will walk you through a typical Data Science process:
Our motivating example: predict whether a credit card client will default.
Before taking our tour, let's briefly talk about Python. First thing first, the general characteristics of the language:
Technical details:
From those characteristics emerge the following advantages:
And the following disadvantages:
Let's state why is Python a language of choice for Data Scientists. Viable alternatives include matlab, R and Julia, and, for more statistical jobs, the SAS and SPSS statistical packages. The strenghs of Python are:
Jupyter notebook is an HTML-based notebook which allows you to create and share documents that contain live code, equations, visualizations and explanatory text. It allows a clean presentation of computational results as HTML or PDF reports and is well suited for interactive tasks surch as data cleaning, transformation and exploration, numerical simulation, statistical modeling, machine learning and more. It runs everywhere (Window, Mac, Linux, Cloud) and supports multiple languages through various kernels, e.g. Python, R, Julia, Matlab.
While Jupyter is itself becoming an Integreted Development Environment (IDE), alternative scientific IDEs include Spyder and Rodeo. Non-scientific IDEs include IDLE and PyCharm. Vim and Emacs lovers (or more recently Atom and Sublime Text) will find full support of Python in their editor of choice. An interactive prompt, useful for experimentations or as a calculator, is offered by Python itself or by IPython, the Jupyter kernel for Python.
During this tour, we'll need the packages shown below, which are best installed from PyPI in a virtual environment. Please see the instructions on the README.
In [ ]:
%%script sh
cat ../requirements.txt
In [ ]:
# Windows
# !type ..\requirements.txt
The statements starting with %
or %%
are built-in magic commands, i.e. commands interpreted by the IPython kernel. E.g. %%script sh
tells IPython to run the cell with the shell sh
(like the #!
line at the beginning of script).
The Python prompt is what you get when typing python
in your terminal. It is useful to test commands and check your installation. We however prefer IPython for interactive work, see below.
Python files, with the extension .py
, are either scripts or modules. A Python script is a file which gets executed with either python myscript.py
or ./myscript.py
if it has execution permissions as well as a shabang (#!) indicating which interpreter should be used. Below is an example of a typical script. The !
in front of a command tells IPython to execute the command with the system terminal.
In [ ]:
!cat ../check_install.py
!python ../check_install.py
!../check_install.py
# Windows
# !type ..\check_install.py
# !python ..\check_install.py
A Python module is similar to a script, except that it is suposed to be imported and used by another script or module. It defines objects like classes or functions which are meant to be exported. Below is an example of a typical module, composed of only one function, get_data()
. Note that the module itself imports other modules (pandas
, urllib
and os.path
).
In [ ]:
!cat ../utils.py
# Windows: !type ..\utils.py
The IPython prompt is that is what you get when running ipython
in your terminal. It is more convenient than the Python prompt and is useful for interactive work like small experiments or as a powerful calculator.
The Jupyter notebook is the web interface you get when running jupyter notebook
. It features a file explorer, various kernels (for Python, R, Julia) and can export any notebook to HTML / PDF (via jupyter nbconvert
). The basic document is a notebook which is composed of cells who are either code, results or markdown text / math. The Jupyter notebook is the interface we'll use for most of the course.
Markdown is a lightweight markup language which is very much used to generate HTML documents (e.g. on GitHub or with static website generators). See this cheatsheet as a very short introduction. Or simply edit the cells in this notebook. Markdown can include Latex math such as $y = 2x$.
As explained in the README, we prefer to work inside virtual environments. Installing packages, a collection of modules, inside or outside virtual environments is however the same.
Most of the packages, i.e. reusable pieces of code, are posted on PyPI, the Python Package Index, by their authors. The Python package manager, pip
, is a command-line tool to search and download packages from PyPI.
Note that some packages, like NumPy, requires native, i.e. compiled, dependencies. That is why installing with
pip install
may fail, as it only manages Python packages. In that case you need to install those dependencies by hand or with the help of a package manager likebrew
for Mac or whatever your Linux distribution uses.
Searching for a package goes like this (can be typed in your terminal):
In [ ]:
!pip search music21
In [ ]:
!pip install numpy
You can get the list of installed packages with pip freeze
. These are all the packages that are installed and available on your system. They could have been installed by pip install packname
(maybe as a dependancy), by conda install packname
or by your system's package manager.
In [ ]:
!pip freeze
While mastering git is not a necessity to follow the exercises, it is a good practice to version the code you write and it will definitely be useful to you in the future.
The commands you need for the exercises is
git clone https://github.com/mdeff/ntds_2016.git
the first time to copy the repository on your computer. Once you have it, you can simply download the updates every Monday morning with
git pull
Other commands of interest if you want to maintain your own repository are add
, commit
and push
. The basic workfow is
git clone url
# make your changes
git commit -a -m "my first commit"
git push