Perhaps instead of telling you how to write a loop or a conditional in Python, it might be a better option to put Python in context, tell a bit about how programming languages are designed, and why certain trade-offs are chosen. A programming language is something you can learn on your own once you understand why it works the way it does.
A compiler takes a piece of code written in a high-level language and translates it to binary machine code that the CPU can run. Compilation is a complex process that looks at the entire code, checks syntax, does optimizations, links to other binaries, and spits out an executable or some other form of binary code such as a dynamic library.
Interpreted languages parse the code line by line, and thus they only translate something to a machine-executable format one command at a time. This means that you can have an interactive shell and you can type in commands one by one, and see the result immediately. If you make an error, your previous variables and computations are not lost: the interpreter keeps track of them and you can still access them. In contrast, unless you have some mechanism in your compiled code to save interim calculations, an error will terminate the program, and its full memory space is liberated and control is returned to the operating system.
For interactive work, an interpreter is much more suitable. This explains why scientific languages like R, Mathematica, and MATLAB work in this fashion. On the other hand, since there are no optimizations whatsoever, they tend to be sluggish. So for numerical calculations, compiled languages are better: Fortran, C, or newer ones that were designed with safety and concurrency in mind, such as Go and Rust.
A newer paradigm does just-in-time (JIT) compilation: you get an interactive shell, but everything you enter is actually compiled quickly, and then run. That is, a JIT system combines the best of two worlds. Most modern languages are either written with JIT in mind, such as Scala and Julia, or were adapted to be used in this fashion.
Apart from these paradigms, there are abominations like Java: it is both compiled and slow, running on a perfectly horrific level of abstraction called the Java Virtual Machine. MATLAB is a multiparadigm language that is designed to maximize user frustration, although it is primarily interpreted. The following table gives a few examples of each paradigm in approximate temporal order.
| Compiled | Interpreted | JIT | Horror |
|---|---|---|---|
| Fortran (1957) | Lisp (1958) | ||
| BASIC (1964) | |||
| C (1972) | S (1976) | ||
| C++ (1983) | Perl (1987), Mathematica (1988) | MATLAB (1984) | |
| Haskell (1990) | R (1993) | Java (1995) | |
| Go (2009) | Scala (2004) | ||
| Rust (2010) | Julia (2012) |
Python is a language specification born in 1991. As it is the case with many languages (Fortran, C, C++, Haskell), the language specification and its actual implementations are independently developed, although the development is correlated. What you normally call Python, and this is the Python that ships with your operating system or with Anaconda, is actually the reference implementation of the language, which is formally called CPython. This reference implementation is a Python interpreter written in the C language.
The Python language was the first language that was designed with humans in mind: code was meant to be easy to read by humans. This was a response to write-only languages that introduce tricky syntax that is difficult to decipher. Both Mathematica and MATLAB are guilty of being write-only languages, and so are the latest standards of C++. Here is a priceless Mathematica one-liner:
ArrayPlot@Log[BinCounts[Through@{Re,Im}@#&/@NestList[(5#c+Re@#^6-2.7)#+c^5&,.1+.2|,9^7],a={-1,1,0.001},a]+1]
You clearly don't syntax highlighting for this. No matter how hard you try, it would be difficult to write something as convoluted as this in Python.
Python also wants to have exactly one obvious way to do something, which was anything but true for a similar scripting language called Perl that many of us refuse to admit that we ever used it.
By some clever design decisions, it is extremely easy to call low-level code from Python, and this makes it the best glue language: you can call C, Fortran, Julia, Lisp, and whatever functions from Python with ease. Try that from Mathematica.
The default CPython implementation is an interpreter, and therefore it comes with a shell (the funny screen where you type stuff in). This shell, however, is not any good by today's standards. IPython was conceived to have a good shell for Python. In principle, IPython can use any Python interpreter on the back (CPython, Pypy, and others). Jupyter provides a notebook interface based on IPython that allows you to practice literate programming, that is, mixing code, text, mathematical formula, and images in the same environment, making it attractive for scientists. Mathematica's notebook interface is far more advanced than that of Jupyter, but development is rapid, and the functionality keeps expanding. Both IPython and Jupyter were conceived for Python, but now they work with many other languages.
Due to its ease of use and glue language nature, Python became massively popular among programmers. They developed thousands of packages for Python, everything from controlling robots to running websites. It was never designed to be a language for scientific computing. Yet, it became a de facto next-best alternative to MATLAB after the unification of the various numerical libraries under the umbrella of numpy and SciPy. With the development of SymPy, it acquired properties similar in functionality to Mathematica. With Pandas, it takes on R as the choice for statistical modelling. With TensorFlow, it is overtaking distributed frameworks like Hadoop and Spark in large-scale machine learning. We can keep listing packages, but you get the idea. The package ecosystem gives Python users superpowers.
The reference implementation, CPython, by virtue of being an interpreter, is slow, but it is not the only implementation of the language. Pypy is a JIT implementation of the language, started in 2007. It is up to 20-40x faster on pure Python code. The problem is that its foreign language interface is incompatible with CPython, so the glue language nature is gone, and many important Python packages do not work with it. Cython is an extension of Python that generates C code that in turn can be compiled for speed. As a user of Python, you probably don't want to deal with this directly, but it is nevertheless an option if you want speed. To put this together, we can extend the table above:
| Compiled | Interpreted | JIT | Horror |
|---|---|---|---|
| CPython (0.9, 1991) | |||
| CPython (1.0, 1994) | |||
| CPython (2.0, 2000) | Jython (2001) | ||
| Cython (2007) | CPython (3.0, 2008) | Pypy (2.7, 2007) | |
| CPython (3.6, 2016) | Pypy (3.2, 2014), Pyston (2014) |
Python was originally conceived in 1991: until the second half of the 2000s, consumer-grade CPUs were single core. Thus Python was not designed to be easy to parallelize. To understand what goes on here, we have to understand what "running parallel" means.
Conceptually the simplest case is when you have several computers: each one accesses its own memory space and communicates via the network. This is called distributed memory model.
For the next level, we have to understand what a process is. The operating system that you run, let that be Android, macOS, Linux, and even Windows on a good day, ensures that when you run a program, it has its own, protected memory space. It cannot access the memory space allocated to a different program, and other programs cannot access its own allocated memory space. In fact, the operating system itself cannot access the memory space of any of the running programmes: it can terminate them and free the memory, but it cannot access the content of the memory (in principle). A thing that runs with its allocated, protected memory space is called a process.
Multiprocessing means running several processes at the same time. If the processes run on several cores on a multicore processor working on the same calculation, you end up with a scheme similar to the distributed memory model: the processes must communicate with one another if they want to exchange data. It does not happen through the network, but the operating system's help must be invoked. This is a shared memory model with isolated memory spaces. Going between multiprocessing and distributed memory processing is straightforward, at least from the users' perspective.
Multithreading means that one single process uses several CPU cores. It means that each thread can access an arbitrary piece of data belonging to the process. Now imagine you have some variable a and two processes want to increase its value by 1. First, process 1 reads it, learns that the value is 5, and wants to write back 6. The second process reads out 5 as well, and writes back 6. So the final value is 6, instead of 7. This is called a race condition. To get around it, the thread can declare a lock: no other thread can access that part of the code until the lock is released. If the thread that declared the lock waits for another lock to be released, a deadlock can occur: this is an infinite cycle from which there is no exit.
Python allows you to have multiprocessing, but multithreading is implicitly forbidden. To avoid race conditions and deadlocks, the interpreter maintains a global lock on every variable: this is called the global interpreter lock (GIL).
Multiprocessing is inherently less efficient, so there is an increasing pressure to remove the 26-year-old GIL. Pypy introduced an experimental software transaction memory that replaces the GIL. It is an inefficient implementation and it is more of a proof of concept, but it works. Cython allows you to release the GIL and write multithreaded code in C, if that is your thing. There are also plans that upcoming releases of CPython would slowly outphase the GIL in favour of a software transaction memory, but it will take decades.
Python 3 is the present and future of the Python language. It is actively developed, whereas Python 2 only receives security updates, and its end-of-life was declared several times (although it refuses to die). Python 3 is a more elegant and consistent language, which is also faster than older versions, at least starting from version 3.5. Yet, there are still some libraries out there that do not work with Python 3. With the release of Python 3.5 in 2015, now most people recommend Python 3. Anaconda changed to recommending Python 3 in January 2017.
The transition between Python 2 and 3 is a tale of how to do it wrong. Most people never asked for Python 3, and for the first seven years of Python 3, the changes were mainly below the hood. Perhaps the most important change was the proper handling of UTF characters, which sounds abstract for a scientist, until you learn that you can type in Greek characters in mathematical formulas if you use Python 3.
In any case, the two differences every Python-using scientist should be aware of are related to printing and integer division. If you start your code with this line, you ensure that your code will work in both versions identically:
In [1]:
from __future__ import print_function, division
Printing had a weird implementation in Python 2 that was rectified, and now printing is a function like every other. This means that you must use brackets when you print something. Then in Python 2, there were two ways of doing integer division: 2 / 3 and 2 // 3 both gave zero. In Python 3, the former triggers a type upgrade to floats. If you import division from future, you get the same behaviour in Python 2.
A good start for any programming language is a Jupyter kernel if the language has one. Jupyter was originally designed for Python, so naturally it has a matching kernel. Why Jupyter? It is a uniform interface for many languages (Python, Julia, R, Scala, Haskell, even bloody MATLAB has a Jupyter kernel), so you can play with a new language in a familiar, interpreter-oriented environment. If you never coded in your life, it is also a good start, as you get instant feedback on your initial steps in what essentially is a tab in your browser.
If you are coming from MATLAB, or you advanced beyond the skills of writing a few dozens lines of code in Python, I recommend using Spyder. It is an awesome integrated environment for doing scientific work in Python: it includes instant access to documentation, variable inspection, code navigation, an IPython console, plus cool tools for writing beautiful and efficient code.
For tutorials, check out the Learning tab in Anaconda Navigator. Both videos and other tutorials are available in great multitude.
The fundamental difference between a computer scientist and an arbitrary other scientist is that the former will first try to find other people's code to achieve a task, whereas the latter type is suspicious of alien influence and will try to code up everything from scratch. Find a balance.
Here we are not talking about packages: we are talking about snippets of code. The chances are slim that you want to do something in Python that N+1 humans did not do before. Two and a half places to look for code:
The obvious internet search will point you to the exact solution on Stackoverflow.
Code search engines are junk, so for even half-trivial queries that include idiomatic use of a programming language, they will not show up much. This is when you can turn to GitHub's Advanced Search. It will not let you search directly for code, but you can restrict your search by language, and look at relevant commits and issues. You have a good chance of finding what you want.
GitHub has a thing called gist. These are short snippets (1-150 lines) of code under git control. The gist search engine is awesome for finding good code.
Exercise 1. Find three different ways of iterating over a dictionary and printing out each key-value pairs. Explain the design principle of one obvious way of doing something through this example. If you do not know what a dictionary is, that is even better.
In [ ]:
Hate speech follows:
Licence fee: MathWorks is second biggest enemy of science after academic publishers. You need a pricey licence on every computer where you want to use it. Considering that the language did not see much development since 1984, it does not seem like a great deal. They, however, ensure that subsequent releases break something, so open source replacement efforts like Octave will never be able to catch up.
Package management does not exist.
Maintenance: maintaining a toolbox is a major pain since the language forces you to have a very large number of files.
Slow: raw MATLAB code is on par with Python in terms of inefficiency. It can be fast, but only when the operations you use actually translate to low-level linear algebra operations.
MEX: this system was designed to interact with C code. In reality, it only ensures that you tear your hair out if you try to use it.
Interface is not decoupled correctly. You cannot use the editor while running a code in the interpreter. Seriously? In 2017?
Name space mangling: imported functions override older ones. There is no other option. You either overwrite, or you do not use a toolbox.
Write-only language: this one can be argued. With an excessive use of parentheses, MATLAB code can be pretty hard to parse, but allegedly some humans mastered it.
Once you go beyond the basic hurdles of Python, you definitely want to use packages. Many of them are extremely well written, efficient, and elegant. Although most of the others are complete junk.
Package management in Python used to be terrible, but nowadays it is simply bad (this is already a step up from MATLAB or Mathematica). So where does the difficulty stem from? From compilation. Since Python interacts so well with compiled languages, it is the most natural thing to do to bypass the GIL with C or Cython code for some quick calculations, and then get everything back to Python. The problem is that we have to deal with three major operating systems and at least three compiler chain families.
Python allows the distribution of pre-compiled packages through a system called wheels, which works okay if the developers have access to all the platforms. Anaconda itself is essentially a package management system for Python, shipping precompiled binaries that supposed to work together well. So, assuming you have Anaconda, and you know which package you want to install, try this first:
conda install whatever_package
If the package is not in the Anaconda ecosytem, you can use the standard Python Package Index (PyPI) through the ultra-universal pip command:
pip install whatever_package
If you do not have Anaconda or you use some shared computer, change this to pip install whatever_package --user. This will install the package locally to your home folder.
Depending on your operating system, several things can happen.
Windows: if there are no binaries in Anaconda or on PyPI, good luck. Compilation is notoriously difficult to get right on Windows both for package developers and for users.
macOS: if there are no binaries in Anaconda or on PyPI, start scratching your head. There are two paths to follow: (i) the code will compile with Apple's purposefully maimed Clang variant. In this case, if you XCode, things will work with a high chance of success. The downside: Apple hates you. They keep removing support for compiling multithreaded from Clang. (ii) Install the uncontaminated GNU Compiler Chain (gcc) with brew. You still have a high chance of making it work. The problems begin if the compilation requires many dependent libraries to be present, which may or may not be supported by brew.
Linux: there are no binaries by design. The compiler chain is probably already there. The pain comes from getting the development headers of all necessary libraries, not to mention, the right version of the libraries. Ubuntu tends to have outdated libraries.
Exercise 2. Install the conic optimization library Picos. In Anaconda, proceed in two steps: install cvxopt with conda, and then Picos from PyPI. If you are not using Anaconda, a pip install will be just fine.
Python has few syntactic candies, precisely because it wants to keep code readable. One thing you can do, though, is defining lists in a functional programming way, that is, it will be familiar to Mathematica users. This is the crappy way of filling a list with values:
In [2]:
l = []
for i in range(10):
l.append(i)
print(l)
This is more Pythonesque:
In [3]:
l = [i**2 for i in range(10)]
print(l)
What you have inside the square bracket is a generator expression. Sometimes you do not need the list, only its values. In such cases, it suffices to use the generator expression. The following two lines of code achieve the same thing:
In [4]:
print(sum([i for i in range(10)]))
print(sum(i for i in range(10)))
Which one is more efficient? Why?
You can also use conditionals in the generator expressions. For instance, this is a cheap way to get even numbers:
In [5]:
[i for i in range(10) if i % 2 == 0]
Out[5]:
Exercise 3. List all odd square numbers below 1000.
In [ ]:
And on the seventh day, God created PEP8. Python Enhancement Proposal (PEP) is a series of ideas and good practices for writing nice Python code and evolving the language. PEP8 is the set of policies that tells you what makes Python syntax pretty (meaning it is easy to read for any other Python programmer). In an ideal world, everybody should follow it. Start programming in Python by keeping good practices in mind.
As a starter, Python uses indentation and indentation alone to tell the hierarchy of code. Use EXACTLY four space characters as indentation, always. If somebody tells you to use one tab, butcher the devil on the spot.
Bad:
In [6]:
for _ in range(10):
print("Vomit")
Good:
In [7]:
for _ in range(10):
print("OMG, the code generating this is so prettily idented")
The code is more readable if it is a bit leafy. For this reason, leave a space after every comma just as you would do in natural languages:
In [8]:
print([1,2,3,4]) # Ugly crap
print([1, 2, 3, 4]) # My god, this is so much easier to read!
Spyder has tools for helping you keeping to PEP8, but it is not so straightforward in Jupyter unfortunately.
Exercise 4. Clean up this horrific mess:
In [ ]:
for i in range(2,5):
print(i)
for j in range( -10,0, 1):
print(j )
Tuples are like lists, but with a fixed number of entries. Technically, this is a tuple:
In [18]:
t = (2, 3, 4)
print(t)
print(type(t))
You would, however, seldom use it in this form, because you would just use a list. They come handy in certain scenarios, like enumerating a list:
In [9]:
very_interesting_list = [i**2-1 for i in range(10) if i % 2 != 0]
for i, e in enumerate(very_interesting_list):
print(i, e)
Here enumerate returns you a tuple with the running index and the matching entry of the list. You can also zip several lists and create a stream of tuples:
In [10]:
another_interesting_list = [i**2+1 for i in range(10) if i % 2 == 0]
In [11]:
for i, j in zip(very_interesting_list, another_interesting_list):
print(i, j)
You can use tuple-like assignment to initialize multiple variables:
In [12]:
a, b, c = 1, 2, 3
print(a, b, c)
This syntax in turn enables you the most elegant way of swapping the value of two variables:
In [13]:
a, b = b, a
print(a, b)
In [14]:
l = [i for i in range(10)]
print(l)
print(l[2:5])
print(l[2:])
print(l[:-1])
In [16]:
l[-2]
Out[16]:
Note that the upper index is not inclusive (the same as in range). The index -1 refers to the last item, -2 to the second last, and so on. Python lists are zero-indexed.
Unfortunately, you cannot do convenient double indexing on multidimensional lists. For this, you need numpy.
In [17]:
import numpy as np
a = np.array([[(i+1)*(j+1)for j in range(5)]
for i in range(3)])
print(a)
print(a[:, 0])
print(a[0, :])
Exercise 5. Get the bottom-right 2x2 submatrix of a.
In [ ]:
Python will hide the pain of working with types: you don't have to declare the type of any variable. But this does not mean they don't have a type. The type gets assigned automatically via an internal type inference mechanism. To demonstrate this, we import the main numerical and symbolic packages, along with an option to pretty-print symbolic operations.
In [18]:
import sympy as sp
import numpy as np
from sympy.interactive import printing
printing.init_printing(use_latex='mathjax')
In [19]:
print(np.sqrt(2))
sp.sqrt(2)
Out[19]:
The types tell you why these two look different:
In [20]:
print(type(np.sqrt(2)))
print(type(sp.sqrt(2)))
The symbolic representation is, in principle, infinite precision, whereas the numerical representation uses 64 bits.
As we said above, you can do some things with numpy arrays that you cannot do with lists. Their types can be checked:
In [21]:
a = [0. for _ in range(5)]
b = np.zeros(5)
print(a)
print(b)
print(type(a))
print(type(b))
There are many differences between numpy arrays and lists. The most important ones are that lists can expand, but arrays cannot, and lists can contain any object, whereas numpy arrays can only contain things of the same type.
Type conversion is (usually) easy:
In [22]:
print(type(list(b)))
print(type(np.array(a)))
This is where the trouble begins:
In [23]:
from sympy import sqrt
from numpy import sqrt
sqrt(2)
Out[23]:
Because of this, never import everything from a package: from numpy import * is forbidden.
Exercise 6. What would you do to keep everything at infinite precision to ensure the correctness of a computational proof? This does not seem to be working:
In [57]:
b = np.zeros(3)
b[0] = sp.pi
b[1] = sqrt(2)
b[2] = 1/3
print(b)
Python packages and individual functions typically come with documentation. Documentation is often hosted on ReadTheDocs. For individual functions, you can get the matching documentation as you type. Just press Shift+Tab on a function:
In [ ]:
sp.sqrt
In Spyder, Ctrl+I will bring up the documentation of the function.
This documentation is called docstring, and it is extremely easy to write, and you should do it yourself if you write a function. It is epsilon effort and it will take you a second to write it. Here is an example:
In [24]:
def multiply(a, b):
"""Multiply two numbers together.
:param a: The first number to be multiplied.
:type a: float.
:param b: The second number to be multiplied.
:type b: float.
:returns: the multiplication of the two numbers.
"""
return a*b
Now you can press Shift+Tab to see the above documentation:
In [ ]:
multiply
Exercise 7. In the documentation above, it was specified that the types of the arguments are floats, but the actual implementation multiplies anything. Add a type check. Then extend the function and the documentation to handle three inputs.
In [ ]: