Threads are sequence of programmed instructions that can be managed independently by the os. They share the memory space (so we have to be carefull and ensure thread safety, in order to avoid two threads writing in the same emplacement at the same time), and they are the common way to deal with asynchrone computation and to improve performance when the machine has more than one CPU.
Processes are instance of a computer program that is executed. They do not share memory and require special object to share information, objects as queue, shared objects, pipes, semaphores... They are heavier than the threads, but are safer due to the lack of common memory space.
Due to the GIL (Global interpreter lock), multiple threads cannot execute Python bytecodes at once. It reduced the usefullness of threading : only the function that release the GIL can run at the same time. It's the case for I/O operations (web protocol requests as http/ftp, on-disk reading / writing), and for most numpy operations that rely on C routines.
That's why, in the Python ecosystem, multiprocessing is prefered over threading. NB: some trial has been made to take the GIL away from CPython, and led to drastic performance regression (more info).
Python has a nice standard library that allow multiprocessing computation. It's called multiprocessing
. Good to know: the library threading
is for multiple threads computation and has a very similar API, even if the tendancy is to used asyncio based library to deal with I/O operations.
Some libraries allow abstraction that help multiprocessing computation, as joblib (part of the sklearn ecosystem), concurrent.futures
(in the stdlib, allow future-based API), distributed
(part of the dask ecosystem and allow local and distributed computation that can live on other computers) and so on... I will focus on the stdlib multiprocessing
first, then have few words on the other tools.
A process object represent a processus that can be started and run a function. It can be initialized in two different way:
run
method.The latter is usefull for complexe cases.
The processes objets have some useful methods. Some of there:
For example, these two snippets do exactly the same thing :
In [14]:
import multiprocessing as mp
from time import sleep
In [42]:
def a_long_running_function(time):
sleep(time)
return time
In [30]:
# These lines are not blocking
process = mp.Process(target=a_long_running_function, args=(10, ))
process.start()
# These one will block until `a_long_running_function` is over
print("before join, process.is_alive: %s" % process.is_alive())
process.join()
print("after join, process.is_alive: %s" % process.is_alive())
In [22]:
# These lines are not blocking
class ALongRunningProcess(mp.Process):
def __init__(self, *args):
super().__init__()
self._args = args
def run(self):
a_long_running_function(*self._args)
process = ALongRunningProcess(10)
process.start()
# These one will block until `a_long_running_function` is over
print("before join, process.is_alive: %s" % process.is_alive())
process.join()
print("after join, process.is_alive: %s" % process.is_alive())
Often, we do not want a complex workflow with a lot of different processes sharing informations. We just want N independant computation of the same function with different inputs. In that case, managing by ourself the processes can be harmful, and it's worst considering that we should avoid to restart a process each time because it add some overhead.
There come the Pool object : it's a pool of N processes (often the same number as the machine CPU) that can be fed with tasks (function and input), one by one, or with a range of parameters.
That way:
In [45]:
with mp.Pool() as p:
# A future is a result that we expect.
future = p.apply_async(a_long_running_function, args=(5, ))
print("future object: %s" % future)
# We have to use the get method: otherwise,
# the pool will be closed before we obtain the result
# We can use the wait method too: in that case, the result is not returned
result = future.get()
print("future.get(): %s" % result)
# The map allow to run multiple time the function over a range on inputs
# then return the result as a list. It can be blocking or not.
# If it's async, it will return a MapResult, an equivalent of the future for
# multiple results.
results = p.map(a_long_running_function, [5] * mp.cpu_count())
print("results: %s" % results)
futures = p.map_async(a_long_running_function, [5] * mp.cpu_count())
print("futures: %s" % futures)
print("futures.get: %s" % futures.get())