This part of the tutorial requires numpy.
For this part of the tutorial we will imagine that we are still not convinced of the pressure-volume relations that we just "discovered" and that calculating the volume is actually a very expensive procedure, such as a many particle simulation.
We emulate this by adding an optional cost argument to our volume calculation function:
In [1]:
from time import sleep
def V_idg(N, p, kT, cost=0):
sleep(cost)
return N * kT / p
It is useful to think of each modification of the workspace, that includes addition, modification, and removal of data, in terms of an operation.
An operation should take only one(!) argument: the job handle.
Any additional arguments may represent hidden state point parameters which would lead to a loss of provenance and possibly render our data space inconsistent.
The following function is an example for an operation:
In [2]:
def compute_volume(job):
print('compute volume', job)
V = V_idg(cost=1, **job.statepoint())
job.document['V'] = V
with open(job.fn('V.txt'), 'w') as file:
file.write(str(V) + '\n')
This operation computes the volume solely based on the state point parameters and stores the results such that they are clearly associated with the job, i.e., in the job document and in a file within the job's workspace.
Please note, that the only reason for storing the the same result in two different ways is for demonstration purposes.
In [3]:
import signac
project = signac.get_project('projects/tutorial')
for job in project:
compute_volume(job)
In [4]:
import signac
import numpy as np
project = signac.get_project(root='projects/tutorial')
def init_statepoints(n):
for p in np.linspace(0.1, 10.0, n):
sp = {'p': p, 'kT': 1.0, 'N': 1000}
job = project.open_job(sp)
job.init()
print('initialize', job)
init_statepoints(5)
We see that initializing more jobs and even reinitializing old jobs is no problem. However, since our calculation will be "expensive", we would want to skip the computation whenever the result is already available.
One possibility is to add a simple check before executing the computation:
In [5]:
for job in project:
if 'V' not in job.document:
compute_volume(job)
In [6]:
init_statepoints(10)
Next, we implement a classify()
generator function, which labels a job based on certain conditions:
In [7]:
def classify(job):
yield 'init'
if 'V' in job.document and job.isfile('V.txt'):
yield 'volume-computed'
Our classifier will always yield the init
label, but the volume-computed
label is only yielded if the result has been computed and stored both in the job document and as a text file.
We can then use this function to get an overview of our project's status.
In [8]:
print('Status: {}'.format(project))
for job in project:
labels = ', '.join(classify(job))
p = round(job.sp.p, 1)
print(job, p, labels)
Using only simple classification functions, we already get a very good grasp on our project's overall status.
Furthermore, we can use the classification labels for controling the execution of operations:
In [9]:
for job in project:
labels = classify(job)
if 'volume-computed' not in labels:
compute_volume(job)
So far, we have executed all operations in serial using a simple for-loop. We will now learn how to easily parallelize the execution!
Instead of using a for-loop
, we can also take advantage of Python's built-in map-operator:
In [10]:
list(map(compute_volume, project))
print('Done.')
Using the map()
expression makes it trivial to implement parallelization patterns, for example, using a process Pool:
In [11]:
from multiprocessing import Pool
with Pool() as pool:
pool.map(compute_volume, project)
Or a ThreadPool
:
In [12]:
from multiprocessing.pool import ThreadPool
with ThreadPool() as pool:
pool.map(compute_volume, project)
Uncomment and execute the following line if you want to remove all data and start over.
In [13]:
# % rm -r projects/tutorial/workspace
In this section we learned how to create a simple, yet complete workflow for our computational investigation.
In the next section we will learn how to adjust the data space, e.g., modify existing state point parameters.