1.3. A Basic Workflow

This part of the tutorial requires numpy.

Operations

For this part of the tutorial we will imagine that we are still not convinced of the pressure-volume relations that we just "discovered" and that calculating the volume is actually a very expensive procedure, such as a many particle simulation.

We emulate this by adding an optional cost argument to our volume calculation function:



In [1]:

    
from time import sleep

def V_idg(N, p, kT, cost=0):
    sleep(cost)
    return N * kT / p

It is useful to think of each modification of the workspace, that includes addition, modification, and removal of data, in terms of an operation.

An operation should take only one(!) argument: the job handle.

Any additional arguments may represent hidden state point parameters which would lead to a loss of provenance and possibly render our data space inconsistent.

The following function is an example for an operation:



In [2]:

    
def compute_volume(job):
    print('compute volume', job)
    V = V_idg(cost=1, **job.statepoint())
    job.document['V'] = V
    with open(job.fn('V.txt'), 'w') as file:
        file.write(str(V) + '\n')

This operation computes the volume solely based on the state point parameters and stores the results such that they are clearly associated with the job, i.e., in the job document and in a file within the job's workspace.

Please note, that the only reason for storing the the same result in two different ways is for demonstration purposes.

Execution

To execute our first data space operation, we simply loop through our project's data space:



In [3]:

    
import signac

project = signac.get_project('projects/tutorial')

for job in project:
    compute_volume(job)









    



compute volume 5a456c131b0c5897804a4af8e77df5aa
compute volume 5a6c687f7655319db24de59a2336eff8
compute volume ee617ad585a90809947709a7a45dda9a

Data Space Initialization

Since our operation is now more expensive, it is a good idea to split initialization and execution. Let's initialize a few more state points in one go:



In [4]:

    
import signac
import numpy as np

project = signac.get_project(root='projects/tutorial')

def init_statepoints(n):
    for p in np.linspace(0.1, 10.0, n):
        sp = {'p': p, 'kT': 1.0, 'N': 1000}
        job = project.open_job(sp)
        job.init()
        print('initialize', job)
        
init_statepoints(5)









    



initialize 5a6c687f7655319db24de59a2336eff8
initialize d03270cdbbae73c8bb1d9fa0ab370264
initialize 973e29d6a4ed6cf7329c03c77df7f645
initialize 4cf2795722061df825ec9a4d5e31e494
initialize 5a456c131b0c5897804a4af8e77df5aa

We see that initializing more jobs and even reinitializing old jobs is no problem. However, since our calculation will be "expensive", we would want to skip the computation whenever the result is already available.

One possibility is to add a simple check before executing the computation:



In [5]:

    
for job in project:
    if 'V' not in job.document:
        compute_volume(job)









    



compute volume 4cf2795722061df825ec9a4d5e31e494
compute volume 973e29d6a4ed6cf7329c03c77df7f645
compute volume d03270cdbbae73c8bb1d9fa0ab370264

Classification

It would be even better, if we could get an overview of which state points have been calculated and which not. We call this a project's status.

Before we continue, let's initialize a few more state points.



In [6]:

    
init_statepoints(10)









    



initialize 5a6c687f7655319db24de59a2336eff8
initialize 22582e83c6b12336526ed304d4378ff8
initialize c0ab2e09a6f878019a6057175bf718e6
initialize 9110d0837ad93ff6b4013bae30091edd
initialize b45a2485a44a46364cc60134360ea5af
initialize 05061d2acea19d2d9a25ac3360f70e04
initialize 665547b1344fe40de5b2c7ace4204783
initialize 8629822576debc2bfbeffa56787ca348
initialize e8186b9b68e18a82f331d51a7b8c8c15
initialize 5a456c131b0c5897804a4af8e77df5aa

Next, we implement a classify() generator function, which labels a job based on certain conditions:



In [7]:

    
def classify(job):
    yield 'init'
    if 'V' in job.document and job.isfile('V.txt'):
        yield 'volume-computed'

Our classifier will always yield the init label, but the volume-computed label is only yielded if the result has been computed and stored both in the job document and as a text file. We can then use this function to get an overview of our project's status.



In [8]:

    
print('Status: {}'.format(project))
for job in project:
    labels = ', '.join(classify(job))
    p = round(job.sp.p, 1)
    print(job, p, labels)









    



Status: TutorialProject
05061d2acea19d2d9a25ac3360f70e04 5.6 init
22582e83c6b12336526ed304d4378ff8 1.2 init
4cf2795722061df825ec9a4d5e31e494 7.5 init, volume-computed
5a456c131b0c5897804a4af8e77df5aa 10.0 init, volume-computed
5a6c687f7655319db24de59a2336eff8 0.1 init, volume-computed
665547b1344fe40de5b2c7ace4204783 6.7 init
8629822576debc2bfbeffa56787ca348 7.8 init
9110d0837ad93ff6b4013bae30091edd 3.4 init
973e29d6a4ed6cf7329c03c77df7f645 5.0 init, volume-computed
b45a2485a44a46364cc60134360ea5af 4.5 init
c0ab2e09a6f878019a6057175bf718e6 2.3 init
d03270cdbbae73c8bb1d9fa0ab370264 2.6 init, volume-computed
e8186b9b68e18a82f331d51a7b8c8c15 8.9 init
ee617ad585a90809947709a7a45dda9a 1.0 init, volume-computed

Using only simple classification functions, we already get a very good grasp on our project's overall status.

Furthermore, we can use the classification labels for controling the execution of operations:



In [9]:

    
for job in project:
    labels = classify(job)
    if 'volume-computed' not in labels:
        compute_volume(job)









    



compute volume 05061d2acea19d2d9a25ac3360f70e04
compute volume 22582e83c6b12336526ed304d4378ff8
compute volume 665547b1344fe40de5b2c7ace4204783
compute volume 8629822576debc2bfbeffa56787ca348
compute volume 9110d0837ad93ff6b4013bae30091edd
compute volume b45a2485a44a46364cc60134360ea5af
compute volume c0ab2e09a6f878019a6057175bf718e6
compute volume e8186b9b68e18a82f331d51a7b8c8c15

Parallelization

So far, we have executed all operations in serial using a simple for-loop. We will now learn how to easily parallelize the execution!

Instead of using a for-loop, we can also take advantage of Python's built-in map-operator:



In [10]:

    
list(map(compute_volume, project))
print('Done.')









    



compute volume 05061d2acea19d2d9a25ac3360f70e04
compute volume 22582e83c6b12336526ed304d4378ff8
compute volume 4cf2795722061df825ec9a4d5e31e494
compute volume 5a456c131b0c5897804a4af8e77df5aa
compute volume 5a6c687f7655319db24de59a2336eff8
compute volume 665547b1344fe40de5b2c7ace4204783
compute volume 8629822576debc2bfbeffa56787ca348
compute volume 9110d0837ad93ff6b4013bae30091edd
compute volume 973e29d6a4ed6cf7329c03c77df7f645
compute volume b45a2485a44a46364cc60134360ea5af
compute volume c0ab2e09a6f878019a6057175bf718e6
compute volume d03270cdbbae73c8bb1d9fa0ab370264
compute volume e8186b9b68e18a82f331d51a7b8c8c15
compute volume ee617ad585a90809947709a7a45dda9a
Done.

Using the map() expression makes it trivial to implement parallelization patterns, for example, using a process Pool:



In [11]:

    
from multiprocessing import Pool

with Pool() as pool:
    pool.map(compute_volume, project)









    



compute volume 05061d2acea19d2d9a25ac3360f70e04
compute volume 22582e83c6b12336526ed304d4378ff8
compute volume 4cf2795722061df825ec9a4d5e31e494
compute volume 5a456c131b0c5897804a4af8e77df5aa
compute volume 5a6c687f7655319db24de59a2336eff8
compute volume 8629822576debc2bfbeffa56787ca348
compute volume 665547b1344fe40de5b2c7ace4204783
compute volume 9110d0837ad93ff6b4013bae30091edd
compute volume 973e29d6a4ed6cf7329c03c77df7f645
compute volume b45a2485a44a46364cc60134360ea5af
compute volume c0ab2e09a6f878019a6057175bf718e6
compute volume d03270cdbbae73c8bb1d9fa0ab370264
compute volume e8186b9b68e18a82f331d51a7b8c8c15
compute volume ee617ad585a90809947709a7a45dda9a

Or a ThreadPool:



In [12]:

    
from multiprocessing.pool import ThreadPool

with ThreadPool() as pool:
    pool.map(compute_volume, project)









    



compute volume 05061d2acea19d2d9a25ac3360f70e04compute volume 22582e83c6b12336526ed304d4378ff8

compute volume 4cf2795722061df825ec9a4d5e31e494
compute volume 5a456c131b0c5897804a4af8e77df5aa
compute volume 5a6c687f7655319db24de59a2336eff8compute volumecompute volume
  compute volume 665547b1344fe40de5b2c7ace42047838629822576debc2bfbeffa56787ca3489110d0837ad93ff6b4013bae30091edd


compute volume 973e29d6a4ed6cf7329c03c77df7f645
compute volume b45a2485a44a46364cc60134360ea5afcompute volumecompute volume
  d03270cdbbae73c8bb1d9fa0ab370264c0ab2e09a6f878019a6057175bf718e6

compute volume e8186b9b68e18a82f331d51a7b8c8c15compute volume
 ee617ad585a90809947709a7a45dda9a

Uncomment and execute the following line if you want to remove all data and start over.



In [13]:

    
# % rm -r projects/tutorial/workspace

In this section we learned how to create a simple, yet complete workflow for our computational investigation.

In the next section we will learn how to adjust the data space, e.g., modify existing state point parameters.