This notebook demonstrates how to use Sumatra to capture simulation input data and meta data and then export these records into a Pandas data frame. Sumatra has a stand alone web interface built with Django which allows users to view the data. Data can also be imported into Python, but requires a lot of code to manipulate and display in useful custom formats. Pandas seems like the ideal solution for manipulating Sumatra's data. In particular the ability to easily and quickly combine input data, meta data, and output data into custom data frames is really powerful for data analysis, reproduciblity and sharing.
The first step in using Sumatra is to setup a simulation. Here the simulation just runs a diffusion problem using FiPy and outputs the time taken for a time step. The goal of the work is to test FiPy's parallel speed up based on different input parameters.
In [1]:
%matplotlib inline
%load_ext autoreload
%autoreload 2
import numpy as np
import matplotlib.pyplot as plt
Sumatra requires a file with the parameters specified.
In [2]:
import json
params = {'N' : 10, 'suite' : 'trilinos', 'iterations' : 100}
with open('params.json', 'w') as fp:
json.dump(params, fp)
The script file for running the simulation is fipy_timing.py
. It reads the JSON file, runs the simulation and the stores the run times in data.txt
.
In [3]:
%%writefile fipy_timing.py
"""
Usage: fipy_timing.py [<jsonfile>]
"""
from docopt import docopt
import json
import timeit
import numpy as np
import fipy as fp
import os
arguments = docopt(__doc__, version='Run FiPy timing')
jsonfile = arguments['<jsonfile>']
if jsonfile:
with open(jsonfile, 'rb') as ff:
params = json.load(ff)
else:
params = dict()
N = params.get('N', 10)
iterations = params.get('iterations', 100)
suite = params.get('suite', 'trilinos')
sumatra_label = params.get('sumatra_label', '')
attempts = 3
setup_str = '''
import fipy as fp
import numpy as np
np.random.seed(1)
L = 1.
N = {N:d}
m = fp.GmshGrid3D(nx=N, ny=N, nz=N, dx=L / N, dy=L / N, dz=L / N)
v0 = np.random.random(m.numberOfCells)
v = fp.CellVariable(mesh=m)
v0 = np.resize(v0, len(v)) ## Gmsh doesn't always give us the correct sized grid!
eqn = fp.TransientTerm(1e-3) == fp.DiffusionTerm()
v[:] = v0.copy()
import fipy.solvers.{suite} as solvers
solver = solvers.linearPCGSolver.LinearPCGSolver(precon=None, iterations={iterations}, tolerance=1e-100)
eqn.solve(v, dt=1., solver=solver)
v[:] = v0.copy()
'''
timeit_str = '''
eqn.solve(v, dt=1., solver=solver)
fp.parallelComm.Barrier()
'''
timer = timeit.Timer(timeit_str, setup=setup_str.format(N=N, suite=suite, iterations=iterations))
times = timer.repeat(attempts, 1)
if fp.parallelComm.procID == 0:
filepath = os.path.join('Data', sumatra_label)
filename = 'data.txt'
np.savetxt(os.path.join(filepath, filename), times)
Without using Sumatra and in serial this is run with
In [145]:
!python fipy_timing.py params.json
and the output data file is
In [146]:
!more Data/data.txt
In this demo, I'm assuming that the working directory is a Git repository set up with
$ git init
$ git add fipy_timing.py $ git ci -m "Add timing script."
Sumatra requires that the script is sitting in the a working copy of a repository.
In [3]:
!git log -1
Once the repository is setup, the Sumatra repository can be configured. Here we are using the distributed
launch mode as we want Sumatra to launch and
record parallel jobs.
In [148]:
%%bash
\rm -rf .smt
smt init smt-demo
smt configure --executable=python --main=fipy_timing.py
smt configure --launch_mode=distributed
smt configure -g uuid
smt configure -c store-diff
smt configure --addlabel=parameters
Sumatra requires that a Data/
directory exists in the working copy.
In [ ]:
!mkdir Data
If we were not using Sumatra, we would launch the job with
$ mpirun -n 2 python fipy_timing.py params.json
The equivalent command using Sumatra is
$ smt run -n 2 params.json
In [4]:
import itertools
nprocs = (1, 2, 4, 8)#
iterations_ = (100,)
Ns = (10, 40)
suites = ('trilinos',)
tag='demo4'
for nproc, iterations, N, suite in itertools.product(nprocs, iterations_, Ns, suites):
!smt run --tag=$tag -n $nproc params.json N=$N iterations=$iterations suite=$suite
In [5]:
import json
import pandas
!smt export
with open('.smt/records_export.json') as ff:
data = json.load(ff)
df = pandas.DataFrame(data)
The Sumatra data is now in a Pandas data frame, albeit a touch raw.
In [6]:
print df
In [7]:
print df[['label', 'duration']]
While all the meta data is important, often we want the input and output data combined into a data frame in a digestible form. Typically, we want a graph of reduced input versus reduced output.
The first step is to introduce columns in the data frame for each of the input parameters (input data). The input data is buried in the launch_mode
and parameters
columns of the raw data frame.
In [8]:
import json
df = df.copy()
df['nproc'] = df.launch_mode.map(lambda x: x['parameters']['n'])
for p in 'N', 'iterations', 'suite':
df[p] = df.parameters.map(lambda x: json.loads(x['content'])[p])
We now have the input data exposed as columns in the data frame.
In [9]:
columns = ['label', 'nproc', 'N', 'iterations', 'suite', 'tags']
print df[columns].sort('nproc')
The following pulls out the run times stored in the output files from each simulation into a run_time
column.
In [10]:
import os
datafiles = df['output_data'].map(lambda x: x[0]['path'])
datapaths = df['datastore'].map(lambda x: x['parameters']['root'])
data = [np.loadtxt(os.path.join(x, y)) for x, y in zip(datapaths, datafiles)]
df['run_time'] = [min(d) for d in data]
In [11]:
columns.append('run_time')
print df[columns].sort('nproc')
Create masks based on simulations records that have been tagged with either demo2
or demo3
. We want to plot these results as different curves on the same graph.
In [17]:
tag_mask = df.tags.map(lambda x: 'demo4' in x)
df_tmp = df[tag_mask]
m10 = df_tmp.N.map(lambda x: x == 10)
m40 = df_tmp.N.map(lambda x: x == 40)
df_N10 = df_tmp[m10]
df_N40 = df_tmp[m40]
print df_N10[columns].sort('nproc')
print df_N40[columns].sort('nproc')
We can plot the results we're interested in. Larger system size gives better parallel speed up.
In [18]:
ax = df_N10.plot('nproc', 'run_time', label='N={0}'.format(df_N10.N.iat[0]))
df_N40.plot('nproc', 'run_time', ylim=0, ax=ax, label='N={0}'.format(df_N40.N.iat[0]))
plt.ylabel('Run Time (s)')
plt.xlabel('Number of Processes')
plt.legend()
Out[18]:
Using Pandas it is easy to store a custom data frame.
In [19]:
df.to_hdf('store.h5', 'df')
In [21]:
store = pandas.HDFStore('store.h5')
print store.df.dependencies