An introduction to Storage

Introduction

All we need is contained in the openpathsampling package


In [1]:
import openpathsampling as paths

The storage itself is mainly a netCDF file and can also be used as such. Technically it is a subclass of netCDF4.Dataset and can use all of its functions in case we want to add additional tables to the file besides what we store using stores. You can of course also add new stores to the storage. Using Storage() will automatically create a set of needed storages when a new file is created.

netCDF files are very generic while our Storage is more tuned to needs we have. It support etc native support for simtk.units, and can recursively store nested objects using JSON pickling. But we will get to that.

Open the output from the 'alanine.ipynb' notebook to have something to work with


In [2]:
storage = paths.Storage('mstis.nc')
storage


Out[2]:
Storage @ 'mstis.nc'

and have a look at what stores are available


In [3]:
print storage.list_stores()


['stores', u'cv1', u'cv2', u'trajectories', u'topologies', u'cvs', u'snapshots', u'samples', u'samplesets', u'movechanges', u'steps', u'details', u'pathmovers', u'shootingpointselectors', u'engines', u'pathsimulators', u'transitions', u'networks', u'cv0', u'schemes', u'interfacesets', u'msouters', u'volumes', u'ensembles', u'tag', u'snapshot0']

In [4]:
cv = storage.cvs[0]

In [5]:
cv(storage.trajectories[0])[0:10]


Out[5]:
[0.19958673417568207,
 0.2097039520740509,
 0.21874895691871643,
 0.22582970559597015,
 0.23340286314487457,
 0.24211987853050232,
 0.25293034315109253,
 0.2662259638309479,
 0.27970921993255615,
 0.29388806223869324]

and we can access all of these using


In [6]:
snapshot_store = storage.snapshots

Stores are lists

In general it is useful to think about the storage as a set of lists. Each of these lists contain objects of the same type, e.g. Sample, Trajectory, Ensemble, Volume, ... The class instances used to access elements from the storage are called a store. Imagine you go into a store to buy and sell objects (luckily our stores are free). All the stores share the same storage space, which is a netCDF file on disc.

Still, a store is not really a list or subclassed from a list, but it almost acts like one.


In [7]:
print 'We have %d snapshots in our storage' % len(storage.snapshots)


We have 32882 snapshots in our storage

Loading objects

In the same way we access lists we can also access these lists using slicing, and even lists of indices.

Load by slicing


In [8]:
print storage.samples[2:4]


[<Sample @ 0x11e7921d0>, <Sample @ 0x11e792110>]

Load by list of indices


In [9]:
print storage.ensembles[[0,1,3]]


[<openpathsampling.ensemble.TISEnsemble object at 0x11e77fcd0>, <openpathsampling.ensemble.IntersectionEnsemble object at 0x11fc7cc50>, <openpathsampling.ensemble.AllInXEnsemble object at 0x11fc7c290>]

Saving objects

Saving is somehow special, since we try to deal exclusively with immutable objects. That means that once an object is saved, it cannot be changed. This is not completely true, since the netCDF file allow changing, but we try not to do it. The only exeption are collective variables, these can store their cached values and we want to store intermediate states so we add new values once we have computed these. This should be the only exception and we use the .sync command to update the status of a once saved collectivevariable

Saving is easy. Just use .save on the store


In [10]:
# storage.samples.save(my_sample)

and it will add the object to the end of our store list or do nothing, if the object has already been stored. It is important to note, that each object knows, if it has been stored already. This allows to write nice recursive saving without worrying that we save the same object several times.

You can also store directly using the storage. Both is fine and the storage just delegates the task to the appropriate store.


In [11]:
# storage.save(my_sample)

For completeness you can also use __setitem__ to save, but since you cannot explicitely set the number you have to use None as the key or Ellipsis, ... is fine.


In [12]:
# storage.samples[None] = my_sample
# storage.samples[Ellipsis] = my_sample
# storage.samples[...] = my_sample

I mentioned recursive saving. This does the following. Imagine a sample snapshot which itself has a Configuration and a Momentum object. If you store the snapshot it also store the content using the approriate stores. This can be arbitrarily complex. And most object can be either stored in a special way or get converted into a JSON string that we can turn into an object again. Python has something like this build it, which works similar, but we needed something that add the recursive storage connection and uses JSON. If you are curious, the json string can be accessed for some objects using .json but is only available for loaded or saved objects. It will not be computed unless it is used.


In [13]:
volume = storage.volumes[1]
print storage.repr_json(volume)


{"_cls":"CVDefinedVolume","_dict":{"collectivevariable":{"_store":"cvs","_obj_uuid":"cb90664f-80cc-11e6-90fa-00000000001e"},"lambda_max":0.3,"lambda_min":0.0}}

Naming

Load by name


In [14]:
print storage.volumes['A'].name


A

Equivalent to using .find(<name>)


In [15]:
print storage.volumes.find('A').name


A

Names can exist multiple times. To find all use .find_all() which returns a list of objects.


In [16]:
print storage.volumes.find_all('A')


[<openpathsampling.volume.CVDefinedVolume object at 0x11f8eb290>]

To get indices only use .find_indices(<name>)


In [17]:
print storage.volumes.find_indices('A')


[4]

A look at the internal list of names


In [18]:
storage.volumes.name_idx


Out[18]:
{'A': {4},
 'B': {5},
 'C': {0},
 'all states except A': {12},
 'all states except B': {22},
 'all states except C': {3}}

Objects can be saved by name. To do this we need a new objects that we can actually save. All loaded objects cannot be saved again of course.


In [19]:
empty_ensemble = paths.EmptyEnsemble()
full_ensemble = paths.EmptyEnsemble()

Now store as you would set a dictionary


In [20]:
len(storage.ensembles)


Out[20]:
156

In [21]:
storage.ensembles['empty'] = empty_ensemble

In [22]:
print storage.ensembles.index[empty_ensemble.__uuid__]


156

In [23]:
storage.variables['ensembles_json'][70:80]


Out[23]:
array([ u'{"_cls":"AllInXEnsemble","_dict":{"volume":{"_store":"volumes","_obj_uuid":"6a5b9e8a-80d5-11e6-b0c9-0000029b24b0"},"trusted":true}}',
       u'{"_cls":"MinusInterfaceEnsemble","_dict":{"n_l":2,"state_vol":{"_store":"volumes","_obj_uuid":"cb90664f-80cc-11e6-90fa-000000000024"},"innermost_vols":[{"_store":"volumes","_obj_uuid":"6a5b9e8a-80d5-11e6-b0c9-0000029b23cc"}],"innermost_vol":{"_store":"volumes","_obj_uuid":"6a5b9e8a-80d5-11e6-b0c9-0000029b23cc"},"greedy":false}}',
       u'{"_cls":"TISEnsemble","_dict":{"orderparameter":null,"ensembles":[{"_store":"ensembles","_obj_uuid":"6a5b9e8a-80d5-11e6-b0c9-0000029b25d2"},{"_store":"ensembles","_obj_uuid":"6a5b9e8a-80d5-11e6-b0c9-0000029b25d8"},{"_store":"ensembles","_obj_uuid":"6a5b9e8a-80d5-11e6-b0c9-0000029b25de"}],"min_overlap":{"_tuple":[0,0]},"max_overlap":{"_tuple":[0,0]},"greedy":false,"final_states":[{"_store":"volumes","_obj_uuid":"cb90664f-80cc-11e6-90fa-000000000024"}],"interface":{"_store":"volumes","_obj_uuid":"6a5b9e8a-80d5-11e6-b0c9-0000029b23cc"},"lambda_i":null,"initial_states":[{"_store":"volumes","_obj_uuid":"cb90664f-80cc-11e6-90fa-000000000024"}]}}',
       u'{"_cls":"IntersectionEnsemble","_dict":{"ensemble2":{"_store":"ensembles","_obj_uuid":"6a5b9e8a-80d5-11e6-b0c9-0000029b25d0"},"ensemble1":{"_store":"ensembles","_obj_uuid":"6a5b9e8a-80d5-11e6-b0c9-0000029b25ce"}}}',
       u'{"_cls":"LengthEnsemble","_dict":{"length":1}}',
       u'{"_cls":"AllInXEnsemble","_dict":{"volume":{"_store":"volumes","_obj_uuid":"cb90664f-80cc-11e6-90fa-000000000024"},"trusted":true}}',
       u'{"_cls":"IntersectionEnsemble","_dict":{"ensemble2":{"_store":"ensembles","_obj_uuid":"6a5b9e8a-80d5-11e6-b0c9-0000029b25d6"},"ensemble1":{"_store":"ensembles","_obj_uuid":"6a5b9e8a-80d5-11e6-b0c9-0000029b25d4"}}}',
       u'{"_cls":"PartOutXEnsemble","_dict":{"volume":{"_store":"volumes","_obj_uuid":"6a5b9e8a-80d5-11e6-b0c9-0000029b23cc"},"trusted":true}}',
       u'{"_cls":"AllOutXEnsemble","_dict":{"volume":{"_store":"volumes","_obj_uuid":"cb90664f-80cc-11e6-90fa-000000000024"},"trusted":true}}',
       u'{"_cls":"IntersectionEnsemble","_dict":{"ensemble2":{"_store":"ensembles","_obj_uuid":"6a5b9e8a-80d5-11e6-b0c9-0000029b25dc"},"ensemble1":{"_store":"ensembles","_obj_uuid":"6a5b9e8a-80d5-11e6-b0c9-0000029b25da"}}}'], dtype=object)

Alternatively you can use save.


In [24]:
storage.ensembles.save(full_ensemble, 'full')


Out[24]:
16645015244662088194245276165176557816L

And the ensemble now has the .name property set


In [25]:
print empty_ensemble.name
print full_ensemble.name


empty
full

tags

In storage exists a special store name tag. This is to reference any object and mostly to name stuff for later easy access.


In [26]:
print len(storage.tag)


0

In [27]:
for t in storage.tag:
    print t

In [28]:
storage.tag.keys()


Out[28]:
[]

In [29]:
for name, obj in storage.tag.iteritems():
    print '{:20s} : {:20s}'.format(name, obj.__class__.__name__)

In [30]:
dict(storage.tag)


Out[30]:
{}

Indexing

Each loaded object is equipped with a .idx attribute which is a dictionary that contains the index for a specific storage. This is necessary since we can - in theory - store an object in several different stores at once and these might have different indices. Note that idx is NOT a function, but a dictionary, hence the square brackets.


In [31]:
samp = storage.samples[0]
print storage.idx(samp)
print samp.idx(storage)
print samp.idx(storage.samples)


0
0
None

Iterators

A list is iterable and so is a store. Lets load all ensembles and list their names


In [32]:
[ens.name for ens in storage.ensembles][2:4]


Out[32]:
['[IntersectionEnsemble]', u'Out A 1']

Maybe you have realized that some command run slower the first time. This is because we use caching and once an object is loaded it stays in memory and can be accessed much faster.

Searching for objects

One way to find objects is to use their name, which I mentioned before, but in general there are no search functions, but we can use python notation in the usual way to load what we need. List comprehensions is the magic word. Say, we want to get all snapshots that are reversed. We could just load all of these and filter them, but there is a more elegant way to do that, or let's say a more elegant way of writing it in python, because the underlying code does just that.


In [33]:
stA = storage.volumes['A']
first_5000_snaps = storage.snapshots[0:5000]
reversed_samples = [snapshot for snapshot in first_5000_snaps if stA(snapshot)]
print 'We found %d snapshots in StateA among %d total snapshots' % (len(reversed_samples), len(first_5000_snaps))


We found 1304 snapshots in StateA among 5000 total snapshots

Lets do something more useful: For TIS ensemble we want statistics on pathlengths associated with sampled trajectories Sample objects that are sampled for a specific ensemble. And we one want samples that have been generated in our production runs and are present in a SampleSet

TODO: add a way to select only specific SampleSets


In [34]:
print storage.samplesets[0]


<openpathsampling.sample.SampleSet object at 0x1227aec90>

In [35]:
my_network = storage.networks[0]
my_ensemble = my_network.sampling_ensembles[0]
relevant_samples = [
    sample 
    for sample_set in storage.samplesets 
    for sample in sample_set 
    if sample.ensemble is my_ensemble
]
print len(relevant_samples)


0

and finally compute the average length


In [36]:
list_of_path_lengths = [
    len(sample.trajectory)
    for sample_set in storage.samplesets 
    for sample in sample_set 
    if sample.ensemble is my_ensemble
]
print list_of_path_lengths
if len(list_of_path_lengths) > 0:
    mean = float(sum(list_of_path_lengths))/len(list_of_path_lengths)
else:
    mean = 0.0 # actually, it is not defined, so we just set it to zero
print mean


[]
0.0

Allright, we loaded from a bootstrapping sampling algorithm and the analysis is pointless, but still it is rather short considering what we just did.

Generator expression

There is another very cool feature about python that is worth noting: generator expressions. Before we used list comprehensions to generate a list of all that we need, but what, if we don't want the whole list at once? Maybe that is impossible because of too much memory and also not desirable? We can do the same thing as above using a generator (although it would only be useful if we had to average over billions of samples). So assume the list of lengths is too large for memory. The summing does not mind to use little pieces so we construct a function that always gives us the next element. These functions are called iterators and to make these iteratore there is syntactic way to create them easily: Instead of square brackets in in list comprehensions use round brackets. So the above example would look like this


In [37]:
iterator_over_path_lengths = (
    len(sample.trajectory)
    for sample_set in storage.samplesets
    for sample in sample_set 
    if sample.ensemble is my_ensemble
)
total = float(sum(iterator_over_path_lengths))
print total


0.0

Note that we now have a generator and no computed values yet. If we iterator using our iterator called generator it will pass one value at a time and we can use it in sum as we did before. There are two important things to note. Once an iteratore is used, it is consumed and we cannot just be run again so we need to change the code again. I assume there are other ways to do that, too


In [38]:
iterator_over_path_lengths = (
    len(sample.trajectory)
    for sample_set in storage.samplesets 
    for sample in sample_set 
    if sample.ensemble is my_ensemble
)
total = 0
count = 0
for length in iterator_over_path_lengths:
    total += length
    count += 1
    
if count > 0:
    mean = float(total)/count
else:
    mean = 0.0 # actually, it is not defined, so we just set it to zero
print mean


0.0

Voilà, this time without computing all length before!

A last example that will be interesting is the statistics on acceptance. Each sample knows which mover was involved in its creation. This is stored in .details.mover in the .details attribute. Let us try to look at only forward moves


In [39]:
ff_movers = filter(lambda self : type(self) == paths.ForwardShootMover, storage.pathmovers)

In [40]:
ff_movers


Out[40]:
[<openpathsampling.pathmover.ForwardShootMover at 0x1254d1f90>,
 <openpathsampling.pathmover.ForwardShootMover at 0x1068a7d50>,
 <openpathsampling.pathmover.ForwardShootMover at 0x11fc7cb10>,
 <openpathsampling.pathmover.ForwardShootMover at 0x1227c1e50>,
 <openpathsampling.pathmover.ForwardShootMover at 0x1278a0750>,
 <openpathsampling.pathmover.ForwardShootMover at 0x1250222d0>,
 <openpathsampling.pathmover.ForwardShootMover at 0x11f8eb510>,
 <openpathsampling.pathmover.ForwardShootMover at 0x125022c90>,
 <openpathsampling.pathmover.ForwardShootMover at 0x125022110>,
 <openpathsampling.pathmover.ForwardShootMover at 0x1227c1650>,
 <openpathsampling.pathmover.ForwardShootMover at 0x12556e7d0>,
 <openpathsampling.pathmover.ForwardShootMover at 0x1250226d0>,
 <openpathsampling.pathmover.ForwardShootMover at 0x125022150>]

In [41]:
if len(ff_movers) > 2:
    mover = ff_movers[2]
    print "Use a '%s' for ensemble(s) '%s'" % ( mover.cls, mover.ensemble.name )


Use a 'ForwardShootMover' for ensemble(s) 'Out C 0'

I use a little trick here, notice that we use a list comprehension inside of a function call, this actually uses the generator expression and passes the resulting iterator to the .join function.

Now to get statistics on acceptances