Tutorial 1- AdaptiveMD basics

First we cover some basics demonstrating the use of AdaptiveMD objects to get you going.

We will briefly talk about

  1. resources
  2. files
  3. generators
  4. how to run a simple trajectory

All of these objects from the adaptivemd package are used to organize a workflow by associating them with the adaptivemd.Project class. We will create a new project called "tutorial" that is used in this and subsequent tutorial notebooks, so as with any project, be careful about deleting work if revisiting this notebook. All python packages specified in the README must be installed on your local machine, along with MongoDB. The first tutorials can easily be modified to test the use of other resources, so make any necessary adjustments. A mongodb port must be visible to this notebook session and the at the time of execution defined resource.

Connecting AdaptiveMD to a MongoDB Instance

Before getting started, it is essential to connect your AdaptiveMD python session to a running mongo database. You must run the database outside AdaptiveMD, and then configure the connection to it. In the simplest (default) case, it will be located on your local host on the default MongoDB port, 27017, and you don't need to do anything to connect. In a practical scenario, the database will not be located on a local machine. You will need to determine the IP address of the database host, and port number used by the mongod instance.

Assuming you don't have a firewall issue, you can connect by changing the database network address set in the package. You can always check this looking at the _db_url attribute of our MongoDB storage interface.


In [1]:
from adaptivemd import mongodb

In [2]:
mongodb.MongoDBStorage._db_url


Out[2]:
'mongodb://localhost:27017/'

To change the port number, use the set_port method of the MongoDBStorage interface class:


In [3]:
mongodb.MongoDBStorage.set_port(27018)
mongodb.MongoDBStorage._db_url


Out[3]:
'mongodb://localhost:27018/'

Likewise, reset the host address with set_host:


In [4]:
mongodb.MongoDBStorage.set_host('128.219.191.255')
mongodb.MongoDBStorage._db_url


Out[4]:
'mongodb://128.219.191.255:27017/'

Or the whole location at one time with set_location:


In [5]:
mongodb.MongoDBStorage.set_location('localhost:27017')
mongodb.MongoDBStorage._db_url


Out[5]:
'mongodb://localhost:27017'

The database URL used by AdaptiveMD is a class attribute of MongoDBStorage, so it is 'global' for a python session. You cannot use multiple databases in a single session. Before loading a Project, this must be set correctly, or else you will get a pymongo connection error.

The Project

Alright, let's load the package and import the Project class since we want to start a project.


In [6]:
from adaptivemd import Project

A bit more on connecting

Note that the best way to set the database location is probably with a Project class or object. The convenience functions are set_dbhost, set_dbport, and set_dblocation that call the MongoDBStorage methods above.


In [7]:
Project.set_dbhost('128.219.186.38')
mongodb.MongoDBStorage._db_url


Out[7]:
'mongodb://128.219.186.38:27017/'

In [8]:
Project.set_dbport('27018')
mongodb.MongoDBStorage._db_url


Out[8]:
'mongodb://128.219.186.38:27018/'

With the Project, the whole URL can also be set at one time if you'd like:


In [9]:
Project.set_dburl('mongodb://not-correct:dburl/')
mongodb.MongoDBStorage._db_url


Out[9]:
'mongodb://not-correct:dburl/'

In [10]:
Project.set_dblocation('localhost:27017')
mongodb.MongoDBStorage._db_url


Out[10]:
'mongodb://localhost:27017'

Now for the fun stuff

Let's open a project with a UNIQUE name by instantiating Project with a name. This will be the name used in the DB so make sure it is new and not too short. Calling adaptivemd.Project with a name to construct an instance will always create a non-existing project, or reopen an exising one. You cannot chose between opening types as you would with a file. This is a precaution to not accidentally delete your project.

First, let's see what projects exist already by listing them from the database. Careful not to delete something you want to keep.


In [11]:
Project.list()


Out[11]:
[]

In [3]:
# Use this to completely remove the tutorial project from the database.
Project.delete('tutorial')
Project.list()


Out[3]:
[u'test']

Note that if you have trajectories or models saved in a pre-existing folder for a project named tutorial, they are not deleted. They will be overwritten as new data is produced. The new project will iterate through pre-existing trajectory names and overwrite the data as each next tutorial is run. The data must be manually moved if desired, or deleted. Only the MongoDB storage associated with the project has been affected by Project.delete('tutorial').


In [13]:
project = Project('tutorial')

In [14]:
project.list()


Out[14]:
[]

Now we have a handle for our project. First thing is to set it up to work on a resource.

The Resource Configuration

What is a configuration?

A Configuration specifies a shared filesystem and clustere attached to it. This can be your local machine, a regular cluster, or even a group of cluster that can access the same FS (like Titan, Eos and Rhea do). At least one is created for a project, and more may be specified via a configurations.cfg file or object creation in a session. Once created, these should remain static for a project since AdaptiveMD doesn't move data around yet on a (multiple) filesystem(s).

Let us look at the simplest cases to use a local resource configuration; your laptop or desktop machine for now. No cluster / HPC involved. In a later tutorial, we will consider more ways to set the configuration. shared_path is a required field/attribute to tell adaptivemd where to store project data such as simulations and models:

  • no configuration specified when initializing project --> local config with shared_path '$HOME/adaptivemd'
  • give configuration dict to initialize method --> {'shared_path': '$HOME/admd'}

Since this object defines the path where all files will be placed, let's get the path to the shared folder. This path must be accessible from all workers on the resource. When using a local resource make sure you have the default folder created on your machine.

initialize is only run once for a project. When this command is executed, the project is entered in the MongoDB under the name 'tutorial'. The directory for the project is not yet created, as workers manage the files and folders associated with a project.


In [15]:
project.initialize({'shared_path': '$HOME/admd'})

In [16]:
project.configurations.one.shared_path


Out[16]:
'$HOME/admd'

File Objects


In [10]:
from adaptivemd import File

First we define a File object. Instead of just a string, these are used to represent files anywhere, on the cluster or your local application. There are some subclasses or extensions of File that have additional meta information like Trajectory or Frame. The underlying base object of a File is called a Location.

We start with a first PDB file that is located on this machine at a relative path


In [11]:
pdb_file = File('file://../files/alanine/alanine.pdb')

File, like any complex object in adaptivemd, can have a .name attribute that makes them easier to find later. You can either set the .name property after creation, or use a little helper method .named() to get a one-liner. This function will set .name and return itself.

For more information about the possibilities to specify filelocation consult the documentation for File


In [12]:
pdb_file.name = 'initial_pdb'

The .load() at the end is important. It causes the File object to load the content of the file, and if you save the File object in the database, the actual file is stored with it. This way it can simply be rewritten on the cluster or anywhere else.


In [13]:
pdb_file.load()


Out[13]:
'alanine.pdb'

Generator Objects

TaskGenerators are instances whose purpose is to create tasks for execution. This is similar to the way Kernels work. A TaskGenerator will generate Task objects for you which will be translated into a TaskDescription and executed. In simple terms:

The task generator creates the bash scripts for you that run a simulation or run pyemma.

A task generator will be initialized with all parameters needed to make it work, and it will know what files need to be staged for the task to be used. adaptivemd relies primarily on two types of generators we will call:

  1. engine: for producing simulation data
  2. modeller: for analyzing simulation data

The Engine is used to run trajectories


In [14]:
from adaptivemd.engine.openmm import OpenMMEngine

A task generator will create tasks that workers use to run simulations. Currently, this means a little python script is created that will excute OpenMM. It requires conda to be added to the PATH variable, or at least openmm to be included in the python installation used by the resource. If you set up your resource correctly, then the task should execute automatically via a worker.

So let's do an example for the OpenMM engine. A small python script is created that makes OpenMM look like a executable. It runs a simulation by providing an initial frame, OpenMM specific system.xml and integrator.xml files, and some additional parameters like the platform name, how often to store simulation frames, etc.


In [15]:
engine = OpenMMEngine(
    pdb_file=pdb_file,
    system_file=File('file://../files/alanine/system.xml').load(),
    integrator_file=File('file://../files/alanine/integrator.xml').load(),
    args='-r --report-interval 1 -p CPU'
).named('openmm')

We have now an OpenMMEngine which uses the previously made pdb File object in the location defined by its shared_path. The same for the OpenMM XML files, along with some args to run using the CPU kernel, etc.

Last we name the engine openmm to find it later, when we reopen the project.


In [16]:
engine.name


Out[16]:
'openmm'

Next, we need to set the output types we want the engine to generate. We chose a stride of 10 for the master trajectory without selection, and save a second trajectory selecting only protein atoms and native stride.

Note that the stride and frame number ALWAYS refer to the native steps used in the engine. In our example the engine uses 2fs time steps. So master stores every 20fs and protein every 2fs.


In [17]:
engine.add_output_type('master', 'master.dcd', stride=10)
engine.add_output_type('protein', 'protein.dcd', stride=1, selection='protein')

The selection must be an mdtraj formatted atom selection string.

The PyEMMAAnalysis modeller


In [18]:
from adaptivemd.analysis.pyemma import PyEMMAAnalysis

The object that computes an MSM model from existing trajectories that you pass it. It is initialized with a .pdb file that is used to create features between the $c_\alpha$ atoms. This implementaton requires a PDB but in general this is not necessay. It is specific to my PyEMMAAnalysis show case.


In [19]:
modeller = PyEMMAAnalysis(
    engine=engine,
    outtype='protein',
    features={'add_inverse_distances': {'select_Backbone': None}}
).named('pyemma')

Again we name it pyemma for later reference.

We specified which output type from the engine we want to analyse. We chose the protein trajectories since these are faster to load and have better time resolution.

The features dict expresses which features to use in the analysis. In this case we will use all inverse distances between backbone c_alpha atoms.

Add generators to project

Next step is to add the generators to the project for later usage. We pick the .generators store and just add it. Consider a store that works like a set() in python, where additionally when an object is added it is stored in the database. It contains objects only once and is not ordered. Therefore we need a name to find the objects later. Of course you can always iterate over all objects, but the order is not given.


In [20]:
#project.generators.add(engine)
#project.generators.add(modeller)
project.generators.add([engine, modeller])
len(project.generators)


Out[20]:
2

Note, that you cannot add the same engine instance twice (or any stored object to its store). If you create a new but equivalent engine, it will be considered different and hence you can store it again.


In [21]:
project.generators.add(engine)
len(project.generators)


Out[21]:
2

Create one initial trajectory

Finally we are ready to run a first trajectory that we will store as a point of reference in the project. Also it is nice to see how it works in general.

We are using a Worker approach. This means simply that someone (in our case the user from inside a script or a notebook) creates a list of tasks to be done and some other instance (the worker) will actually do the work.

Create a Trajectory object

First we create the parameters for the engine to run the simulation. We use a Trajectory object (a special File with initial frame and length) as the input. You could of course pass these things separately, but this way, we can actually reference the not yet existing trajectory and do stuff with it.

A Trajectory should have a unique name, and so there is a project function to do this automatically. It uses numbers and makes sure that this number has not been used yet in the project. The data will be stored in the "$HOME/admd/projects/project.name/trajs/traj.name" directory we set in the configuration, ie:

$HOME/admd/project/tutorial/trajs/00000000/


In [22]:
trajectory = project.new_trajectory(engine['pdb_file'], 100, engine)
#trajectory = project.new_trajectory(pdb_file, 100, engine)
trajectory


Out[22]:
Trajectory('alanine.pdb' >> 00000000[0..100])

This says, initial is alanine.pdb run for 100 frames and is named xxxxxxxx. This is the name of a folder in the data directory, where trajectory files will be stored. Multiple atom selections, e.g. protein and all atoms, may be written to create multiple files in this folder. We will refer to these distinct trajectories as the outtypes later.

Why do we need a trajectory object?

You might wonder why a Trajectory object is necessary. You could just build a function that will take these parameters and run a simulation. At the end it will return the trajectory object. The same object we created just now.

One main reason is to use it as a so-called Promise in AdaptiveMD's asynchronous execution framework. The trajectory object we built acts as a Promise, so what is that exactly?

A Promise is a value (or an object) that represents the result of a function at some point in the future. In our case it represents a trajectory at some point in the future. Normal promises have specific functions do deal with the unknown result, for us this is a little different but the general concept stands. We create an object that represents the specifications of a Trajectory and so, regardless of existence (the corresponding data file), we can use the trajectory as if it would exists to build operations on it.

We see the second reason by considering the object after the promise is fulfilled. We now have an object that can offer a lightweight view on the trajectory data it represents for inspection and sampling. Later we will use it as a convenient way to view analysis results.

Trajectory objects are list-like, get the length:


In [23]:
print(trajectory.length)


100

and since the length is fixed, we know how many frames there are and can access them


In [24]:
print(trajectory[20].exists)
print(trajectory[20])
print(trajectory[19].exists)
print(trajectory[19])


True
Frame(sandbox:///{}/00000000/[20])
False
Frame(sandbox:///{}/00000000/[19])

extend method to elongate the trajectory in an additional task


In [25]:
print(trajectory.extend(100))


<adaptivemd.engine.engine.TrajectoryExtensionTask object at 0x112ab5490>

run method gives us a task that will do an MD simulation and create the trajectory


In [26]:
print(trajectory.run())


<adaptivemd.engine.engine.TrajectoryGenerationTask object at 0x112ab5a10>

We can ask to extend it, we can save it. We can reference specific frames in it before running a simulation. You could even build a whole set of related simulations this way without running a single frame. This is pretty powerful especially in the context of running asynchronous simulations.

Create a Task object

Now, we want that this trajectory actually exists so we have to make it. This requires a Task object that knows how to describe the execution of a simulation. Since Task objects are very flexible and can be complex, there are helper functions (i.e. factories) to create these in an easy manner like the ones we created before.

Use the trajectory (which uses its engine) to call .run() and save the returned task object to directly work with it.


In [27]:
task = trajectory.run()

That's it, just take a trajectory description and turn it into a task that contains the shell commands and needed files, etc. Use the property trajectory.exists so see whether the trajectory object is associated with any data.


In [28]:
trajectory.exists


Out[28]:
False

Submit the task to the queue

Finally we need to add this task to the things we want to be done. This is easy and only requires saving the task to the project. This is done to the project.tasks bundle and once it has been stored it can be picked up by any worker to execute it.

Note that you should be able to submit a trajectory like this, however in practical situations it is likely some additional operations are required in the pre- and post- tasks (outside scope of this tutorial), so they will usually be converted to tasks prior to queueing in project.


In [29]:
#FIXME#project.queue(trajectory)
# shortcut for project.tasks.add(task)
project.queue(task)

In [30]:
len(project.tasks)


Out[30]:
2

That is all we can do from here. To execute the tasks you need to create a worker using the adaptivemdworker command from the shell:

adaptivemdworker tutorial --verbose

For the simple setup in this tutorial, you can just navigate to the directory as follows if it's not in your PATH already.

cd home/of/adaptivemd/scripts/

The worker is responsible for managing the project filestructure, so when both the worker is running and project entries and structure are changed in the database, the directory and subdirectories of "$shared_path/project/tutorial" are created and modified. The project subfolders "trajs" and "models" are populated with trajectories and models of the corresponding names as workers complete tasks that are entered into the database.

If you have a database set up remotely, you will need to tell the worker where to look for this connection just as with the AdaptiveMD application. There are 3 options you can use for this: 1- "-d" to set the database host IP 2- "-p" to set the database port on its host 3- "-l" to set the full database location, optionally with the URL prefix "mongodb://"

i.e. Host IP

adaptivemdworker tutorial -d 8.8.8.8

Host port number

adaptivemdworker tutorial -p 27018

Full location

adaptivemdworker tutorial -l 8.8.8.8:27018

or

adaptivemdworker tutorial -l mongodb://8.8.8.8:27018/

or

adaptivemdworker tutorial -d 8.8.8.8 -p 27018

In [31]:
task.state
project.tasks.all.state


Out[31]:
[u'dummy', u'created']

In [32]:
# now there are data files & folders associated with the trajectory
project.wait_until(task.is_done)

In [33]:
task.state


Out[33]:
u'success'

If you are done for now, its also good practice to relieve your workers (and save yourself some compute time charges on HPC resources!). You don't have to, even if you're closing the project's database connection. They are associated with the project and will accept tasks at any point that are entered in.


In [34]:
# use the 'one' method inherited from bundle to see available methods
# for the worker type, such as 'execute'
#project.workers.one.execute('shutdown')

# but use 'all' method in practice to apply across all members of
# the workers bundle in the typical case, where you have many workers
project.workers.all.execute('shutdown')


Out[34]:
[None]

The final project.close() will close the DB connection. The daemon outside the notebook would be closed separately.


In [35]:
project.close()

In [ ]: