AdaptiveMD
basicsAdaptiveMD
objects to get you going.We will briefly talk about
All of these objects from the adaptivemd
package are used to organize a workflow by associating them with the adaptivemd.Project
class. We will create a new project called "tutorial" that is used in this and subsequent tutorial notebooks, so as with any project, be careful about deleting work if revisiting this notebook. All python packages specified in the README must be installed on your local machine, along with MongoDB. The first tutorials can easily be modified to test the use of other resources, so make any necessary adjustments. A mongodb
port must be visible to this notebook session and the at the time of execution defined resource
.
Before getting started, it is essential to connect your AdaptiveMD python session to a running mongo database. You must run the database outside AdaptiveMD, and then configure the connection to it. In the simplest (default) case, it will be located on your local host on the default MongoDB port, 27017, and you don't need to do anything to connect. In a practical scenario, the database will not be located on a local machine. You will need to determine the IP address of the database host, and port number used by the mongod
instance.
Assuming you don't have a firewall issue, you can connect by changing the database network address set in the package. You can always check this looking at the _db_url
attribute of our MongoDB storage interface.
In [1]:
from adaptivemd import mongodb
In [2]:
mongodb.MongoDBStorage._db_url
Out[2]:
To change the port number, use the set_port
method of the MongoDBStorage
interface class:
In [3]:
mongodb.MongoDBStorage.set_port(27018)
mongodb.MongoDBStorage._db_url
Out[3]:
Likewise, reset the host address with set_host
:
In [4]:
mongodb.MongoDBStorage.set_host('128.219.191.255')
mongodb.MongoDBStorage._db_url
Out[4]:
Or the whole location at one time with set_location
:
In [5]:
mongodb.MongoDBStorage.set_location('localhost:27017')
mongodb.MongoDBStorage._db_url
Out[5]:
The database URL used by AdaptiveMD is a class attribute of MongoDBStorage
, so it is 'global' for a python session. You cannot use multiple databases in a single session. Before loading a Project
, this must be set correctly, or else you will get a pymongo
connection error.
Alright, let's load the package and import the Project
class since we want to start a project.
In [6]:
from adaptivemd import Project
Note that the best way to set the database location is probably with a Project
class or object. The convenience functions are set_dbhost
, set_dbport
, and set_dblocation
that call the MongoDBStorage
methods above.
In [7]:
Project.set_dbhost('128.219.186.38')
mongodb.MongoDBStorage._db_url
Out[7]:
In [8]:
Project.set_dbport('27018')
mongodb.MongoDBStorage._db_url
Out[8]:
With the Project, the whole URL can also be set at one time if you'd like:
In [9]:
Project.set_dburl('mongodb://not-correct:dburl/')
mongodb.MongoDBStorage._db_url
Out[9]:
In [10]:
Project.set_dblocation('localhost:27017')
mongodb.MongoDBStorage._db_url
Out[10]:
Let's open a project with a UNIQUE name by instantiating Project
with a name. This will be the name used in the DB so make sure it is new and not too short. Calling adaptivemd.Project
with a name to construct an instance will always create a non-existing project, or reopen an exising one. You cannot chose between opening types as you would with a file. This is a precaution to not accidentally delete your project.
First, let's see what projects exist already by listing them from the database. Careful not to delete something you want to keep.
In [11]:
Project.list()
Out[11]:
In [3]:
# Use this to completely remove the tutorial project from the database.
Project.delete('tutorial')
Project.list()
Out[3]:
Note that if you have trajectories or models saved in a pre-existing folder for a project named tutorial, they are not deleted. They will be overwritten as new data is produced. The new project will iterate through pre-existing trajectory names and overwrite the data as each next tutorial is run. The data must be manually moved if desired, or deleted. Only the MongoDB storage associated with the project has been affected by Project.delete('tutorial')
.
In [13]:
project = Project('tutorial')
In [14]:
project.list()
Out[14]:
Now we have a handle for our project. First thing is to set it up to work on a resource.
A Configuration
specifies a shared filesystem and clustere attached to it. This can be your local machine, a regular cluster, or even a group of cluster that can access the same FS (like Titan, Eos and Rhea do). At least one is created for a project, and more may be specified via a configurations.cfg file or object creation in a session. Once created, these should remain static for a project since AdaptiveMD doesn't move data around yet on a (multiple) filesystem(s).
Let us look at the simplest cases to use a local resource configuration; your laptop or desktop machine for now. No cluster / HPC involved. In a later tutorial, we will consider more ways to set the configuration. shared_path
is a required field/attribute to tell adaptivemd where to store project data such as simulations and models:
shared_path
'$HOME/adaptivemd'dict
to initialize
method --> {'shared_path': '$HOME/admd'}
Since this object defines the path where all files will be placed, let's get the path to the shared folder. This path must be accessible from all workers on the resource. When using a local resource make sure you have the default folder created on your machine.
initialize
is only run once for a project. When this command is executed, the project is entered in the MongoDB under the name 'tutorial'. The directory for the project is not yet created, as workers manage the files and folders associated with a project.
In [15]:
project.initialize({'shared_path': '$HOME/admd'})
In [16]:
project.configurations.one.shared_path
Out[16]:
In [10]:
from adaptivemd import File
First we define a File
object. Instead of just a string, these are used to represent files anywhere, on the cluster or your local application. There are some subclasses or extensions of File
that have additional meta information like Trajectory
or Frame
. The underlying base object of a File
is called a Location
.
We start with a first PDB file that is located on this machine at a relative path
In [11]:
pdb_file = File('file://../files/alanine/alanine.pdb')
File
, like any complex object in adaptivemd, can have a .name
attribute that makes them easier to find later. You can either set the .name
property after creation, or use a little helper method .named()
to get a one-liner. This function will set .name
and return itself.
For more information about the possibilities to specify filelocation consult the documentation for File
In [12]:
pdb_file.name = 'initial_pdb'
The .load()
at the end is important. It causes the File
object to load the content of the file, and if you save the File
object in the database, the actual file is stored with it. This way it can simply be rewritten on the cluster or anywhere else.
In [13]:
pdb_file.load()
Out[13]:
TaskGenerators are instances whose purpose is to create tasks for execution. This is similar to the
way Kernels work. A TaskGenerator will generate Task
objects for you which will be translated into a TaskDescription
and executed. In simple terms:
The task generator creates the bash scripts for you that run a simulation or run pyemma.
A task generator will be initialized with all parameters needed to make it work, and it will know what files need to be staged for the task to be used. adaptivemd
relies primarily on two types of generators we will call:
In [14]:
from adaptivemd.engine.openmm import OpenMMEngine
A task generator will create tasks that workers use to run simulations. Currently, this means a little python script is created that will excute OpenMM. It requires conda to be added to the PATH variable, or at least openmm to be included in the python installation used by the resource. If you set up your resource correctly, then the task should execute automatically via a worker.
So let's do an example for the OpenMM engine. A small python script is created that makes OpenMM look like a executable. It runs a simulation by providing an initial frame, OpenMM specific system.xml and integrator.xml files, and some additional parameters like the platform name, how often to store simulation frames, etc.
In [15]:
engine = OpenMMEngine(
pdb_file=pdb_file,
system_file=File('file://../files/alanine/system.xml').load(),
integrator_file=File('file://../files/alanine/integrator.xml').load(),
args='-r --report-interval 1 -p CPU'
).named('openmm')
We have now an OpenMMEngine which uses the previously made pdb File
object in the location defined by its shared_path
. The same for the OpenMM XML files, along with some args to run using the CPU
kernel, etc.
Last we name the engine openmm
to find it later, when we reopen the project.
In [16]:
engine.name
Out[16]:
Next, we need to set the output types we want the engine to generate. We chose a stride of 10 for the master
trajectory without selection, and save a second trajectory selecting only protein atoms and native stride.
Note that the stride and frame number ALWAYS refer to the native steps used in the engine. In our example the engine uses 2fs
time steps. So master stores every 20fs
and protein every 2fs
.
In [17]:
engine.add_output_type('master', 'master.dcd', stride=10)
engine.add_output_type('protein', 'protein.dcd', stride=1, selection='protein')
The selection must be an mdtraj
formatted atom selection string.
In [18]:
from adaptivemd.analysis.pyemma import PyEMMAAnalysis
The object that computes an MSM model from existing trajectories that you pass it. It is initialized with a .pdb
file that is used to create features between the $c_\alpha$ atoms. This implementaton requires a PDB but in general this is not necessay. It is specific to my PyEMMAAnalysis show case.
In [19]:
modeller = PyEMMAAnalysis(
engine=engine,
outtype='protein',
features={'add_inverse_distances': {'select_Backbone': None}}
).named('pyemma')
Again we name it pyemma
for later reference.
We specified which output type from the engine we want to analyse. We chose the protein trajectories since these are faster to load and have better time resolution.
The features dict expresses which features to use in the analysis. In this case we will use all inverse distances between backbone c_alpha atoms.
Next step is to add the generators to the project for later usage. We pick the .generators
store and just add it. Consider a store that works like a set()
in python, where additionally when an object is added it is stored in the database. It contains objects only once and is not ordered. Therefore we need a name to find the objects later. Of course you can always iterate over all objects, but the order is not given.
In [20]:
#project.generators.add(engine)
#project.generators.add(modeller)
project.generators.add([engine, modeller])
len(project.generators)
Out[20]:
Note, that you cannot add the same engine instance twice (or any stored object to its store). If you create a new but equivalent engine, it will be considered different and hence you can store it again.
In [21]:
project.generators.add(engine)
len(project.generators)
Out[21]:
Finally we are ready to run a first trajectory that we will store as a point of reference in the project. Also it is nice to see how it works in general.
We are using a Worker approach. This means simply that someone (in our case the user from inside a script or a notebook) creates a list of tasks to be done and some other instance (the worker) will actually do the work.
First we create the parameters for the engine to run the simulation. We use a Trajectory
object (a special File
with initial frame and length) as the input. You could of course pass these things separately, but this way, we can actually reference the not yet existing trajectory and do stuff with it.
A Trajectory should have a unique name, and so there is a project function to do this automatically. It uses numbers and makes sure that this number has not been used yet in the project. The data will be stored in the "$HOME/admd/projects/project
.name
/trajs/traj
.name
" directory we set in the configuration, ie:
$HOME/admd/project/tutorial/trajs/00000000/
In [22]:
trajectory = project.new_trajectory(engine['pdb_file'], 100, engine)
#trajectory = project.new_trajectory(pdb_file, 100, engine)
trajectory
Out[22]:
This says, initial is alanine.pdb
run for 100 frames and is named xxxxxxxx
. This is the name of a folder in the data directory, where trajectory files will be stored. Multiple atom selections, e.g. protein and all atoms, may be written to create multiple files in this folder. We will refer to these distinct trajectories as the outtypes
later.
You might wonder why a Trajectory
object is necessary. You could just build a function that will take these parameters and run a simulation. At the end it will return the trajectory object. The same object we created just now.
One main reason is to use it as a so-called Promise in AdaptiveMD's asynchronous execution framework. The trajectory object we built acts as a Promise, so what is that exactly?
A Promise is a value (or an object) that represents the result of a function at some point in the future. In our case it represents a trajectory at some point in the future. Normal promises have specific functions do deal with the unknown result, for us this is a little different but the general concept stands. We create an object that represents the specifications of a Trajectory
and so, regardless of existence (the corresponding data file), we can use the trajectory as if it would exists to build operations on it.
We see the second reason by considering the object after the promise is fulfilled. We now have an object that can offer a lightweight view on the trajectory data it represents for inspection and sampling. Later we will use it as a convenient way to view analysis results.
Trajectory
objects are list-like, get the length:
In [23]:
print(trajectory.length)
and since the length is fixed, we know how many frames there are and can access them
In [24]:
print(trajectory[20].exists)
print(trajectory[20])
print(trajectory[19].exists)
print(trajectory[19])
extend
method to elongate the trajectory in an additional task
In [25]:
print(trajectory.extend(100))
run
method gives us a task
that will do an MD simulation and create the trajectory
In [26]:
print(trajectory.run())
We can ask to extend it, we can save it. We can reference specific frames in it before running a simulation. You could even build a whole set of related simulations this way without running a single frame. This is pretty powerful especially in the context of running asynchronous simulations.
Now, we want that this trajectory actually exists so we have to make it. This requires a Task
object that knows how to describe the execution of a simulation. Since Task
objects are very flexible and can be complex, there are helper functions (i.e. factories) to create these in an easy manner like the ones we created before.
Use the trajectory (which uses its engine) to call .run()
and save the returned task object to directly work with it.
In [27]:
task = trajectory.run()
That's it, just take a trajectory description and turn it into a task that contains the shell commands and needed files, etc. Use the property trajectory.exists
so see whether the trajectory object is associated with any data.
In [28]:
trajectory.exists
Out[28]:
Finally we need to add this task to the things we want to be done. This is easy and only requires saving the task to the project. This is done to the project.tasks
bundle and once it has been stored it can be picked up by any worker to execute it.
Note that you should be able to submit a trajectory like this, however in practical situations it is likely some additional operations are required in the pre- and post- tasks (outside scope of this tutorial), so they will usually be converted to tasks prior to queueing in project.
In [29]:
#FIXME#project.queue(trajectory)
# shortcut for project.tasks.add(task)
project.queue(task)
In [30]:
len(project.tasks)
Out[30]:
That is all we can do from here. To execute the tasks you need to create a worker using the adaptivemdworker
command from the shell:
adaptivemdworker tutorial --verbose
For the simple setup in this tutorial, you can just navigate to the directory as follows if it's not in your PATH already.
cd home/of/adaptivemd/scripts/
The worker is responsible for managing the project filestructure, so when both the worker is running and project
entries and structure are changed in the database, the directory and subdirectories of "$shared_path/project/tutorial" are created and modified. The project subfolders "trajs" and "models" are populated with trajectories and models of the corresponding names as workers complete tasks that are entered into the database.
If you have a database set up remotely, you will need to tell the worker where to look for this connection just as with the AdaptiveMD application. There are 3 options you can use for this: 1- "-d" to set the database host IP 2- "-p" to set the database port on its host 3- "-l" to set the full database location, optionally with the URL prefix "mongodb://"
i.e. Host IP
adaptivemdworker tutorial -d 8.8.8.8
Host port number
adaptivemdworker tutorial -p 27018
Full location
adaptivemdworker tutorial -l 8.8.8.8:27018
or
adaptivemdworker tutorial -l mongodb://8.8.8.8:27018/
or
adaptivemdworker tutorial -d 8.8.8.8 -p 27018
In [31]:
task.state
project.tasks.all.state
Out[31]:
In [32]:
# now there are data files & folders associated with the trajectory
project.wait_until(task.is_done)
In [33]:
task.state
Out[33]:
If you are done for now, its also good practice to relieve your workers (and save yourself some compute time charges on HPC resources!). You don't have to, even if you're closing the project's database connection. They are associated with the project and will accept tasks at any point that are entered in.
In [34]:
# use the 'one' method inherited from bundle to see available methods
# for the worker type, such as 'execute'
#project.workers.one.execute('shutdown')
# but use 'all' method in practice to apply across all members of
# the workers bundle in the typical case, where you have many workers
project.workers.all.execute('shutdown')
Out[34]:
The final project.close()
will close the DB connection. The daemon outside the notebook would be closed separately.
In [35]:
project.close()
In [ ]: