2.1 Advanced Indexing

Indexing files

As was shown earlier, we can create an index of the data space using the index() method:


In [1]:
import signac

project = signac.get_project(root='projects/tutorial')
index = list(project.index())

for doc in index[:3]:
    print(doc)


{'fluid': 'ideal gas', 'root': '/home/johndoe/signac-examples/notebooks/projects/tutorial/workspace', 'V_liq': 0.0, 'signac_id': '0e909ffdba496bbb590fbce31f3a4563', 'V_gas': 294.1176470588235, 'statepoint': {'kT': 1.0, 'b': 0, 'p': 3.4000000000000004, 'a': 0, 'N': 1000}, '_id': '0e909ffdba496bbb590fbce31f3a4563'}
{'fluid': 'ideal gas', 'root': '/home/johndoe/signac-examples/notebooks/projects/tutorial/workspace', 'V_liq': 0.0, 'signac_id': '10743bc8b95bffab09503bce9abbe627', 'V_gas': 10000.0, 'statepoint': {'kT': 1.0, 'b': 0, 'p': 0.1, 'a': 0, 'N': 1000}, '_id': '10743bc8b95bffab09503bce9abbe627'}
{'fluid': 'water', 'root': '/home/johndoe/signac-examples/notebooks/projects/tutorial/workspace', 'V_liq': 30.659766945026785, 'signac_id': '11d8997f19b8ba53d2360ee9fb1606fa', 'V_gas': 416.5817831941532, 'statepoint': {'kT': 1.0, 'b': 0.03049, 'p': 1.2000000000000002, 'a': 5.536, 'N': 1000}, '_id': '11d8997f19b8ba53d2360ee9fb1606fa'}

We will use the Collection class to manage the index directly in-memory:


In [2]:
index = signac.Collection(project.index())

This enables us for example, to quickly search for all indexes related to a specific state point:


In [3]:
for doc in index.find({'statepoint.p': 0.1}):
    print(doc)


{'fluid': 'argon', 'root': '/home/johndoe/signac-examples/notebooks/projects/tutorial/workspace', 'V_liq': 32.804113976682224, 'signac_id': 'f803d91519e23a9eee19fd9e789eeb2e', 'V_gas': 8430.935727416612, 'statepoint': {'kT': 1.0, 'a': 1.355, 'p': 0.1, 'b': 0.03201, 'N': 1000}, '_id': 'f803d91519e23a9eee19fd9e789eeb2e'}
{'fluid': 'ideal gas', 'root': '/home/johndoe/signac-examples/notebooks/projects/tutorial/workspace', 'V_liq': 0.0, 'signac_id': '10743bc8b95bffab09503bce9abbe627', 'V_gas': 10000.0, 'statepoint': {'kT': 1.0, 'a': 0, 'p': 0.1, 'b': 0, 'N': 1000}, '_id': '10743bc8b95bffab09503bce9abbe627'}
{'fluid': 'water', 'root': '/home/johndoe/signac-examples/notebooks/projects/tutorial/workspace', 'V_liq': 30.659799008990184, 'signac_id': '40405b550e7cc2d127b9758d0e764672', 'V_gas': 4999.915100495509, 'statepoint': {'kT': 1.0, 'a': 5.536, 'p': 0.1, 'b': 0.03049, 'N': 1000}, '_id': '40405b550e7cc2d127b9758d0e764672'}

At this point the index contains information about the statepoint and all data stored in the job document. If we want to include the V.txt text files we used to store data in, with the index, we need to tell signac the filename pattern and optionally the file format.


In [4]:
index = signac.Collection(project.index('.*\.txt'))
for doc in index.find(limit=2):
    print(doc)


{'root': '/home/johndoe/signac-examples/notebooks/projects/tutorial/workspace', 'signac_id': '304357838edbf2ec730f4847bb8a0e20', 'filename': '304357838edbf2ec730f4847bb8a0e20/V.txt', '_id': '0f0ea18abc2bf4eef4dfea3cc5f34547', 'file_id': '98f41c5bed6b5579285d113d2c36ffb9', 'md5': '98f41c5bed6b5579285d113d2c36ffb9', 'statepoint': {'kT': 1.0, 'a': 0, 'p': 10.0, 'b': 0, 'N': 1000}, 'format': 'File'}
{'fluid': 'argon', 'root': '/home/johndoe/signac-examples/notebooks/projects/tutorial/workspace', 'V_gas': 110.71550646813046, 'signac_id': '1f147aff97cbbda8aa7c4457a9b51159', 'V_liq': 32.801209285961185, 'statepoint': {'kT': 1.0, 'a': 1.355, 'p': 4.5, 'b': 0.03201, 'N': 1000}, '_id': '1f147aff97cbbda8aa7c4457a9b51159'}

The index contains basic information about the files within our data space, such as the path and the MD5 hash sum. The format field currently says File, which is the default value.

We can specify that all files ending with .txt are to be defined to be of TextFile format:


In [5]:
index = signac.Collection(project.index({'.*\.txt': 'TextFile'}))
print(index.find_one({'format': 'TextFile'}))


{'root': '/home/johndoe/signac-examples/notebooks/projects/tutorial/workspace', 'signac_id': '2f029fb9a2e67621efb884dd9906ceb6', 'filename': '2f029fb9a2e67621efb884dd9906ceb6/V.txt', '_id': '93de9c645b47ecb8c252b5a1f4468588', 'file_id': 'dfe07d8958168d574f62858008123ded', 'md5': 'dfe07d8958168d574f62858008123ded', 'statepoint': {'kT': 1.0, 'a': 0, 'p': 5.6, 'b': 0, 'N': 1000}, 'format': 'TextFile'}

Generating a Master Index

A master index is compiled from multiple other indexes, which is useful when operating on data compiled from multiple sources, such as multiple signac projects.

To make a data space part of master index, we need to create a signac_access.py module. We use the access module to define how the index for the particular space is to be generated. We can create a basic access module using the Project.create_access_module() function:


In [6]:
# Let's make sure to remoe any remnants from previous runs...
% rm -f projects/tutorial/signac_access.py

# This will generate a minimal access module:
project.create_access_module(master=False)

% cat projects/tutorial/signac_access.py


import signac

def get_indexes(root):
    yield signac.get_project(root).index()

When compiling a master index, signac will search for access modules named signac_access.py. Whenever it finds a file with that name, it will import the module and compile all indexes yielded from a function called get_indexes() into the master index.

Let's try that!


In [7]:
master_index = signac.Collection(signac.index())
for doc in master_index.find(limit=2):
    print(doc)


{'root': '/home/johndoe/signac-examples/notebooks/projects/tutorial/workspace', 'signac_id': '7baa598db5b1f5c1405b75e3745bd148', 'V_liq': 0.0, '_id': '7baa598db5b1f5c1405b75e3745bd148', 'fluid': 'ideal gas', 'V_gas': 149.2537313432836, 'statepoint': {'kT': 1.0, 'a': 0, 'p': 6.7, 'b': 0, 'N': 1000}, 'format': None}
{'root': '/home/johndoe/signac-examples/notebooks/projects/tutorial/workspace', 'signac_id': '1f147aff97cbbda8aa7c4457a9b51159', 'V_liq': 32.801209285961185, '_id': '1f147aff97cbbda8aa7c4457a9b51159', 'fluid': 'argon', 'V_gas': 110.71550646813046, 'statepoint': {'kT': 1.0, 'a': 1.355, 'p': 4.5, 'b': 0.03201, 'N': 1000}, 'format': None}

Please note, that we executed the index() function without specifying the project directory. The function crawled through all sub-directories below the root directory in an attempt to find acccess modules.

We can use the access module to control how exactly the index is generated, for example by adding filename and format definitions. Usually we could edit the file directly, here we will just overwrite the old one:


In [8]:
access_module = \
"""import signac

def get_indexes(root):
    yield signac.get_project(root).index({'.*\.txt': 'TextFile'})
"""

with open('projects/tutorial/signac_access.py', 'w') as file:
    file.write(access_module)

Now files will also be part of the master index!


In [9]:
master_index = signac.Collection(signac.index())
print(master_index.find_one({'format': 'TextFile'}))


{'root': '/home/johndoe/signac-examples/notebooks/projects/tutorial/workspace', 'signac_id': '2f029fb9a2e67621efb884dd9906ceb6', 'filename': '2f029fb9a2e67621efb884dd9906ceb6/V.txt', '_id': '93de9c645b47ecb8c252b5a1f4468588', 'file_id': 'dfe07d8958168d574f62858008123ded', 'md5': 'dfe07d8958168d574f62858008123ded', 'statepoint': {'kT': 1.0, 'a': 0, 'p': 5.6, 'b': 0, 'N': 1000}, 'format': 'TextFile'}

We can use the signac.fetch() function to directly open files associated with a particular index document:


In [10]:
for doc in master_index.find({'format': 'TextFile'}, limit=3):
    with signac.fetch(doc) as file:
        p = doc['statepoint']['p']
        V = [float(v) for v in file.read().strip().split(',')]
        print(p, V)


5.6 [0.0, 178.57142857142858]
1.2000000000000002 [0.0, 833.3333333333333]
3.4000000000000004 [32.80193336746696, 146.6628568456784]

Think of fetch() like the built-in open() function. It allows us to retrieve and open files based on the index document (file id) instead of an absolute file path. This makes it easier to operate on data agnostic to its actual physical location.

Please note that we can specify access modules for any kind of data space, it does not have to be a signac project!

In the next section, we will learn how to use indexes in combination with pandas dataframes.