134k Database Access

Han, Kehang (hkh12@mit.edu)

This notebook is designed to demonstrate how to access a centralized database hosted on RMG server. The database currently contains several big tables, including the most comprehensive one sdata134k_table containing all the 134k molecules. People are welcome to access to other tables as well. They are mostly subsets of sdata134k_table, e.g., small_cyclic_table contains all the hydrocarbon cyclics with less than 3 rings.

Set up


In [1]:
from rmgpy.data.rmg import RMGDatabase
from rmgpy import settings
from rmgpy.species import Species
from rmgpy.rmg.main import RMG
from IPython.display import display
import numpy as np
import os
import pandas as pd
from pymongo import MongoClient
import logging
logging.disable(logging.CRITICAL)

In [2]:
def get_data(host, db_name, collection_name, port=27017):
    # connect to db and query
    client = MongoClient(host, port)
    db =  getattr(client, db_name)
    collection = getattr(db, collection_name)
    db_cursor = collection.find()

    # collect data
    print('reading data...')
    db_mols = []
    for db_mol in db_cursor:
        db_mols.append(db_mol)
    print('done')

    return db_mols

In [3]:
database = RMGDatabase()

In [4]:
database.load(settings['database.directory'], thermoLibraries=[],\
             kineticsFamilies='none', kineticsDepositories='none', reactionLibraries = [])

thermoDatabase = database.thermo

In [5]:
# fetch testing dataset
db_name = 'sdata134k'
collection_name = 'small_cyclic_table'
host = 'mongodb://user:user@rmg.mit.edu/admin'
port = 27018
db_mols = get_data(host, db_name, collection_name, port)
len(db_mols)


reading data...
done
Out[5]:
2903

In [6]:
# Don't use G298, not formation, in hartrees
# Hf298 in kcal/mol
# S298 in cal/mol
db_mols[0]


Out[6]:
{u'Cv298': 11.041,
 u'G298': -117.849087,
 u'Hf298': 10.906934730002945,
 u'S298': 60.20433858795673,
 u'SMILES_input': u'C1CC1',
 u'SMILES_output': u'C1CC1',
 u'_id': ObjectId('58d95b9cfa01b636114d63ce'),
 u'atom_list': [u'C', u'C', u'C', u'H', u'H', u'H', u'H', u'H', u'H'],
 u'mol_idx': u'16',
 u'mulliken_e_list': [-0.222941,
  -0.222793,
  -0.222925,
  0.111459,
  0.111459,
  0.111429,
  0.111428,
  0.111442,
  0.111442],
 u'x_list': [-0.0119327974,
  1.3029911772,
  0.0086717414,
  -0.3054146435,
  -0.3227549828,
  1.8849339609,
  1.9022663059,
  -0.2708804319,
  -0.2882198296],
 u'y_list': [1.5143319798,
  0.7788656204,
  0.0076704221,
  2.0170211723,
  2.0268015645,
  0.7914501983,
  0.7816539079,
  -0.5128912474,
  -0.503104088],
 u'z_list': [0.010316996,
  -0.0061784231,
  0.0020103247,
  0.9253323893,
  -0.8934783865,
  -0.9211939346,
  0.8976637317,
  0.9113747374,
  -0.9074082449]}

In [ ]: