Jeff's initial writeup -- 2019_01_21
It's really important that both RDKitToolkitWrapper and OpenEyeToolkitWrapper load the same file to equivalent OFFMol objects. If the ToolkitWrappers don't create the same OFFMol from a single file/external molecule representation, it's possible that the OpenForceField toolkit will assign the same molecule different parameters depending on whether OE or RDK is running on the backend(1). This notebook is designed to help us see which molecule formats can be reliably loaded using both toolkits.
The first test that is performed is loading from 3D. We have OpenEye load MiniDrugBank_tripos.mol2
. RDKit can not load mol2, but can load SDF, so I've used openbabel(2) to convert MiniDrugBank_tripos.mol2
to SDF (MiniDrugBank.sdf
), and then we have RDKit load that.
This is, surprisingly, an unresolved question. "Equivalent" for our purposes means "would have the same SMIRKS-based parameters applied, for any valid SMIRKS". Since we don't have an offxml
file containing parameters for all valid SMIRKS (because there are infinity valid SMIRKS, and because we'd also need an infinitely large molecule test set to contain representatives of all those infinity substructures), I'd like to find some sort of test other than SMIRKS matching that will answer this question.
This notebook uses three comparison methods:
_As a note for future efforts, it's possible that, since each SMILES is also a valid SMIRKS, we could test for equality between OFFMol_1 and OFFMol_2 by seeing if OFFMol_1's SMILES is found in a SMIRKS substructure search of OFFMol2, and vice versa.
Each of these has some weaknesses.
to_rdkit
. or to_openeye
function not implemented correctly? Or is the mismatch due to some strange toolkit quirk or quiet sanitization that goes on?This isn't the end of the world. We'd just remove SDF from RDKit's ALLOWED_READ_FORMATS
list(4). RDKit's main input method would then just be SMILES. But actually we should ensure that...
My initial thought here is that we could have both toolkits load molecules from a SMILES database. "I converted MiniDrugBank_tripos.mol2
to SDF so RDKit could read it, why not convert it to SMILES too?".
But conversion from 3D to SMILES requires either OpenEye or RDKit, which relies on their 3D molecule interpretation being correct... which is what we're testing here in the first place!
So, I couldn't think of a clear resolution to this problem.
For background, Shuzhe Wang did some really good foundational work on toolkit molecule perception equivalence earlier in this project. Shuzhe's successful test of forcefield parameter application equivalence (5) used OE to load a test databse, convert the molecules to SMILES, and then had RDKit load those SMILES. It then tested whether the OE-loaded molecule and the RDKit-loaded molecule got identical forcefield parameters assigned. But this didn't test whether RDKit can load from 3D, just from SMILES. And not just any SMILES -- SMILES that came from OpenEye (including any sneaky sanitizaton that it could have done when loading from 3D). And since we didn't have a toolkit-independent molecule representation then (no OFFMols), he tested whether both molecules got the same forcefield parameters applied, which suffers from the "our current SMIRKS may not be fine-grained enough to catch all meaningful molecule differences" problem.
Shuzhe's approach does highlight one way to get a test database of SMILES -- Load a 3D database using OpenEyeToolkitWrapper, then write all the resulting OFFMols to SMILES, and then have RDKitToolkitWrapper read all the SMILES and create OFFMols, then test whether both sets of OFFMols are identical.
And, in case OE's molecule santiization is making us end up with unusually clean SMILES, we can do the same test in reverse -- that is, use RDKitToolkitWrapper to load a 3D SDF, write the resulting OFFMols to SMILES, and then have OpenEyeToolkitWrapper read the SMILES, and compare the resulting OFFMols.
oe_mols_from 3d
: The results of calling OFFMol.from_file('MiniDrugBank_tripos.mol2', toolkit_registry=an_oe_toolkit_wrapper)
.
rdk_mols_from 3d
: The results of calling OFFMol.from_file('MiniDrugBank.sdf', toolkit_registry=a_rdk_toolkit_wrapper)
.
rdk_mols_from_smiles_from_oe_mols_from_3d
: The result of taking oe_mols_from_3d
, converting them all to SMILES using OpenEyeToolkitWrapper, then reading SMILES to OFFMols using RDKitToolkitWrapper.
oe_mols_from_smiles_from_rdk_mols_from_3d
: The result of taking rdk_mols_from_3d
, converting them all to SMILES using RDKitToolkitWrapper, then reading SMILES to OFFMols using OpenEyeToolkitWrapper.
Graph isomorphism
OFFMol.to_smiles() using OpenEyeToolkitWrapper
OFFMol.to_smiles using RDKitToolkitWrapper
When the notebook is run using the current codebase, the important outputs are as follows:
In comparing oe_mols_from_3d to rdk_mols_from_3d, I found 284 graph matches, 293 OE SMILES matches (26 comparisons errored), and 293 RDKit SMILES matches (7 comparisons errored) out of 346 molecules
This is the core test that we wanted to do (loading the same molecules from 3D), and these are pretty good results. We pass the strictest test (graph isomorphism) for 284 out of 346 molecules. We see the the SMILES tests are a little more forgiving, both giving 293/346 successes.
In comparing oe_mols_from_3d to rdk_mols_from_smiles_from_oe_mols_from_3d, I found 293 graph matches, 293 OE SMILES matches (26 comparisons errored), and 332 RDKit SMILES matches (0 comparisons errored) out of 365 molecules
In comparing oe_mols_from_smiles_from_rdk_mols_from_3d to rdk_mols_from_3d, I found 310 graph matches, 219 OE SMILES matches (0 comparisons errored), and 304 RDKit SMILES matches (7 comparisons errored) out of 319 molecules
Based on this test, we can expect identical parameterization between RDKit and OpenEye ToolkitWrappers at least 82% (284/346) of the time. The real number is likely to be higher, as the SMILES comparisons are more realistic measures of identity. Also, this limit assumes infinitely specific SMIRKS.
Passing molecules FROM rdkit TO OE is troublesome largely because rdkit won't recognize stereo around trivalent nitrogens, whereas OE will. See discussion about "how much stereochemistry info do we want specified" here: https://github.com/openforcefield/openforcefield/issues/146
There is low-hanging fruit that will improve reliability.
To continue debugging differences between OE and RDK, consider setting the "verbose" argument to compare_mols_using_smiles
to True in the final cell of this workbook. Or, look at the molecule loading errors during dataset construction and add logic to catch those cases.
(1) Some molecule perception differences will be covered up by the fact that our SMIRKSes are somewhat coarse-grained currently (one SMIRKS can cover a lot of chemical space). But, this won't always be the case, as we may add finer-grained SMIRKS later. So, taking care of toolkit perception differences now can keep us from getting "surprised" in the future.
(2)conda list
--> openbabel 1!2.4.0 py37_2 omnia
(3) OFFMol.to_smiles()
actually operates by converting the OFFMol to an OEMol or RDMol (depending on which toolkit is available), and then uses the respective toolkit to create a SMILES. You can control which toolkit is used by instantialing an openforcefield.utils.toolkits.OpenEyeToolkitWrapper
or RDKitToolkitWrapper
, and passing it as the toolkit_registry
argument to to_smiles
(eg. OFFMol.to_smiles(toolkit_registry=my_toolkit_wrapper)
)
(4) If they really wanted to load from SDF using RDKit, a determined OFF toolkit user could just make an RDMol manually from SDK (Chem.MolSupplierFromFile()
or something like that), and then run the molecules through RDKitToolkitWrapper.from_rdkit()
to get OFFMol. However, this would put the burden of correct chemical perception on them.
(5) https://github.com/openforcefield/openforcefield/blob/swang/examples/forcefield_modification/RDKitComparison.ipynb, search for molecules =
In [1]:
from openforcefield.utils.toolkits import RDKitToolkitWrapper, OpenEyeToolkitWrapper, UndefinedStereochemistryError
from openforcefield.utils import get_data_file_path
import difflib
OETKW = OpenEyeToolkitWrapper()
RDKTKW = RDKitToolkitWrapper()
In [2]:
from openforcefield.topology.molecule import Molecule
from rdkit import Chem
rdk_mols_from_3d = Molecule.from_file(get_data_file_path('molecules/MiniDrugBank.sdf'),
toolkit_registry = RDKTKW,
file_format='sdf',
allow_undefined_stereo=True)
# Known loading problems (numbers mean X: "DrugBank_X"):
# Pre-condition violations ("Stereo atoms should be specified...") for 3787 and 2684 are due to R-C=C=C-R motifs
# Sanitization errors:
# Nitro : 2799, 5415, 472, 794, 3739, 1570, 1802, 4865, 2465
# C#N : 3046, 3655, 1594, 6947, 2467
# C-N(=O)(C) : 6353
# Complicated aromatic situation with nitrogen?: 1659, 1661(?), 7049
# Unknown: 4346
In [3]:
# NBVAL_SKIP
from openforcefield.topology import Molecule
#molecules = Molecule.from_file(get_data_file_path('molecules/MiniDrugBank.sdf'),
oe_mols_from_3d = Molecule.from_file(get_data_file_path('molecules/MiniDrugBank_tripos.mol2'),
toolkit_registry = OETKW,
file_format='sdf',
allow_undefined_stereo=False)
In [4]:
# NBVAL_SKIP
rdk_mols_from_smiles_from_oe_mols_from_3d = []
for oe_mol in oe_mols_from_3d:
if oe_mol is None:
continue
smiles = oe_mol.to_smiles(toolkit_registry=OETKW)
try:
new_mol = Molecule.from_smiles(smiles,
toolkit_registry=RDKTKW,
hydrogens_are_explicit=True
)
new_mol.name = oe_mol.name
rdk_mols_from_smiles_from_oe_mols_from_3d.append(new_mol)
except Exception as e:
print(smiles)
print(e)
print()
pass
In [5]:
oe_mols_from_smiles_from_rdk_mols_from_3d = []
for rdk_mol in rdk_mols_from_3d:
try:
smiles = rdk_mol.to_smiles(toolkit_registry=RDKTKW)
new_mol = Molecule.from_smiles(smiles,
toolkit_registry=OETKW,
hydrogens_are_explicit=True
)
new_mol.name = rdk_mol.name
oe_mols_from_smiles_from_rdk_mols_from_3d.append(new_mol)
except UndefinedStereochemistryError as e:
print(smiles)
print(e)
print()
pass
In [6]:
def print_smi_difference(mol_name,
smi_1, smi_2,
smi_1_label='OE: ',
smi_2_label='RDK:'):
print(mol_name)
print(smi_1_label, smi_1)
print(smi_2_label, smi_2)
differences = list(difflib.ndiff(smi_1, smi_2))
msg = ''
for i,s in enumerate(differences):
if s[0]==' ':
continue
elif s[0]=='-':
msg += u'Delete "{}" from position {}\n'.format(s[-1],i)
elif s[0]=='+':
msg += u'Add "{}" to position {}\n'.format(s[-1],i)
# Sometimes the diffs get really big. Skip printing in those cases
if msg.count('\n') < 5:
print(msg)
print()
def compare_mols_using_smiles(mol_1, mol_2,
toolkit_wrapper,
mol_1_label,
mol_2_label,
verbose=False):
mol_1_smi = mol_1.to_smiles(toolkit_registry=toolkit_wrapper)
mol_2_smi = mol_2.to_smiles(toolkit_registry=toolkit_wrapper)
if mol_1_smi == mol_2_smi:
return True
else:
if verbose:
print_smi_difference(mol_1.name,
mol_1_smi, mol_2_smi,
smi_1_label=mol_1_label,
smi_2_label=mol_2_label)
return False
def compare_mols_using_nx(mol_1, mol_2):
return mol_1 == mol_2
In [7]:
# NBVAL_SKIP
mol_sets_to_compare = ((('oe_mols_from_3d', oe_mols_from_3d),
('rdk_mols_from_3d', rdk_mols_from_3d)),
#(('oe_mols_from_smiles', oe_mols_from_smiles),
# ('rdk_mols_from_smiles', rdk_mols_from_smiles)),
(('oe_mols_from_3d', oe_mols_from_3d),
('rdk_mols_from_smiles_from_oe_mols_from_3d', rdk_mols_from_smiles_from_oe_mols_from_3d)),
(('oe_mols_from_smiles_from_rdk_mols_from_3d', oe_mols_from_smiles_from_rdk_mols_from_3d),
('rdk_mols_from_3d', rdk_mols_from_3d)),
)
for (set_1_name, mol_set_1), (set_2_name, mol_set_2) in mol_sets_to_compare:
set_1_name_to_mol = {mol.name:mol for mol in mol_set_1 if not(mol is None)}
set_2_name_to_mol = {mol.name:mol for mol in mol_set_2 if not(mol is None)}
names_in_common = set(set_1_name_to_mol.keys()) & set(set_2_name_to_mol.keys())
print()
print()
print()
print('There are {} molecules in the {} set'.format(len(mol_set_1), set_1_name))
print('There are {} molecules in the {} set'.format(len(mol_set_2), set_2_name))
print('These sets have {} molecules in common'.format(len(names_in_common)))
graph_matches = 0
rdk_smiles_matches = 0
oe_smiles_matches = 0
errored_graph_comparisons = 0
errored_rdk_smiles_comparisons = 0
errored_oe_smiles_comparisons = 0
for name in names_in_common:
set_1_mol = set_1_name_to_mol[name]
set_2_mol = set_2_name_to_mol[name]
nx_match = compare_mols_using_nx(set_1_mol, set_2_mol)
if nx_match:
graph_matches += 1
try:
rdk_smi_match = compare_mols_using_smiles(set_1_mol, set_2_mol,
RDKTKW,
'OE--(RDKTKW)-->SMILES: ','RDK--(RDKTKW)-->SMILES:',
verbose = False)
if rdk_smi_match:
rdk_smiles_matches += 1
except Exception as e:
errored_rdk_smiles_comparisons += 1
print(e)
try:
#if 1:
oe_smi_match = compare_mols_using_smiles(set_1_mol, set_2_mol,
OETKW,
'OE--(OETKW)-->SMILES: ','RDK--(OETKW)-->SMILES:',
verbose = False)
if oe_smi_match:
oe_smiles_matches += 1
except:
errored_oe_smiles_comparisons += 1
print("In comparing {} to {}, I found {} graph matches, {} OE SMILES matches ({} comparisons errored), and " \
"{} RDKit SMILES matches ({} comparisons errored) out of {} molecules".format(set_1_name,
set_2_name,
graph_matches,
rdk_smiles_matches,
errored_rdk_smiles_comparisons,
oe_smiles_matches,
errored_oe_smiles_comparisons,
len(names_in_common)))