2. Signatures and duplicates selection

(c) 2019, Dr. Ramil Nugmanov; Dr. Timur Madzhidov; Ravil Mukhametgaleev

Installation instructions of CGRtools package information and tutorial's files see on https://github.com/cimm-kzn/CGRtools

NOTE: Tutorial should be performed sequentially from the start. Random cell running will lead to unexpected results.

In [ ]:
import pkg_resources
if pkg_resources.get_distribution('CGRtools').version.split('.')[:2] != ['3', '1']:
    print('WARNING. Tutorial was tested on 3.1 version of CGRtools')

In [ ]:
# load data for tutorial
from pickle import load
from traceback import format_exc

with open('molecules.dat', 'rb') as f:
    molecules = load(f) # list of MoleculeContainer objects
with open('reactions.dat', 'rb') as f:
    reactions = load(f) # list of ReactionContainer objects

m1, m2, m3 = molecules[:3] # molecule
m7 = m3.copy()
m11 = m3.copy()
r1 = reactions[0] # reaction
cgr2 = ~r1  
proj = m3.substructure([4,5,6,7,8,9])
benzene = benzene = m3.substructure([4,5,6,7,8,9], as_view=False) 
proj_copy = proj.copy()
m3.delete_bond(4, 5) 

2.1. Molecule Signatures

MoleculeContainer has methods for unique molecule signature generation. Signature is SMILES string with explicit bonds notation and canonical atoms ordering. For pyroles, signatures does not comply with the SMILES rules.

For signature generation one need to call str function on MoleculeContainer object.
Fixed length hash of signature could be retrieved by calling bytes function on molecule (correspond to SHA 512 bitstring).

Order of atoms calculated by Morgan-like algorithm. On initial state for each atoms it's integer code calculated based on its type. All bonds incident to atoms also coded as integers and stored in sorted tuple. Atom code and tuple of it's bonds used for ordering and similar atoms detecting. Ordered atoms rank is replaced with prime numbers from a prime number lookup table. Atoms of the same type with the same bonds types incident to it have equal prime numbers.

Prime numbers codes found are used in Morgan algorithm cycle.

On each loop for each atom square of its prime number is multiplied to neighboring atoms prime numbers, observed numbers for atoms are ranked and prime numbers are again assigned. Loop is repeated until all atoms will be unique or number of unique atoms will not change in 3 subsequent loops.

In [ ]:
ms2 = str(m2)  # get and print signature
# or 

hms2 = bytes(m2)  # get sha512 hash of signature as bytes-string

String formatting is supported that is useful for reporting

In [ ]:
print(f'f string {m2}')  # use signature in string formatting
print('C-style string %s' % m2)
print('format method {}'.format(m2))

Number of neighbors and hybridization could be added to signature. Note that in this case they are not readable as SMILES.

For MoleculeContainer and ReactionContainer query marks are not included in signatures and not printed by str and print function. However they could be printed in the following way:

In [ ]:
m2.reset_query_marks() # calculate hybridization and number of neighbors
print(f'{m2:hn}')  # get signatures with hybridization and neighbors data
print('{:h}'.format(m2))  # get signature with hybridization only data
# h - hybridization marks, n- neighbors marks
format(m2, 'n') # include only number of neighbors in signature
print(m2)   # notice that neighbors and hybridization are hided, since signatures for molecules does not contain this information.

Atoms in the signature are represented in the following way: [element_symbol;hn;charge;multiplicity (if not None)]. h mean hybridization, n - number of neighbors. Notation for hybridization is the following:

s - all bonds of atom are single
d - atom has one double bond and others are single
t - atom has one triple or two double bonds and other are single
a - atom is in aromatic ring

Examples: s1 - atom has s hybridization and one neighbor d3 - atom has d hybridization and 3 neighbors

Signatures for QueryContainer and QueryCGRContainer include query marks and are printed by str and print function.

Signatures of projections also supported.

In [ ]:

Molecules comparable and hashable

Comparison of MoleculeContainer is based on its signatures. Moreover, since strings in Python are hashable, MoleculeContaier also hashable.

NOTE: MoleculeContainer can be changed. This can lead to unobvious behavior of the sets and dictionaries in which these molecules were placed before the change. Avoid changing molecules (standardize, aromatize, hydrogens and atoms/bonds changes) placed inside sets and dictionaries.

In [ ]:
m1 != m2 # different molecules

In [ ]:
m7 == m11 # copy of the same molecule

In [ ]:
m7 is m11  # this is not same objects!

In [ ]:
benzene == proj_copy   # projection extracted benzene structure from molecule and then was transformed in MoleculeContainer

In [ ]:
# Simplest way to exclude duplicated structures
len({m1, m2, m7, m11}) == 3 # create set of unique molecules. Only 3 of them were different.

2.2. Reaction signatures

ReactionContainer have its signature. Signature is SMIRKS string in which molecules of reactants, reagents, products presented in canonical order.

API is the same as for molecules

In [ ]:

2.3. CGR signature

CGRContainer have its signature. Signatures is SMIRKS-like strings where atoms in reactants and products has same order and split by >> symbol

In [ ]:

If one align left- and right-hand side of signature, he will see bond order changes.