(c) 2019, Dr. Ramil Nugmanov; Dr. Timur Madzhidov; Ravil Mukhametgaleev
Installation instructions of CGRtools package information and tutorial's files see on https://github.com/cimm-kzn/CGRtools
NOTE: Tutorial should be performed sequentially from the start. Random cell running will lead to unexpected results.
In [ ]:
import pkg_resources
if pkg_resources.get_distribution('CGRtools').version.split('.')[:2] != ['3', '1']:
print('WARNING. Tutorial was tested on 3.1 version of CGRtools')
else:
print('Welcome!')
In [ ]:
# load data for tutorial
from pickle import load
from traceback import format_exc
with open('molecules.dat', 'rb') as f:
molecules = load(f) # list of MoleculeContainer objects
with open('reactions.dat', 'rb') as f:
reactions = load(f) # list of ReactionContainer objects
m1, m2, m3 = molecules[:3] # molecule
m7 = m3.copy()
m11 = m3.copy()
m11.standardize()
m7.standardize()
r1 = reactions[0] # reaction
m1.reset_query_marks()
m1.flush_cache()
m1.delete_atom(3)
cgr2 = ~r1
cgr2.reset_query_marks()
m3.reset_query_marks()
proj = m3.substructure([4,5,6,7,8,9])
benzene = benzene = m3.substructure([4,5,6,7,8,9], as_view=False)
benzene.reset_query_marks()
proj_copy = proj.copy()
proj_copy.reset_query_marks()
m3.delete_bond(4, 5)
proj.flush_cache()
MoleculeContainer has methods for unique molecule signature generation. Signature is SMILES string with explicit bonds notation and canonical atoms ordering. For pyroles, signatures does not comply with the SMILES rules.
For signature generation one need to call str
function on MoleculeContainer object.
Fixed length hash of signature could be retrieved by calling bytes
function on molecule (correspond to SHA 512 bitstring).
Order of atoms calculated by Morgan-like algorithm. On initial state for each atoms it's integer code calculated based on its type. All bonds incident to atoms also coded as integers and stored in sorted tuple. Atom code and tuple of it's bonds used for ordering and similar atoms detecting. Ordered atoms rank is replaced with prime numbers from a prime number lookup table. Atoms of the same type with the same bonds types incident to it have equal prime numbers.
Prime numbers codes found are used in Morgan algorithm cycle.
On each loop for each atom square of its prime number is multiplied to neighboring atoms prime numbers, observed numbers for atoms are ranked and prime numbers are again assigned. Loop is repeated until all atoms will be unique or number of unique atoms will not change in 3 subsequent loops.
In [ ]:
ms2 = str(m2) # get and print signature
print(ms2)
# or
print(m2)
hms2 = bytes(m2) # get sha512 hash of signature as bytes-string
String formatting is supported that is useful for reporting
In [ ]:
print(f'f string {m2}') # use signature in string formatting
print('C-style string %s' % m2)
print('format method {}'.format(m2))
Number of neighbors and hybridization could be added to signature. Note that in this case they are not readable as SMILES.
For MoleculeContainer and ReactionContainer query marks are not included in signatures and not printed by str and print function. However they could be printed in the following way:
In [ ]:
m2.reset_query_marks() # calculate hybridization and number of neighbors
print(f'{m2:hn}') # get signatures with hybridization and neighbors data
print('{:h}'.format(m2)) # get signature with hybridization only data
# h - hybridization marks, n- neighbors marks
format(m2, 'n') # include only number of neighbors in signature
print(m2) # notice that neighbors and hybridization are hided, since signatures for molecules does not contain this information.
Atoms in the signature are represented in the following way: [element_symbol;hn;charge;multiplicity (if not None)]. h mean hybridization, n - number of neighbors. Notation for hybridization is the following:
s - all bonds of atom are single
d - atom has one double bond and others are single
t - atom has one triple or two double bonds and other are single
a - atom is in aromatic ring
Examples: s1 - atom has s hybridization and one neighbor d3 - atom has d hybridization and 3 neighbors
Signatures for QueryContainer and QueryCGRContainer include query marks and are printed by str and print function.
Signatures of projections also supported.
In [ ]:
f'{proj:h}'
Molecules comparable and hashable
Comparison of MoleculeContainer is based on its signatures. Moreover, since strings in Python are hashable, MoleculeContaier also hashable.
NOTE: MoleculeContainer can be changed. This can lead to unobvious behavior of the sets and dictionaries in which these molecules were placed before the change. Avoid changing molecules (standardize, aromatize, hydrogens and atoms/bonds changes) placed inside sets and dictionaries.
In [ ]:
m1 != m2 # different molecules
In [ ]:
m7 == m11 # copy of the same molecule
In [ ]:
m7 is m11 # this is not same objects!
In [ ]:
benzene == proj_copy # projection extracted benzene structure from molecule and then was transformed in MoleculeContainer
In [ ]:
# Simplest way to exclude duplicated structures
len({m1, m2, m7, m11}) == 3 # create set of unique molecules. Only 3 of them were different.
In [ ]:
str(r1)
In [ ]:
str(cgr2)
If one align left- and right-hand side of signature, he will see bond order changes.
C-C-[O-].C(=O)(-O)-C(=O)(-O).O-C-C
C-C-O-C(=O)(.O)-C(=O)(.O)-O-C-C