Molecules in scikit-chem

scikit-chem is first and formost a wrapper around rdkit to make it more Pythonic, and more intuitive to a user familiar with other libraries in the Scientific Python Stack. The package implements a core Mol class, physically representing a molecule. It is a direct subclass of the rdkit.Mol class:


In [1]:
import rdkit.Chem
issubclass(skchem.Mol, rdkit.Chem.Mol)


Out[1]:
True

As such, it has all the methods available that an rdkit.Mol class has, for example:


In [2]:
hasattr(skchem.Mol, 'GetAromaticAtoms')


Out[2]:
True

Initializing new molecules

Constructors are provided as classmethods on the skchem.Mol object, in the same fashion as pandas objects are constructed. For example, to make a pandas.DataFrame from a dictionary, you call:


In [3]:
df = pd.DataFrame.from_dict({'a': [10, 20], 'b': [20, 40]}); df


Out[3]:
a b
0 10 20
1 20 40

Analogously, to make a skchem.Mol from a smiles string, you call;


In [4]:
mol = skchem.Mol.from_smiles('CC(=O)Cl'); mol


Out[4]:
<Mol name="None" formula="C2H3ClO" at 0x11dc8f490>

The available methods are:


In [5]:
[method for method in skchem.Mol.__dict__ if method.startswith('from_')]


Out[5]:
['from_tplblock',
 'from_molblock',
 'from_molfile',
 'from_binary',
 'from_tplfile',
 'from_mol2block',
 'from_pdbfile',
 'from_pdbblock',
 'from_smiles',
 'from_smarts',
 'from_mol2file',
 'from_inchi']

When a molecule fails to parse, a ValueError is raised:


In [6]:
skchem.Mol.from_smiles('NOTSMILES')


---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
<ipython-input-6-99e03ef822e7> in <module>()
----> 1 skchem.Mol.from_smiles('NOTSMILES')

/Users/rich/projects/scikit-chem/skchem/core/mol.py in constructor(_, in_arg, name, *args, **kwargs)
    419         m = getattr(rdkit.Chem, 'MolFrom' + constructor_name)(in_arg, *args, **kwargs)
    420         if m is None:
--> 421             raise ValueError('Failed to parse molecule, {}'.format(in_arg))
    422         m = Mol.from_super(m)
    423         m.name = name

ValueError: Failed to parse molecule, NOTSMILES

Molecule accessors

Atoms and bonds are accessible as a property:


In [7]:
mol.atoms


Out[7]:
<AtomView values="['C', 'C', 'O', 'Cl']" at 0x11dc9ac88>

In [8]:
mol.bonds


Out[8]:
<BondView values="['C-C', 'C=O', 'C-Cl']" at 0x11dc9abe0>

These are iterable:


In [9]:
[a for a in mol.atoms]


Out[9]:
[<Atom element="C" at 0x11dcfe8a0>,
 <Atom element="C" at 0x11dcfe9e0>,
 <Atom element="O" at 0x11dcfed00>,
 <Atom element="Cl" at 0x11dcfedf0>]

subscriptable:


In [10]:
mol.atoms[3]


Out[10]:
<Atom element="Cl" at 0x11dcfef30>

sliceable:


In [11]:
mol.atoms[:3]


Out[11]:
[<Atom element="C" at 0x11dcfebc0>,
 <Atom element="C" at 0x11de690d0>,
 <Atom element="O" at 0x11de693f0>]

indexable:


In [19]:
mol.atoms[[1, 3]]


Out[19]:
[<Atom element="C" at 0x11de74760>, <Atom element="Cl" at 0x11de7fe40>]

and maskable:


In [18]:
mol.atoms[[True, False, True, False]]


Out[18]:
[<Atom element="C" at 0x11de74ad0>, <Atom element="O" at 0x11de74f30>]

Properties on the rdkit objects are accessible through the props property:


In [11]:
mol.props['is_reactive'] = 'very!'

In [12]:
mol.atoms[1].props['kind'] = 'electrophilic'
mol.atoms[3].props['leaving group'] = 1
mol.bonds[2].props['bond strength'] = 'strong'

These are using the rdkit property functionality internally:


In [13]:
mol.GetProp('is_reactive')


Out[13]:
'very!'

The properties of atoms and bonds are accessible molecule wide:


In [14]:
mol.atoms.props


Out[14]:
<MolPropertyView values="{'leaving group': [nan, nan, nan, 1.0], 'kind': [None, 'electrophilic', None, None]}" at 0x11daf8390>

In [15]:
mol.bonds.props


Out[15]:
<MolPropertyView values="{'bond strength': [None, None, 'strong']}" at 0x11daf80f0>

These can be exported as pandas objects:


In [16]:
mol.atoms.props.to_frame()


Out[16]:
kind leaving group
atom_idx
0 None NaN
1 electrophilic NaN
2 None NaN
3 None 1.0

Export and Serialization

Molecules are exported and/or serialized in a very similar way in which they are constructed, again with an inspiration from pandas.


In [17]:
df.to_csv()


Out[17]:
',a,b\n0,10,20\n1,20,40\n'

In [18]:
mol.to_inchi_key()


Out[18]:
'WETWJCDKMRHUPV-UHFFFAOYSA-N'

The total available formats are:


In [19]:
[method for method in skchem.Mol.__dict__ if method.startswith('to_')]


Out[19]:
['to_inchi',
 'to_json',
 'to_smiles',
 'to_smarts',
 'to_inchi_key',
 'to_binary',
 'to_dict',
 'to_molblock',
 'to_tplfile',
 'to_formula',
 'to_molfile',
 'to_pdbblock',
 'to_tplblock']