In [15]:
import skchem
import pandas as pd
pd.options.display.max_rows = 10
pd.options.display.max_columns = 10
%matplotlib inline
Operations on compounds are implemented as Transformer
s in scikit-chem, which are analoguous to Transformer
objects in scikit-learn. These objects define a 1:1 mapping between input and output objects in a collection (i.e. the length of the collection remains the same during a transform). These mappings can be very varied, but the three main types currently implemented in scikit-chem
are Standardizers
, Forcefields
and Featurizers
.
Chemical data curation is a difficult concept, and data may be formatted differently depending on the source, or even the habits of the curator.
For example, solvents or salts might be included the representation, which might be considered an unnecessary detail to a modeller, or even irrelevant to an experimentalist, if the compound is solvated is a standard solvent during the protocol.
Even the structure of molecules that would be considered the 'same', can often be drawn very differently. For example, tautomers are arguably the same molecule in different conditions, and mesomers might be considered different aspects of the same molecule.
Often, it is sensible to canonicalize these compounds in a process called Standardization.
In scikit-chem
, the standardizers package provides this functionality. Standardizer
objects transform Mol
objects into other Mol
objects, which have their representation canonicalized (or into None
if the protocol fails). The details of the canonicalization may be configured at object initialization, or by altering properties.
As an example, we will standardize the sodium acetate:
In [3]:
mol = skchem.Mol.from_smiles('CC(=O)[O-].[Na+]', name='sodium acetate'); mol.to_smiles()
Out[3]:
A Standardizer
object is initialized:
In [43]:
std = skchem.standardizers.ChemAxonStandardizer()
Calling transform on sodium acetate yields the conjugate 'canonical' acid, acetic acid.
In [44]:
mol_std = std.transform(mol); mol_std.to_smiles()
Out[44]:
The standardization of a collection of Mol
s can be achieved by calling transform
on a pandas.Series
:
In [45]:
mols = skchem.read_smiles('https://archive.org/download/scikit-chem_example_files/example.smi',
name_column=1); mols
Out[45]:
In [46]:
std.transform(mols)
Out[46]:
A loading bar is provided by default, although this can be disabled by lowering the verbosity:
In [47]:
std.verbose = 0
std.transform(mols)
Out[47]:
Often the three dimensional structure of a compound is of relevance, but many chemical formats, such as SMILES do not encode this information (and often even in formats which serialize geometry only coordinates in two dimensions are provided).
To produce a reasonable three dimensional conformer, a compound must be roughly embedded in three dimensions according to local geometrical constraints, and forcefields used to optimize the geometry of a compound.
In scikit-chem
, the forcefields package provides access to this functionality. Two forcefields, the Universal Force Field (UFF) and the Merck Molecular Force Field (MMFF) are currently provided. We will use the UFF:
In [23]:
uff = skchem.forcefields.UFF()
mol = uff.transform(mol_std)
In [25]:
mol.atoms
Out[25]:
This uses the forcefield to generate a reasonable three dimensional structure. In rdkit
(and thus scikit-chem
, conformers are separate entities). The forcefield creates a new conformer on the object:
In [27]:
mol.conformers[0].atom_positions
Out[27]:
The molecule can be visualized by drawing it:
In [35]:
skchem.vis.draw(mol)
Out[35]:
Chemical representation is not by itself very amenable to data analysis and mining techniques. Often, a fixed length vector representation is required. This is achieved by calculating features from the chemical representation.
In scikit-chem, this is provided by the descriptors
package. A selection of features are available:
In [11]:
skchem.descriptors.__all__
Out[11]:
Circular fingerprints (of which Morgan fingerprints are an example) are often considered the most consistently well performing descriptor across a wide variety of compounds.
In [12]:
mf = skchem.descriptors.MorganFeaturizer()
mf.transform(mol)
Out[12]:
We can also call the standardizer on a series of Mol
s:
In [13]:
mf.transform(mols.structure)
Out[13]: