In [15]:

    
import skchem
import pandas as pd
pd.options.display.max_rows = pd.options.display.max_columns = 10

Pipelining

scikit-chem expands on the scikit-learn Pipeline object to support filtering. It is initialized using a list of Transformer objects.



In [10]:

    
pipeline = skchem.pipeline.Pipeline([
        skchem.standardizers.ChemAxonStandardizer(keep_failed=True),
        skchem.forcefields.UFF(),
        skchem.filters.OrganicFilter(),
        skchem.descriptors.MorganFeaturizer()])

The pipeline will apply each in turn to objects, using the the highest priority function that each object implements, according to the order transform_filter > filter > transform.

For example, our pipeline can transform sodium acetate all the way to fingerprints:



In [11]:

    
mol = skchem.Mol.from_smiles('CC(=O)[O-].[Na+]')



In [4]:

    
pipeline.transform_filter(mol)









    Out[4]:





morgan_fp_idx
0       0
1       0
2       0
3       0
4       0
       ..
2043    0
2044    0
2045    0
2046    0
2047    0
Name: MorganFeaturizer, dtype: uint8

It also works on collections of molecules:



In [12]:

    
mols = skchem.read_smiles('https://archive.org/download/scikit-chem_example_files/example.smi', name_column=1); mols









    Out[12]:





batch
ethane                          <Mol: CC>
propane                        <Mol: CCC>
benzene                   <Mol: c1ccccc1>
sodium acetate    <Mol: CC(=O)[O-].[Na+]>
serine                <Mol: NC(CO)C(=O)O>
Name: structure, dtype: object



In [16]:

    
pipeline.transform_filter(mols)









    



ChemAxonStandardizer: 100% (5 of 5) |##########################################| Elapsed Time: 0:00:04 Time: 0:00:04
UFF: 100% (5 of 5) |###########################################################| Elapsed Time: 0:00:00 Time: 0:00:00
OrganicFilter: 100% (5 of 5) |#################################################| Elapsed Time: 0:00:00 Time: 0:00:00
MorganFeaturizer: 100% (5 of 5) |##############################################| Elapsed Time: 0:00:00 Time: 0:00:00






    Out[16]:






  
    
      morgan_fp_idx
      0
      1
      2
      3
      4
      ...
      2043
      2044
      2045
      2046
      2047
    
    
      batch
      
      
      
      
      
      
      
      
      
      
      
    
  
  
    
      ethane
      0
      0
      0
      0
      0
      ...
      0
      0
      0
      0
      0
    
    
      propane
      0
      0
      0
      0
      0
      ...
      0
      0
      0
      0
      0
    
    
      benzene
      0
      0
      0
      0
      0
      ...
      0
      0
      0
      0
      0
    
    
      sodium acetate
      0
      0
      0
      0
      0
      ...
      0
      0
      0
      0
      0
    
    
      serine
      0
      0
      0
      0
      0
      ...
      0
      0
      1
      0
      0
    
  

5 rows × 2048 columns



In [ ]:

morgan_fp_idx	0	1	2	3	4	...	2043	2044	2045	2046	2047
batch
ethane	0	0	0	0	0	...	0	0	0	0	0
propane	0	0	0	0	0	...	0	0	0	0	0
benzene	0	0	0	0	0	...	0	0	0	0	0
sodium acetate	0	0	0	0	0	...	0	0	0	0	0
serine	0	0	0	0	0	...	0	0	1	0	0