In [30]:
import pandas as pd
pd.options.display.max_columns = pd.options.display.max_rows = 10

Data

scikit-chem provides a simple interface to chemical datasets, and a framework for constructing these datasets. The data module uses fuel to make complex out of memory iterative functionality straightforward (see the fuel documentation). It also offers an abstraction to allow easy loading of smaller datasets, that can fit in memory.

In memory datasets

Datasets consist of sets and sources. Simply put, sets are collections of molecules in the dataset, and sources are types of data relating to these molecules.

For demonstration purposes, we will use the Bursi Ames dataset. This has 3 sets:


In [31]:
skchem.data.BursiAmes.available_sets()


Out[31]:
('train', 'valid', 'test')

And many sources:


In [32]:
skchem.data.BursiAmes.available_sources()


Out[32]:
('G', 'A', 'y', 'A_cx', 'G_d', 'X_morg', 'X_cx', 'X_pc')

For this example, we will load the X_morg and the y sources for all the sets. These are circular fingerprints, and the target labels (in this case, whether the molecule was a mutagen).

We can load the data for requested sets and sources using the in memory API:


In [33]:
kws = {'sets': ('train', 'valid', 'test'), 'sources':('X_morg', 'y')}

(X_train, y_train), (X_valid, y_valid), (X_test, y_test) = skchem.data.BursiAmes.load_data(**kws)

The requested data is loaded as nested tuples, sorted first by set, and then by source, which can easily be unpacked as above.


In [34]:
print('train shapes:', X_train.shape, y_train.shape)
print('valid shapes:', X_valid.shape, y_valid.shape)
print('test shapes:', X_test.shape, y_test.shape)


train shapes: (3007, 2048) (3007,)
valid shapes: (645, 2048) (645,)
test shapes: (645, 2048) (645,)

The raw data is loaded as numpy arrays:


In [35]:
X_train


Out[35]:
array([[0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0],
       ..., 
       [0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0]])

In [36]:
y_train


Out[36]:
array([1, 1, 1, ..., 0, 1, 1], dtype=uint8)

Which should be ready to use as fuel for modelling!

Data as pandas objects

The data is originally saved as pandas objects, and can be retrieved as such using the read_frame class method.

Features are available under the 'feats' namespace:


In [37]:
skchem.data.BursiAmes.read_frame('feats/X_morg')


Out[37]:
morgan_fp_idx 0 1 2 3 4 ... 2043 2044 2045 2046 2047
batch
1728-95-6 0 0 0 0 0 ... 0 0 0 0 0
74550-97-3 0 0 0 0 0 ... 0 0 0 0 0
16757-83-8 0 0 0 0 0 ... 0 0 0 0 0
553-97-9 0 0 0 0 0 ... 0 0 0 0 0
115-39-9 0 0 0 0 0 ... 0 0 0 0 0
... ... ... ... ... ... ... ... ... ... ... ...
874-60-2 0 0 0 0 0 ... 0 0 0 0 0
92-66-0 0 0 0 0 0 ... 0 0 0 0 0
594-71-8 0 0 0 0 0 ... 0 0 0 0 0
55792-21-7 0 0 0 0 0 ... 0 0 0 0 0
84987-77-9 0 0 0 0 0 ... 0 0 0 0 0

4297 rows × 2048 columns

Target variables under 'targets':


In [39]:
skchem.data.BursiAmes.read_frame('targets/y')


Out[39]:
batch
1728-95-6     1
74550-97-3    1
16757-83-8    1
553-97-9      0
115-39-9      0
             ..
874-60-2      1
92-66-0       0
594-71-8      1
55792-21-7    0
84987-77-9    1
Name: is_mutagen, dtype: uint8

Set membership masks under 'indices':


In [40]:
skchem.data.BursiAmes.read_frame('indices/train')


Out[40]:
batch
1728-95-6      True
74550-97-3     True
16757-83-8     True
553-97-9       True
115-39-9       True
              ...  
874-60-2      False
92-66-0       False
594-71-8      False
55792-21-7    False
84987-77-9    False
Name: split, dtype: bool

Finally, molecules are accessible via 'structure':


In [42]:
skchem.data.BursiAmes.read_frame('structure')


Out[42]:
batch
1728-95-6      <Mol: [H]c1c([H])c([H])c(-c2nc(-c3c([H])c([H])...
119-34-6       <Mol: [H]Oc1c([H])c([H])c(N([H])[H])c([H])c1[N...
371-40-4           <Mol: [H]c1c([H])c(N([H])[H])c([H])c([H])c1F>
2319-96-2      <Mol: [H]c1c([H])c([H])c2c([H])c3c(c([H])c(C([...
1822-51-1         <Mol: [H]c1nc([H])c([H])c(C([H])([H])Cl)c1[H]>
                                     ...                        
84-64-0        <Mol: [H]c1c([H])c([H])c(C(=O)OC2([H])C([H])([...
121808-62-6    <Mol: [H]OC(=O)C1([H])N(C(=O)C2([H])N([H])C(=O...
134-20-3       <Mol: [H]c1c([H])c([H])c(N([H])[H])c(C(=O)OC([...
6441-91-4      <Mol: [H]Oc1c([H])c(S(=O)(=O)O[H])c([H])c2c([H...
97534-21-9     <Mol: [H]Oc1nc(=S)n([H])c(O[H])c1C(=O)N([H])c1...
Name: structure, dtype: object