In [30]:
import pandas as pd
pd.options.display.max_columns = pd.options.display.max_rows = 10
scikit-chem provides a simple interface to chemical datasets, and a framework for constructing these datasets. The data module uses fuel to make complex out of memory iterative functionality straightforward (see the fuel documentation). It also offers an abstraction to allow easy loading of smaller datasets, that can fit in memory.
Datasets consist of sets and sources. Simply put, sets are collections of molecules in the dataset, and sources are types of data relating to these molecules.
For demonstration purposes, we will use the Bursi Ames dataset. This has 3 sets:
In [31]:
skchem.data.BursiAmes.available_sets()
Out[31]:
And many sources:
In [32]:
skchem.data.BursiAmes.available_sources()
Out[32]:
For this example, we will load the X_morg and the y sources for all the sets. These are circular fingerprints, and the target labels (in this case, whether the molecule was a mutagen).
We can load the data for requested sets and sources using the in memory API:
In [33]:
kws = {'sets': ('train', 'valid', 'test'), 'sources':('X_morg', 'y')}
(X_train, y_train), (X_valid, y_valid), (X_test, y_test) = skchem.data.BursiAmes.load_data(**kws)
The requested data is loaded as nested tuples, sorted first by set, and then by source, which can easily be unpacked as above.
In [34]:
print('train shapes:', X_train.shape, y_train.shape)
print('valid shapes:', X_valid.shape, y_valid.shape)
print('test shapes:', X_test.shape, y_test.shape)
The raw data is loaded as numpy arrays:
In [35]:
X_train
Out[35]:
In [36]:
y_train
Out[36]:
Which should be ready to use as fuel for modelling!
In [37]:
skchem.data.BursiAmes.read_frame('feats/X_morg')
Out[37]:
Target variables under 'targets':
In [39]:
skchem.data.BursiAmes.read_frame('targets/y')
Out[39]:
Set membership masks under 'indices':
In [40]:
skchem.data.BursiAmes.read_frame('indices/train')
Out[40]:
Finally, molecules are accessible via 'structure':
In [42]:
skchem.data.BursiAmes.read_frame('structure')
Out[42]: