Here you can see how datapot works with Mushroom Data Set. The important detail about this dataset is that all it's features are categorical.
In [1]:
import datapot as dp
import pandas as pd
import time
Creating the DataPot object.
In [2]:
datapot = dp.DataPot()
In [3]:
import bz2
ftr = bz2.BZ2File('../data/mushrooms.jsonlines.bz2')
Let's call the fit method. It automatically finds appropriate transformers for the fields of jsonlines file. The parameter 'limit' means how many objects will be used to detect the right transformers.
In [4]:
t0 = time.time()
datapot.detect(ftr, limit = 1000)
print('detect time:', time.time() - t0)
datapot
Out[4]:
In [5]:
datapot.fit(ftr)
Out[5]:
In [6]:
datapot
Out[6]:
As a result, only categorical transformers were choosen.
In [7]:
data = datapot.transform(ftr)
In [8]:
data.head()
Out[8]:
In [9]:
data.columns
Out[9]:
Let's test new features. For prediction, 'e' field is choosen.
In [10]:
X = data.drop(['e_e', 'e_t'], axis=1)
y = data['e_e']
In [11]:
from sklearn.model_selection import cross_val_score
In [12]:
from xgboost import XGBClassifier
clf = XGBClassifier(n_estimators=100)
cross_val_score(clf, X, y, cv=5)
Out[12]:
In [ ]: