Dataset with categorical features.

Here you can see how datapot works with Mushroom Data Set. The important detail about this dataset is that all it's features are categorical.


In [1]:
import datapot as dp
import pandas as pd

import time

Creating the DataPot object.


In [2]:
datapot = dp.DataPot()

In [3]:
import bz2
ftr = bz2.BZ2File('../data/mushrooms.jsonlines.bz2')

Let's call the fit method. It automatically finds appropriate transformers for the fields of jsonlines file. The parameter 'limit' means how many objects will be used to detect the right transformers.


In [4]:
t0 = time.time()
datapot.detect(ftr, limit = 1000)
print('detect time:', time.time() - t0)
datapot


detect time: 0.1961979866027832
Out[4]:
DataPot class instance
 - number of features without transformation: 23
 - number of new features: 0
features to transform: 
	('t', [SVDOneHotTransformer])
	('p.1', [SVDOneHotTransformer])
	('p.2', [SVDOneHotTransformer])
	('n.1', [SVDOneHotTransformer])
	('x', [SVDOneHotTransformer])
	('o', [SVDOneHotTransformer])
	('p', [SVDOneHotTransformer])
	('e', [SVDOneHotTransformer])
	('k', [SVDOneHotTransformer])
	('s', [SVDOneHotTransformer])
	('f', [SVDOneHotTransformer])
	('u', [SVDOneHotTransformer])
	('s.3', [SVDOneHotTransformer])
	('w.2', [SVDOneHotTransformer])
	('s.1', [SVDOneHotTransformer])
	('n', [SVDOneHotTransformer])
	('w.1', [SVDOneHotTransformer])
	('c', [SVDOneHotTransformer])
	('e.1', [SVDOneHotTransformer])
	('k.1', [SVDOneHotTransformer])
	('w', [SVDOneHotTransformer])
	('s.2', [SVDOneHotTransformer])
	('p.3', [SVDOneHotTransformer])

In [5]:
datapot.fit(ftr)


Out[5]:
DataPot class instance
 - number of features without transformation: 23
 - number of new features: 72
features to transform: 
	('t', [SVDOneHotTransformer])
	('p.1', [SVDOneHotTransformer])
	('p.2', [SVDOneHotTransformer])
	('n.1', [SVDOneHotTransformer])
	('x', [SVDOneHotTransformer])
	('o', [SVDOneHotTransformer])
	('p', [SVDOneHotTransformer])
	('e', [SVDOneHotTransformer])
	('k', [SVDOneHotTransformer])
	('s', [SVDOneHotTransformer])
	('f', [SVDOneHotTransformer])
	('u', [SVDOneHotTransformer])
	('s.3', [SVDOneHotTransformer])
	('w.2', [SVDOneHotTransformer])
	('s.1', [SVDOneHotTransformer])
	('n', [SVDOneHotTransformer])
	('w.1', [SVDOneHotTransformer])
	('c', [SVDOneHotTransformer])
	('e.1', [SVDOneHotTransformer])
	('k.1', [SVDOneHotTransformer])
	('w', [SVDOneHotTransformer])
	('s.2', [SVDOneHotTransformer])
	('p.3', [SVDOneHotTransformer])

In [6]:
datapot


Out[6]:
DataPot class instance
 - number of features without transformation: 23
 - number of new features: 72
features to transform: 
	('t', [SVDOneHotTransformer])
	('p.1', [SVDOneHotTransformer])
	('p.2', [SVDOneHotTransformer])
	('n.1', [SVDOneHotTransformer])
	('x', [SVDOneHotTransformer])
	('o', [SVDOneHotTransformer])
	('p', [SVDOneHotTransformer])
	('e', [SVDOneHotTransformer])
	('k', [SVDOneHotTransformer])
	('s', [SVDOneHotTransformer])
	('f', [SVDOneHotTransformer])
	('u', [SVDOneHotTransformer])
	('s.3', [SVDOneHotTransformer])
	('w.2', [SVDOneHotTransformer])
	('s.1', [SVDOneHotTransformer])
	('n', [SVDOneHotTransformer])
	('w.1', [SVDOneHotTransformer])
	('c', [SVDOneHotTransformer])
	('e.1', [SVDOneHotTransformer])
	('k.1', [SVDOneHotTransformer])
	('w', [SVDOneHotTransformer])
	('s.2', [SVDOneHotTransformer])
	('p.3', [SVDOneHotTransformer])

As a result, only categorical transformers were choosen.


In [7]:
data = datapot.transform(ftr)


/usr/local/lib/python3.6/site-packages/datapot/__init__.py:137: FutureWarning: convert_objects is deprecated.  Use the data-type specific converters pd.to_datetime, pd.to_timedelta and pd.to_numeric.
  return pd.DataFrame(data=np.hstack(columns), columns=names).convert_objects(convert_numeric=True)

In [8]:
data.head()


Out[8]:
t_t t_f p.1_None p.2_None n.1_None x_x x_b x_s x_f x_k ... w_g w_p w_n w_b w_e w_o w_c w_y s.2_None p.3_None
0 1.0 0.0 1.0 1.0 1.0 1.0 0.0 0.0 0.0 0.0 ... 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 1.0 1.0
1 1.0 0.0 1.0 1.0 1.0 0.0 1.0 0.0 0.0 0.0 ... 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 1.0 1.0
2 1.0 0.0 1.0 1.0 1.0 1.0 0.0 0.0 0.0 0.0 ... 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 1.0 1.0
3 0.0 1.0 1.0 1.0 1.0 1.0 0.0 0.0 0.0 0.0 ... 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 1.0 1.0
4 1.0 0.0 1.0 1.0 1.0 1.0 0.0 0.0 0.0 0.0 ... 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 1.0 1.0

5 rows × 72 columns


In [9]:
data.columns


Out[9]:
Index(['t_t', 't_f', 'p.1_None', 'p.2_None', 'n.1_None', 'x_x', 'x_b', 'x_s',
       'x_f', 'x_k', 'x_c', 'o_o', 'o_t', 'o_n', 'p_e', 'p_p', 'e_e', 'e_t',
       'k_k', 'k_n', 'k_g', 'k_p', 'k_w', 'k_h', 'k_u', 'k_e', 'k_b', 'k_r',
       'k_y', 'k_o', 's_s', 's_y', 's_f', 's_g', 'f_f', 'f_a', 'u_g', 'u_m',
       'u_u', 'u_d', 'u_p', 'u_w', 'u_l', 's.3_None', 'w.2_None', 's.1_None',
       'n_y', 'n_w', 'n_g', 'n_n', 'n_e', 'n_p', 'n_b', 'n_u', 'n_c', 'n_r',
       'w.1_None', 'c_c', 'c_w', 'e.1_None', 'k.1_None', 'w_w', 'w_g', 'w_p',
       'w_n', 'w_b', 'w_e', 'w_o', 'w_c', 'w_y', 's.2_None', 'p.3_None'],
      dtype='object')

Let's test new features. For prediction, 'e' field is choosen.


In [10]:
X = data.drop(['e_e', 'e_t'], axis=1)
y = data['e_e']

In [11]:
from sklearn.model_selection import cross_val_score

In [12]:
from xgboost import XGBClassifier
clf = XGBClassifier(n_estimators=100)
cross_val_score(clf, X, y, cv=5)


Out[12]:
array([ 0.976     ,  0.92430769,  0.98523077,  0.9612069 ,  0.92118227])

In [ ]: