In [1]:
import pandas as pd
df = pd.read_csv('datasets/ozone.csv')
X, y = df.iloc[:,:-1], df['output']

Aim

Motive of the notebook is to give a brief overview as to how to use the evolutionary sampling powered ensemble models as part of the EvoML research project.

Will make the notebook more verbose if time permits. Priority will be to showcase the flexible API of the new estimators which encourage research and tinkering.

Contents

  • Subsampling
  • Subspacing

1. Subsampling - Sampling in the example space - rows will be mutated and evolved.


In [2]:
from evoml.subsampling import BasicSegmenter_FEMPO, BasicSegmenter_FEGT, BasicSegmenter_FEMPT


  File "C:\Users\harshnisar\Programming\bhanu\EvoML\evoml\subsampling\evaluators.py", line 201
    print 'Size of OOB', len(out_bag_idx)
                      ^
SyntaxError: invalid syntax

In [2]:
df = pd.read_csv('datasets/ozone.csv')

In [3]:
df.head(2)


Out[3]:
temp invHt press vis milPress hum invTemp wind output
0 0.220588 0.528124 0.250000 0.714286 0.619048 0.121622 0.313725 0.190476 3
1 0.294118 0.097975 0.255682 0.285714 0.603175 0.243243 0.428571 0.142857 5

In [4]:
X, y = df.iloc[:,:-1], df['output']

In [6]:
print(BasicSegmenter_FEGT.__doc__)


    Uses basic evolutionary algorithm to find the best subsets of X and trains
    Linear Regression on each subset. For given row of input, prediction
    is based on the model trained on segment closest to input.

    Same as the BasicSegmenter, but uses list of thrained models instead of DataFrames
    as each individual. Done to boost performance. 

    Parameters
    ----------
    n : Integer, optional, default, 10
        The number of segments you want in your dataset.
    
    base_estimator: estimator, default, LinearRegression
        The basic estimator for all segments.

    test_size : float, optional, default, 0.2
        Test size that the algorithm internally uses in its 
        fitness function.

    n_population : Integer, optional, default, 30
        The number of ensembles present in population.

    init_sample_percentage : float, optional, default, 0.2
    

    Attributes
    -----------
    best_enstimator_ : estimator 
    
    segments_ : list of DataFrames

    

In [7]:
from sklearn.tree import DecisionTreeRegressor
clf_dt = DecisionTreeRegressor(max_depth=3)
clf = BasicSegmenter_FEGT(base_estimator=clf_dt, statistics=True)

In [8]:
clf.fit(X, y)


gen	nevals	avg   	std     	min    	max    
0  	30    	5.2781	0.589689	4.23039	6.80946
1  	22    	4.74899	0.469555	4.2272 	5.96946
2  	22    	4.57545	0.334731	4.09894	5.43991
3  	22    	4.44866	0.488645	4.03896	6.3982 
4  	19    	4.31265	0.222423	3.88692	4.74369
5  	27    	4.32705	0.449177	3.88692	6.41261
6  	25    	4.30957	0.465491	3.88692	5.8717 
7  	21    	4.26635	0.361028	3.88692	5.27358
8  	20    	4.251  	0.552977	3.89738	6.27109
9  	21    	4.15874	0.338961	3.79678	5.30784
10 	22    	4.10114	0.29433 	3.79678	5.05167
11 	25    	4.04121	0.253477	3.76978	4.70117
12 	20    	4.01039	0.363592	3.75632	5.65705
13 	23    	3.98229	0.272671	3.70865	4.60254
14 	24    	3.89309	0.226678	3.70865	4.78833
15 	20    	3.81467	0.219473	3.63065	4.8581 
16 	23    	3.8633 	0.270345	3.63065	4.59721
17 	25    	3.82304	0.197699	3.5993 	4.50707
18 	18    	3.81666	0.359854	3.55746	5.55283
19 	23    	3.80377	0.23816 	3.55746	4.75423
20 	22    	3.75274	0.146903	3.55746	4.16259
21 	24    	3.84858	0.300793	3.37172	4.88835
22 	19    	3.78003	0.218148	3.37172	4.30379
23 	18    	3.77996	0.356439	3.35977	4.90352
24 	26    	3.6638 	0.2453  	3.32211	4.39246
25 	24    	3.63329	0.295922	3.32211	4.42435
26 	22    	3.64748	0.395289	3.28616	4.86763
27 	18    	3.74302	0.757016	3.28616	7.2129 
28 	22    	3.4815 	0.286406	3.28616	4.75574
29 	19    	3.50169	0.417142	3.25784	5.15243
30 	23    	3.55941	0.431252	3.28178	5.18807
31 	22    	3.44066	0.372184	3.09765	5.04611
32 	21    	3.36931	0.266471	3.09765	4.40901
33 	23    	3.3757 	0.36565 	3.09765	4.82305
34 	20    	3.30006	0.283425	3.03929	4.23263
35 	21    	3.32881	0.337995	3.03929	4.48041
36 	23    	3.36321	0.355317	3.03929	4.45418
37 	18    	3.36533	0.456911	3.03929	4.73097
38 	20    	3.19259	0.164568	2.9845 	3.8426 
39 	21    	3.34565	0.367022	3.05676	4.66485
40 	19    	3.39098	0.472878	3.05676	4.92911
41 	24    	3.45124	0.683043	3.05676	6.61767
42 	22    	3.32108	0.46326 	2.99966	5.17868
43 	24    	3.25789	0.289575	3.05676	4.13871
44 	17    	3.317  	0.47207 	3.00231	4.74823
45 	23    	3.23767	0.387888	2.98618	5.05057
46 	19    	3.37091	0.478699	2.98618	4.65248
47 	19    	3.31203	0.477201	2.98618	4.90274
48 	23    	3.29415	0.348598	2.98618	4.12619
49 	26    	3.26899	0.349108	2.97331	4.36843
50 	20    	3.18521	0.237827	2.97331	3.88278
Out[8]:
BasicSegmenter_FEGT(base_estimator=DecisionTreeRegressor(criterion='mse', max_depth=3, max_features=None,
           max_leaf_nodes=None, min_samples_leaf=1, min_samples_split=2,
           min_weight_fraction_leaf=0.0, presort=False, random_state=None,
           splitter='best'),
          crossover_func=<function cxTwoPoint at 0x106C5B70>, cxpb=0.5,
          indpb=0.2, init_sample_percentage=0.2, mutpb=0.5, n=10,
          n_population=30, n_votes=1, ngen=50, statistics=True,
          test_size=0.2, tournsize=3)

In [9]:
clf.score(X, y)


Out[9]:
0.69093734986554811

In [10]:
EGs = clf.segments_

In [11]:
len(EGs)


Out[11]:
10

In [12]:
sampled_datasets = [eg.get_data() for eg in EGs]

In [13]:
[sd.shape for sd in sampled_datasets]


Out[13]:
[(27, 9),
 (66, 9),
 (40, 9),
 (118, 9),
 (66, 9),
 (53, 9),
 (53, 9),
 (53, 9),
 (66, 9),
 (66, 9)]

In [ ]:

2. Subspacing - sampling in the domain of features - evolving and mutating columns


In [14]:
from evoml.subspacing import FeatureStackerFEGT, FeatureStackerFEMPO

In [15]:
print(FeatureStackerFEGT.__doc__)


    Uses basic evolutionary algorithm to find the best subspaces of X and trains 
    a model on each subspace. For given row of input, prediction is based on the ensemble
    which has performed the best on the test set. The prediction is the average of all the 
    chromosome predictions.

    Same as the BasicSegmenter, but uses list of thrained models instead of DataFrames
    as each individual. Done to boost performance. 

    Parameters
    ----------
    test_size: float, default = 0.2
        Test size that the algorithm internally uses in its fitness
        function
    
    N_population: Integer, default : 30
        The population of the individuals that the evolutionary algorithm is going to use. 
    
    N_individual: Integer, default : 5
        Number of chromosomes in each individual of the population

    featMin: Integer, default : 1
        The minimum number of features for the sub space from the dataset
        Cannot be <= 0 else changes it to 1 instead.
    
    featMax: Integer, default : max number of features in the dataset
        The maximum number of features for the sub space from the dataset
        Cannot be <featMin else changes it to equal to featMin

    indpb: float, default : 0.05
        The number that defines the probability by which the chromosome will be mutated.

    ngen: Integer, default : 10
        The iterations for which the evolutionary algorithm is going to run.

    mutpb: float, default : 0.40
        The probability by which the individuals will go through mutation.

    cxpb: float, default : 0.50
        The probability by which the individuals will go through cross over.

    base_estimator: model, default: LinearRegression
        The type of model which is to be trained in the chromosome.

    crossover_func: cross-over function, default : tools.cxTwoPoint [go through eaSimple's documentation]
        The corssover function that will be used between the individuals

    test_frac, test_frac_flag: Parameters for playing around with test set. Not in use as of now.

    Attributes
    -----------
    segment: HallOfFame individual 
        Gives you the best individual from the whole population. 
        The best individual can be accessed via segment[0]

    

In [16]:
clf = FeatureStackerFEGT(ngen=30)

In [17]:
clf.fit(X, y)


gen	nevals	avg    	min    	max    
0  	30    	4.80779	4.30355	5.31144
1  	14    	4.55898	4.30355	4.96747
2  	24    	4.47572	4.30232	5.01653
3  	30    	4.39705	4.24509	4.5792 
4  	13    	4.3305 	4.22728	4.70083
5  	22    	4.27701	4.22728	4.38708
6  	22    	4.25929	4.22728	4.38545
7  	21    	4.23435	4.21544	4.24509
8  	17    	4.23617	4.21544	4.38545
9  	18    	4.22293	4.21544	4.22728
10 	21    	4.21741	4.21544	4.22728
11 	27    	4.21559	4.21544	4.22013
12 	20    	4.21544	4.21544	4.21544
13 	20    	4.21544	4.21544	4.21544
14 	28    	4.21544	4.21544	4.21544
15 	17    	4.21536	4.21307	4.21544
16 	22    	4.21522	4.21307	4.21833
17 	26    	4.21459	4.21307	4.21831
18 	21    	4.21346	4.21307	4.21544
19 	19    	4.21307	4.21307	4.21307
20 	20    	4.21307	4.21307	4.21307
21 	24    	4.21307	4.21307	4.21307
22 	23    	4.21307	4.21307	4.21307
23 	18    	4.21328	4.21307	4.21833
24 	21    	4.21307	4.21307	4.21307
25 	23    	4.21307	4.21307	4.21307
26 	23    	4.21307	4.21307	4.21307
27 	20    	4.2131 	4.21307	4.21409
28 	17    	4.21314	4.21307	4.21532
29 	25    	4.21307	4.21307	4.21307
30 	21    	4.21307	4.21307	4.21307
Out[17]:
FeatureStackerFEGT(N_individual=5, N_population=30,
          base_estimator=LinearRegression(copy_X=True, fit_intercept=True, n_jobs=1, normalize=False),
          crossover_func=<function cxTwoPoint at 0x106C5B70>, cxpb=0.5,
          featMax=7, featMin=1, indpb=0.05, mutpb=0.4, ngen=30,
          test_frac=0.3, test_frac_flag=False, test_size=0.2)

In [18]:
clf.score(X, y)


Out[18]:
0.65262771433009603

In [19]:
## Get the Hall of Fame individual
hof = clf.segment[0]

In [20]:
sampled_datasets = [eg.get_data() for eg in hof]

In [21]:
[data.columns.tolist() for data in sampled_datasets]


Out[21]:
[['hum', 'milPress', 'temp', 'invTemp', 'vis', 'invHt', 'press', 'output'],
 ['invHt', 'milPress', 'hum', 'temp', 'invTemp', 'vis', 'output'],
 ['invHt', 'output'],
 ['invHt', 'hum', 'vis', 'output'],
 ['hum', 'press', 'vis', 'milPress', 'invTemp', 'output']]

In [22]:
## Original X columns
X.columns


Out[22]:
Index([u'temp', u'invHt', u'press', u'vis', u'milPress', u'hum', u'invTemp',
       u'wind'],
      dtype='object')

In [ ]:
# The exploration of the dataset by benchmark algorithms
clf = DecisionTreeClassifier(random_state=34092)
clf.fit(X_train, y_train)
pred_DTC = clf.predict(X_test)
print('Base DecisionTreeClassifier accuracy: {}'.format(clf.score(X_test, y_test)))

clf = RandomForestClassifier(random_state=34092)
clf.fit(X_train_tot, y_train)
pred_RFC = clf.predict(X_test)
print('Base RandomForestClassifier accuracy: {}'.format(clf.score(X_test, y_test)))

clf = GradientBoostingClassifier(random_state=34092)
clf.fit(X_train, y_train)
pred_GBC = clf.predict(X_test)
print('Base GradientBoostingClassifier accuracy: {}'.format(clf.score(X_test, y_test)))

print('')

FECV



In [1]:
import pandas as pd
df = pd.read_csv('datasets/ozone.csv')
X, y = df.iloc[:,:-1], df['output']
from evoml.subspacing import FeatureStackerFEGT, FeatureStackerFEMPO, FeatureStackerFECV
#print(FeatureStackerFECV.__doc__)
clf = FeatureStackerFECV(ngen=3)
clf.fit(X, y)
clf.predict(X)


C:\Anaconda3\lib\site-packages\sklearn\cross_validation.py:516: Warning: The least populated class in y has only 1 members, which is too few. The minimum number of labels for any class cannot be less than n_folds=3.
  % (min_labels, self.n_folds)), Warning)
gen	nevals	avg    	min    	max    
0  	40    	4.89781	4.76298	5.61928
C:\Anaconda3\lib\site-packages\sklearn\cross_validation.py:516: Warning: The least populated class in y has only 1 members, which is too few. The minimum number of labels for any class cannot be less than n_folds=3.
  % (min_labels, self.n_folds)), Warning)
1  	27    	5.01939	4.796  	5.64465
2  	35    	5.15548	4.82603	5.61928
C:\Anaconda3\lib\site-packages\sklearn\cross_validation.py:516: Warning: The least populated class in y has only 1 members, which is too few. The minimum number of labels for any class cannot be less than n_folds=3.
  % (min_labels, self.n_folds)), Warning)
3  	28    	5.38592	4.98559	5.93645
Out[1]:
array([  5.05453477,  10.65522201,  13.09773067,  12.69005093,
         8.29522015,   7.24764979,  11.68401972,  10.28011928,
         7.52059056,  11.22048606,  12.06793789,   5.06193149,
         6.55877489,   5.18957124,   5.79493633,  12.42355785,
        14.75432868,   6.11332286,   3.57810273,   5.8905531 ,
        11.22055139,  10.68963004,  14.2756501 ,  10.9999527 ,
         9.24377563,  10.9509904 ,  13.02889082,  10.65554333,
        16.16201447,  14.00363413,  10.93369492,   7.14752758,
        12.56565479,   3.97788217,   7.20467753,   4.54461853,
         5.23523088,   3.0295633 ,   4.60664144,  11.098163  ,
         9.99213301,   7.04512089,  10.47667674,   9.05649514,
        15.77837683,   2.10874918,   1.87048022,   2.47752583,
         3.495652  ,   5.9635339 ,  12.04512118,  11.11174712,
        12.52717889,  14.31330279,  13.10102913,  11.68763725,
         3.97343152,   6.03807717,   7.14191293,   6.52789846,
         9.21743841,   9.93083879,   8.62770772,   8.26597509,
         9.88387606,   7.52146402,   6.93424114,  12.18856573,
        14.89300469,  13.56703121,  14.70130312,  16.54913314,
        11.04236944,   9.26247041,   6.48189957,  10.20818745,
        11.1745393 ,  13.16796566,   7.81540769,   8.14272612,
        10.25290335,  10.58553882,   9.37597647,   8.70787179,
        11.96750967,  14.00703752,   8.02819519,   7.78380242,
         7.27477927,   5.40935234,   8.2435832 ,   6.90243079,
        10.23565172,  12.78199396,   9.06076411,   9.45822061,
         9.01233727,   6.76585311,  11.33549318,   3.54341393,
         6.39255198,  12.27738236,  13.11733441,  13.32615591,
        15.86101826,  12.05962609,  12.50896564,  14.83987015,
        15.6530483 ,  12.96463363,   9.15869043,   8.86887635,
        13.76868928,  13.91569118,  14.01076206,  13.51781629,
        14.83232224,  10.83235161,  12.45064133,  14.42541231,
        15.46525421,  15.8463476 ,  15.84383442,  17.27010193,
        18.24276955,  17.77642607,  15.94530862,  16.6070014 ,
        16.12909726,  13.73653445,  12.68270774,  13.47789484,
        12.20689777,  10.80251526,  14.20541968,  17.0124374 ,
        14.63167901,  14.84908195,  12.58123744,  13.96464978,
        14.12145754,  16.32750318,  14.56000398,  13.25872155,
        14.5664398 ,  13.73937783,  13.33987828,  12.79088339,
        10.09952993,  10.82063916,  10.45474995,  12.83435791,
        13.46249738,  15.18947445,  16.69225486,  17.28341543,
        15.65740396,  16.57905127,  15.96948486,  14.66641614,
        14.50798063,  13.90010713,  12.38660971,  13.00073759,
        14.61465238,  14.19289524,  15.13097985,  14.82848204,
        15.70820981,  13.83419821,  13.71536315,  12.98965322,
        13.21052373,  15.96049299,  16.96262026,  17.9564371 ,
        16.14296336,  16.26415592,  16.06753639,  15.73564155,
        15.33340885,  14.61628751,  15.81289031,  14.7862702 ,
        14.34569812,  14.05620013,  15.39774603,  17.30969669,
        17.78051258,  17.04257352,  16.26386029,  16.21068897,
        16.50491783,  16.32143422,  16.52258887,  13.09562404,
        12.21187796,   9.99721984,  11.87079269,  14.74017174,
        15.72615209,  16.79751331,  15.70741093,  11.38931786,
         9.96799793,   9.6603325 ,   9.60775539,   7.7899765 ,
         7.53116273,  15.83195026,   9.8768992 ,  13.7405053 ,
        10.56497525,  17.24162706,  15.94235372,  15.39914275,
        17.56666895,  18.19784575,  18.36177001,  16.28504547,
        17.48610962,  16.78468674,  16.90481937,  16.53516327,
        16.51568054,  15.00279191,  12.05449381,  15.59885756,
        14.53573316,  10.84206013,  14.30815457,  14.99203414,
        15.57586796,  12.31394642,  11.2830373 ,  12.60366764,
        13.42282764,  14.17611005,  15.2940217 ,  13.56278876,
        11.31778973,  11.44006602,  13.98114362,  12.08813838,
        13.64734483,   9.80643539,   9.24705465,   8.69311474,
         9.89197602,  15.21186218,  16.00221844,  16.02338101,
        16.15860272,   8.99320812,   5.34475418,  12.93076002,
        17.48048651,  16.77319308,  17.79124581,  16.23415456,
        13.98033614,  14.35936909,  15.81113022,  12.81588333,
        14.41240211,  12.26047498,  13.54012994,  11.59826373,
        11.89219042,  14.04828673,  13.97568002,   9.47814152,
         3.39032509,   5.72857795,  11.06527595,  12.60968872,
        12.25252985,  10.46755672,   3.93308749,  14.29702491,
        14.3620112 ,  15.39871471,   8.04875677,   8.9424318 ,
         5.35072772,   8.63520069,  10.22113059,   7.26739268,
         6.97090802,   9.14169983,  15.51333243,  15.67110899,
        15.72094425,  15.46523173,  16.63081522,  16.19522218,
        14.58871248,  11.84568501,   4.39314391,   5.82028347,
         4.70445251,   7.9835592 ,   4.48790779,   5.62897809,
         8.95968729,  12.77927671,  12.54764923,   3.75329072,
         6.30467244,   6.93330062,  10.25453445,  11.52516716,
         7.64841891,   6.27041204,   6.35594864,   8.16826876,
         7.75539022,   6.4788892 ,   6.54121838,   9.14254905,
        10.45061882,   9.71000862,   6.02948909,   2.79183707,
        12.19334352,  11.2994614 ,  10.39180471,   8.21525451,
        12.34092824,  10.62054915])

In [8]:
import numpy as np
import pandas as pd
from evoml.subspacing import FeatureStackerFEGT, FeatureStackerFEMPO, FeatureStackerFECV
from sklearn.cross_validation import train_test_split


data = pd.read_csv('datasets/GAMETES.csv',sep='\t')
headers_ = list(data.columns)

features = data[headers_[0:-1]]
output = data[headers_[-1]]

X_train, X_test, y_train, y_test = train_test_split(features, output, stratify=output,
                                                        train_size=0.75, test_size=0.25)


from sklearn.tree import DecisionTreeClassifier
clf_dt = DecisionTreeClassifier(max_features=None)
clf = FeatureStackerFECV(ngen=20, model_type='classification', base_estimator=clf_dt, folds_CV=10)
clf.fit(X_train, y_train)


gen	nevals	avg     	min     	max     
0  	40    	0.524286	0.463393	0.610714
1  	28    	0.54904 	0.49375 	0.610714
2  	28    	0.557679	0.491964	0.604464
3  	22    	0.579821	0.501786	0.666071
4  	32    	0.590603	0.507143	0.670536
5  	24    	0.603661	0.525893	0.670536
6  	29    	0.626585	0.550893	0.692857
7  	31    	0.642143	0.575893	0.686607
8  	27    	0.648862	0.521429	0.686607
9  	29    	0.651473	0.601786	0.692857
10 	25    	0.661116	0.592857	0.701786
11 	36    	0.654598	0.573214	0.723214
12 	32    	0.66558 	0.607143	0.723214
13 	23    	0.679531	0.619643	0.719643
14 	28    	0.680937	0.624107	0.719643
15 	27    	0.688058	0.629464	0.721429
16 	28    	0.684196	0.625893	0.719643
17 	27    	0.678058	0.627679	0.727679
18 	21    	0.688951	0.633929	0.729464
19 	30    	0.680513	0.621429	0.729464
20 	29    	0.687902	0.653571	0.727679
Out[8]:
FeatureStackerFECV(N_individual=5, N_population=40,
          base_estimator=DecisionTreeClassifier(class_weight=None, criterion='gini', max_depth=None,
            max_features=None, max_leaf_nodes=None, min_samples_leaf=1,
            min_samples_split=2, min_weight_fraction_leaf=0.0,
            presort=False, random_state=None, splitter='best'),
          crossover_func=<function cxTwoPoint at 0x000001A11E508B70>,
          cxpb=0.5, featMax=99, featMin=1, folds_CV=10,
          indiv_replace_flag=False, indpb=0.05, maxOrMin=1,
          model_type='classification', mutpb=0.4, ngen=20, test_size=0.3,
          verbose_flag=True)

In [9]:
from sklearn.metrics import accuracy_score
pred = clf.predict(X_test)
accuracy_score(pred, y_test)


Out[9]:
0.35999999999999999

In [6]:
StratifiedKFold?

In [ ]: