In [1]:

    
import pandas as pd

Aim

Motive of the notebook is to give a brief overview as to how to use the evolutionary sampling powered ensemble models as part of the EvoML research project.

Will make the notebook more verbose if time permits. Priority will be to showcase the flexible API of the new estimators which encourage research and tinkering.

Subsampling
Subspacing

1. Subsampling - Sampling in the example space - rows will be mutated and evolved.



In [4]:

    
from evoml.subsampling import BasicSegmenter_FEMPO, BasicSegmenter_FEGT, BasicSegmenter_FEMPT



In [5]:

    
df = pd.read_csv('datasets/ozone.csv')



In [6]:

    
df.head(2)



In [7]:

    
X, y = df.iloc[:,:-1], df['output']



In [8]:

    
print(BasicSegmenter_FEGT.__doc__)









    



    Uses basic evolutionary algorithm to find the best subsets of X and trains
    regression model on each subset to form an ensemble. For given row of input,
    prediction is based on the model trained on segment closest to input.

    Same as the BasicSegmenter, but uses list of thrained models instead of DataFrames
    as each individual. Done to boost performance. 

    In Fitness Each Model Global Test (FEGT) - the fitness of each ensemble is defined
    by it's performance as a unit against a validation set carved out initially.
    Performance of constituent models is not taken into consideration (like in FEMPT or FEMPO)

    Inherits scikit-learn's BaseEstimator and RegressorMixin class to have sklearn compatible APIs.

    Parameters
    ----------
    n : Integer, optional, default, 10
        The number of segments you want in your dataset.
    

    test_size : float, optional, default, 0.2
        Test size that the algorithm internally uses in its 
        fitness function.

    n_population : Integer, optional, default, 30
        The number of ensembles present in population.

    init_sample_percentage : float, optional, default, 0.2
    
    base_estimator: estimator, default, LinearRegression
        The basic estimator for all segments.

    n_votes: Integer, default, 1,
        The number of models in the ensemble which get to vote in in the final
        prediction based on Nearest Neighbour. If same as `n` then final prediction
        is average of all models in the ensemble.

    Attributes
    -----------
    best_enstimator_ : estimator 
    
    segments_ : list of DataFrames



In [9]:

    
from sklearn.tree import DecisionTreeRegressor
clf_dt = DecisionTreeRegressor(max_depth=3)
clf = BasicSegmenter_FEGT(base_estimator=clf_dt, statistics=True)



In [10]:

    
clf.fit(X, y)









    



gen	nevals	avg    	std    	min    	max    
0  	30    	5.82547	0.62937	4.69311	7.37076
1  	25    	5.53367	0.409543	4.69311	6.92655
2  	20    	5.35782	0.34994 	4.69311	6.39671
3  	27    	5.26732	0.408422	4.47775	6.07493
4  	19    	5.11003	0.398512	4.47775	6.10377
5  	27    	5.14461	0.376102	4.42656	5.8455 
6  	16    	4.8434 	0.277614	4.42656	5.32259
7  	21    	4.83556	0.44338 	4.42656	6.66769
8  	22    	4.66358	0.35164 	4.26511	5.91133
9  	24    	4.70781	0.482996	4.32197	6.36382
10 	18    	4.59877	0.372972	4.29769	5.7697 
11 	21    	4.56512	0.417304	4.04641	5.74978
12 	24    	4.73707	0.552904	4.04641	6.33106
13 	28    	4.77419	0.474437	3.91406	5.75991
14 	28    	4.48796	0.412752	3.91406	5.3857 
15 	26    	4.32804	0.372563	3.8795 	5.32593
16 	22    	4.23388	0.383352	3.82731	5.25778
17 	24    	4.19033	0.442918	3.84661	5.40902
18 	25    	4.24188	0.66207 	3.84661	6.91599
19 	24    	4.0841 	0.362207	3.84661	5.31733
20 	22    	4.18025	0.401554	3.78823	5.19352
21 	21    	4.087  	0.35975 	3.78823	5.01075
22 	20    	4.04811	0.347232	3.78823	4.77439
23 	24    	4.12918	0.499687	3.77681	5.24749
24 	22    	4.0569 	0.445293	3.71059	5.49853
25 	21    	3.96571	0.321159	3.71059	4.95867
26 	18    	3.99797	0.518076	3.70182	6.06314
27 	27    	4.06872	0.632638	3.69894	6.35316
28 	22    	4.07523	0.521329	3.69894	5.49532
29 	23    	3.99405	0.410404	3.69894	5.44592
30 	25    	4.02384	0.423389	3.69067	5.06955
31 	24    	4.02481	0.413468	3.69067	5.0849 
32 	20    	4.146  	0.567882	3.69067	5.95125
33 	25    	4.09681	0.496527	3.68413	5.28622
34 	26    	4.07464	0.647953	3.67582	6.47308
35 	21    	3.9241 	0.533046	3.67582	6.32998
36 	22    	3.98568	0.451596	3.67042	5.13129
37 	18    	3.90932	0.342031	3.67042	4.9394 
38 	22    	4.03998	0.649033	3.67042	6.23846
39 	21    	4.0416 	0.580352	3.67042	6.11388
40 	22    	4.07196	0.498416	3.67042	5.64313
41 	22    	3.93747	0.342122	3.67042	5.0546 
42 	26    	4.18775	0.513987	3.67042	5.75726
43 	22    	4.13015	0.365232	3.67042	4.92223
44 	25    	4.29768	0.687381	3.67042	6.47451
45 	23    	4.16219	0.471584	3.66544	5.43127
46 	20    	4.16208	0.685945	3.66544	6.77116
47 	20    	4.05228	0.461079	3.66544	5.10527
48 	23    	3.99632	0.591556	3.66544	5.85828
49 	22    	3.92554	0.455539	3.66315	5.23587
50 	24    	4.13253	0.710503	3.66265	6.13344






    Out[10]:





BasicSegmenter_FEGT(base_estimator=DecisionTreeRegressor(criterion='mse', max_depth=3, max_features=None,
           max_leaf_nodes=None, min_samples_leaf=1, min_samples_split=2,
           min_weight_fraction_leaf=0.0, presort=False, random_state=None,
           splitter='best'),
          crossover_func=<function cxTwoPoint at 0x0000026EFCC11F28>,
          cxpb=0.5, indpb=0.2, init_sample_percentage=0.2, mutpb=0.5, n=10,
          n_population=30, n_votes=1, ngen=50, statistics=True,
          test_size=0.2, tournsize=3)



In [11]:

    
clf.score(X, y)









    Out[11]:





0.64161941917513921



In [10]:

    
EGs = clf.segments_



In [11]:

    
len(EGs)









    Out[11]:





10



In [12]:

    
sampled_datasets = [eg.get_data() for eg in EGs]



In [13]:

    
[sd.shape for sd in sampled_datasets]









    Out[13]:





[(27, 9),
 (66, 9),
 (40, 9),
 (118, 9),
 (66, 9),
 (53, 9),
 (53, 9),
 (53, 9),
 (66, 9),
 (66, 9)]



In [ ]:

2. Subspacing - sampling in the domain of features - evolving and mutating columns



In [14]:

    
from evoml.subspacing import FeatureStackerFEGT, FeatureStackerFEMPO



In [15]:

    
print(FeatureStackerFEGT.__doc__)









    



    Uses basic evolutionary algorithm to find the best subspaces of X and trains 
    a model on each subspace. For given row of input, prediction is based on the ensemble
    which has performed the best on the test set. The prediction is the average of all the 
    chromosome predictions.

    Same as the BasicSegmenter, but uses list of thrained models instead of DataFrames
    as each individual. Done to boost performance. 

    Parameters
    ----------
    test_size: float, default = 0.2
        Test size that the algorithm internally uses in its fitness
        function
    
    N_population: Integer, default : 30
        The population of the individuals that the evolutionary algorithm is going to use. 
    
    N_individual: Integer, default : 5
        Number of chromosomes in each individual of the population

    featMin: Integer, default : 1
        The minimum number of features for the sub space from the dataset
        Cannot be <= 0 else changes it to 1 instead.
    
    featMax: Integer, default : max number of features in the dataset
        The maximum number of features for the sub space from the dataset
        Cannot be <featMin else changes it to equal to featMin

    indpb: float, default : 0.05
        The number that defines the probability by which the chromosome will be mutated.

    ngen: Integer, default : 10
        The iterations for which the evolutionary algorithm is going to run.

    mutpb: float, default : 0.40
        The probability by which the individuals will go through mutation.

    cxpb: float, default : 0.50
        The probability by which the individuals will go through cross over.

    base_estimator: model, default: LinearRegression
        The type of model which is to be trained in the chromosome.

    crossover_func: cross-over function, default : tools.cxTwoPoint [go through eaSimple's documentation]
        The corssover function that will be used between the individuals

    test_frac, test_frac_flag: Parameters for playing around with test set. Not in use as of now.

    Attributes
    -----------
    segment: HallOfFame individual 
        Gives you the best individual from the whole population. 
        The best individual can be accessed via segment[0]



In [16]:

    
clf = FeatureStackerFEGT(ngen=30)



In [17]:

    
clf.fit(X, y)









    



gen	nevals	avg    	min    	max    
0  	30    	4.80779	4.30355	5.31144
1  	14    	4.55898	4.30355	4.96747
2  	24    	4.47572	4.30232	5.01653
3  	30    	4.39705	4.24509	4.5792 
4  	13    	4.3305 	4.22728	4.70083
5  	22    	4.27701	4.22728	4.38708
6  	22    	4.25929	4.22728	4.38545
7  	21    	4.23435	4.21544	4.24509
8  	17    	4.23617	4.21544	4.38545
9  	18    	4.22293	4.21544	4.22728
10 	21    	4.21741	4.21544	4.22728
11 	27    	4.21559	4.21544	4.22013
12 	20    	4.21544	4.21544	4.21544
13 	20    	4.21544	4.21544	4.21544
14 	28    	4.21544	4.21544	4.21544
15 	17    	4.21536	4.21307	4.21544
16 	22    	4.21522	4.21307	4.21833
17 	26    	4.21459	4.21307	4.21831
18 	21    	4.21346	4.21307	4.21544
19 	19    	4.21307	4.21307	4.21307
20 	20    	4.21307	4.21307	4.21307
21 	24    	4.21307	4.21307	4.21307
22 	23    	4.21307	4.21307	4.21307
23 	18    	4.21328	4.21307	4.21833
24 	21    	4.21307	4.21307	4.21307
25 	23    	4.21307	4.21307	4.21307
26 	23    	4.21307	4.21307	4.21307
27 	20    	4.2131 	4.21307	4.21409
28 	17    	4.21314	4.21307	4.21532
29 	25    	4.21307	4.21307	4.21307
30 	21    	4.21307	4.21307	4.21307






    Out[17]:





FeatureStackerFEGT(N_individual=5, N_population=30,
          base_estimator=LinearRegression(copy_X=True, fit_intercept=True, n_jobs=1, normalize=False),
          crossover_func=<function cxTwoPoint at 0x106C5B70>, cxpb=0.5,
          featMax=7, featMin=1, indpb=0.05, mutpb=0.4, ngen=30,
          test_frac=0.3, test_frac_flag=False, test_size=0.2)



In [18]:

    
clf.score(X, y)









    Out[18]:





0.65262771433009603



In [19]:

    
## Get the Hall of Fame individual
hof = clf.segment[0]



In [20]:

    
sampled_datasets = [eg.get_data() for eg in hof]



In [21]:

    
[data.columns.tolist() for data in sampled_datasets]









    Out[21]:





[['hum', 'milPress', 'temp', 'invTemp', 'vis', 'invHt', 'press', 'output'],
 ['invHt', 'milPress', 'hum', 'temp', 'invTemp', 'vis', 'output'],
 ['invHt', 'output'],
 ['invHt', 'hum', 'vis', 'output'],
 ['hum', 'press', 'vis', 'milPress', 'invTemp', 'output']]



In [22]:

    
## Original X columns
X.columns









    Out[22]:





Index([u'temp', u'invHt', u'press', u'vis', u'milPress', u'hum', u'invTemp',
       u'wind'],
      dtype='object')



In [ ]:



In [ ]:

	temp	invHt	press	vis	milPress	hum	invTemp	wind	output
0	0.220588	0.528124	0.250000	0.714286	0.619048	0.121622	0.313725	0.190476	3
1	0.294118	0.097975	0.255682	0.285714	0.603175	0.243243	0.428571	0.142857	5

Aim

Contents

1. Subsampling - Sampling in the example space - rows will be mutated and evolved.

2. Subspacing - sampling in the domain of features - evolving and mutating columns