In [1]:
import pandas as pd

Aim

Motive of the notebook is to give a brief overview as to how to use the evolutionary sampling powered ensemble models as part of the EvoML research project.

Will make the notebook more verbose if time permits. Priority will be to showcase the flexible API of the new estimators which encourage research and tinkering.

Contents

  • Subsampling
  • Subspacing

1. Subsampling - Sampling in the example space - rows will be mutated and evolved.


In [2]:
from evoml.subsampling import BasicSegmenter_FEMPO, BasicSegmenter_FEGT, BasicSegmenter_FEMPT

In [3]:
df = pd.read_csv('datasets/ozone.csv')

In [4]:
df.head(2)


Out[4]:
temp invHt press vis milPress hum invTemp wind output
0 0.220588 0.528124 0.250000 0.714286 0.619048 0.121622 0.313725 0.190476 3
1 0.294118 0.097975 0.255682 0.285714 0.603175 0.243243 0.428571 0.142857 5

In [5]:
X, y = df.iloc[:,:-1], df['output']

In [6]:
print(BasicSegmenter_FEGT.__doc__)


    Uses basic evolutionary algorithm to find the best subsets of X and trains
    regression model on each subset to form an ensemble. For given row of input,
    prediction is based on the model trained on segment closest to input.

    Same as the BasicSegmenter, but uses list of thrained models instead of DataFrames
    as each individual. Done to boost performance. 

    In Fitness Each Model Global Test (FEGT) - the fitness of each ensemble is defined
    by it's performance as a unit against a validation set carved out initially.
    Performance of constituent models is not taken into consideration (like in FEMPT or FEMPO)

    Inherits scikit-learn's BaseEstimator and RegressorMixin class to have sklearn compatible APIs.

    Parameters
    ----------
    n : Integer, optional, default, 10
        The number of segments you want in your dataset.
    

    test_size : float, optional, default, 0.2
        Test size that the algorithm internally uses in its 
        fitness function.

    n_population : Integer, optional, default, 30
        The number of ensembles present in population.

    init_sample_percentage : float, optional, default, 0.2
    
    base_estimator: estimator, default, LinearRegression
        The basic estimator for all segments.

    n_votes: Integer, default, 1,
        The number of models in the ensemble which get to vote in in the final
        prediction based on Nearest Neighbour. If same as `n` then final prediction
        is average of all models in the ensemble.

    Attributes
    -----------
    best_enstimator_ : estimator 
    
    segments_ : list of DataFrames

    

In [7]:
from sklearn.tree import DecisionTreeRegressor
clf_dt = DecisionTreeRegressor(max_depth=3)
clf = BasicSegmenter_FEGT(base_estimator=clf_dt, statistics=True)

In [ ]:
clf.fit(X, y)


gen	nevals	avg    	std     	min    	max    
0  	30    	6.32876	0.492233	5.04799	7.43042
1  	21    	6.07521	0.573498	4.93253	7.31233
2  	25    	5.78503	0.494581	4.93253	6.99333
3  	20    	5.437  	0.391602	4.90363	6.30164
4  	24    	5.07619	0.176965	4.8006 	5.5222 
5  	22    	5.04254	0.260894	4.85295	6.21677
6  	19    	5.04885	0.266438	4.83477	5.89185
7  	20    	5.07402	0.238189	4.7992 	5.62234
8  	21    	5.01239	0.264842	4.72583	5.72276
9  	16    	4.93039	0.204229	4.71831	5.74288
10 	26    	4.90982	0.192576	4.68315	5.58544
11 	21    	4.88004	0.249679	4.5957 	5.76242
12 	25    	4.88742	0.226945	4.58175	5.65752
13 	20    	4.88799	0.270974	4.62788	5.77757
14 	21    	4.86156	0.427771	4.61906	6.96394
15 	23    	4.83864	0.300189	4.61991	5.61473
16 	23    	4.96292	0.517193	4.54947	6.351  
17 	23    	4.77844	0.359522	4.45723	6.08087
18 	23    	4.75444	0.273759	4.45723	5.58176

In [9]:
clf.score(X, y)


Out[9]:
0.69093734986554811

In [10]:
EGs = clf.segments_

In [11]:
len(EGs)


Out[11]:
10

In [12]:
sampled_datasets = [eg.get_data() for eg in EGs]

In [13]:
[sd.shape for sd in sampled_datasets]


Out[13]:
[(27, 9),
 (66, 9),
 (40, 9),
 (118, 9),
 (66, 9),
 (53, 9),
 (53, 9),
 (53, 9),
 (66, 9),
 (66, 9)]

In [ ]:

2. Subspacing - sampling in the domain of features - evolving and mutating columns


In [14]:
from evoml.subspacing import FeatureStackerFEGT, FeatureStackerFEMPO

In [15]:
print(FeatureStackerFEGT.__doc__)


    Uses basic evolutionary algorithm to find the best subspaces of X and trains 
    a model on each subspace. For given row of input, prediction is based on the ensemble
    which has performed the best on the test set. The prediction is the average of all the 
    chromosome predictions.

    Same as the BasicSegmenter, but uses list of thrained models instead of DataFrames
    as each individual. Done to boost performance. 

    Parameters
    ----------
    test_size: float, default = 0.2
        Test size that the algorithm internally uses in its fitness
        function
    
    N_population: Integer, default : 30
        The population of the individuals that the evolutionary algorithm is going to use. 
    
    N_individual: Integer, default : 5
        Number of chromosomes in each individual of the population

    featMin: Integer, default : 1
        The minimum number of features for the sub space from the dataset
        Cannot be <= 0 else changes it to 1 instead.
    
    featMax: Integer, default : max number of features in the dataset
        The maximum number of features for the sub space from the dataset
        Cannot be <featMin else changes it to equal to featMin

    indpb: float, default : 0.05
        The number that defines the probability by which the chromosome will be mutated.

    ngen: Integer, default : 10
        The iterations for which the evolutionary algorithm is going to run.

    mutpb: float, default : 0.40
        The probability by which the individuals will go through mutation.

    cxpb: float, default : 0.50
        The probability by which the individuals will go through cross over.

    base_estimator: model, default: LinearRegression
        The type of model which is to be trained in the chromosome.

    crossover_func: cross-over function, default : tools.cxTwoPoint [go through eaSimple's documentation]
        The corssover function that will be used between the individuals

    test_frac, test_frac_flag: Parameters for playing around with test set. Not in use as of now.

    Attributes
    -----------
    segment: HallOfFame individual 
        Gives you the best individual from the whole population. 
        The best individual can be accessed via segment[0]

    

In [16]:
clf = FeatureStackerFEGT(ngen=30)

In [17]:
clf.fit(X, y)


gen	nevals	avg    	min    	max    
0  	30    	4.80779	4.30355	5.31144
1  	14    	4.55898	4.30355	4.96747
2  	24    	4.47572	4.30232	5.01653
3  	30    	4.39705	4.24509	4.5792 
4  	13    	4.3305 	4.22728	4.70083
5  	22    	4.27701	4.22728	4.38708
6  	22    	4.25929	4.22728	4.38545
7  	21    	4.23435	4.21544	4.24509
8  	17    	4.23617	4.21544	4.38545
9  	18    	4.22293	4.21544	4.22728
10 	21    	4.21741	4.21544	4.22728
11 	27    	4.21559	4.21544	4.22013
12 	20    	4.21544	4.21544	4.21544
13 	20    	4.21544	4.21544	4.21544
14 	28    	4.21544	4.21544	4.21544
15 	17    	4.21536	4.21307	4.21544
16 	22    	4.21522	4.21307	4.21833
17 	26    	4.21459	4.21307	4.21831
18 	21    	4.21346	4.21307	4.21544
19 	19    	4.21307	4.21307	4.21307
20 	20    	4.21307	4.21307	4.21307
21 	24    	4.21307	4.21307	4.21307
22 	23    	4.21307	4.21307	4.21307
23 	18    	4.21328	4.21307	4.21833
24 	21    	4.21307	4.21307	4.21307
25 	23    	4.21307	4.21307	4.21307
26 	23    	4.21307	4.21307	4.21307
27 	20    	4.2131 	4.21307	4.21409
28 	17    	4.21314	4.21307	4.21532
29 	25    	4.21307	4.21307	4.21307
30 	21    	4.21307	4.21307	4.21307
Out[17]:
FeatureStackerFEGT(N_individual=5, N_population=30,
          base_estimator=LinearRegression(copy_X=True, fit_intercept=True, n_jobs=1, normalize=False),
          crossover_func=<function cxTwoPoint at 0x106C5B70>, cxpb=0.5,
          featMax=7, featMin=1, indpb=0.05, mutpb=0.4, ngen=30,
          test_frac=0.3, test_frac_flag=False, test_size=0.2)

In [18]:
clf.score(X, y)


Out[18]:
0.65262771433009603

In [19]:
## Get the Hall of Fame individual
hof = clf.segment[0]

In [20]:
sampled_datasets = [eg.get_data() for eg in hof]

In [21]:
[data.columns.tolist() for data in sampled_datasets]


Out[21]:
[['hum', 'milPress', 'temp', 'invTemp', 'vis', 'invHt', 'press', 'output'],
 ['invHt', 'milPress', 'hum', 'temp', 'invTemp', 'vis', 'output'],
 ['invHt', 'output'],
 ['invHt', 'hum', 'vis', 'output'],
 ['hum', 'press', 'vis', 'milPress', 'invTemp', 'output']]

In [22]:
## Original X columns
X.columns


Out[22]:
Index([u'temp', u'invHt', u'press', u'vis', u'milPress', u'hum', u'invTemp',
       u'wind'],
      dtype='object')

In [ ]:


In [ ]: