0. Setup



In [1]:

    
# -*- coding: utf-8 -*-



In [2]:

    
import os
import sys
import numpy as np
import pandas as pd
import sklearn as sk
import pickle as pkl

1. Data

About the dataset

Predicting the age of abalone from physical measurements. The age of abalone is determined by cutting the shell through the cone, staining it, and counting the number of rings through a microscope -- a boring and time-consuming task. Other measurements, which are easier to obtain, are used to predict the age. Further information, such as weather patterns and location (hence food availability) may be required to solve the problem.

Sex / nominal / -- / M, F, and I (infant)
Length / continuous / mm / Longest shell measurement
Diameter / continuous / mm / perpendicular to length
Height / continuous / mm / with meat in shell
Whole weight / continuous / grams / whole abalone
Shucked weight / continuous / grams / weight of meat
Viscera weight / continuous / grams / gut weight (after bleeding)
Shell weight / continuous / grams / after being dried
Rings / integer / -- / +1.5 gives the age in years



In [3]:

    
# variable names
names = [
    'sex',
    'length',
    'diameter',
    'height',
    'whole_weight',
    'shucked_weight',
    'viscera_weight',
    'shell_weight',
    'rings'
]

# reading dataset
df = pd.read_csv('data/abalone.data', header=None, names=names)

# building prediction target
df['target'] = (df['rings'] >= 10).astype(int)
df = df.drop('rings', axis=1)



In [4]:

    
df.head()









    Out[4]:







  
    
      
      sex
      length
      diameter
      height
      whole_weight
      shucked_weight
      viscera_weight
      shell_weight
      target
    
  
  
    
      0
      M
      0.455
      0.365
      0.095
      0.5140
      0.2245
      0.1010
      0.150
      1
    
    
      1
      M
      0.350
      0.265
      0.090
      0.2255
      0.0995
      0.0485
      0.070
      0
    
    
      2
      F
      0.530
      0.420
      0.135
      0.6770
      0.2565
      0.1415
      0.210
      0
    
    
      3
      M
      0.440
      0.365
      0.125
      0.5160
      0.2155
      0.1140
      0.155
      1
    
    
      4
      I
      0.330
      0.255
      0.080
      0.2050
      0.0895
      0.0395
      0.055
      0

2. Feature Preparation



In [5]:

    
# seperating target from features
y = np.array(df['target'])
X = df.drop('target', axis=1)



In [6]:

    
# shuffling and splitting data into training and test sets
from sklearn.model_selection import train_test_split

SPLIT = 0.33
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=SPLIT, random_state=42)

Pre-processing Categorical Features

Example: one-hot encoding



In [7]:

    
# one-hot encoding categorical features
CATEGORICALS = ['sex']
DUMMIES = {
    'sex':['M','F','I']
}

def dummy_encode(in_df, dummies):
    out_df = in_df.copy()
    
    for feature, values in dummies.items():
        for value in values:
            dummy_name = '{}__{}'.format(feature, value)
            out_df[dummy_name] = (out_df[feature] == value).astype(int)
            
        del out_df[feature]
        # print('Dummy-encoded feature\t\t{}'.format(feature))
    return out_df
        
X_train = dummy_encode(in_df=X_train, dummies=DUMMIES)

X_test = dummy_encode(in_df=X_test, dummies=DUMMIES)

X_train.head()









    Out[7]:







  
    
      
      length
      diameter
      height
      whole_weight
      shucked_weight
      viscera_weight
      shell_weight
      sex__M
      sex__F
      sex__I
    
  
  
    
      1593
      0.525
      0.380
      0.135
      0.6150
      0.2610
      0.1590
      0.1750
      0
      0
      1
    
    
      111
      0.465
      0.360
      0.105
      0.4310
      0.1720
      0.1070
      0.1750
      1
      0
      0
    
    
      3271
      0.520
      0.425
      0.155
      0.7735
      0.2970
      0.1230
      0.2550
      1
      0
      0
    
    
      1089
      0.450
      0.330
      0.105
      0.3715
      0.1865
      0.0785
      0.0975
      0
      0
      1
    
    
      2918
      0.600
      0.445
      0.135
      0.9205
      0.4450
      0.2035
      0.2530
      0
      0
      1

Pre-processing Numeric Features

Example: min-max scaling



In [8]:

    
# rescaling numerical features
NUMERICS = ['length','diameter','height','whole_weight','shucked_weight','viscera_weight','shell_weight']
BOUNDARIES = {
    'length': (0.075000, 0.815000),
    'diameter': (0.055000, 0.650000),
    'height': (0.000000, 1.130000),
    'whole_weight': (0.002000, 2.825500),
    'shucked_weight': (0.001000, 1.488000),
    'viscera_weight': (0.000500, 0.760000),
    'shell_weight': (0.001500, 1.005000)
}

def minmax_scale(in_df, boundaries):
    out_df = in_df.copy()
    
    for feature, (min_val, max_val) in boundaries.items():      
        col_name = '{}__norm'.format(feature)
        
        out_df[col_name] = round((out_df[feature] - min_val)/(max_val - min_val),3)
        out_df.loc[out_df[col_name] < 0, col_name] = 0
        out_df.loc[out_df[col_name] > 1, col_name] = 1
            
        del out_df[feature]
        # print('MinMax Scaled feature\t\t{}'.format(feature))
    return out_df
        
X_train = minmax_scale(in_df=X_train, boundaries=BOUNDARIES)

X_test = minmax_scale(in_df=X_test, boundaries=BOUNDARIES)

X_train.head()

Notes

Scikit-Learn already has implementations of the most common variable transformers, however they tend to break down or behave unusually when they encounter values different from those they were fitted on.
Your transformations need to be executed in the same order in production because Scikit-Learn (silently) assumes that your columns in prediction are in the same order as they were during training.

3. Training



In [10]:

    
from sklearn.ensemble import RandomForestClassifier

clf = RandomForestClassifier(
    n_estimators=100, # number of trees
    n_jobs=-1, # parallelization
    random_state=1337, # random seed
    max_depth=10, # maximum tree depth
    min_samples_leaf=10
)



In [11]:

    
%time model = clf.fit(X_train, y_train)









    



CPU times: user 428 ms, sys: 16 ms, total: 444 ms
Wall time: 306 ms

4. Evaluation



In [12]:

    
# computing ROC AUC over training set
train_auc = sk.metrics.roc_auc_score(y_train, model.predict(X_train))
print('Training ROC AUC:\t', round(train_auc, 3))









    



Training ROC AUC:	 0.851



In [13]:

    
# computing ROC AUC over test set
test_auc = sk.metrics.roc_auc_score(y_test, model.predict(X_test))
print('Test ROC AUC:\t\t', round(test_auc, 3))









    



Test ROC AUC:		 0.783

5. Storing Model



In [14]:

    
pkl.dump(model, open('pickles/model_v1.pkl','wb'))

6. Loading Model



In [15]:

    
m = pkl.load(open('pickles/model_v1.pkl','rb'))
m









    Out[15]:





RandomForestClassifier(bootstrap=True, class_weight=None, criterion='gini',
            max_depth=10, max_features='auto', max_leaf_nodes=None,
            min_impurity_split=1e-07, min_samples_leaf=10,
            min_samples_split=2, min_weight_fraction_leaf=0.0,
            n_estimators=100, n_jobs=-1, oob_score=False,
            random_state=1337, verbose=0, warm_start=False)

	sex	length	diameter	height	whole_weight	shucked_weight	viscera_weight	shell_weight	target
0	M	0.455	0.365	0.095	0.5140	0.2245	0.1010	0.150	1
1	M	0.350	0.265	0.090	0.2255	0.0995	0.0485	0.070	0
2	F	0.530	0.420	0.135	0.6770	0.2565	0.1415	0.210	0
3	M	0.440	0.365	0.125	0.5160	0.2155	0.1140	0.155	1
4	I	0.330	0.255	0.080	0.2050	0.0895	0.0395	0.055	0

	length	diameter	height	whole_weight	shucked_weight	viscera_weight	shell_weight	sex__M	sex__I
1593	0.525	0.380	0.135	0.6150	0.2610	0.1590	0.1750	0	1
111	0.465	0.360	0.105	0.4310	0.1720	0.1070	0.1750	1	0
3271	0.520	0.425	0.155	0.7735	0.2970	0.1230	0.2550	1	0
1089	0.450	0.330	0.105	0.3715	0.1865	0.0785	0.0975	0	1
2918	0.600	0.445	0.135	0.9205	0.4450	0.2035	0.2530	0	1