0. Setup


In [1]:
# -*- coding: utf-8 -*-

In [2]:
import os
import sys
import numpy as np
import pandas as pd
import sklearn as sk
import pickle as pkl

1. Data

About the dataset

Predicting the age of abalone from physical measurements. The age of abalone is determined by cutting the shell through the cone, staining it, and counting the number of rings through a microscope -- a boring and time-consuming task. Other measurements, which are easier to obtain, are used to predict the age. Further information, such as weather patterns and location (hence food availability) may be required to solve the problem.

  • Sex / nominal / -- / M, F, and I (infant)
  • Length / continuous / mm / Longest shell measurement
  • Diameter / continuous / mm / perpendicular to length
  • Height / continuous / mm / with meat in shell
  • Whole weight / continuous / grams / whole abalone
  • Shucked weight / continuous / grams / weight of meat
  • Viscera weight / continuous / grams / gut weight (after bleeding)
  • Shell weight / continuous / grams / after being dried
  • Rings / integer / -- / +1.5 gives the age in years

In [3]:
# variable names
names = [
    'sex',
    'length',
    'diameter',
    'height',
    'whole_weight',
    'shucked_weight',
    'viscera_weight',
    'shell_weight',
    'rings'
]

# reading dataset
df = pd.read_csv('data/abalone.data', header=None, names=names)

# building prediction target
df['target'] = (df['rings'] >= 10).astype(int)
df = df.drop('rings', axis=1)

In [4]:
df.head()


Out[4]:
sex length diameter height whole_weight shucked_weight viscera_weight shell_weight target
0 M 0.455 0.365 0.095 0.5140 0.2245 0.1010 0.150 1
1 M 0.350 0.265 0.090 0.2255 0.0995 0.0485 0.070 0
2 F 0.530 0.420 0.135 0.6770 0.2565 0.1415 0.210 0
3 M 0.440 0.365 0.125 0.5160 0.2155 0.1140 0.155 1
4 I 0.330 0.255 0.080 0.2050 0.0895 0.0395 0.055 0

2. Feature Preparation


In [5]:
# seperating target from features
y = np.array(df['target'])
X = df.drop('target', axis=1)

In [6]:
# shuffling and splitting data into training and test sets
from sklearn.model_selection import train_test_split

SPLIT = 0.33
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=SPLIT, random_state=42)

Pre-processing Categorical Features

Example: one-hot encoding


In [7]:
# one-hot encoding categorical features
CATEGORICALS = ['sex']
DUMMIES = {
    'sex':['M','F','I']
}

def dummy_encode(in_df, dummies):
    out_df = in_df.copy()
    
    for feature, values in dummies.items():
        for value in values:
            dummy_name = '{}__{}'.format(feature, value)
            out_df[dummy_name] = (out_df[feature] == value).astype(int)
            
        del out_df[feature]
        # print('Dummy-encoded feature\t\t{}'.format(feature))
    return out_df
        
X_train = dummy_encode(in_df=X_train, dummies=DUMMIES)

X_test = dummy_encode(in_df=X_test, dummies=DUMMIES)

X_train.head()


Out[7]:
length diameter height whole_weight shucked_weight viscera_weight shell_weight sex__M sex__F sex__I
1593 0.525 0.380 0.135 0.6150 0.2610 0.1590 0.1750 0 0 1
111 0.465 0.360 0.105 0.4310 0.1720 0.1070 0.1750 1 0 0
3271 0.520 0.425 0.155 0.7735 0.2970 0.1230 0.2550 1 0 0
1089 0.450 0.330 0.105 0.3715 0.1865 0.0785 0.0975 0 0 1
2918 0.600 0.445 0.135 0.9205 0.4450 0.2035 0.2530 0 0 1

Pre-processing Numeric Features

Example: min-max scaling


In [8]:
# rescaling numerical features
NUMERICS = ['length','diameter','height','whole_weight','shucked_weight','viscera_weight','shell_weight']
BOUNDARIES = {
    'length': (0.075000, 0.815000),
    'diameter': (0.055000, 0.650000),
    'height': (0.000000, 1.130000),
    'whole_weight': (0.002000, 2.825500),
    'shucked_weight': (0.001000, 1.488000),
    'viscera_weight': (0.000500, 0.760000),
    'shell_weight': (0.001500, 1.005000)
}

def minmax_scale(in_df, boundaries):
    out_df = in_df.copy()
    
    for feature, (min_val, max_val) in boundaries.items():      
        col_name = '{}__norm'.format(feature)
        
        out_df[col_name] = round((out_df[feature] - min_val)/(max_val - min_val),3)
        out_df.loc[out_df[col_name] < 0, col_name] = 0
        out_df.loc[out_df[col_name] > 1, col_name] = 1
            
        del out_df[feature]
        # print('MinMax Scaled feature\t\t{}'.format(feature))
    return out_df
        
X_train = minmax_scale(in_df=X_train, boundaries=BOUNDARIES)

X_test = minmax_scale(in_df=X_test, boundaries=BOUNDARIES)

X_train.head()

Notes

  • Scikit-Learn already has implementations of the most common variable transformers, however they tend to break down or behave unusually when they encounter values different from those they were fitted on.
  • Your transformations need to be executed in the same order in production because Scikit-Learn (silently) assumes that your columns in prediction are in the same order as they were during training.

3. Training


In [10]:
from sklearn.ensemble import RandomForestClassifier

clf = RandomForestClassifier(
    n_estimators=100, # number of trees
    n_jobs=-1, # parallelization
    random_state=1337, # random seed
    max_depth=10, # maximum tree depth
    min_samples_leaf=10
)

In [11]:
%time model = clf.fit(X_train, y_train)


CPU times: user 428 ms, sys: 16 ms, total: 444 ms
Wall time: 306 ms

4. Evaluation


In [12]:
# computing ROC AUC over training set
train_auc = sk.metrics.roc_auc_score(y_train, model.predict(X_train))
print('Training ROC AUC:\t', round(train_auc, 3))


Training ROC AUC:	 0.851

In [13]:
# computing ROC AUC over test set
test_auc = sk.metrics.roc_auc_score(y_test, model.predict(X_test))
print('Test ROC AUC:\t\t', round(test_auc, 3))


Test ROC AUC:		 0.783

5. Storing Model


In [14]:
pkl.dump(model, open('pickles/model_v1.pkl','wb'))

6. Loading Model


In [15]:
m = pkl.load(open('pickles/model_v1.pkl','rb'))
m


Out[15]:
RandomForestClassifier(bootstrap=True, class_weight=None, criterion='gini',
            max_depth=10, max_features='auto', max_leaf_nodes=None,
            min_impurity_split=1e-07, min_samples_leaf=10,
            min_samples_split=2, min_weight_fraction_leaf=0.0,
            n_estimators=100, n_jobs=-1, oob_score=False,
            random_state=1337, verbose=0, warm_start=False)

All good. Let's build this bad boy into an API!