Base Line for ML



In [1]:

    
import warnings
warnings.filterwarnings('ignore')



In [2]:

    
%matplotlib inline
%pylab inline









    



Populating the interactive namespace from numpy and matplotlib



In [3]:

    
import pandas as pd
print(pd.__version__)

First Step: Load Data and disassemble for our purposes



In [4]:

    
df = pd.read_csv('./insurance-customers-300.csv', sep=';')



In [5]:

    
y=df['group']



In [7]:

    
df.drop('group', axis='columns', inplace=True)



In [6]:

    
X = df.as_matrix()



In [8]:

    
df.describe()









    Out[8]:







  
    
      
      max speed
      age
      thousand km per year
    
  
  
    
      count
      300.000000
      300.000000
      300.000000
    
    
      mean
      171.863333
      44.006667
      31.220000
    
    
      std
      18.807545
      16.191784
      15.411792
    
    
      min
      132.000000
      18.000000
      5.000000
    
    
      25%
      159.000000
      33.000000
      18.000000
    
    
      50%
      171.000000
      42.000000
      30.000000
    
    
      75%
      187.000000
      52.000000
      43.000000
    
    
      max
      211.000000
      90.000000
      99.000000

Second Step: Visualizing Prediction



In [9]:

    
# ignore this, it is just technical code
# should come from a lib, consider it to appear magically 
# http://scikit-learn.org/stable/auto_examples/neighbors/plot_classification.html

import matplotlib.pyplot as plt
from matplotlib.colors import ListedColormap

cmap_print = ListedColormap(['#AA8888', '#004000', '#FFFFDD'])
cmap_bold = ListedColormap(['#AA4444', '#006000', '#AAAA00'])
cmap_light = ListedColormap(['#FFAAAA', '#AAFFAA', '#FFFFDD'])
font_size=25

def meshGrid(x_data, y_data):
    h = 1  # step size in the mesh
    x_min, x_max = x_data.min() - 1, x_data.max() + 1
    y_min, y_max = y_data.min() - 1, y_data.max() + 1
    xx, yy = np.meshgrid(np.arange(x_min, x_max, h),
                         np.arange(y_min, y_max, h))
    return (xx,yy)
    
def plotPrediction(clf, x_data, y_data, x_label, y_label, colors, title="", mesh=True, fname=None):
    xx,yy = meshGrid(x_data, y_data)
    plt.figure(figsize=(20,10))

    if clf and mesh:
        Z = clf.predict(np.c_[yy.ravel(), xx.ravel()])
        # Put the result into a color plot
        Z = Z.reshape(xx.shape)
        plt.pcolormesh(xx, yy, Z, cmap=cmap_light)
    
    plt.xlim(xx.min(), xx.max())
    plt.ylim(yy.min(), yy.max())
    if fname:
        plt.scatter(x_data, y_data, c=colors, cmap=cmap_print, s=200, marker='o', edgecolors='k')
    else:
        plt.scatter(x_data, y_data, c=colors, cmap=cmap_bold, s=80, marker='o', edgecolors='k')
    plt.xlabel(x_label, fontsize=font_size)
    plt.ylabel(y_label, fontsize=font_size)
    plt.title(title, fontsize=font_size)
    if fname:
        plt.savefig(fname)



In [10]:

    
from sklearn.model_selection import train_test_split



In [11]:

    
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.4, random_state=42, stratify=y)



In [12]:

    
X_train.shape, y_train.shape, X_test.shape, y_test.shape









    Out[12]:





((180, 4), (180,), (120, 4), (120,))



In [13]:

    
X_train_kmh_age = X_train[:, :2]
X_test_kmh_age = X_test[:, :2]
X_train_2_dim = X_train_kmh_age
X_test_2_dim = X_test_kmh_age



In [22]:

    
# 0: red
# 1: green
# 2: yellow

class ClassifierBase:
    def predict(self, X):
        return np.array([ self.predict_single(x) for x in X])
    def score(self, X, y):
        n = len(y)
        correct = 0
        predictions = self.predict(X)
        for prediction, ground_truth in zip(predictions, y):
            if prediction == ground_truth:
                correct = correct + 1
        return correct / n

from random import randrange

class RandomClassifier(ClassifierBase):
    def predict_single(self, x):
        return randrange(3)



In [17]:

    
random_clf = RandomClassifier()



In [48]:

    
plotPrediction(random_clf, X_train_2_dim[:, 1], X_train_2_dim[:, 0], 
               'Age', 'Max Speed', y_train,
                title="Train Data Max Speed vs Age (Random)")

By just randomly guessing, we get approx. 1/3 right, which is what we expect



In [44]:

    
random_clf.score(X_test_2_dim, y_test)









    Out[44]:





0.275

Third Step: Creating a Base Line

Creating a naive classifier manually, how much better is it?



In [28]:

    
class BaseLineClassifier(ClassifierBase):
    def predict_single(self, x):
        try:
            speed, age, km_per_year = x
        except:
            speed, age = x
            km_per_year = 0
        if age < 25:
            if speed > 180:
                return 0
            else:
                return 2
        if age > 75:
            return 0
        if km_per_year > 50:
            return 0
        if km_per_year > 35:
            return 2
        return 1



In [34]:

    
base_clf = BaseLineClassifier()



In [49]:

    
plotPrediction(base_clf, X_train_2_dim[:, 1], X_train_2_dim[:, 0], 
               'Age', 'Max Speed', y_train,
                title="Train Data Max Speed vs Age with Classification")

This is the baseline we have to beat



In [45]:

    
base_clf.score(X_test_2_dim, y_test)









    Out[45]:





0.43333333333333335

No overfitting, which is too be expected, as we use general rules rather than inferring from single data points



In [46]:

    
base_clf.score(X_train_2_dim, y_train)









    Out[46]:





0.4111111111111111

Exercise in Code

Form a group where at least one of you knows a little bit of coding

Change the rules and try to beat our score

Be careful: tune on train data only, use test data only for single validation (otherwise you are fooling yourself)

	max speed	age	thousand km per year
count	300.000000	300.000000	300.000000
mean	171.863333	44.006667	31.220000
std	18.807545	16.191784	15.411792
min	132.000000	18.000000	5.000000
25%	159.000000	33.000000	18.000000
50%	171.000000	42.000000	30.000000
75%	187.000000	52.000000	43.000000
max	211.000000	90.000000	99.000000