Some informations


source: https://www.kaggle.com/c/porto-seguro-safe-driver-prediction/data

Data Description

In this competition, you will predict the probability that an auto insurance policy holder files a claim.

In the train and test data, features that belong to similar groupings are tagged as such in the feature names (e.g., ind, reg, car, calc). In addition, feature names include the postfix bin to indicate binary features and cat to indicate categorical features. Features without these designations are either continuous or ordinal. Values of -1 indicate that the feature was missing from the observation. The target columns signifies whether or not a claim was filed for that policy holder.


source: https://www.kaggle.com/c/porto-seguro-safe-driver-prediction#evaluation

Submissions are evaluated using the Normalized Gini Coefficient.

During scoring, observations are sorted from the largest to the smallest predictions. Predictions are only used for ordering observations; therefore, the relative magnitude of the predictions are not used during scoring. The scoring algorithm then compares the cumulative proportion of positive class observations to a theoretical uniform proportion.

The Gini Coefficient ranges from approximately 0 for random guessing, to approximately 0.5 for a perfect score. The theoretical maximum for the discrete calculation is (1 - frac_pos) / 2.

The Normalized Gini Coefficient adjusts the score by the theoretical maximum so that the maximum score is 1.

source: https://www.kaggle.com/c/ClaimPredictionChallenge/discussion/703

def gini(actual, pred, cmpcol = 0, sortcol = 1):
     assert( len(actual) == len(pred) )
     all = np.asarray(np.c_[ actual, pred, np.arange(len(actual)) ], dtype=np.float)
     all = all[ np.lexsort((all[:,2], -1*all[:,1])) ]
     totalLosses = all[:,0].sum()
     giniSum = all[:,0].cumsum().sum() / totalLosses

     giniSum -= (len(actual) + 1) / 2.
     return giniSum / len(actual)

 def gini_normalized(a, p):
     return gini(a, p) / gini(a, a)

 def test_gini():
     def fequ(a,b):
         return abs( a -b) < 1e-6
     def T(a, p, g, n):
         assert( fequ(gini(a,p), g) )
         assert( fequ(gini_normalized(a,p), n) )
     T([1, 2, 3], [10, 20, 30], 0.111111, 1)
     T([1, 2, 3], [30, 20, 10], -0.111111, -1)
     T([1, 2, 3], [0, 0, 0], -0.111111, -1)
     T([3, 2, 1], [0, 0, 0], 0.111111, 1)
     T([1, 2, 4, 3], [0, 0, 0, 0], -0.1, -0.8)
     T([2, 1, 4, 3], [0, 0, 2, 1], 0.125, 1)
     T([0, 20, 40, 0, 10], [40, 40, 10, 5, 5], 0, 0)
     T([40, 0, 20, 0, 10], [1000000, 40, 40, 5, 5], 0.171428,
       0.6)
     T([40, 20, 10, 0, 0], [40, 20, 10, 0, 0], 0.285714, 1)
     T([1, 1, 0, 1], [0.86, 0.26, 0.52, 0.32], -0.041666,
       -0.333333)

source: https://www.kaggle.com/c/porto-seguro-safe-driver-prediction/discussion/40222#225669

Hi, Daniel!

These names refer to the origin of variable, but there is no reason to concern about it.

  • "ind" is related to individual or driver,
  • "reg" is related to region,
  • "car" is related to car itself and
  • "calc" is an calculated feature.

The features and rows are not time dependent, so "ps_ind_01" and "ps_ind_02" are just labels to hide real names.

Unfortunately I can't share the real meanings of variables.

Thanks for joining!

Baseline model - 0.252


In [1]:
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import GridSearchCV
from sklearn.model_selection import StratifiedShuffleSplit
from sklearn.preprocessing import Imputer
from sklearn.preprocessing import LabelBinarizer
from sklearn.pipeline import Pipeline
from sklearn.metrics import make_scorer
from sklearn_pandas import DataFrameMapper
import pandas as pd
import numpy as np
import regex as re

# métrica utilizada na competição
def gini(actual, pred, cmpcol = 0, sortcol = 1):
     all = np.asarray(np.c_[ actual, pred, np.arange(len(actual)) ], dtype=np.float)
     all = all[ np.lexsort((all[:,2], -1*all[:,1])) ]
     totalLosses = all[:,0].sum()
     giniSum = all[:,0].cumsum().sum() / totalLosses
 
     giniSum -= (len(actual) + 1) / 2.
     return giniSum / len(actual)
 
def gini_normalized(a, p):
    return gini(a, p) / gini(a, a)

gini_norm_scorer = make_scorer(gini_normalized)

# dataset de treino
df = pd.read_csv("~/ps_kaggle_data/train.csv", na_values='-1', index_col='id')

# extrai metadados dos nomes das features
col_pattern = re.compile("^ps_(?P<class>\w+)_(?P<number>\d+)(_(?P<type>\w+))?$")
ci = pd.DataFrame({column: col_pattern.search(column).groupdict() for column in df.columns[1:]}).T
ci.loc[ci.type.isnull(), 'type'] = 'num'

# realiza preprocessamento sobre as features
# features numéricas -> mean imputer
# features binárias -> mode imputer
# features categóricas -> mode imputer + label binarizer
mapper = DataFrameMapper([([col], Imputer(strategy='mean')) for col in ci[ci.type == 'num'].index.tolist()] +\
                         [([col], Imputer(strategy='most_frequent')) for col in ci[ci.type == 'bin'].index.tolist()] +\
                         [([col], [Imputer(strategy='most_frequent'), LabelBinarizer()]) for col in ci[ci.type == 'cat'].index.tolist()])

df_final = mapper.fit_transform(df.drop('target', axis=1))

In [37]:
# modelo regressão linear
ln = LinearRegression()
# particionador de dataset em 10 splits, mantendo a proporção das classes, com test_size de tamanho 10%
sss = StratifiedShuffleSplit(n_splits=10, test_size=.1, random_state=100)
param_grid = {}

cv = GridSearchCV(ln, cv=sss, param_grid=param_grid, scoring=gini_norm_scorer, verbose=2)
cv.fit(df_final, df.target)


Fitting 10 folds for each of 1 candidates, totalling 10 fits
[Parallel(n_jobs=1)]: Done  10 out of  10 | elapsed:  1.4min finished
Out[37]:
GridSearchCV(cv=StratifiedShuffleSplit(n_splits=10, random_state=100, test_size=0.05,
            train_size=None),
       error_score='raise',
       estimator=LinearRegression(copy_X=True, fit_intercept=True, n_jobs=1, normalize=False),
       fit_params=None, iid=True, n_jobs=1, param_grid={},
       pre_dispatch='2*n_jobs', refit=True, return_train_score='warn',
       scoring=make_scorer(gini_normalized), verbose=1)

In [39]:
# treina o melhor modelo
model = cv.best_estimator_
model.fit(df_final, df.target)

# executa sobre os dados de treino
test = pd.read_csv("~/ps_kaggle_data/test.csv", na_values='-1', index_col='id')
test_final = mapper.fit_transform(test)
predictions = model.predict(test_final)

# escala a predição pro range [0, 1]
from sklearn.preprocessing import MinMaxScaler

mm = MinMaxScaler()
predictions_0_1 = mm.fit_transform(predictions.reshape((predictions.shape[0], 1)))

# cria arquivo para submissão
pd.DataFrame(predictions_0_1, index=test.index, columns=['target']).to_csv("~/prediction.csv")