Prediction

If the average lifetime of a customer is 90 days since their first deposit :
build a model to predict the Lifetime Value of a customer given his first 15 days of activity. What other data could be pertinent to answer this question? Given the data provided is the assumption of 90 days valid?


In [115]:
import pandas as pd
import pandas_profiling
import matplotlib
from datetime import datetime
import numpy as np
import matplotlib.pyplot as plt
from datetime import timedelta
from sklearn.preprocessing import MinMaxScaler
from sklearn.preprocessing import RobustScaler
from sklearn.cluster import KMeans

pd.options.display.max_columns = None

matplotlib.style.use('ggplot')
%matplotlib inline

In [116]:
# get dataset
df_bc = pd.read_csv('data/dataset_betclic.csv')

# convert string date to date
df_bc['BirthDate'] = pd.to_datetime(df_bc['BirthDate'], format='%Y-%m-%dT%H:%M:%S.%fZ')
df_bc['FirstDepositDate'] = pd.to_datetime(df_bc['FirstDepositDate'], format='%Y-%m-%dT%H:%M:%S.%fZ')
df_bc['BetDate'] = pd.to_datetime(df_bc['BetDate'], format='%Y%m%d')

Variables

Lifetime

The lifetime of a customer is the difference between the date of the last bet and the first deposit date
$max(BetDate) - min(FirstDepositDate)$


In [117]:
df_max_BetDate = df_bc.groupby(['UserId'])['BetDate'].max().reset_index()
df_max_BetDate.columns = ['UserId', 'max_BetDate']

df_min_FirstDepositDate = df_bc.groupby(['UserId'])['FirstDepositDate'].min().reset_index()
df_min_FirstDepositDate.columns = ['UserId', 'min_FirstDepositDate']

df_ltv = pd.merge(df_bc, 
                  df_max_BetDate, 
                  how='left', 
                  left_on=['UserId'], 
                  right_on=['UserId'])

df_ltv = pd.merge(df_ltv, 
                  df_min_FirstDepositDate, 
                  how='left', 
                  left_on=['UserId'], 
                  right_on=['UserId'])

df_ltv['lifetime'] = df_ltv['max_BetDate'] - df_ltv['min_FirstDepositDate']

df_ltv.head()


Out[117]:
UserId Country Gender BirthDate PartnerType FirstDepositDate BetDate BetId Application ProductName IsLive BetAmount AmountWon Odds Cashout max_BetDate min_FirstDepositDate lifetime
0 10783501 Portugal M 1992-06-08 PPC 2017-01-03 2017-01-03 607023702 BETCLIC.PT SportsBook False 5.00 0.0 124.76 0.0 2017-09-10 2017-01-03 250 days
1 10782912 Ivory Coast M 1996-03-18 Internal 2017-01-03 2017-01-03 607023878 BETCLIC.COM SportsBook False 5.00 23.0 4.60 0.0 2017-10-21 2017-01-03 291 days
2 10780813 France F 1998-04-05 PPC 2017-01-01 2017-01-03 607024082 BETCLIC.FR SportsBook False 5.00 0.0 8.00 0.0 2017-09-04 2017-01-01 246 days
3 10783215 Portugal M 1976-12-17 Internal 2017-01-03 2017-01-03 607024088 BETCLIC.PT SportsBook False 1.00 0.0 41.08 0.0 2017-11-19 2017-01-03 320 days
4 10783401 Poland M 1995-04-16 Affiliation 2017-01-03 2017-01-03 607024113 EXPEKT.COM SportsBook False 5.67 0.0 3.75 0.0 2017-01-09 2017-01-03 6 days

We check if the average lifetime of a customer given by Betclick, 90 jours, is the same in our data :


In [118]:
df_ltv.drop_duplicates(subset='UserId')['lifetime'].mean()


Out[118]:
Timedelta('98 days 01:44:26.035675')

We find an average at 98 days

Value

I consider the Value of a customer the amount he lost during his active lifetime on Betclic
$BetAmount - AmountWon$


In [119]:
df_bet_amount = df_bc.groupby(['UserId'])['BetAmount'].sum().reset_index()
df_bet_amount.columns = ['UserId', 'tot_bet_amount']

df_ltv = pd.merge(df_ltv, 
                  df_bet_amount, 
                  how='left', 
                  left_on=['UserId'], 
                  right_on=['UserId'])

df_amount_won = df_bc.groupby(['UserId'])['AmountWon'].sum().reset_index()
df_amount_won.columns = ['UserId', 'tot_amount_won']

df_ltv = pd.merge(df_ltv, 
                  df_amount_won, 
                  how='left', 
                  left_on=['UserId'], 
                  right_on=['UserId'])

df_ltv['value_brut'] = df_ltv['tot_bet_amount'] - df_ltv['tot_amount_won']

df_ltv.head()


Out[119]:
UserId Country Gender BirthDate PartnerType FirstDepositDate BetDate BetId Application ProductName IsLive BetAmount AmountWon Odds Cashout max_BetDate min_FirstDepositDate lifetime tot_bet_amount tot_amount_won value_brut
0 10783501 Portugal M 1992-06-08 PPC 2017-01-03 2017-01-03 607023702 BETCLIC.PT SportsBook False 5.00 0.0 124.76 0.0 2017-09-10 2017-01-03 250 days 30.00 0.00 30.00
1 10782912 Ivory Coast M 1996-03-18 Internal 2017-01-03 2017-01-03 607023878 BETCLIC.COM SportsBook False 5.00 23.0 4.60 0.0 2017-10-21 2017-01-03 291 days 1044.00 900.16 143.84
2 10780813 France F 1998-04-05 PPC 2017-01-01 2017-01-03 607024082 BETCLIC.FR SportsBook False 5.00 0.0 8.00 0.0 2017-09-04 2017-01-01 246 days 966.71 481.26 485.45
3 10783215 Portugal M 1976-12-17 Internal 2017-01-03 2017-01-03 607024088 BETCLIC.PT SportsBook False 1.00 0.0 41.08 0.0 2017-11-19 2017-01-03 320 days 503.30 320.30 183.00
4 10783401 Poland M 1995-04-16 Affiliation 2017-01-03 2017-01-03 607024113 EXPEKT.COM SportsBook False 5.67 0.0 3.75 0.0 2017-01-09 2017-01-03 6 days 186.77 169.85 16.92

Lifetime Value

I consider the lifetime value as :
$\frac{Value}{Lifetime}$
I think this value is more interesting than :
$BetAmount - AmountWon$
as it allows us to compare two customers independantly of the time spend on the website.


In [120]:
# value/lifetime
df_ltv['lifetime_int'] = df_ltv['lifetime'].astype('<m8[D]')
# turn 1 the 0 lifetime
df_ltv['lifetime_int'] = np.where(df_ltv['lifetime_int'] == 0, 1, df_ltv['lifetime_int'])
df_ltv['lifetime_value'] = df_ltv['value_brut'] / df_ltv['lifetime_int']

In [121]:
df_ltv.head()


Out[121]:
UserId Country Gender BirthDate PartnerType FirstDepositDate BetDate BetId Application ProductName IsLive BetAmount AmountWon Odds Cashout max_BetDate min_FirstDepositDate lifetime tot_bet_amount tot_amount_won value_brut lifetime_int lifetime_value
0 10783501 Portugal M 1992-06-08 PPC 2017-01-03 2017-01-03 607023702 BETCLIC.PT SportsBook False 5.00 0.0 124.76 0.0 2017-09-10 2017-01-03 250 days 30.00 0.00 30.00 250.0 0.120000
1 10782912 Ivory Coast M 1996-03-18 Internal 2017-01-03 2017-01-03 607023878 BETCLIC.COM SportsBook False 5.00 23.0 4.60 0.0 2017-10-21 2017-01-03 291 days 1044.00 900.16 143.84 291.0 0.494296
2 10780813 France F 1998-04-05 PPC 2017-01-01 2017-01-03 607024082 BETCLIC.FR SportsBook False 5.00 0.0 8.00 0.0 2017-09-04 2017-01-01 246 days 966.71 481.26 485.45 246.0 1.973374
3 10783215 Portugal M 1976-12-17 Internal 2017-01-03 2017-01-03 607024088 BETCLIC.PT SportsBook False 1.00 0.0 41.08 0.0 2017-11-19 2017-01-03 320 days 503.30 320.30 183.00 320.0 0.571875
4 10783401 Poland M 1995-04-16 Affiliation 2017-01-03 2017-01-03 607024113 EXPEKT.COM SportsBook False 5.67 0.0 3.75 0.0 2017-01-09 2017-01-03 6 days 186.77 169.85 16.92 6.0 2.820000

In [122]:
# age
now = pd.Timestamp(datetime.now())
df_ltv['age'] = (now - df_bc['BirthDate']).astype('<m8[Y]')

15 jours


In [123]:
df_ltv_15 = df_ltv[df_ltv['BetDate'] <= (df_ltv['FirstDepositDate'] + timedelta(days=15))]

Building feature

Features on bets


In [124]:
mean_bet_by_user = df_ltv_15.groupby('UserId')['BetAmount'].mean().reset_index()
mean_bet_by_user.columns = ['UserId', 'mean_bet']

min_bet_by_user = df_ltv_15.groupby('UserId')['BetAmount'].min().reset_index()
min_bet_by_user.columns = ['UserId', 'min_bet']

max_bet_by_user = df_ltv_15.groupby('UserId')['BetAmount'].max().reset_index()
max_bet_by_user.columns = ['UserId', 'max_bet']

median_bet_by_user = df_ltv_15.groupby('UserId')['BetAmount'].median().reset_index()
median_bet_by_user.columns = ['UserId', 'median_bet']

sum_bet_by_user = df_ltv_15.groupby('UserId')['BetAmount'].sum().reset_index()
sum_bet_by_user.columns = ['UserId', 'sum_bet']

count_bet_by_user = df_ltv_15.groupby('UserId')['BetId'].count().reset_index()
count_bet_by_user.columns = ['UserId', 'count_bet']

Feature on bet won


In [125]:
mean_bet_won_by_user = df_ltv_15.groupby('UserId')['AmountWon'].mean().reset_index()
mean_bet_won_by_user.columns = ['UserId', 'mean_bet_won']

min_bet_won_by_user = df_ltv_15.groupby('UserId')['AmountWon'].min().reset_index()
min_bet_won_by_user.columns = ['UserId', 'min_bet_won']

max_bet_won_by_user = df_ltv_15.groupby('UserId')['AmountWon'].max().reset_index()
max_bet_won_by_user.columns = ['UserId', 'max_bet_won']

median_bet_won_by_user = df_ltv_15.groupby('UserId')['AmountWon'].median().reset_index()
median_bet_won_by_user.columns = ['UserId', 'median_bet_won']

sum_bet_won_by_user = df_ltv_15.groupby('UserId')['AmountWon'].sum().reset_index()
sum_bet_won_by_user.columns = ['UserId', 'sum_bet_won']

count_bet_won_by_user = df_ltv_15[df_ltv_15['AmountWon']>0].groupby('UserId')['BetId'].count().reset_index()
count_bet_won_by_user.columns = ['UserId', 'count_bet_won']

Other feature


In [126]:
mean_ods_by_user = df_ltv_15.groupby('UserId')['Odds'].mean()

In [127]:
islive_by_user = df_ltv_15.groupby('UserId')['IsLive'].max()

Categorical features


In [128]:
application_used = pd.get_dummies(df_ltv_15[['UserId', 'Application']]).groupby('UserId').sum()

partnertype_used = pd.get_dummies(df_ltv_15[['UserId', 'PartnerType']]).groupby('UserId').sum()

gender_used = pd.get_dummies(df_ltv_15[['UserId', 'Gender']]).groupby('UserId').sum()
gender_used = gender_used / gender_used
gender_used = gender_used.fillna(0)

Country_used = pd.get_dummies(df_ltv_15[['UserId', 'Country']]).groupby('UserId').sum()
Country_used = Country_used / Country_used
Country_used = Country_used.fillna(0)

Merging all features


In [129]:
df_res = pd.merge(mean_bet_by_user, 
                  min_bet_by_user, 
                  how='left', 
                  left_on=['UserId'], 
                  right_on=['UserId'])
df_res = pd.merge(df_res, 
                  max_bet_by_user, 
                  how='left', 
                  left_on=['UserId'], 
                  right_on=['UserId'])
df_res = pd.merge(df_res, 
                  median_bet_by_user, 
                  how='left', 
                  left_on=['UserId'], 
                  right_on=['UserId'])
df_res = pd.merge(df_res, 
                  sum_bet_by_user, 
                  how='left', 
                  left_on=['UserId'], 
                  right_on=['UserId'])
df_res = pd.merge(df_res, 
                  count_bet_by_user, 
                  how='left', 
                  left_on=['UserId'], 
                  right_on=['UserId'])
df_res = pd.merge(df_res, 
                  mean_bet_won_by_user, 
                  how='left', 
                  left_on=['UserId'], 
                  right_on=['UserId'])
df_res = pd.merge(df_res, 
                  min_bet_won_by_user, 
                  how='left', 
                  left_on=['UserId'], 
                  right_on=['UserId'])
df_res = pd.merge(df_res, 
                  max_bet_won_by_user, 
                  how='left', 
                  left_on=['UserId'], 
                  right_on=['UserId'])
df_res = pd.merge(df_res, 
                  median_bet_won_by_user, 
                  how='left', 
                  left_on=['UserId'], 
                  right_on=['UserId'])
df_res = pd.merge(df_res, 
                  sum_bet_won_by_user, 
                  how='left', 
                  left_on=['UserId'], 
                  right_on=['UserId'])
df_res = pd.merge(df_res, 
                  count_bet_won_by_user, 
                  how='left', 
                  left_on=['UserId'], 
                  right_on=['UserId'])
df_res = pd.merge(df_res, 
                  application_used, 
                  how='left', 
                  left_on=['UserId'], 
                  right_on=['UserId'])
df_res = pd.merge(df_res, 
                  mean_ods_by_user, 
                  how='left', 
                  left_on=['UserId'], 
                  right_on=['UserId'])
df_res = pd.merge(df_res, 
                  islive_by_user, 
                  how='left', 
                  left_on=['UserId'], 
                  right_on=['UserId'])
df_res = pd.merge(df_res, 
                  partnertype_used, 
                  how='left', 
                  left_on=['UserId'], 
                  right_on=['UserId'])
df_res = pd.merge(df_res, 
                  gender_used, 
                  how='left', 
                  left_on=['UserId'], 
                  right_on=['UserId'])
df_res = pd.merge(df_res, 
                  Country_used, 
                  how='left', 
                  left_on=['UserId'], 
                  right_on=['UserId'])

Adding interesting feature


In [130]:
df_res['ratio_win'] = df_res['count_bet_won'] / df_res['count_bet']

Final cleaning


In [131]:
df_res = df_res.fillna(0)

In [132]:
ltv_age = df_ltv[['UserId', 'age']].drop_duplicates()

In [133]:
df_final = pd.merge(df_res, 
                    ltv_age, 
                    how='left', 
                    left_on=['UserId'], 
                    right_on=['UserId'])

In [134]:
ltv_to_evaluation = df_ltv[['UserId', 'lifetime_value']].drop_duplicates()

In [135]:
df_final = pd.merge(df_res, 
                    ltv_to_evaluation, 
                    how='left', 
                    left_on=['UserId'], 
                    right_on=['UserId'])

Saving the dataset


In [136]:
df_final.to_csv('15_bc.csv', index=None, sep=';')

Clustering

N'etant pas satisfait de l'approche du probleme en regression (code dans le notebook regression.ipynb), je propose une autre facon de voir les choses.
En clusterisant les comportements, nous obtenons une vision des differents profils des clients et ainsi observer les differences de lifetime value de chaque groupe.
Le choix de 5 classes est arbitraire (ouvert à discussion et à une analyse plus poussée).

Kmeans #1


In [137]:
# compute kmeans
RANDOM_STATE = 50
NB_CLUSTERS = 5
X = df_final[list(set(df_final).difference({'UserId'}))]
kmeans = KMeans(n_clusters=NB_CLUSTERS,
                random_state=RANDOM_STATE, n_jobs=8)
kmeans.fit(X)
X['kmeans_cluster'] = kmeans.labels_

In [138]:
from matplotlib import colors as matplot_colors
import matplotlib.pyplot as plt

def background_gradient(row):
    cmap = 'YlGnBu'
    m = row.min()
    M = row.max()
    low = 0
    high = 0.2
    rng = M - m
    norm = matplot_colors.Normalize(m - (rng * low), M + (rng * high))
    normed = norm(list(row))
    c = [matplot_colors.rgb2hex(x) for x in plt.cm.get_cmap(cmap)(normed)]
    return ['background-color: %s' % color for color in c]


X.groupby('kmeans_cluster').mean().T.style.apply(background_gradient, axis=1)


Out[138]:
kmeans_cluster 0 1 2 3 4
min_bet 20.9906 44.85 172.86 14.2054 17.2431
median_bet 26.2083 1118 898.082 129.817 68.6359
Country_France 0.525399 0 0.666667 0.521739 0.555046
Country_Netherlands 0.00178389 0 0 0 0.00458716
PartnerType_Ad Networks 0.0155878 0 0 0 0
PartnerType_Sponsoring 0.0133367 0 0 0 0
Application_BETCLIC.PT 6.49291 0 0 13.9565 17.4472
min_bet_won 10.9776 0 0 0 11.578
Country_Hungary 0.00552158 0 0 0 0.00688073
sum_bet_won 96.3705 178384 46857.9 12045.7 2421.86
Gender_F 0.140418 0 0 0.0652174 0.0733945
mean_bet 28.059 1197.34 1001.02 188.304 86.5676
PartnerType_Expekt 4.24737e-05 0 0 0 0
Country_United Kingdom 0.0132943 0 0 0.0652174 0.0321101
Odds 68.6593 3.01406 1.58785 26.5616 30.1208
Country_Portugal 0.178432 0 0 0.0652174 0.130734
PartnerType_Referral 1.92903 0 0 14.9348 5.2156
lifetime_value 9.31106 2304.83 102.807 156.89 14.2549
PartnerType_betclic 0.00522426 0 0 0 0.25
Country_Norway 0.00522426 1 0 0.0217391 0.0321101
mean_bet_won 18.6237 991.02 950.463 170.221 97.9391
count_bet 19.7362 180 56.6667 161.174 87.3601
PartnerType_Emailing 0.009684 0 0 0 0
Country_Italy 0.0493969 0 0 0 0.0458716
ratio_win 0.235702 0.594444 0.686032 0.445024 0.436401
Country_Ivory Coast 0.0266735 0 0 0.0217391 0.0183486
PartnerType_Affiliation 4.63282 180 33.3333 19.5 26.344
Application_BETCLIC.COM 2.53772 0 33.3333 19.1522 17.617
Application_BETCLIC.IT 0.325433 0 0 0 2.47248
max_bet_won 46.2576 5645 3152.47 1847.01 520.848
max_bet 45.4971 3161.2 2925.82 1013.9 297.176
Application_BETCLIC.FR 8.90911 0 23.3333 86.6304 41.5688
sum_bet 130.297 215521 48678.7 12492.4 2589.42
Country_Finland 0.00777268 0 0 0.0217391 0.0114679
Country_Sweden 0.0304111 0 0 0.0869565 0.0321101
PartnerType_Offline 0.00114679 0 0 0 0
PartnerType_SEO 0.000679579 0 0 0 0
Country_Germany 0.0109157 0 0 0 0.0321101
Country_Poland 0.143221 0 0.333333 0.195652 0.0917431
IsLive 0.507051 1 1 1 0.915138
PartnerType_Mobile 0.000594631 0 0 0 0
PartnerType_PPC 4.54634 0 0 65.1957 18.2248
Gender_M 0.859582 1 1 0.934783 0.926606
count_bet_won 5.35623 107 42 65.6739 33.9817
Country_Switzerland 0.00195379 0 0 0 0.00688073
PartnerType_Partnerships 0.257221 0 0 0 1.12615
PartnerType_Internal 8.3245 0 23.3333 61.5435 36.1995
Application_EXPEKT.COM 1.47103 180 0 41.4348 8.25459
median_bet_won 14.7678 193.94 751.463 32.1745 45.6529

In [139]:
X.groupby('kmeans_cluster').count()


Out[139]:
min_bet median_bet Country_France Country_Netherlands PartnerType_Ad Networks PartnerType_Sponsoring Application_BETCLIC.PT min_bet_won Country_Hungary sum_bet_won Gender_F mean_bet PartnerType_Expekt Country_United Kingdom Odds Country_Portugal PartnerType_Referral lifetime_value PartnerType_betclic Country_Norway mean_bet_won count_bet PartnerType_Emailing Country_Italy ratio_win Country_Ivory Coast PartnerType_Affiliation Application_BETCLIC.COM Application_BETCLIC.IT max_bet_won max_bet Application_BETCLIC.FR sum_bet Country_Finland Country_Sweden PartnerType_Offline PartnerType_SEO Country_Germany Country_Poland IsLive PartnerType_Mobile PartnerType_PPC Gender_M count_bet_won Country_Switzerland PartnerType_Partnerships PartnerType_Internal Application_EXPEKT.COM median_bet_won
kmeans_cluster
0 23544 23544 23544 23544 23544 23544 23544 23544 23544 23544 23544 23544 23544 23544 23544 23544 23544 23544 23544 23544 23544 23544 23544 23544 23544 23544 23544 23544 23544 23544 23544 23544 23544 23544 23544 23544 23544 23544 23544 23544 23544 23544 23544 23544 23544 23544 23544 23544 23544
1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
2 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3
3 46 46 46 46 46 46 46 46 46 46 46 46 46 46 46 46 46 46 46 46 46 46 46 46 46 46 46 46 46 46 46 46 46 46 46 46 46 46 46 46 46 46 46 46 46 46 46 46 46
4 436 436 436 436 436 436 436 436 436 436 436 436 436 436 436 436 436 436 436 436 436 436 436 436 436 436 436 436 436 436 436 436 436 436 436 436 436 436 436 436 436 436 436 436 436 436 436 436 436

Conclusion Kmeans #1

Les clusters sont beaucoup trop desequilibrés, les valeurs 'financieres' comportent des cas extremes qui polluent le bon placement des centroides.

Kmeans #2 : Robust scaler


In [146]:
X = df_final[list(set(df_final).difference({'UserId'}))]

transformer = RobustScaler().fit(X)
df_robust = pd.DataFrame(transformer.transform(X))
df_robust.columns = X.columns
df_robust#['mean_bet'].hist()

# compute kmeans
RANDOM_STATE = 50
NB_CLUSTERS = 5
X = df_robust[list(set(df_robust).difference({'UserId'}))]
kmeans = KMeans(n_clusters=NB_CLUSTERS,
                random_state=RANDOM_STATE, n_jobs=8)
kmeans.fit(X)
X['kmeans_cluster'] = kmeans.labels_
X.groupby('kmeans_cluster').mean().T.style.apply(background_gradient, axis=1)


Out[146]:
kmeans_cluster 0 1 2 3 4
min_bet 0.801304 7.32414 4.4175 -0.05431 2.36657
median_bet 0.484519 4.81751 2.6296 -0.12915 1.19718
Country_France -0.479141 -0.28169 -0.323034 -0.511111 -1
Country_Netherlands 0.00185697 0 0.00140449 0 0
PartnerType_Ad Networks 0.015849 0 0 0 0
PartnerType_Sponsoring 0.0135602 0 0 0 0
Application_BETCLIC.PT 6.92274 0.0140845 0.00561798 9 0
min_bet_won 1.18783 765.946 248.857 0 4429.5
Country_Hungary 0.00574365 0 0 0 0
sum_bet_won 1.23799 12.0724 4.01914 -0.00827777 37.9947
Application_EXPEKT.COM 1.7288 0.225352 0.289326 0.711111 1
Gender_F 0.1353 0.323944 0.251404 0.0666667 0
PartnerType_Expekt 4.31854e-05 0 0 0 0
Odds 4.11948 0.251596 -0.14996 763.074 9.06381
Country_Portugal 0.182631 0.0140845 0.00280899 0.322222 0
PartnerType_Referral 2.07717 0.140845 0.11236 2.1 0
lifetime_value 1.19338 -28.4483 -9.88006 0.252968 -407.97
PartnerType_betclic 0.00993263 0 0.00280899 0 0
Country_Norway 0.00583002 0.0140845 0.00421348 0 0
mean_bet_won 1.3362 120.902 40.8487 -0.0537441 654.835
count_bet 0.74395 -0.278169 -0.27816 1.18667 -0.3
PartnerType_Emailing 0.00984626 0 0 0 0
Country_Italy 0.0495336 0 0.0505618 0 0
ratio_win 0.12553 2.10923 2.0959 -0.344238 2.14286
Country_Ivory Coast 0.0269477 0 0.00421348 0.111111 0
PartnerType_Affiliation 5.23782 0.0704225 0.453652 1.36667 0
median_bet_won 4.82717 818.197 277.005 0 4429.5
Application_BETCLIC.COM 2.8989 0.084507 0.310393 11.7333 0
Application_BETCLIC.IT 0.374978 0 0.0800562 0 0
max_bet_won 0.94606 20.7541 6.99899 0.0390482 102.844
max_bet 0.297926 2.48993 1.00759 -0.0176593 0.297802
Application_BETCLIC.FR 1.27908 0.0160966 -0.0355136 1.18413 -0.142857
sum_bet 1.04566 5.94721 1.56802 -0.142439 -0.280992
Country_Finland 0.0079461 0 0.00561798 0 1
Country_Sweden 0.0290206 0.211268 0.0646067 0.0111111 0
PartnerType_Offline 0.001166 0 0 0 0
PartnerType_SEO 0.000690966 0 0 0 0
Country_Germany 0.0110123 0.028169 0.0168539 0.0222222 0
Country_Poland 0.142382 0 0.171348 0.0333333 0
IsLive -0.468777 -0.943662 -0.942416 -0.544444 -1
PartnerType_Mobile 0.000604595 0 0 0 0
PartnerType_PPC 5.06197 0.295775 0.217697 6.58889 0
Gender_M -0.1353 -0.323944 -0.251404 -0.0666667 0
count_bet_won 1.0349 0.0507042 0.0651685 0.168889 0
Country_Switzerland 0.00211608 0 0 0 0
PartnerType_Partnerships 0.278675 0 0.00561798 1 0
PartnerType_Internal 2.29292 0.232394 0.161166 4.91944 0.25
mean_bet 0.421251 4.33345 2.27924 -0.130744 0.991596
Country_United Kingdom 0.0141216 0.0140845 0.00140449 0.0111111 0

In [149]:
X.groupby('kmeans_cluster').count()


Out[149]:
min_bet median_bet Country_France Country_Netherlands PartnerType_Ad Networks PartnerType_Sponsoring Application_BETCLIC.PT min_bet_won Country_Hungary sum_bet_won Application_EXPEKT.COM Gender_F PartnerType_Expekt Odds Country_Portugal PartnerType_Referral lifetime_value PartnerType_betclic Country_Norway mean_bet_won count_bet PartnerType_Emailing Country_Italy ratio_win Country_Ivory Coast PartnerType_Affiliation median_bet_won Application_BETCLIC.COM Application_BETCLIC.IT max_bet_won max_bet Application_BETCLIC.FR sum_bet Country_Finland Country_Sweden PartnerType_Offline PartnerType_SEO Country_Germany Country_Poland IsLive PartnerType_Mobile PartnerType_PPC Gender_M count_bet_won Country_Switzerland PartnerType_Partnerships PartnerType_Internal mean_bet Country_United Kingdom
kmeans_cluster
0 23156 23156 23156 23156 23156 23156 23156 23156 23156 23156 23156 23156 23156 23156 23156 23156 23156 23156 23156 23156 23156 23156 23156 23156 23156 23156 23156 23156 23156 23156 23156 23156 23156 23156 23156 23156 23156 23156 23156 23156 23156 23156 23156 23156 23156 23156 23156 23156 23156
1 71 71 71 71 71 71 71 71 71 71 71 71 71 71 71 71 71 71 71 71 71 71 71 71 71 71 71 71 71 71 71 71 71 71 71 71 71 71 71 71 71 71 71 71 71 71 71 71 71
2 712 712 712 712 712 712 712 712 712 712 712 712 712 712 712 712 712 712 712 712 712 712 712 712 712 712 712 712 712 712 712 712 712 712 712 712 712 712 712 712 712 712 712 712 712 712 712 712 712
3 90 90 90 90 90 90 90 90 90 90 90 90 90 90 90 90 90 90 90 90 90 90 90 90 90 90 90 90 90 90 90 90 90 90 90 90 90 90 90 90 90 90 90 90 90 90 90 90 90
4 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1

Conlusion Kmeans #2

Amelioration du kmeans #1 mais toujours aussi desequilibré, essayons avec un scaler plus classique.

Kmeans #3


In [142]:
X = df_final[list(set(df_final).difference({'UserId'}))]

transformer = MinMaxScaler().fit(X)
df_robust = pd.DataFrame(transformer.transform(X))
df_robust.columns = X.columns
df_robust#['mean_bet'].hist()

# compute kmeans
RANDOM_STATE = 50
NB_CLUSTERS = 5
X = df_robust[list(set(df_robust).difference({'UserId'}))]
kmeans = KMeans(n_clusters=NB_CLUSTERS,
                random_state=RANDOM_STATE, n_jobs=8)
kmeans.fit(X)
df_final['kmeans_cluster'] = kmeans.labels_
df_final.groupby('kmeans_cluster').mean().T.style.apply(background_gradient, axis=1)


/home/tom/source/smart_click/betclic/lib/python3.5/site-packages/sklearn/preprocessing/data.py:323: DataConversionWarning: Data with input dtype bool, int64, float64 were all converted to float64 by MinMaxScaler.
  return self.partial_fit(X, y)
Out[142]:
kmeans_cluster 0 1 2 3 4
UserId 1.06828e+07 1.07012e+07 1.07255e+07 1.0637e+07 1.07549e+07
mean_bet 47.9892 12.3854 26.7269 16.6599 42.6891
min_bet 39.1547 5.23521 20.646 4.98034 34.5676
max_bet 64.4959 31.2158 40.4294 59.5305 60.2931
median_bet 46.0023 10.6933 25.1792 12.9131 40.697
sum_bet 111.263 264.265 167.534 355.074 154.506
count_bet 6.62808 37.3089 13.2866 32.2382 14.7797
mean_bet_won 32.8917 8.09162 20.5327 10.6705 30.794
min_bet_won 22.7138 0.63661 10.6021 0.440544 20.6537
max_bet_won 57.3083 46.0442 46.7482 78.3702 62.097
median_bet_won 28.8674 3.11416 16.7909 3.16049 26.3678
sum_bet_won 81.6081 216.839 154.026 293.862 124.238
count_bet_won 1.3941 10.0048 4.51971 9.90222 4.05477
Application_BETCLIC.COM 1.04737 6.11771 7.30956 0 2.05447
Application_BETCLIC.FR 4.93233 0 0 32.2382 6.66627
Application_BETCLIC.IT 0.24766 1.13108 0 0 0.277761
Application_BETCLIC.PT 0 27.1704 0 0 4.42323
Application_EXPEKT.COM 0.40072 2.88971 5.97703 0 1.35798
Odds 73.4713 100.027 31.0029 62.6292 45.098
IsLive 0 0.843112 0.367501 1 0.398982
PartnerType_Ad Networks 0.0100792 0 0.000342818 0.0544018 0
PartnerType_Affiliation 1.69316 6.74545 7.02537 7.66348 3.42951
PartnerType_Emailing 0.000719942 0.006684 0.0589647 0.00275685 0
PartnerType_Expekt 0.000143988 0 0 0 0
PartnerType_Internal 2.84392 17.0472 5.37161 12.3404 6.07483
PartnerType_Mobile 0 0 0 0.00257306 0
PartnerType_Offline 0.00115191 0 0 0.00349201 0
PartnerType_PPC 1.27487 9.35815 0 8.83422 3.19066
PartnerType_Partnerships 0.0354212 0.598775 0.0294823 0.436133 0.184675
PartnerType_Referral 0.764003 3.47419 0.776826 2.90075 1.88896
PartnerType_SEO 0.00115191 0.00148533 0 0 0
PartnerType_Sponsoring 0 0.0527293 0 0 0.00897935
PartnerType_betclic 0.00345572 0.0243223 0.0239973 0 0.00209518
Gender_F 0 0 0 0 1
Gender_M 1 1 1 1 0
Country_Finland 0.0151188 0.0135537 0 0 0.00329243
Country_France 0.758531 0 0 1 0.577372
Country_Germany 0.0165587 0.020609 0 0 0.013469
Country_Hungary 0.00446364 0.0142963 0 0 0.00748279
Country_Italy 0.0767459 0.0748236 0 0 0.07393
Country_Ivory Coast 0.0358531 0.0616413 0 0 0.0167614
Country_Netherlands 0.0024478 0.00352766 0 0 0.00239449
Country_Norway 0.00921526 0.012811 0 0 0.00179587
Country_Poland 0 0 1 0 0.151152
Country_Portugal 0 0.719829 0 0 0.114936
Country_Sweden 0.0578834 0.0447456 0 0 0.0272374
Country_Switzerland 0.0024478 0.00538433 0 0 0.000897935
Country_United Kingdom 0.0207343 0.0287783 0 0 0.00927866
ratio_win 0.208766 0.23524 0.243643 0.276756 0.248235
lifetime_value 10.0264 9.04076 9.18415 8.72277 12.7769

In [152]:
df_final.groupby('kmeans_cluster').count()['UserId']


Out[152]:
kmeans_cluster
0    6945
1    5386
2    2917
3    5441
4    3341
Name: UserId, dtype: int64

Conclusion kmeans #3

Les clusters sont bien equilibrés et il est possible de raconter une histoire sur le cluster ayant une plus forte lifetime value:
Les clients generant un $\text{lifetime value}$ haut parient des montants elevés en moyenne, jouent assez peu souvant, sont des femmes, francaise/italienne, sur des Odds assez bas, rarement en live et ont un assez bon ratio de gain.

Un tel modèle permet de predire, suivant les actions effectuées par le client en 15 jours, potentiel $\text{lifetime value}$ d'un client.

Un tel model est sujet a etre amelioré mais cette piste peut servir de base solide pour la suite.
En temps normal j'utilise la librairie "pickle" pour sauvegarder mes modeles mais pour ne pas generer de fichier suplementaire sur vos machines, j'ai retiré toute sauvegarde.
Tous les modeles ont une seed fixe pour etre reproductible.


In [ ]:


In [ ]: