If the average lifetime of a customer is 90 days since their first deposit :
build a model to predict the Lifetime Value of a customer given his first 15 days of activity. What other data could be pertinent to answer this question? Given the data provided is the assumption of 90 days valid?
In [115]:
import pandas as pd
import pandas_profiling
import matplotlib
from datetime import datetime
import numpy as np
import matplotlib.pyplot as plt
from datetime import timedelta
from sklearn.preprocessing import MinMaxScaler
from sklearn.preprocessing import RobustScaler
from sklearn.cluster import KMeans
pd.options.display.max_columns = None
matplotlib.style.use('ggplot')
%matplotlib inline
In [116]:
# get dataset
df_bc = pd.read_csv('data/dataset_betclic.csv')
# convert string date to date
df_bc['BirthDate'] = pd.to_datetime(df_bc['BirthDate'], format='%Y-%m-%dT%H:%M:%S.%fZ')
df_bc['FirstDepositDate'] = pd.to_datetime(df_bc['FirstDepositDate'], format='%Y-%m-%dT%H:%M:%S.%fZ')
df_bc['BetDate'] = pd.to_datetime(df_bc['BetDate'], format='%Y%m%d')
In [117]:
df_max_BetDate = df_bc.groupby(['UserId'])['BetDate'].max().reset_index()
df_max_BetDate.columns = ['UserId', 'max_BetDate']
df_min_FirstDepositDate = df_bc.groupby(['UserId'])['FirstDepositDate'].min().reset_index()
df_min_FirstDepositDate.columns = ['UserId', 'min_FirstDepositDate']
df_ltv = pd.merge(df_bc,
df_max_BetDate,
how='left',
left_on=['UserId'],
right_on=['UserId'])
df_ltv = pd.merge(df_ltv,
df_min_FirstDepositDate,
how='left',
left_on=['UserId'],
right_on=['UserId'])
df_ltv['lifetime'] = df_ltv['max_BetDate'] - df_ltv['min_FirstDepositDate']
df_ltv.head()
Out[117]:
We check if the average lifetime of a customer given by Betclick, 90 jours, is the same in our data :
In [118]:
df_ltv.drop_duplicates(subset='UserId')['lifetime'].mean()
Out[118]:
We find an average at 98 days
In [119]:
df_bet_amount = df_bc.groupby(['UserId'])['BetAmount'].sum().reset_index()
df_bet_amount.columns = ['UserId', 'tot_bet_amount']
df_ltv = pd.merge(df_ltv,
df_bet_amount,
how='left',
left_on=['UserId'],
right_on=['UserId'])
df_amount_won = df_bc.groupby(['UserId'])['AmountWon'].sum().reset_index()
df_amount_won.columns = ['UserId', 'tot_amount_won']
df_ltv = pd.merge(df_ltv,
df_amount_won,
how='left',
left_on=['UserId'],
right_on=['UserId'])
df_ltv['value_brut'] = df_ltv['tot_bet_amount'] - df_ltv['tot_amount_won']
df_ltv.head()
Out[119]:
In [120]:
# value/lifetime
df_ltv['lifetime_int'] = df_ltv['lifetime'].astype('<m8[D]')
# turn 1 the 0 lifetime
df_ltv['lifetime_int'] = np.where(df_ltv['lifetime_int'] == 0, 1, df_ltv['lifetime_int'])
df_ltv['lifetime_value'] = df_ltv['value_brut'] / df_ltv['lifetime_int']
In [121]:
df_ltv.head()
Out[121]:
In [122]:
# age
now = pd.Timestamp(datetime.now())
df_ltv['age'] = (now - df_bc['BirthDate']).astype('<m8[Y]')
In [123]:
df_ltv_15 = df_ltv[df_ltv['BetDate'] <= (df_ltv['FirstDepositDate'] + timedelta(days=15))]
In [124]:
mean_bet_by_user = df_ltv_15.groupby('UserId')['BetAmount'].mean().reset_index()
mean_bet_by_user.columns = ['UserId', 'mean_bet']
min_bet_by_user = df_ltv_15.groupby('UserId')['BetAmount'].min().reset_index()
min_bet_by_user.columns = ['UserId', 'min_bet']
max_bet_by_user = df_ltv_15.groupby('UserId')['BetAmount'].max().reset_index()
max_bet_by_user.columns = ['UserId', 'max_bet']
median_bet_by_user = df_ltv_15.groupby('UserId')['BetAmount'].median().reset_index()
median_bet_by_user.columns = ['UserId', 'median_bet']
sum_bet_by_user = df_ltv_15.groupby('UserId')['BetAmount'].sum().reset_index()
sum_bet_by_user.columns = ['UserId', 'sum_bet']
count_bet_by_user = df_ltv_15.groupby('UserId')['BetId'].count().reset_index()
count_bet_by_user.columns = ['UserId', 'count_bet']
In [125]:
mean_bet_won_by_user = df_ltv_15.groupby('UserId')['AmountWon'].mean().reset_index()
mean_bet_won_by_user.columns = ['UserId', 'mean_bet_won']
min_bet_won_by_user = df_ltv_15.groupby('UserId')['AmountWon'].min().reset_index()
min_bet_won_by_user.columns = ['UserId', 'min_bet_won']
max_bet_won_by_user = df_ltv_15.groupby('UserId')['AmountWon'].max().reset_index()
max_bet_won_by_user.columns = ['UserId', 'max_bet_won']
median_bet_won_by_user = df_ltv_15.groupby('UserId')['AmountWon'].median().reset_index()
median_bet_won_by_user.columns = ['UserId', 'median_bet_won']
sum_bet_won_by_user = df_ltv_15.groupby('UserId')['AmountWon'].sum().reset_index()
sum_bet_won_by_user.columns = ['UserId', 'sum_bet_won']
count_bet_won_by_user = df_ltv_15[df_ltv_15['AmountWon']>0].groupby('UserId')['BetId'].count().reset_index()
count_bet_won_by_user.columns = ['UserId', 'count_bet_won']
In [126]:
mean_ods_by_user = df_ltv_15.groupby('UserId')['Odds'].mean()
In [127]:
islive_by_user = df_ltv_15.groupby('UserId')['IsLive'].max()
In [128]:
application_used = pd.get_dummies(df_ltv_15[['UserId', 'Application']]).groupby('UserId').sum()
partnertype_used = pd.get_dummies(df_ltv_15[['UserId', 'PartnerType']]).groupby('UserId').sum()
gender_used = pd.get_dummies(df_ltv_15[['UserId', 'Gender']]).groupby('UserId').sum()
gender_used = gender_used / gender_used
gender_used = gender_used.fillna(0)
Country_used = pd.get_dummies(df_ltv_15[['UserId', 'Country']]).groupby('UserId').sum()
Country_used = Country_used / Country_used
Country_used = Country_used.fillna(0)
In [129]:
df_res = pd.merge(mean_bet_by_user,
min_bet_by_user,
how='left',
left_on=['UserId'],
right_on=['UserId'])
df_res = pd.merge(df_res,
max_bet_by_user,
how='left',
left_on=['UserId'],
right_on=['UserId'])
df_res = pd.merge(df_res,
median_bet_by_user,
how='left',
left_on=['UserId'],
right_on=['UserId'])
df_res = pd.merge(df_res,
sum_bet_by_user,
how='left',
left_on=['UserId'],
right_on=['UserId'])
df_res = pd.merge(df_res,
count_bet_by_user,
how='left',
left_on=['UserId'],
right_on=['UserId'])
df_res = pd.merge(df_res,
mean_bet_won_by_user,
how='left',
left_on=['UserId'],
right_on=['UserId'])
df_res = pd.merge(df_res,
min_bet_won_by_user,
how='left',
left_on=['UserId'],
right_on=['UserId'])
df_res = pd.merge(df_res,
max_bet_won_by_user,
how='left',
left_on=['UserId'],
right_on=['UserId'])
df_res = pd.merge(df_res,
median_bet_won_by_user,
how='left',
left_on=['UserId'],
right_on=['UserId'])
df_res = pd.merge(df_res,
sum_bet_won_by_user,
how='left',
left_on=['UserId'],
right_on=['UserId'])
df_res = pd.merge(df_res,
count_bet_won_by_user,
how='left',
left_on=['UserId'],
right_on=['UserId'])
df_res = pd.merge(df_res,
application_used,
how='left',
left_on=['UserId'],
right_on=['UserId'])
df_res = pd.merge(df_res,
mean_ods_by_user,
how='left',
left_on=['UserId'],
right_on=['UserId'])
df_res = pd.merge(df_res,
islive_by_user,
how='left',
left_on=['UserId'],
right_on=['UserId'])
df_res = pd.merge(df_res,
partnertype_used,
how='left',
left_on=['UserId'],
right_on=['UserId'])
df_res = pd.merge(df_res,
gender_used,
how='left',
left_on=['UserId'],
right_on=['UserId'])
df_res = pd.merge(df_res,
Country_used,
how='left',
left_on=['UserId'],
right_on=['UserId'])
In [130]:
df_res['ratio_win'] = df_res['count_bet_won'] / df_res['count_bet']
In [131]:
df_res = df_res.fillna(0)
In [132]:
ltv_age = df_ltv[['UserId', 'age']].drop_duplicates()
In [133]:
df_final = pd.merge(df_res,
ltv_age,
how='left',
left_on=['UserId'],
right_on=['UserId'])
In [134]:
ltv_to_evaluation = df_ltv[['UserId', 'lifetime_value']].drop_duplicates()
In [135]:
df_final = pd.merge(df_res,
ltv_to_evaluation,
how='left',
left_on=['UserId'],
right_on=['UserId'])
In [136]:
df_final.to_csv('15_bc.csv', index=None, sep=';')
N'etant pas satisfait de l'approche du probleme en regression (code dans le notebook regression.ipynb), je propose une autre facon de voir les choses.
En clusterisant les comportements, nous obtenons une vision des differents profils des clients et ainsi observer les differences de lifetime value de chaque groupe.
Le choix de 5 classes est arbitraire (ouvert à discussion et à une analyse plus poussée).
In [137]:
# compute kmeans
RANDOM_STATE = 50
NB_CLUSTERS = 5
X = df_final[list(set(df_final).difference({'UserId'}))]
kmeans = KMeans(n_clusters=NB_CLUSTERS,
random_state=RANDOM_STATE, n_jobs=8)
kmeans.fit(X)
X['kmeans_cluster'] = kmeans.labels_
In [138]:
from matplotlib import colors as matplot_colors
import matplotlib.pyplot as plt
def background_gradient(row):
cmap = 'YlGnBu'
m = row.min()
M = row.max()
low = 0
high = 0.2
rng = M - m
norm = matplot_colors.Normalize(m - (rng * low), M + (rng * high))
normed = norm(list(row))
c = [matplot_colors.rgb2hex(x) for x in plt.cm.get_cmap(cmap)(normed)]
return ['background-color: %s' % color for color in c]
X.groupby('kmeans_cluster').mean().T.style.apply(background_gradient, axis=1)
Out[138]:
In [139]:
X.groupby('kmeans_cluster').count()
Out[139]:
In [146]:
X = df_final[list(set(df_final).difference({'UserId'}))]
transformer = RobustScaler().fit(X)
df_robust = pd.DataFrame(transformer.transform(X))
df_robust.columns = X.columns
df_robust#['mean_bet'].hist()
# compute kmeans
RANDOM_STATE = 50
NB_CLUSTERS = 5
X = df_robust[list(set(df_robust).difference({'UserId'}))]
kmeans = KMeans(n_clusters=NB_CLUSTERS,
random_state=RANDOM_STATE, n_jobs=8)
kmeans.fit(X)
X['kmeans_cluster'] = kmeans.labels_
X.groupby('kmeans_cluster').mean().T.style.apply(background_gradient, axis=1)
Out[146]:
In [149]:
X.groupby('kmeans_cluster').count()
Out[149]:
In [142]:
X = df_final[list(set(df_final).difference({'UserId'}))]
transformer = MinMaxScaler().fit(X)
df_robust = pd.DataFrame(transformer.transform(X))
df_robust.columns = X.columns
df_robust#['mean_bet'].hist()
# compute kmeans
RANDOM_STATE = 50
NB_CLUSTERS = 5
X = df_robust[list(set(df_robust).difference({'UserId'}))]
kmeans = KMeans(n_clusters=NB_CLUSTERS,
random_state=RANDOM_STATE, n_jobs=8)
kmeans.fit(X)
df_final['kmeans_cluster'] = kmeans.labels_
df_final.groupby('kmeans_cluster').mean().T.style.apply(background_gradient, axis=1)
Out[142]:
In [152]:
df_final.groupby('kmeans_cluster').count()['UserId']
Out[152]:
Les clusters sont bien equilibrés et il est possible de raconter une histoire sur le cluster ayant une plus forte lifetime value:
Les clients generant un $\text{lifetime value}$ haut parient des montants elevés en moyenne, jouent assez peu souvant, sont des femmes, francaise/italienne, sur des Odds assez bas, rarement en live et ont un assez bon ratio de gain.
Un tel modèle permet de predire, suivant les actions effectuées par le client en 15 jours, potentiel $\text{lifetime value}$ d'un client.
Un tel model est sujet a etre amelioré mais cette piste peut servir de base solide pour la suite.
En temps normal j'utilise la librairie "pickle" pour sauvegarder mes modeles mais pour ne pas generer de fichier suplementaire sur vos machines, j'ai retiré toute sauvegarde.
Tous les modeles ont une seed fixe pour etre reproductible.
In [ ]:
In [ ]: