In [1]:
import lifelines
import pymc as pm
from pyBMA.CoxPHFitter import CoxPHFitter
import matplotlib.pyplot as plt
import numpy as np
from numpy import log
from datetime import datetime
import pandas as pd
%matplotlib inline
The first step in any data analysis is acquiring and munging the data
Our starting data set can be found here: http://jakecoltman.com in the pyData post
It is designed to be roughly similar to the output from DCM's path to conversion
Download the file and transform it into something with the columns:
id,lifetime,age,male,event,search,brand
where lifetime is the total time that we observed someone not convert for and event should be 1 if we see a conversion and 0 if we don't. Note that all values should be converted into ints
It is useful to note that end_date = datetime.datetime(2016, 5, 3, 20, 36, 8, 92165)
In [2]:
running_id = 0
output = [[0]]
with open("E:/output.txt") as file_open:
for row in file_open.read().split("\n"):
cols = row.split(",")
if cols[0] == output[-1][0]:
output[-1].append(cols[1])
output[-1].append(True)
else:
output.append(cols)
output = output[1:]
for row in output:
if len(row) == 6:
row += [datetime(2016, 5, 3, 20, 36, 8, 92165), False]
output = output[1:-1]
In [3]:
def convert_to_days(dt):
day_diff = dt / np.timedelta64(1, 'D')
if day_diff == 0:
return 23.0
else:
return day_diff
df = pd.DataFrame(output, columns=["id", "advert_time", "male","age","search","brand","conversion_time","event"])
df["lifetime"] = pd.to_datetime(df["conversion_time"]) - pd.to_datetime(df["advert_time"])
df["lifetime"] = df["lifetime"].apply(convert_to_days)
df["male"] = df["male"].astype(int)
df["search"] = df["search"].astype(int)
df["brand"] = df["brand"].astype(int)
df["age"] = df["age"].astype(int)
df["event"] = df["event"].astype(int)
df = df.drop('advert_time', 1)
df = df.drop('conversion_time', 1)
df = df.set_index("id")
df = df.dropna(thresh=2)
df.median()
Out[3]:
In [ ]:
###Parametric Bayes
#Shout out to Cam Davidson-Pilon
In [14]:
## Example fully worked model using toy data
## Adapted from http://blog.yhat.com/posts/estimating-user-lifetimes-with-pymc.html
## Note that we've made some corrections
N = 2500
##Generate some random data
lifetime = pm.rweibull( 2, 5, size = N )
birth = pm.runiform(0, 10, N)
censor = ((birth + lifetime) >= 10)
lifetime_ = lifetime.copy()
lifetime_[censor] = 10 - birth[censor]
alpha = pm.Uniform('alpha', 0, 20)
beta = pm.Uniform('beta', 0, 20)
@pm.observed
def survival(value=lifetime_, alpha = alpha, beta = beta ):
return sum( (1-censor)*(log( alpha/beta) + (alpha-1)*log(value/beta)) - (value/beta)**(alpha))
mcmc = pm.MCMC([alpha, beta, survival ] )
mcmc.sample(50000, 30000)
In [ ]:
pm.Matplot.plot(mcmc)
mcmc.trace("alpha")[:]
Problems:
1 - Try to fit your data from section 1
2 - Use the results to plot the distribution of the median
Note that the media of a Weibull distribution is: $$β(log 2)^{1/α}$$
In [1]:
#### Fit to your data here
In [ ]:
#### Plot the distribution of the median
Problems:
4 - Try adjusting the number of samples for burning and thinnning
5 - Try adjusting the prior and see how it affects the estimate
In [ ]:
#### Adjust burn and thin, both paramters of the mcmc sample function
In [2]:
#### Narrow and broaden prior
Problems:
7 - Try testing whether the median is greater than a different values
In [ ]:
#### Hypothesis testing
If we want to look at covariates, we need a new approach.
We'll use Cox proprtional hazards, a very popular regression model.
To fit in python we use the module lifelines:
In [3]:
### Fit a cox proprtional hazards model
Once we've fit the data, we need to do something useful with it. Try to do the following things:
1 - Plot the baseline survival function
2 - Predict the functions for a particular set of features
3 - Plot the survival function for two different set of features
4 - For your results in part 3 caculate how much more likely a death event is for one than the other for a given period of time
In [4]:
#### Plot baseline hazard function
In [ ]:
#### Predict
In [ ]:
#### Plot survival functions for different covariates
In [ ]:
#### Plot some odds
Model selection
Difficult to do with classic tools (here)
Problem:
1 - Calculate the BMA coefficient values
2 - Try running with different priors
In [ ]:
#### BMA Coefficient values
In [ ]:
#### Different priors