In [ ]:
import lifelines
import pymc as pm
from pyBMA.CoxPHFitter import CoxPHFitter
import matplotlib.pyplot as plt
import numpy as np
from math import log
from datetime import datetime
import pandas as pd
%matplotlib inline
The first step in any data analysis is acquiring and munging the data
An example data set can be found at: https://jakecoltman.gitlab.io/website/post/pydata/
Download the file output.txt and transform it into a format like below where the event column should be 0 if there's only one entry for an id, and 1 if there are two entries:
End date = datetime.datetime(2016, 5, 3, 20, 36, 8, 92165)
id,time_to_convert,age,male,event,search,brand
In [ ]:
####Data munging here
In [ ]:
###Parametric Bayes
#Shout out to Cam Davidson-Pilon
In [ ]:
## Example fully worked model using toy data
## Adapted from http://blog.yhat.com/posts/estimating-user-lifetimes-with-pymc.html
alpha = pm.Uniform("alpha", 0,20)
beta = pm.Uniform("beta", 0,20)
obs = pm.Weibull('obs', alpha, beta, value = df["time_to_convert"], observed = True )
obs.random
@pm.potential
def censorfactor(obs=obs):
if np.any(obs>23 ):
return -100000
else:
return 0
mcmc = pm.MCMC([alpha, beta, obs, censorfactor ] )
mcmc.sample(5000, burn = 0)
In [ ]:
pm.Matplot.plot(mcmc)
mcmc.trace("alpha")[:]
Problems:
1 - Try to fit your data from section 1
2 - Use the results to plot the distribution of the median
Note that the media of a Weibull distribution is: $$β(log 2)^{1/α}$$
In [1]:
#### Fit to your data here
In [ ]:
#### Plot the distribution of the median
Problems:
4 - Try adjusting the number of samples for burning and thinnning
5 - Try adjusting the prior and see how it affects the estimate
In [ ]:
#### Adjust burn and thin, both paramters of the mcmc sample function
In [2]:
#### Narrow and broaden prior
Problems:
6 - Try using different distributions
7 - Bonus - try testing whether the median is greater than 14
In [ ]:
#### Different distributions
In [ ]:
#### Hypothesis testing
If we want to look at covariates, we need a new approach. We'll use Cox proprtional hazards. More information available at https://jakecoltman.gitlab.io/website/post/cox_proportional_hazard/
To fit in python we use the module lifelines:
In [3]:
### Fit a cox proprtional hazards model
Once we've fit the data, we need to do something useful with it. Try to do the following things:
1 - Plot the baseline survival function
2 - Predict the functions for a particular set of features
3 - Plot the survival function for two different set of features
4 - For your results in part 3 caculate how much more likely a death event is for one than the other for a given period of time
In [4]:
#### Plot baseline hazard function
In [ ]:
#### Predict
In [ ]:
#### Plot survival functions for different covariates
In [ ]:
#### Plot some odds
Model selection
Difficult to do with classic tools (here)
Problem:
1 - Calculate the BMA coefficient values
2 - Try running with different priors
In [ ]:
#### BMA Coefficient values
In [ ]:
#### Different priors