notebook.community

Edit and run



In [ ]:

    
import lifelines
import pymc as pm
from pyBMA.CoxPHFitter import CoxPHFitter
import matplotlib.pyplot as plt
import numpy as np
from math import log
from datetime import datetime
import pandas as pd
%matplotlib inline

The first step in any data analysis is acquiring and munging the data

An example data set can be found at: https://jakecoltman.gitlab.io/website/post/pydata/

Download the file output.txt and transform it into a format like below where the event column should be 0 if there's only one entry for an id, and 1 if there are two entries:

End date = datetime.datetime(2016, 5, 3, 20, 36, 8, 92165)

id,time_to_convert,age,male,event,search,brand



In [ ]:

    
####Data munging here



In [ ]:

    
###Parametric Bayes
#Shout out to Cam Davidson-Pilon



In [ ]:

    
## Example fully worked model using toy data
## Adapted from http://blog.yhat.com/posts/estimating-user-lifetimes-with-pymc.html

alpha = pm.Uniform("alpha", 0,20) 
beta = pm.Uniform("beta", 0,20) 
obs = pm.Weibull('obs', alpha, beta, value = df["time_to_convert"], observed = True )
obs.random
@pm.potential
def censorfactor(obs=obs): 
    if np.any(obs>23 ): 
        return -100000
    else:
        return 0

mcmc = pm.MCMC([alpha, beta, obs, censorfactor ] )
mcmc.sample(5000, burn = 0)



In [ ]:

    
pm.Matplot.plot(mcmc)
mcmc.trace("alpha")[:]

Problems:

1 - Try to fit your data from section 1 
2 - Use the results to plot the distribution of the median

Note that the media of a Weibull distribution is: $$β(log 2)^{1/α}$$



In [1]:

    
#### Fit to your data here



In [ ]:

    
#### Plot the distribution of the median

Problems:

4 - Try adjusting the number of samples for burning and thinnning
5 - Try adjusting the prior and see how it affects the estimate



In [ ]:

    
#### Adjust burn and thin, both paramters of the mcmc sample function



In [2]:

    
#### Narrow and broaden prior

Problems:

6 - Try using different distributions
7 - Bonus - try testing whether the median is greater than 14



In [ ]:

    
#### Different distributions



In [ ]:

    
#### Hypothesis testing

If we want to look at covariates, we need a new approach. We'll use Cox proprtional hazards. More information available at https://jakecoltman.gitlab.io/website/post/cox_proportional_hazard/

To fit in python we use the module lifelines:

http://lifelines.readthedocs.io/en/latest/



In [3]:

    
### Fit a cox proprtional hazards model

Once we've fit the data, we need to do something useful with it. Try to do the following things:

1 - Plot the baseline survival function

2 - Predict the functions for a particular set of features

3 - Plot the survival function for two different set of features

4 - For your results in part 3 caculate how much more likely a death event is for one than the other for a given period of time



In [4]:

    
#### Plot baseline hazard function



In [ ]:

    
#### Predict



In [ ]:

    
#### Plot survival functions for different covariates



In [ ]:

    
#### Plot some odds

Model selection

Difficult to do with classic tools (here)

Problem:

1 - Calculate the BMA coefficient values

2 - Try running with different priors



In [ ]:

    
#### BMA Coefficient values



In [ ]:

    
#### Different priors