KCBO is a toolkit for anyone who wants to do Bayesian data analysis without worrying about all the implementation details for a certain test.
Currently KCBO is very much alpha/pre-alpha software and only implements a three tests. Here is a list of future objectives for the project.
KCBO is available through PyPI and on Github here. The following commands will install KCBO:
pip install kcbo
git clone https://github.com/HHammond/kcbo
cd kcbo
python setup.py sdist install
If any of this fails, you may need to install numpy (pip install numpy) in order to install some dependencies of KCBO, then retry installing it.
There are currently three tests implemented in the KCBO library:
Lognormal-Difference of medians: used to compare medians of log-normal distributed data with the same variance.
Bayesian t-Test: an implementation of Kruschke's t-Test.
Conversion Test: test of conversion to success using the Beta-Binomial model. Popular in A/B testing and estimation.
In [1]:
from kcbo import lognormal_comparison_test, t_test, conversion_test
In [2]:
import pandas as pd
import numpy as np
from matplotlib import pyplot as plt
import seaborn as sns
%matplotlib inline
Note: Because this test uses an MC simulation on the lognormal distribution conjugate prior, it assumes that both distributions have the same variance.
In [3]:
# Generate some data
g1d = np.random.lognormal(mean=3, sigma=1, size=10000)
g1l = ['A'] * g1d.shape[0]
g2d = np.random.lognormal(mean=3.03, sigma=1, size=10000)
g2l = ['B'] * g2d.shape[0]
g1 = pd.DataFrame(data=g1d, columns=['value'])
g1['group'] = g1l
g2 = pd.DataFrame(data=g2d, columns=['value'])
g2['group'] = g2l
lognormal_data = pd.concat([g1, g2])
In [4]:
summary, data = lognormal_comparison_test(lognormal_data, samples=100000)
In [5]:
print summary
In [6]:
A,B = data['A']['median'], data['B']['median']
diff = data[('A','B')]['diff_medians']
f, axes = plt.subplots(1,2, figsize=(12, 7))
sns.set(style="white", palette="muted")
sns.despine(left=True)
sns.distplot(A, ax=axes[0], label='Median Estimate Density for A')
sns.distplot(B, ax=axes[0], label='Median Estimate Density for B')
sns.distplot(diff, ax=axes[1], label='Difference of Densities (B-A)')
axes[0].legend()
axes[1].legend()
plt.show()
Kruschke's implementation of the Bayesian t-Test. Since the implementation of this test uses PyMC2 and HMCMC it can take a while for the sampler to converge.
In [7]:
n1,n2 = (140,200)
group1 = np.random.normal(15,2,n1)
group2 = np.random.normal(15.7,2,n2)
A = zip(['A']*n1, group1)
B = zip(['B']*n2, group2)
df = pd.concat([pd.DataFrame(A), pd.DataFrame(B)])
df.columns = 'group','value'
df.head()
Out[7]:
In [8]:
description, data = t_test(df,groupcol='group',valuecol='value', samples=60000, progress_bar=True)
In [9]:
print description
In [10]:
diff = data[('A', 'B')]['diff_means']
f, axes = plt.subplots(1,1, figsize=(12, 7))
sns.despine(left=True)
sns.distplot(diff, label='Difference of Means').legend()
plt.show()
A common test in A/B tests is comparing the conversion rate between two features. Here we take the number of successes and total trials (or tests) and use the beta-binomial model.
In [11]:
A = {'group':'A', 'trials': 10000, 'successes':5000}
B = {'group':'B', 'trials': 8000, 'successes':4090}
df = pd.DataFrame([A,B])
df
Out[11]:
In [12]:
summary, data = conversion_test(df, groupcol='group',successcol='successes',totalcol='trials')
In [13]:
print summary
In [14]:
A = data['A']['distribution']
B = data['B']['distribution']
diff = data[('A','B')]['distribution']
f, axes = plt.subplots(1,2, figsize=(12, 7))
sns.set(style="white", palette="muted")
sns.despine(left=True)
sns.distplot(A, ax=axes[0], label='Density Estimate for A')
sns.distplot(B, ax=axes[0], label='Density Estimate for B')
sns.distplot(diff, ax=axes[1], label='Difference of Densities (B-A)')
axes[0].legend()
axes[1].legend()
plt.show()
In [ ]: