KCBO

Introducing a new Bayesian Data Analysis Toolkit

KCBO is a toolkit for anyone who wants to do Bayesian data analysis without worrying about all the implementation details for a certain test.

Currently KCBO is very much alpha/pre-alpha software and only implements a three tests. Here is a list of future objectives for the project.

Installation

KCBO is available through PyPI and on Github here. The following commands will install KCBO:

From PyPI:

pip install kcbo

From Source:

git clone https://github.com/HHammond/kcbo
cd kcbo
python setup.py sdist install

If any of this fails, you may need to install numpy (pip install numpy) in order to install some dependencies of KCBO, then retry installing it.

Usage

There are currently three tests implemented in the KCBO library:

  • Lognormal-Difference of medians: used to compare medians of log-normal distributed data with the same variance.

  • Bayesian t-Test: an implementation of Kruschke's t-Test.

  • Conversion Test: test of conversion to success using the Beta-Binomial model. Popular in A/B testing and estimation.

Examples:


In [1]:
from kcbo import lognormal_comparison_test, t_test, conversion_test

In [2]:
import pandas as pd
import numpy as np
from matplotlib import pyplot as plt
import seaborn as sns
%matplotlib inline

Lognormal-Difference of Medians test

Note: Because this test uses an MC simulation on the lognormal distribution conjugate prior, it assumes that both distributions have the same variance.


In [3]:
# Generate some data
g1d = np.random.lognormal(mean=3, sigma=1, size=10000)
g1l = ['A'] * g1d.shape[0]
g2d = np.random.lognormal(mean=3.03, sigma=1, size=10000)
g2l = ['B'] * g2d.shape[0]

g1 = pd.DataFrame(data=g1d, columns=['value'])
g1['group'] = g1l
g2 = pd.DataFrame(data=g2d, columns=['value'])
g2['group'] = g2l

lognormal_data = pd.concat([g1, g2])

In [4]:
summary, data = lognormal_comparison_test(lognormal_data, samples=100000)

In [5]:
print summary


                        Lognormal Median Comparison Test                        

Groups: A, B

Estimates:

| Group   |   Median |   95% CI Lower |   95% CI Upper |      Mu |   95% CI Lower |   95% CI Upper |
|:--------|---------:|---------------:|---------------:|--------:|---------------:|---------------:|
| A       |  19.7915 |        19.3976 |        20.1921 | 2.9852  |        2.96515 |        3.00529 |
| B       |  20.3804 |        19.9753 |        20.7924 | 3.01452 |        2.9945  |        3.03459 |

Comparisions:

| Hypothesis   |   Difference of Medians |   P.Value |   95% CI Lower |   95% CI Upper |
|:-------------|------------------------:|----------:|---------------:|---------------:|
| A < B        |                 0.58924 |   0.97753 |      0.0129132 |        1.15935 |


In [6]:
A,B = data['A']['median'], data['B']['median']
diff = data[('A','B')]['diff_medians']

f, axes = plt.subplots(1,2, figsize=(12, 7))
sns.set(style="white", palette="muted")
sns.despine(left=True)

sns.distplot(A, ax=axes[0], label='Median Estimate Density for A')
sns.distplot(B, ax=axes[0], label='Median Estimate Density for B')
sns.distplot(diff, ax=axes[1], label='Difference of Densities (B-A)')

axes[0].legend()
axes[1].legend()

plt.show()


Bayesian t-Test

Kruschke's implementation of the Bayesian t-Test. Since the implementation of this test uses PyMC2 and HMCMC it can take a while for the sampler to converge.


In [7]:
n1,n2 = (140,200)

group1 = np.random.normal(15,2,n1)
group2 = np.random.normal(15.7,2,n2)

A = zip(['A']*n1, group1)
B = zip(['B']*n2, group2)

df = pd.concat([pd.DataFrame(A), pd.DataFrame(B)])
df.columns = 'group','value'
df.head()


Out[7]:
group value
0 A 12.755334
1 A 17.371657
2 A 15.678301
3 A 13.358686
4 A 10.424609

In [8]:
description, data = t_test(df,groupcol='group',valuecol='value',  samples=60000, progress_bar=True)


 [-----------------100%-----------------] 60000 of 60000 complete in 39.3 sec

In [9]:
print description


                                Bayesian t-Test                                 

 

| Hypothesis   |   Difference of Means |   P.Value |   95% CI Lower |   95% CI Upper |
|:-------------|----------------------:|----------:|---------------:|---------------:|
| A < B        |               1.02472 |         1 |        0.60196 |        1.44501 |

| Hypothesis   |   Difference of S.Dev |   P.Value |   95% CI Lower |   95% CI Upper |
|:-------------|----------------------:|----------:|---------------:|---------------:|
| A < B        |             0.0458748 |         1 |       -0.25691 |       0.339418 |

| Hypothesis   |   Effect Size |   P.Value |   95% CI Lower |   95% CI Upper |
|:-------------|--------------:|----------:|---------------:|---------------:|
| A < B        |      0.546456 |         1 |          0.317 |       0.775599 |


In [10]:
diff = data[('A', 'B')]['diff_means']

f, axes = plt.subplots(1,1, figsize=(12, 7))
sns.despine(left=True)
sns.distplot(diff, label='Difference of Means').legend()
plt.show()


Beta-Binomial Conversion Rate Test

A common test in A/B tests is comparing the conversion rate between two features. Here we take the number of successes and total trials (or tests) and use the beta-binomial model.


In [11]:
A = {'group':'A', 'trials': 10000, 'successes':5000}
B = {'group':'B', 'trials': 8000, 'successes':4090}

df = pd.DataFrame([A,B])
df


Out[11]:
group successes trials
0 A 5000 10000
1 B 4090 8000

In [12]:
summary, data = conversion_test(df, groupcol='group',successcol='successes',totalcol='trials')

In [13]:
print summary


                       Beta-Binomial Conversion Rate Test                       

Groups: A, B

Estimates:

| Group   |   Estimate |   95% CI Lower |   95% CI Upper |
|:--------|-----------:|---------------:|---------------:|
| A       |   0.500016 |       0.490261 |       0.509824 |
| B       |   0.511272 |       0.500279 |       0.522188 |

Comparisions:

| Hypothesis   |   Difference |   P.Value |   95% CI Lower |   95% CI Upper |
|:-------------|-------------:|----------:|---------------:|---------------:|
| A < B        |    0.0112787 |   0.93366 |    -0.00343106 |      0.0259026 |


In [14]:
A = data['A']['distribution']
B = data['B']['distribution']
diff = data[('A','B')]['distribution']

f, axes = plt.subplots(1,2, figsize=(12, 7))
sns.set(style="white", palette="muted")
sns.despine(left=True)

sns.distplot(A, ax=axes[0], label='Density Estimate for A')
sns.distplot(B, ax=axes[0], label='Density Estimate for B')
sns.distplot(diff, ax=axes[1], label='Difference of Densities (B-A)')

axes[0].legend()
axes[1].legend()

plt.show()



In [ ]: