Detailed A/B Test Experiment

Problem

There are 2 options for the landing page:
- "start free trial"
  - If the visitor clicks this option, will be asked for credit card info, and after 14 days they will be charged automatically.
- "access course materials"
  - If the visitor clicks this option, they can watch videos and take the quiz for free, but won't get coaching nor certificate.
The goal for A/B test here is to see which option could help maximize the course completion. Hmm, in fact this can be hard to guess, since when it's free students may not finish the course, even though there might be more clicks; when it's paid trail, students may be more likely to finish the course even though there might be less visitors.

Hypothesis

H0: P_control - P_experiment = 0
- P_control is the conversion rate of control group, P_experiment is the conversion rate of experiment group.
H1: P_control - P_experiment = d
- d is the detectable effect
When the results appear to be significant, we reject H0.

Metrics

Invariate Metrics

They are majorly used for sanity check.
Pick those metrics considered not to change, and make sire they won't change dramatically between the control and experiment groups.

Evaluation Metrics

The metrics are expected to see changes between the control and the experiment group.These metrics are related to the business goals.
Each metrics has a Dmin, indicating the min change that's practically significant to the business.

Overall Process

Choose invariant metrics and evaluation metrics.
Estimation of baselines and Sample Size
- Estimate population metrics baseline
- Estimate sample metrics baseline
- Estimate sample size for each evaluation metrics, only keep the metrics that have practical amount of sample size.
Verify null hypothese - Control Group vs. Experiment Group
- Sanity check with invariant metrics
- Differences in evaluation metrics, using confidence interval, Dmin to check both statistical, practical significance.
- Differences in trending, using p_value to check statistical significance.



In [119]:

    
import pandas as pd
import math
import numpy as np
from scipy.stats import norm

Data Overview

Each pgeview is a unique cookie
Control group: "access course materials" option
Experiment group: "free trail" option



In [6]:

    
# Control group
control_df = pd.read_csv('ab_control.csv')
control_df.head()









    Out[6]:







  
    
      
      Date
      Pageviews
      Clicks
      Enrollments
      Payments
    
  
  
    
      0
      Sat, Oct 11
      7723
      687
      134.0
      70.0
    
    
      1
      Sun, Oct 12
      9102
      779
      147.0
      70.0
    
    
      2
      Mon, Oct 13
      10511
      909
      167.0
      95.0
    
    
      3
      Tue, Oct 14
      9871
      836
      156.0
      105.0
    
    
      4
      Wed, Oct 15
      10014
      837
      163.0
      64.0



In [9]:

    
# Experiment group
experiment_df = pd.read_csv('ab_experiment.csv')
experiment_df.head()









    Out[9]:







  
    
      
      Date
      Pageviews
      Clicks
      Enrollments
      Payments
    
  
  
    
      0
      Sat, Oct 11
      7716
      686
      105.0
      34.0
    
    
      1
      Sun, Oct 12
      9288
      785
      116.0
      91.0
    
    
      2
      Mon, Oct 13
      10480
      884
      145.0
      79.0
    
    
      3
      Tue, Oct 14
      9867
      827
      138.0
      92.0
    
    
      4
      Wed, Oct 15
      9793
      832
      140.0
      94.0



In [10]:

    
control_df.describe()









    Out[10]:







  
    
      
      Pageviews
      Clicks
      Enrollments
      Payments
    
  
  
    
      count
      37.000000
      37.000000
      23.000000
      23.000000
    
    
      mean
      9339.000000
      766.972973
      164.565217
      88.391304
    
    
      std
      740.239563
      68.286767
      29.977000
      20.650202
    
    
      min
      7434.000000
      632.000000
      110.000000
      56.000000
    
    
      25%
      8896.000000
      708.000000
      146.500000
      70.000000
    
    
      50%
      9420.000000
      759.000000
      162.000000
      91.000000
    
    
      75%
      9871.000000
      825.000000
      175.000000
      102.500000
    
    
      max
      10667.000000
      909.000000
      233.000000
      128.000000



In [11]:

    
experiment_df.describe()









    Out[11]:







  
    
      
      Pageviews
      Clicks
      Enrollments
      Payments
    
  
  
    
      count
      37.000000
      37.000000
      23.000000
      23.000000
    
    
      mean
      9315.135135
      765.540541
      148.826087
      84.565217
    
    
      std
      708.070781
      64.578374
      33.234227
      23.060841
    
    
      min
      7664.000000
      642.000000
      94.000000
      34.000000
    
    
      25%
      8881.000000
      722.000000
      127.000000
      69.000000
    
    
      50%
      9359.000000
      770.000000
      142.000000
      91.000000
    
    
      75%
      9737.000000
      827.000000
      172.000000
      99.000000
    
    
      max
      10551.000000
      884.000000
      213.000000
      123.000000

Step 1 - Choose Metrics

Invariate Metrics

Ck = unique daily cookies (pageviews) on a page
- Dmin = 3000
Cl = unique daily clicks on the free trial button
- Dmin = 240
CTP = Cl/Ck, free trial button click through probability
- Dmin = 0.01

Evaluation Metrics

GConversion = enrolled/Cl
- It's gross conversion
- Dmin = 0.01
- Probability of daily enrolled among daily clicked free trial button
Retention = paid/enrolled
- Dmin = 0.01
- Probability of daily paid among daily enrolled
NConversion = paid/Cl
- It's net conversion
- Dmin = 0.01
- Probability of daily paid among daily clicked free trial button

Step 2.1 - Estimate Metrics Baseline Values

The baseline values are the values of these metrics before the change.
These values are given by the data provider Udacity, it's their rough estimation.



In [28]:

    
baseline = {'Cookies': 40000, 'Clicks': 3200, 'Enrollments': 660, 'CTP': 0.08,
           'GConversion': 0.20625, 'Retention': 0.53, 'NConversion': 0.109313}

Step 2.2 Estimate Standard Deviation & Sample Size

Estimate Standard Deviation of Metrics

This is for later estimating sample size, confidence interval.
The more variant of a metric, the more difficult to reach to a significant result.
In order to estimate variance, assume metrics probabilities `p_hat` are binomially distributed, (probability density function PDF is binomially distribution), then standard deviation is:
- std = sqrt(p_hat*(1-p_hat)/n)
  - p_hat: baseline probability of the event to occur
  - n: sample size
- The reason we assume it's binomial distribution for probability density function is because, the logic here is if not option A then option B.
The assumption is only valid when the unit of diversion of the experiment is equal to the unit of analysis (the denominator of the metric formula). If the assumption is not valid, the calculated std can be different and better to estimate empirically.



In [29]:

    
sample_baseline = baseline.copy()

sample_baseline['Cookies'] = 5000  # assume sample size is 5000
sample_baseline['Clicks'] = baseline['Clicks'] * 5000/baseline['Cookies']
sample_baseline['Enrollments'] = baseline['Enrollments'] * 5000/baseline['Cookies']

sample_baseline









    Out[29]:





{'Cookies': 5000,
 'Clicks': 400.0,
 'Enrollments': 82.5,
 'CTP': 0.08,
 'GConversion': 0.20625,
 'Retention': 0.53,
 'NConversion': 0.109313}



In [33]:

    
def get_binomial_std(p_hat, n):
    """
    p_hat: baseline probability of the event to occur
    n: sample size
    return: the standard deviation
    """
    std = round(math.sqrt(p_hat*(1-p_hat)/n),4)
    
    return std

Gross Conversion std

p_hat = enrolled/Cl
- probability of daily enrolled among daily clicked free trial button



In [34]:

    
gross_conversion = {}

gross_conversion['d_min'] = 0.01
gross_conversion['p_hat'] = sample_baseline['GConversion']
gross_conversion['n'] = sample_baseline['Clicks']
gross_conversion['std'] = get_binomial_std(gross_conversion['p_hat'],
                                          gross_conversion['n'])

gross_conversion









    Out[34]:





{'d_min': 0.01, 'p_hat': 0.20625, 'n': 400.0, 'std': 0.0202}

Revention std



In [37]:

    
retention = {}

retention['d_min'] = 0.01
retention['p_hat'] = sample_baseline['Retention']
retention['n'] = sample_baseline['Enrollments']
retention['std'] = get_binomial_std(retention['p_hat'],
                                   retention['n'])

retention









    Out[37]:





{'d_min': 0.01, 'p_hat': 0.53, 'n': 82.5, 'std': 0.0549}

Net Conversion std



In [54]:

    
net_conversion = {}

net_conversion['d_min'] = 0.0075
net_conversion['p_hat'] = sample_baseline['NConversion']
net_conversion['n'] = sample_baseline['Clicks']
net_conversion['std'] = get_binomial_std(net_conversion['p_hat'],
                                        net_conversion['n'])

net_conversion









    Out[54]:





{'d_min': 0.0075, 'p_hat': 0.109313, 'n': 400.0, 'std': 0.0156}

Estimate Sample Size

Hypothesis

H0: P_control - P_experiment = 0
- P_control is the conversion rate of control group, P_experiment is the conversion rate of experiment group.
H1: P_control - P_experiment = d
- d is the detectable effect

Sample Size Formula

n = pow(Z1-α/2 * std1 + Z1-β * std2, 2)/pow(d, 2)
- Z1-α/2 is the Z score for 1-α/2, α is the probability of Type I error
- Z1-β is the Z score for 1-β (Power), β is the probability of Type II error
- std1 = sqrt(2p*(1-p))
- std2 = sqrt(p*(1-p) + (p+d)*(1-(p+d)))
  - p is the baseline conversion rate, it's the p_hat from above
  - d is the detectable effect, it's the d_min from above
This is the online calculator for sample size: https://www.evanmiller.org/ab-testing/sample-size.html
- Given p, d, α and 1-β



In [42]:

    
def get_z_score(alpha):
    return norm.ppf(alpha)


def get_stds(p, d):
    std1 = math.sqrt(2*p*(1-p))
    std2 = math.sqrt(p*(1-p) + (p+d)*(1-(p+d)))
    
    std_lst = [std1, std2]
    return std_lst
    
    
def get_sample_size(std_lst, alpha, beta, d):
    n = pow(get_z_score(1-alpha/2)*std_lst[0] + get_z_score(1-beta)*std_lst[1], 2)/pow(d,2)
    return n



In [44]:

    
alpha = 0.05
beta = 0.2

Gross Conversion Sample Size

Calculated sample_size means estimated number of enrolled in each group
In order to get estimated page_views (unique cookies), needs to use sample_size/(400/5000) * 2 for both control and experiments groups
- multuple 2 means for both groups



In [49]:

    
gross_conversion['sample_size'] = round(get_sample_size(get_stds(gross_conversion['p_hat'],
                                                          gross_conversion['d_min']), alpha, beta,
                                                 gross_conversion['d_min']))
gross_conversion['page_views'] = 2*round(gross_conversion['sample_size']/(gross_conversion['n']/5000))

gross_conversion









    Out[49]:





{'d_min': 0.01,
 'p_hat': 0.20625,
 'n': 400.0,
 'std': 0.0202,
 'sample_size': 25835.0,
 'page_views': 645876.0}

Retention Sample Size

Calculated sample_size means estimated number of paid in each group
In order to get estimated page_views (unique cookies), needs to use sample_size/(82.5/5000) * 2 for both control and experiments groups
- multuple 2 means for both groups
The required page_views is too large, if 40000 views a day, it will take 100 days to get the data. So this metric Retention might be given up.



In [52]:

    
retention['sample_size'] = round(get_sample_size(get_stds(retention['p_hat'],
                                                          retention['d_min']), alpha, beta,
                                                 retention['d_min']))
retention['page_views'] = 2*round(retention['sample_size']/(retention['n']/5000))
retention









    Out[52]:





{'d_min': 0.01,
 'p_hat': 0.53,
 'n': 82.5,
 'std': 0.0549,
 'sample_size': 39087.0,
 'page_views': 4737818.0}

Net Conversion Sample Size

Calculated sample_size means estimated number of paid in each group
In order to get estimated page_views (unique cookies), needs to use sample_size/(400/5000) * 2 for both control and experiments groups
- multuple 2 means for both groups
Assume 40000 page_views per day, in order to get such amount page views, it takes about 3 weeks.



In [55]:

    
net_conversion['sample_size'] = round(get_sample_size(get_stds(net_conversion['p_hat'],
                                                         net_conversion['d_min']), alpha, beta,
                                                         net_conversion['d_min']))
net_conversion['page_views'] = 2*round(net_conversion['sample_size']/(net_conversion['n']/5000))

net_conversion









    Out[55]:





{'d_min': 0.0075,
 'p_hat': 0.109313,
 'n': 400.0,
 'std': 0.0156,
 'sample_size': 27413.0,
 'page_views': 685324.0}

Step 3 - Control Group vs. Experiment Group



In [56]:

    
control_df.head()









    Out[56]:







  
    
      
      Date
      Pageviews
      Clicks
      Enrollments
      Payments
    
  
  
    
      0
      Sat, Oct 11
      7723
      687
      134.0
      70.0
    
    
      1
      Sun, Oct 12
      9102
      779
      147.0
      70.0
    
    
      2
      Mon, Oct 13
      10511
      909
      167.0
      95.0
    
    
      3
      Tue, Oct 14
      9871
      836
      156.0
      105.0
    
    
      4
      Wed, Oct 15
      10014
      837
      163.0
      64.0



In [57]:

    
experiment_df.head()









    Out[57]:







  
    
      
      Date
      Pageviews
      Clicks
      Enrollments
      Payments
    
  
  
    
      0
      Sat, Oct 11
      7716
      686
      105.0
      34.0
    
    
      1
      Sun, Oct 12
      9288
      785
      116.0
      91.0
    
    
      2
      Mon, Oct 13
      10480
      884
      145.0
      79.0
    
    
      3
      Tue, Oct 14
      9867
      827
      138.0
      92.0
    
    
      4
      Wed, Oct 15
      9793
      832
      140.0
      94.0

Step 3.1 - Differences in Invariant Metrics (Sanity Check)

The goal is to verify the experiment is conducted as expected, and won't be affected by other factors. Also to make sure the data collection is correct.
Invariant Metrics
- Ck = unique daily cookies (pageviews) on a page
- Cl = unique daily clicks on the free trial button
- CTP = Cl/Ck, free trial button click through probability
We need to compare the invariant metrics in both groups, to make sure the differences are not significant.



In [60]:

    
p=0.5
alpha=0.05



In [62]:

    
def get_std(p, total_size):
    std = math.sqrt(p*(1-p)/total_size)
    return std

def get_marginOferror(std, alpha):
    me = round(get_z_score(1-alpha/2)*std, 4)
    return me

Compare pageviews

We want to verify that the difference between pageview counts in the 2 groups are not significant.
When sample size n is large enough, it can be approximated as normal distribution. We want to test that pbserved p_hat = control group pageviews/both groups pageviews is not significantly different from p=0.5.
- Margin of Error ME = Z1-α/2 * std
  - We need to calculate ME at 95% confidence interval
- Then we will get Confidence Interval CI = [p_hat - ME, p_hat + ME]
  - If p=0.5 is within CI, then the difference between the 2 groups are expected



In [61]:

    
control_pageviews = control_df['Pageviews'].sum()
experiment_pageviews = experiment_df['Pageviews'].sum()

print(control_pageviews, experiment_pageviews)









    



345543 344660



In [67]:

    
total_pageviews = control_pageviews + experiment_pageviews
p_hat = control_pageviews/(total_pageviews)
std = get_std(p, total_pageviews)
me = get_marginOferror(std, alpha)

print('If ' + str(p) +' is within [' + str(round(p_hat - me, 4)) + ', ' + str(round(p_hat + me, 4)) + '], then the difference is expected.')









    



If 0.5 is within [0.4994, 0.5018], then the difference is expected.

Compare clicks

Similar to pageviews comparison above.



In [90]:

    
control_clicks = control_df['Clicks'].sum()
experiment_clicks = experiment_df['Clicks'].sum()

print(control_clicks, experiment_clicks)









    



28378 28325



In [91]:

    
total_clicks = control_clicks + experiment_clicks
p_hat = control_clicks/(total_clicks)
std = get_std(p, total_clicks)
me = get_marginOferror(std, alpha)

print('If ' + str(p) +' is within [' + str(round(p_hat - me, 4)) + ', ' + str(round(p_hat + me, 4)) + '], then the difference is expected.')









    



If 0.5 is within [0.4964, 0.5046], then the difference is expected.

Compare CTP (Click Through Probability)

Because CTP is a proportion in the population, so we need to use pooled standard deviation to calculate the margin of error.
- p_pool = (experiment_clicks + control_clicks)/(experiment_pageviews + control_pageviews)
- std_pool = sqrt(p_pool*(1-p_pool)*(1/experiment_pageviews + 1/control_pageviews))



In [92]:

    
control_ctp = control_clicks/control_pageviews
experiment_ctp = experiment_clicks/experiment_pageviews
p_pool = (control_clicks + experiment_clicks)/(control_pageviews + experiment_pageviews)
std_pool = math.sqrt(p_pool*(1-p_pool)*(1/experiment_pageviews + 1/control_pageviews))
me = get_marginOferror(std_pool, alpha)

diff = round(experiment_ctp - control_ctp, 4)

print('If ' + str(diff) +' is within [' + str(round(0 - me, 4)) + ', ' + str(round(0 + me, 4)) + '], then the difference is expected.')









    



If 0.0001 is within [-0.0013, 0.0013], then the difference is expected.

Summary for Sanity Check

We have checked all the 3 invariant metrics between the 2 groups, all showing the differences are not significant, so the experiments of using these 2 groups should be less likely affected by other factors.

Step 3.2 - Differences in Evaluation Metrics

Similar to the sanity check above, here is to check the differences in evaluation metrics between the 2 groups,to see:
- Whether the differences are statistically significant.
- Whether the differences are practically significant.
  - So that the changes are big enough to be beneficial to the business.
  - Difference is not included in the confidence interval [Dmin - ME, Dmin + ME]
As Step 2 has found, Gross Conversion and Net Conversion can be the evaluation metrics, while Retention is not. All because of the limitation in data collection in reality.
Evaluation Metrics
- GConversion = enrolled/Cl
  - It's gross conversion
  - Dmin = 0.01
  - Probability of daily enrolled among daily clicked free trial button
- NConversion = paid/Cl
  - It's net conversion
  - Dmin = 0.01
  - Probability of daily paid among daily clicked free trial button



In [81]:

    
print(control_df.isnull().sum())
print()
print(experiment_df.isnull().sum())









    



Date            0
Pageviews       0
Clicks          0
Enrollments    14
Payments       14
dtype: int64

Date            0
Pageviews       0
Clicks          0
Enrollments    14
Payments       14
dtype: int64

Compare Gross Conversion

The method here is almost the same as what's used in "Compare CTP" above.
Observation
- As we can see in the result, the change in experiment group is both statistically and pratically significant.
- From control group to experiment group, there is 2.06% decrease in enrollment, and it's significant. So less people will enroll when the website is showing free trail option, comparing to asscess to materials option.



In [93]:

    
control_clicks = control_df['Clicks'].loc[control_df['Enrollments'].notnull()].sum()
experiment_clicks = experiment_df['Clicks'].loc[experiment_df['Enrollments'].notnull()].sum()
print('Clicks', control_clicks, experiment_clicks)

control_enrolls = control_df['Enrollments'].sum()
experiment_enrolls = experiment_df['Enrollments'].sum()
print('Enrollments', control_enrolls, experiment_enrolls)

control_GC = control_enrolls/control_clicks
experiment_GC = experiment_enrolls/experiment_clicks
print('Gross Conversion', control_GC, experiment_GC)









    



Clicks 17293 17260
Enrollments 3785.0 3423.0
Gross Conversion 0.2188746891805933 0.19831981460023174



In [94]:

    
p_pool = (control_enrolls + experiment_enrolls)/(control_clicks + experiment_clicks)
std_pool = math.sqrt(p_pool*(1-p_pool)*(1/control_clicks + 1/experiment_clicks))
me = get_marginOferror(std_pool, alpha)

print(p_pool, std_pool, me)









    



0.20860706740369866 0.004371675385225936 0.0086



In [97]:

    
# Statistical significance
GC_diff = round(experiment_GC - control_GC, 4)

print('If ' + str(GC_diff) +' is within [' + str(round(0 - me, 4)) + ', ' + str(round(0 + me, 4)) + '], then the difference is expected, and the change is not significant.')









    



If -0.0206 is within [-0.0086, 0.0086], then the difference is expected, and the change is not significant.



In [100]:

    
# Practically significance
d_min = gross_conversion['d_min']

print('If ' + str(GC_diff) +' is within [' + str(round(d_min - me, 4)) + ', ' + str(round(d_min + me, 4)) + '], then the change is not practically significant.')









    



If -0.0206 is within [0.0014, 0.0186], then the change is not practically significant.

Compare Net Conversion

Observation
- As we can see, it's not statistically significant but it's practically significant.
- From control group to experiment group, there is 0.49% drop. Practically it's significant means there will be less payment for free trail option, comparing with access to materials option. This drop will affect the business.



In [105]:

    
control_clicks = control_df['Clicks'].loc[control_df['Payments'].notnull()].sum()
experiment_clicks = experiment_df['Clicks'].loc[experiment_df['Payments'].notnull()].sum()
print('Clicks', control_clicks, experiment_clicks)

control_paid = control_df['Payments'].sum()
experiment_paid = experiment_df['Payments'].sum()
print('Payments', control_paid, experiment_paid)

control_NC = control_paid/control_clicks
experiment_NC = experiment_paid/experiment_clicks
print('Net Conversion', control_NC, experiment_NC)









    



Clicks 17293 17260
Payments 2033.0 1945.0
Net Conversion 0.11756201931417337 0.1126882966396292



In [106]:

    
p_pool = (control_paid + experiment_paid)/(control_clicks + experiment_clicks)
std_pool = math.sqrt(p_pool*(1-p_pool)*(1/control_clicks + 1/experiment_clicks))
me = get_marginOferror(std_pool, alpha)

print(p_pool, std_pool, me)









    



0.1151274853124186 0.0034341335129324238 0.0067



In [107]:

    
# Statistical significance
NC_diff = round(experiment_NC - control_NC, 4)

print('If ' + str(NC_diff) +' is within [' + str(round(0 - me, 4)) + ', ' + str(round(0 + me, 4)) + '], then the difference is expected, and the change is not significant.')









    



If -0.0049 is within [-0.0067, 0.0067], then the difference is expected, and the change is not significant.



In [108]:

    
# Practically significance
d_min = net_conversion['d_min']

print('If ' + str(NC_diff) +' is within [' + str(round(d_min - me, 4)) + ', ' + str(round(d_min + me, 4)) + '], then the change is not practically significant.')









    



If -0.0049 is within [0.0008, 0.0142], then the change is not practically significant.

The purpose of this test is to check whether the decrease/increase trend is evident in daily data.
prob(success) = (n!/(x! * (n-x)!)) * pow(p, x) * pow(1-p, n-x)
- x - number of "success", success means when experiment group increased from control group in the record
- n - total records
- p - the probability of being success, this is binomial distribution, so p=0.5
p-value is the sum of prob(success) from each success record. When p-value is smaller than alpha, then the success is significant.
This is the online calculator to get p-value: https://www.graphpad.com/quickcalcs/binomial1/
- Given x, n, p
- It provides both single & double sided p-value
Observation
- Same as what found in Step 3.2, the changes in Gross Conversion is significant, while in Net Conversion is not statistically significant.
- One thing to note here is, no matter to check prob(success) or prob(failure) here, the significance results are the same, even though p_value can be different.



In [110]:

    
control_experiment_df = control_df.join(experiment_df, lsuffix='_control', rsuffix='_experiment')
print(control_experiment_df.shape)
control_experiment_df.head()









    



(37, 10)






    Out[110]:







  
    
      
      Date_control
      Pageviews_control
      Clicks_control
      Enrollments_control
      Payments_control
      Date_experiment
      Pageviews_experiment
      Clicks_experiment
      Enrollments_experiment
      Payments_experiment
    
  
  
    
      0
      Sat, Oct 11
      7723
      687
      134.0
      70.0
      Sat, Oct 11
      7716
      686
      105.0
      34.0
    
    
      1
      Sun, Oct 12
      9102
      779
      147.0
      70.0
      Sun, Oct 12
      9288
      785
      116.0
      91.0
    
    
      2
      Mon, Oct 13
      10511
      909
      167.0
      95.0
      Mon, Oct 13
      10480
      884
      145.0
      79.0
    
    
      3
      Tue, Oct 14
      9871
      836
      156.0
      105.0
      Tue, Oct 14
      9867
      827
      138.0
      92.0
    
    
      4
      Wed, Oct 15
      10014
      837
      163.0
      64.0
      Wed, Oct 15
      9793
      832
      140.0
      94.0



In [111]:

    
control_experiment_df.isnull().sum()









    Out[111]:





Date_control               0
Pageviews_control          0
Clicks_control             0
Enrollments_control       14
Payments_control          14
Date_experiment            0
Pageviews_experiment       0
Clicks_experiment          0
Enrollments_experiment    14
Payments_experiment       14
dtype: int64



In [113]:

    
control_experiment_df.dropna(inplace=True)
print(control_experiment_df.shape)
control_experiment_df.isnull().sum()









    



(23, 10)






    Out[113]:





Date_control              0
Pageviews_control         0
Clicks_control            0
Enrollments_control       0
Payments_control          0
Date_experiment           0
Pageviews_experiment      0
Clicks_experiment         0
Enrollments_experiment    0
Payments_experiment       0
dtype: int64



In [122]:

    
# If it's "success", assign 1, otherwise 0

control_experiment_df['GC_increase'] = np.where(
    control_experiment_df['Enrollments_experiment']/control_experiment_df['Clicks_experiment'] \
    > control_experiment_df['Enrollments_control']/control_experiment_df['Clicks_control'], 1, 0)

control_experiment_df['NC_increase'] = np.where(
    control_experiment_df['Payments_experiment']/control_experiment_df['Clicks_experiment'] \
    > control_experiment_df['Payments_control']/control_experiment_df['Clicks_control'], 1, 0)

control_experiment_df[['GC_increase', 'NC_increase']].head()









    Out[122]:







  
    
      
      GC_increase
      NC_increase
    
  
  
    
      0
      0
      0
    
    
      1
      0
      1
    
    
      2
      0
      0
    
    
      3
      0
      0
    
    
      4
      0
      1



In [126]:

    
print(control_experiment_df['GC_increase'].value_counts())
print(control_experiment_df['NC_increase'].value_counts())









    



0    19
1     4
Name: GC_increase, dtype: int64
0    13
1    10
Name: NC_increase, dtype: int64



In [143]:

    
GC_success_ct = control_experiment_df['GC_increase'].value_counts()[1]
NC_success_ct = control_experiment_df['NC_increase'].value_counts()[1]

print(GC_success_ct, NC_success_ct)



In [144]:

    
p = 0.5
alpha = 0.05
n = control_experiment_df.shape[0]

print(n)



In [152]:

    
def get_probability(x, n):
    prob = round(math.factorial(n)/(math.factorial(x)*math.factorial(n-x))*pow(p,x)*pow(1-p, n-x), 4)
    return prob

def get_p_value(x, n):
    p_value = 0
    
    for i in range(0, x+1):
        p_value += get_probability(i, n)
        
    return round(p_value*2, 4)  # 2 side p_value



In [153]:

    
print ("GC Change is significant if", get_p_value(GC_success_ct,n), "is smaller than", alpha)
print ("NC Change is significant if", get_p_value(NC_success_ct,n), "is smaller than", alpha)









    



GC Change is significant if 0.0026 is smaller than 0.05
NC Change is significant if 0.6774 is smaller than 0.05

Reference

https://www.kaggle.com/tammyrotem/ab-tests-with-python/data#AB-Testing-With-Python---Walkthrough-Udacity's-Course-Final-Project
- Overall it provided all thd details. There're some formulas in the code or text are wrong, so I corrected them in my code.
- Some calculation methods are over complexing the concept, or tend to be mmore confusing. I also modified them in my code.
- I'm just wondering, why the sample size provided in the data is half size smaller than the estimated total page_views.

	Date	Pageviews	Clicks	Enrollments	Payments
0	Sat, Oct 11	7723	687	134.0	70.0
1	Sun, Oct 12	9102	779	147.0	70.0
2	Mon, Oct 13	10511	909	167.0	95.0
3	Tue, Oct 14	9871	836	156.0	105.0
4	Wed, Oct 15	10014	837	163.0	64.0

	Date	Pageviews	Clicks	Enrollments	Payments
0	Sat, Oct 11	7716	686	105.0	34.0
1	Sun, Oct 12	9288	785	116.0	91.0
2	Mon, Oct 13	10480	884	145.0	79.0
3	Tue, Oct 14	9867	827	138.0	92.0
4	Wed, Oct 15	9793	832	140.0	94.0

	Pageviews	Clicks	Enrollments	Payments
count	37.000000	37.000000	23.000000	23.000000
mean	9339.000000	766.972973	164.565217	88.391304
std	740.239563	68.286767	29.977000	20.650202
min	7434.000000	632.000000	110.000000	56.000000
25%	8896.000000	708.000000	146.500000	70.000000
50%	9420.000000	759.000000	162.000000	91.000000
75%	9871.000000	825.000000	175.000000	102.500000
max	10667.000000	909.000000	233.000000	128.000000

Detailed A/B Test Experiment

Problem

Hypothesis

Metrics

Invariate Metrics

Evaluation Metrics

Overall Process

Data Overview

Step 1 - Choose Metrics

Invariate Metrics

Evaluation Metrics

Step 2.1 - Estimate Metrics Baseline Values

Step 2.2 Estimate Standard Deviation & Sample Size

Estimate Standard Deviation of Metrics

Gross Conversion std

Revention std

Net Conversion std

Estimate Sample Size

Hypothesis

Sample Size Formula

Gross Conversion Sample Size

Retention Sample Size

Net Conversion Sample Size

Step 3 - Control Group vs. Experiment Group

Step 3.1 - Differences in Invariant Metrics (Sanity Check)

Compare pageviews

Compare clicks

Compare CTP (Click Through Probability)

Summary for Sanity Check

Step 3.2 - Differences in Evaluation Metrics

Compare Gross Conversion

Compare Net Conversion

Step 3.3 - Differences in Trending Sign Test

Reference