Over the course of a week, I divided invites from about 3000 requests among four new variations of the quote form as well as the baseline form we've been using for the last year. Here are my results:
- Baseline: 32 quotes out of 595 viewers
- Variation 1: 30 quotes out of 599 viewers
- Variation 2: 18 quotes out of 622 viewers
- Variation 3: 51 quotes out of 606 viewers
- Variation 4: 38 quotes out of 578 viewers
What's your interpretation of these results? What conclusions would you draw? What questions would you ask me about my goals and methodology? Do you have any thoughts on the experimental design? Please provide statistical justification for your conclusions and explain the choices you made in your analysis. For the sake of your analysis, you can make whatever assumptions are necessary to make the experiment valid, so long as you state them. So, for example, your response might follow the form "I would ask you A, B and C about your goals and methodology. Assuming the answers are X, Y and Z, then here's my analysis of the results... If I were to run it again, I would consider changing...".
In [36]:
# read data
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline
In [37]:
# I have stored the data in a csv file. Let's load the data in pandas dataframe
split_test_df = pd.read_csv("split_test.csv")
split_test_df['conversion_rate'] = split_test_df['Quotes'] /split_test_df['Views']
In [38]:
# Let's look at the dataframe
split_test_df
Out[38]:
In [39]:
#split_test_df.sort_values('conversion_rate').plot.bar(x = 'Bucket', y='conversion_rate)
split_test_df.plot(kind='bar', x = 'Bucket', y='conversion_rate')
Out[39]:
In [40]:
# percent difference from baseline
split_test_df['percent_diff_from_base'] = (split_test_df['conversion_rate'] - split_test_df['conversion_rate'][0])*100 / split_test_df['conversion_rate']
In [41]:
split_test_df
Out[41]:
In [42]:
# Let's plot percent difference from base in a barchart
split_test_df.plot(kind='bar', x = 'Bucket', y='percent_diff_from_base')
Out[42]:
To test the significance of the conversion baseline vs variation 3, I am going to use the z-test. The null hypothesis is that the differences are not statistically significant (i.e. the differences in quotes has occurred by chance). The alternate hypothesis is that the differences did not occur by random chance. As sample size is > 20 we can use central limit theoram. Using Central Limit Theoram, the distribution of the test statistic can be approximated. For this test I am assuming alfha / significance level 5% (or 0.05) would be sifficient. Z-score or critical value for alpha=0.05 is 1.645.
Where: p1 is the proportion from the first population and p2 the proportion from the second.
In [60]:
# Proportion of variation 3 (i.e. comversion rate)
p1_hat = split_test_df['conversion_rate'][3]
# Sample size of variation 3 (i.e. views)
n1 = split_test_df['Views'][3]
# Proportion of baseline (i.e. comversion rate)
p2_hat = split_test_df['conversion_rate'][0]
# Sample size of baseline (i.e. views)
n2 = split_test_df['Views'][0]
print (p1_hat, p2_hat, n1, n2)
In [61]:
# overall sample proportion
z = (p1_hat - p2_hat) / np.sqrt((p1_hat * (1-p1_hat) / n2) + (p2_hat * (1-p2_hat) / n2))
In [62]:
print(f"Z score is: {z}")
From above results 2.099 > 1.645. Hence we can reject the null hypothesis i.e. We can say there is a significant difference in proportions of baseline vs variation 3 at 95% confidence level.
If I were to run it again, I would consider to add more samples (i.e. viewers) of each variation in the experiment design. This would result in better understanding of distribution of each variation.