Overview: Free Trial Screener

At the time of this experiment, Udacity courses currently have two options on the home page: "start free trial", and "access course materials". If the student clicks "start free trial", they will be asked to enter their credit card information, and then they will be enrolled in a free trial for the paid version of the course. After 14 days, they will automatically be charged unless they cancel first. If the student clicks "access course materials", they will be able to view the videos and take the quizzes for free, but they will not receive coaching support or a verified certificate, and they will not submit their final project for feedback.

In the experiment, Udacity tested a change where if the student clicked "start free trial", they were asked how much time they had available to devote to the course. If the student indicated 5 or more hours per week, they would be taken through the checkout process as usual. If they indicated fewer than 5 hours per week, a message would appear indicating that Udacity courses usually require a greater time commitment for successful completion, and suggesting that the student might like to access the course materials for free. At this point, the student would have the option to continue enrolling in the free trial, or access the course materials for free instead. This screenshot shows what the experiment looks like.

The hypothesis was that this might set clearer expectations for students upfront, thus reducing the number of frustrated students who left the free trial because they didn't have enough time—without significantly reducing the number of students to continue past the free trial and eventually complete the course. If this hypothesis held true, Udacity could improve the overall student experience and improve coaches' capacity to support students who are likely to complete the course.

The unit of diversion is a cookie, although if the student enrolls in the free trial, they are tracked by user-id from that point forward. The same user-id cannot enroll in the free trial twice. For users that do not enroll, their user-id is not tracked in the experiment, even if they were signed in when they visited the course overview page.

Background

  • Start free trial to allow paid version
  • Access course materials for free

experiment

  • users see this popup after click
  • if they're not committed, warning them.

We have two initial hypothesis.

  • “..this might set clearer expectations for students upfront, thus reducing the number of frustrated students who left the free trial because they didn't have enough time”
  • ”..without significantly reducing the number of students to continue past the free trial and eventually complete the course. “

Unit of diversion: cookie

  • users in free trial tracked by user-id
  • same user-id can't enroll in free trial twice
  • users not enroll can't tracked.

Experiment Design

Metric Choice

Invariant metrics

  • Number of cookies
  • Number of clicks
  • Click through probability

Sanity Check is useful when we want to make sure that the data filtered for experiment and control group is the same. This can be done using the right invariance metric. These three metrics shouldn't change because it's outside of the experiment, in a sense that these metric calculated all before the experiment begin.

Number of cookies who views the page should be the same when Udacity experiment. They haven't click the "Start Now" button and see "Free Trial Screener" experiment. So number of cookies can be used as invariant metrics. When users click the button, they also haven't yet see the experiment that Udacity does, so number of clicks shouldn't change between experiment and control groups.

Since the experiments only occurs after the users click the "Start Now" button, its click-through-probability also have to be the same for each experiment and control group. We know that number of cookies and number of clicks has to be the same, then click-thorough-probability also has to be the same.

Besides cookie-id, there is also user-id. But user-id is not a good invariant, because Udacity also open to unregistered users to view page until after click of a button.

Evaluation metrics

  • Gross Conversion
  • Net Conversion

We have two initial hypothesis.

  • “..this might set clearer expectations for students upfront, thus reducing the number of frustrated students who left the free trial because they didn't have enough time”
  • ”..without significantly reducing the number of students to continue past the free trial and eventually complete the course. “

For evaluation metrics, I choose Gross Conversion, Retention, and Net Conversion. All of these metrics are a good evaluation metrics since they change when the experiment change, and since each of the metrics has user-ids as the unit of analysis, should be much smaller standard error since Udacity also using it as the cookie of diversion.

Gross conversion is the number of user-ids to complete checkout and enroll in the free trial divided by number of unique cookies to click the "Start free trial" button. After the visitors click the button, they should see the screener, hence the warning. It should be makes other visitors that doesn't have serious commitment back down and cancel it right away.

Retention is number of user-ids to remain enrolled past the 14-day boundary (and thus make at least one payment) divided by number of user-ids to complete checkout. The experiment intend to focus the visitors that only want to make a serious commitment. The retention rate should be higher for experiment group than the control group.

Net conversion is number of user-ids to remain enrolled past the 14-day boundary (and thus make at least one payment) divided by the number of unique cookies to click the "Start free trial" button. Net Conversion also true, since the experiment intend to see higher conversion rate for students to continue (at least make one payment) than the users that only click the button, that doesn't even see the warning experiment given.

Ouf of these metrics, Retention turns out have a longer duration, which is 118 days. This takes too long, and it's not something Udacity willing to give for the experiment. So Retention will be excluded.

The first part is what Gross Conversion does, we should expect after the experiment, Gross Conversion shows significantly reduce the number who left trial because they don’t have time commitment. The experiment group should be significantly different than control group. The second part is what Net Conversion does, we should expect after experiment, the metric shows insignificantly reduce the number of students who at least make one payment. The experiment group should not significantly different than control group.

Measuring Standard Deviation

  • Gross Conversion: 0.02
  • Net Conversion 0.0156

Expect analytical variance match empirical variance because unit of analysis and unit of diversion is same.

To calculate standard deviation, we use this formula

Formula = np.sqrt(p * (1-p) / n)

and using baseline data below.


In [34]:
baselines= """Unique cookies to view page per day:	40000
Unique cookies to click "Start free trial" per day:	3200
Enrollments per day:	660
Click-through-probability on "Start free trial":	0.08
Probability of enrolling, given click:	0.20625
Probability of payment, given enroll:	0.53
Probability of payment, given click:	0.1093125"""

lines  = baselines.split('\n')
d_baseline = dict([(e.split(':\t')[0],float(e.split(':\t')[1])) for e in lines])

Since we have 5000 sample cookies instead of the original 40000, we can adjust accordingly using calculate probability. For these two evaluation metric, we need number of users who click "Start Now" button, and calculated as


In [79]:
n = 5000
n_click = n * d_baseline['Click-through-probability on "Start free trial"']
n_click


Out[79]:
400.0

Next, standard deviation for Gross conversion is


In [78]:
p = d_baseline['Probability of enrolling, given click']
round(np.sqrt(p * (1-p) / n_click),4)


Out[78]:
0.0202

and for Net Conversion,


In [77]:
p = d_baseline['Probability of payment, given click']
round(np.sqrt(p * (1-p) / n_click),4)


Out[77]:
0.0156

Gross Conversion and Net Conversion, their empirical variance should approximate analytical variance, because the unit of analysis and unit of diversion is the same, cookie-ids/user-ids.

Sizing

Number of Samples vs. Power

  • Gross Conversion. Baseline: 0.20625 dmin: 0.01 = 25.839 cookies who clicks.
  • Net Conversion. Baseline: 0.1093125 dmin: 0.0075 = 27,411 cookies who clicks.
  • Not using Bonferroni correction.
  • Using alpha = 0.05 and beta 0.2

The pageviews needed then will be: 685275 impression.

We feed it into sample size calculator.

We can use bigger number, so the minimum required cookies is sufficient. The sample size is only for one group, so output from the calculator must be doubled to get the enough pageviews. Since this only the user who clicks, we calculate number of pageviews using CTP. The pageviews needed then will be:


In [89]:
(27411 * 2) / d_baseline['Click-through-probability on "Start free trial"']


Out[89]:
685275.0

Duration vs. Exposure

  • Fraction: 0.8 (Low risk)
  • Duration: 22 days (40000 pageviews/day)

The fraction of experiment exposure to Udacity visitors will be 80%. The experiment isn't risky enough that may potentially leaked as blog news or article. It doesn't really big a news, as Udacity only want to put little warning to the users. Because only 40000 pageviews each day can be gathered, the duration will be 22 days.

This is where Retention metric fail for our evaluation metrics. It has a longer duration, which is 118 days. This takes too long, and it's not something Udacity willing to give for the experiment. So Retention will be excluded.

Experiment Analysis

Sanity Checks

  • Number of Cookies:

    • Bounds = (0.4988,0.5012)
    • Observed = 0.5006
    • Passes? Yes
  • Number of clicks on “Start free trial”:

    • Bounds = (0.4959,0.5041)
    • Observed = 0.5005
    • Passes? Yes
  • Click-through-probability on “Start free trial”:

    • Bounds = (0.0812,0.0830)
    • Observed = 0.0821
    • Passes? Yes

Since we have passed all of the sanity checks, we can continue to analyze the experiment.

We do sanity checks to ensure that both experiment and control groups have equal proportion. It's the metric that shouldn't change when experiment change, which is invariant metrics that we chose earlier. First let's see the data that we want to analyze both at control and experiment.


In [4]:
control = pd.read_csv('control_data.csv')
experiment = pd.read_csv('experiment.csv')

In [5]:
control.head()


Out[5]:
Date Pageviews Clicks Enrollments Payments
0 Sat, Oct 11 7723 687 134 70
1 Sun, Oct 12 9102 779 147 70
2 Mon, Oct 13 10511 909 167 95
3 Tue, Oct 14 9871 836 156 105
4 Wed, Oct 15 10014 837 163 64

In [6]:
experiment.head()


Out[6]:
Date Pageviews Clicks Enrollments Payments
0 Sat, Oct 11 7716 686 105 34
1 Sun, Oct 12 9288 785 116 91
2 Mon, Oct 13 10480 884 145 79
3 Tue, Oct 14 9867 827 138 92
4 Wed, Oct 15 9793 832 140 94

Next, we count the total views and clicks for both control and experiment groups.


In [38]:
control_views = control.Pageviews.sum()
control_clicks = control.Clicks.sum()

experiment_views = experiment.Pageviews.sum()
experiment_clicks = experiment.Clicks.sum()

For count like number of cookies and number of clicks in "Start free trial" button, we can do confidence interval around the fraction we expect in control group, and actual fraction as the observed outcome. Since we expect control and experiment to have equal proportion, we set the the expected proportion to be 0.5. Both invariant metrics, the confidence interval for sanity checks use the function below.


In [7]:
def sanity_check_CI(control,experiment,expected):
    SE = np.sqrt((expected*(1-expected))/(control + experiment))
    ME = 1.96 * SE
    return (expected-ME,expected+ME)

Now for sanity checks confidence interval of number of cookies who views the page,


In [42]:
sanity_check_CI(control_views,experiment_views,0.5)


Out[42]:
(0.49882039214902313, 0.50117960785097693)

The actual proportion is


In [60]:
float(control_views)/(control_views+experiment_views)


Out[60]:
0.5006396668806133

Since we know that 0.5006 is within the interval, then experiment pass sanity checks for number of cookies.

Next, we calculate confidence interval of number of clicks at "Start free trial" button.


In [44]:
sanity_check_CI(control_clicks,experiment_clicks,0.5)


Out[44]:
(0.49588449572378945, 0.50411550427621055)

And the actual proportion,


In [61]:
float(control_clicks)/(control_clicks+experiment_clicks)


Out[61]:
0.5004673474066628

Again 0.5006 is within the interval, so our experiment also pass the sanity check.

For our sanity check with ctp, is a little different calculation. Using simple count earlier, we know that if we setup our experiment in a proper way, the true proportion of control group should be 0.5. Since we don't know the true proportion of ctp control group, we build confidence interval around the control group, and ctp experiment as observed outcome. If the experiment change and ctp experiment is outside ctp control confidence interval, then our experiment failed sanity checks. Thus we can't continue our analysis.


In [21]:
ctp_control = float(control_clicks)/control_views
ctp_experiment = float(experiment_clicks)/experiment_views

In [3]:
# %%R
c = 28378
n = 345543
CL = 0.95

pe = c/n
SE = sqrt(pe*(1-pe)/n)
z_star = round(qnorm((1-CL)/2,lower.tail=F),digits=2)
ME = z_star * SE

c(pe-ME, pe+ME)


Out[3]:
  1. 0.0812103597525297
  2. 0.0830412673966239

In [4]:
ctp_experiment


Out[4]:
0.082125813574576823

And as you can see, click-through-probability of the experiment is still within the confidence interval of click-through-probability control groups. Since we have passed all of the sanity checks, we can continue to analyze the experiment.

Effect Size Test

  • Did not use Bonferroni correction
  • Gross Conversion
    • Bounds = (-0.0291, -0.0120)
    • Statistical Significance? Yes
    • Practical Significance? Yes
  • Net Conversion
    • Bounds = (-0.0116,0.0019)
    • Statistical Significance? No
    • Practical Significance? No

In [29]:
get_gross = lambda group: float(group.dropna().Enrollments.sum())/ group.Clicks.sum()
get_net = lambda group: float(group.dropna().Payments.sum())/ group.Clicks.sum()

Gross Conversion

Keep in mind that observed_difference can be negative


In [40]:
print('N_cont = %i'%control.dropna().Clicks.sum())
print('X_cont = %i'%control.dropna().Enrollments.sum())
print('N_exp =  %i'%experiment.dropna().Clicks.sum())
print('X_exp =  %i'%experiment.dropna().Enrollments.sum())


N_cont = 17293
X_cont = 3785
N_exp =  17260
X_exp =  3423

In [3]:
X_exp/N_exp


Out[3]:
0.198319814600232

In [4]:
X_cont/N_cont


Out[4]:
0.218874689180593

In [1]:
#%%R

N_cont = 17293
X_cont = 3785
N_exp = 17260
X_exp = 3423

observed_diff = X_exp/N_exp - X_cont/N_cont
# print(observed_diff)
p_pool = (X_cont+X_exp)/(N_cont+N_exp)
SE = sqrt( (p_pool*(1-p_pool)) * ((1/N_cont) + (1/N_exp)))
ME = 1.96 * SE
# print(p_pool)
c(observed_diff-ME, observed_diff+ME)


Out[1]:
  1. -0.0291233583354044
  2. -0.0119863908253187

In [2]:
observed_diff


Out[2]:
-0.0205548745803616

The observed difference is outside the confidence interval. And the observed difference also above 0.01 dmin, minimum detectable effect. We should definitely launch.

Net Conversion


In [43]:
print('N_cont = %i'%control.dropna().Clicks.sum())
print('X_cont = %i'%control.dropna().Payments.sum())
print('N_exp =  %i'%experiment.dropna().Clicks.sum())
print('X_exp =  %i'%experiment.dropna().Payments.sum())


N_cont = 17293
X_cont = 2033
N_exp =  17260
X_exp =  1945

In [5]:
X_exp/N_exp


Out[5]:
0.198319814600232

In [6]:
X_cont/N_cont


Out[6]:
0.218874689180593

In [11]:
#%%R
N_cont = 17293
X_cont = 2033
N_exp = 17260
X_exp = 1945

observed_diff = X_exp/N_exp - X_cont/N_cont
# print(observed_diff)
p_pool = (X_cont+X_exp)/(N_cont+N_exp)
SE = sqrt( (p_pool*(1-p_pool)) * ((1/N_cont) + (1/N_exp)))
ME = 1.96 * SE
# print(p_pool)
c(observed_diff-ME, observed_diff+ME)


Out[11]:
  1. -0.0116046243598917
  2. 0.00185717901080338

In [12]:
observed_diff


Out[12]:
-0.00487372267454417

The observed difference is within the confidence interval so it's not statiscally significant and also not practically significant. We may fail or continue with our results.

Sign Test

  • Did not use Bonferroni correction
  • Gross Conversion
    • p-value = 0.0026
    • Statistical Significance? Yes
  • Net Conversion
    • p-value = 0.6776
    • Statistical Significance? No

Sign Test is also a test that must be confirmed with effect size test. I'm using Online Calculator to calculate the binomial p-value, whether the probability of experiment is higher than control groups. If we simulate it, what ar the odds. If the probability is so rare, that isn't likely due to chance, then the experiment succeed, provided significance level, which I choose to be 5%.

I'm using helper function, to compare probability day-to-day whether the metric in question is smaller for group than the experiment.


In [1]:
compare_prob = lambda col: ((control.dropna()[col] / control.dropna().Clicks) <
                            (experiment.dropna()[col]/experiment.dropna().Clicks))

Gross Conversion

Count the gross conversion, I got,


In [4]:
compare_prob('Enrollments').value_counts()


Out[4]:
False    19
True      4
dtype: int64

Net Conversion


In [5]:
compare_prob('Payments').value_counts()


Out[5]:
False    13
True     10
dtype: int64

I got p-value of 0.6776 for Net Conversion.

Conclusion

I would not use Bonferroni correction in this case. Bonferroni correction needs all metrics to be significantly different. This is not what we do in our experiment. We have Gross Conversion that need to be significant, and Net Conversion that need to be insignificant.

  • Not use Benferroni correction.
  • Gross Conversion need significant but Net Conversion doesn't.

Recommendation

Gross Conversion is good because it passes Udacity’s practical significance boundary. This means it reduces the number of students who feel not committed (in time/cost). However, even though Net Conversion is not statistically significance, its confidence interval touch practical significance boundary, which is not how Udacity wants. Udacity could lose potential money if the experiment launch. So my recommendation is further experiment or cancel if Udacity doesn’t have a time.

  • Gross Conversion: pass
  • Net Conversion: somehow pass, can loss potential money

  • decision: risky. delay for further experiment or cancel the launch.

Follow-Up Experiment

So what does it take to decrease not-so-serious users without losing potential money? I see that on every course overview page in Udacity, they already given information about the hours spent on particular course. So really, warning them again about time commitment might be unnecessary. What we could do, is giving them an incentive after their enrollment. In an experiment, after the students enroll, they are given an information on the right side of the video material page. An incentive of offering free payment until they graduate. The deal is they have to be Udacity Code Reviewer. Udacity has this program. It gives reasonable payment per hour to whoever graduates reviewing the students' code. If they agree, they can click the button “Start debt program” below the information page.

They will be able to continue after 14-day boundary and finished the course. But in return, they have to be Code Reviewer, and finish the debt through payroll. They won't be given any salary until their debt finished.

Yes, it seems risky to Udacity. But if the users break on their agreement, for example not become Code Reviewer within two months, they will be automatically charged through their registered credit card. They will also automatically charged if they cancel the program. So it’s safe to assume that we have handled risk of potential runner, but this is not part of the experiment.

The hypothesis is that after they’re given an incentive, they become more serious and committed to complete the course. By doing this incentive, number of users who cancel early in the course is also significantly reduced, and boost them compared to ones which already committed.

The unit of diversion is an user-id. Like free trial, the same user-id can’t follow the debt program twice. User-id is more cross-platform and more represent as an user than a cookie. User-ids that don’t enroll in the program, is not tracked in the experiment. The number of user-ids that are in debt program, but cancel at the end of the free trial is also not tracked.

  • Not necessary to show warning
  • Start Debt Program
  • Risky, users break agreement
    • Not become Udacity Code Reviewer
    • Cancel in midway program
  • Hypothesis
    • Non-serious users become more committed after incentive
    • Number of users who cancel early is reduced
    • Boost compared to already committed

We can use Invariant metrics for this experiment for the follow-up:

  • Number of cookies: That is, number of unique cookies to view the course overview page.
  • Number of clicks: That is, number of unique cookies to click the "Start free trial" button (which happens before the free trial screener is trigger).
  • Click-through-probability: That is, number of unique cookies to click the "Start free trial" button divided by number of unique cookies to view the course overview page
  • Gross conversion: That is, number of user-ids to complete checkout and enroll in the free trial divided by number of unique cookies to click the "Start free trial" button.

And the evaluation metric:

  • Debt Conversion: That is, number of user-ids to click “Start Debt Program” divided by number of user-ids that enroll in the free trial.
  • Debt-Net conversion: That is, number of user-ids to click “Start Debt Program” divided by number of user-ids to remain enrolled past the 14-day boundary (and thus make at least one payment)
  • Net conversion: That is, number of user-ids to remain enrolled past the 14-day boundary (and thus make at least one payment) divided by the number of unique cookies to click the "Start free trial" button.

We use user-ids as unit of diversion, expect all of the evaluation metrics to be practically significant.