Logging data



In [ ]:

    
from planout.ops.random import *
from planout.experiment import SimpleExperiment
import pandas as pd
import json

Log data

Here we explain what all the fields are in the log data. Run this:



In [ ]:

    
class LoggedExperiment(SimpleExperiment):
    def assign(self, params, userid):
        params.x = UniformChoice(choices=["What's on your mind?", "Say something."], unit=userid)
        params.y = BernoulliTrial(p=0.5, unit=userid)

print LoggedExperiment(userid=5).get('x')

Then open your terminal, navigate to the directory this notebook is in, and type:

> tail -f LoggedExperiment.log

You can now see how data is logged to your experiment as its run.

Exposure logs

Whenever you request a parameter, an exposure is automatically logged. In a production environment, one would use caching (e.g., memcache) so that we only exposure log once per unit. SimpleExperiment exposure logs once per instance.



In [ ]:

    
e = LoggedExperiment(userid=4)
print e.get('x')
print e.get('y')

Manual exposure logging

Calling log_exposure() will force PlanOut to log an exposure event. You can optionally pass in additional data.



In [ ]:

    
e.log_exposure()
e.log_exposure({'endpoint': 'home.py'})

Event logging

You can also log arbitrary events. The first argument to log_event() is a required parameter that specifies the event type.



In [ ]:

    
e.log_event('post_status_update')
e.log_event('post_status_update', {'type': 'photo'})

Putting it all together

We simulate the components of a PlanOut-driven website and show how data analysis would work in conjunction with the data generated from the simulation.

This hypothetical experiment looks at the effect of sorting a music album's songs by popularity (instead of say track number) on a Web-based music store.

Our website simulation consists of four main parts:

Code to render the web page (which uses PlanOut to decide how to display items)
Code to handle item purchases (this logs the "conversion" event)
Code to simulate the process of users' purchase decision-making
A loop that simulates many users viewing many albums



In [ ]:

    
class MusicExperiment(SimpleExperiment):
    def assign(self, params, userid, albumid):
        params.sort_by_rating = BernoulliTrial(p=0.2, unit=[userid, albumid])



In [ ]:

    
import random

def get_price(albumid):
    "look up the price of an album"
    # this would realistically hook into a database
    return 11.99

Rendering the web page



In [ ]:

    
def render_webpage(userid, albumid):
    'simulated web page rendering function'
    
    # get experiment for the given user / album pair.
    e = MusicExperiment(userid=userid, albumid=albumid)
    
    # use log_exposure() so that we can also record the price
    e.log_exposure({'price': get_price(albumid)})
    
    # use a default value with get() in production settings, in case
    # your experimentation system goes down
    if e.get('sort_by_rating', False):
        songs = "some sorted songs" # this would sort the songs by rating
    else:
        songs = "some non-sorted songs"
    
    html = "some HTML code involving %s" % songs  # most valid html ever.
    # render html

Logging outcomes



In [ ]:

    
def handle_purchase(userid, albumid):
    'handles purchase of an album'
    e = MusicExperiment(userid=userid, albumid=albumid)
    e.log_event('purchase', {'price': get_price(albumid)})
    # start album download

Generative model of user decision making



In [ ]:

    
def simulate_user_decision(userid, albumid):
    'simulate user experience'
    # This function should be thought of as simulating a users' decision-making
    # process for the given stimulus - and so we don't actually want to do any
    # logging here.
    e = MusicExperiment(userid=userid, albumid=albumid)
    e.set_auto_exposure_logging(False)  # turn off auto-logging
    
    # users with sorted songs have a higher purchase rate
    if e.get('sort_by_rating'):
        prob_purchase = 0.15
    else:
        prob_purchase = 0.10
    
    # make purchase with probability prob_purchase
    return random.random() < prob_purchase

Running the simulation



In [ ]:

    
# We then simulate 500 users' visitation to 20 albums, and their decision to purchase
random.seed(0)
for u in xrange(500):
    for a in xrange(20):
        render_webpage(u, a)
        if simulate_user_decision(u, a):
            handle_purchase(u, a)

Loading data into Python for analysis

Data is logged to MusicExperiment.log. Each line is JSON-encoded dictionary that contains information about the event types, inputs, and parameter assignments.



In [ ]:

    
raw_log_data = [json.loads(i) for i in open('MusicExperiment.log')]
raw_log_data[:2]

It's preferable to deal with the data as a flat set of columns. We use this handy-dandy function Eytan found on stackoverflow to flatten dictionaries.



In [ ]:

    
# stolen from http://stackoverflow.com/questions/23019119/converting-multilevel-nested-dictionaries-to-pandas-dataframe
from collections import OrderedDict
def flatten(d):
    "Flatten an OrderedDict object"
    result = OrderedDict()
    for k, v in d.items():
        if isinstance(v, dict):
            result.update(flatten(v))
        else:
            result[k] = v
    return result

Here is what the flattened dataframe looks like:



In [ ]:

    
log_data = pd.DataFrame.from_dict([flatten(i) for i in raw_log_data])
log_data[:5]

Joining exposure data with event data

We first extract all user-album pairs that were exposed to an experiemntal treatment, and their parameter assignments.



In [ ]:

    
all_exposures = log_data[log_data.event=='exposure']
unique_exposures = all_exposures[['userid','albumid','sort_by_rating']].drop_duplicates()

Tabulating the users' assignments, we find that the assignment probabilities correspond to the design at the beginning of this notebook.



In [ ]:

    
unique_exposures[['userid','sort_by_rating']].groupby('sort_by_rating').agg(len)

Now we can merge with the conversion data.



In [ ]:

    
conversions = log_data[log_data.event=='purchase'][['userid', 'albumid','price']]
df = pd.merge(unique_exposures, conversions, on=['userid', 'albumid'], how='left')
df['purchased'] = df.price.notnull()
df['revenue'] = df.purchased * df.price.fillna(0)

Here is a sample of the merged rows. Most rows contain missing values for price, because the user didn't purchase the item.



In [ ]:

    
df[:5]

Restricted to those who bought something...



In [ ]:

    
df[df.price > 0][:5]

Analyzing the experimental results



In [ ]:

    
df.groupby('sort_by_rating')[['purchased', 'price', 'revenue']].agg(mean)

If you were actually analyzing the experiment you would want to compute confidence intervals.