Lesson 03 - Choosing and Characterizing Metrics

Metric Definition

How are you going to use the metric
- invariant
  - is number of users same
  - is distribution of users - age, gender, geo etc. same?
- evaluation
  - business metric
    - how much money do you make
    - how many users you have
    - how many students get jobs after the course
  - detailed metric
    - how long people stay on page
    - useful for digging into problems e.g. UX research for why users are not finishing the class
defining metric
- high level concept of a metric => 1 sentence summary that anyone can understand
  - active users
  - click through probability
- details of metric
  - active
    - 7 day active?
    - 28 day active?
    - which events count towards activity
- a number to measure this
  - sum
  - count
how many metrics?
- sanity metrics may be multiple
- evaluation metric
  - may be single. Some companies do this
  - may be multiple
  - leaders may want multiple metrics to see how things move
- can create a composite metric
  - called objective function or OEC (Overall Evaluation Criterion)
  - people shy away from it because
    - getting everyone to agree to definiton is difficult
    - may overoptimize for one rather than other one
    - when people ask why is it moving then we may have to look at the individual metrics anyway
- sometimes having a metric that is less optimal for your current test but works well enough for suite of tests is more suitable so that you can compare it. Introducing anything custom introduces risk

This paper discusses the OEC, or overall evaluation criterion, in more detail.

High level concepts for Metrics

Business Objective

Help students get jobs
Financial sustainability of the company itself

May want to keep track across platforms i.e. people may start on android and continue on desktop
each of the stage is a metric i.e. how many people reached that point
may want to keep track of exact number at some key points and rates (or probabilities) for all other stages
- because we want to track how many people enter the funnel (as invariant metric) and then increase the rate at which people progress down the funnel (as evaluation metric)

Difficult Metrics (difficult to measure)

Don't have access to the data
Will take too long to collect the data
e.g.
- Rate of returning for 2nd course
  - Data present but will take too long
- Average happiness of data
  - Do not have data
- Probability of finding information via search
  - Do not have data

Other techniques for defining metrics

for generating new ideas for metrics as well as validating your existing metrics
e.g. survey, retrospective analysis, focus groups
external data
- companies that collect fairly granular data about market share, verticals
- companies that put together data for user that visit websites like comscore, Nielsein
- companies that run surveys of users
- academic papers
  - may run experiments, correlation studies
  - e.g. for eye tracking you may have a study
    - gets people in lab and make them read a page
    - collect data about
      - how much time it took to read the page
      - how much time it took to turn pages
- great in brainstorming
  - may give you an idea about how to analyze your data
- great in validating
  - my numbers are X but my whole industry has Y which is totally different. So there is something going on
internal data
- already gathered data or data that needs to be gathered
- data captured from data/logging data
  - how metrics that you are interested in change in response to
    - changes you made in the past
    - experiments that you have run
    - big spikes in your business
  - e.g. in a class there are a lot of students getting stuck on a quiz then you may want to go and check you logs to see whether the quiz is taking more time for everyone or just the few students?
  - these analysis show you correlation not causation
talk with you colleagues
- may have worked in different companies, in research and may have ideas about metrics that can be used
take into account company culture while defining business metrics
- some may want to add more users
- some may want to make existing users happy
user experience research
- good for brainstorming
- in-depth information
- special equipment can be used e.g. special equipment to track eye movement to see what user look at even when they did not click on it
focus groups
- bunch of users for group discussions
- risk of group think
surveys
- useful for metrics that you cannot directly measure
- cannot be directly compared with any other data source

Additional techniques for the same are present here https://storage.googleapis.com/supplemental_media/udacityu/3954679115/additional_techniques.pdf

Applying other techniques

homepage views can be compared with external sources such as hitwise, comscore etc.
if completion is less then User Experience Research can be done
- if video takes time to load then latency can be measured
- if things are difficult to find then they can moved around
if data is not present e.g. how many people got jobs then surveys can be used

Metric Definition and data capture

which data to look at?
- do we filter data?
- which events to count for numerator and denominator
- summary statistics?
build intuition
- what changes in your data and metrics your system can/cannot produce
  - 10% change in CTP is un-realistic. Most probably a bug in experiment
how to define metrics?
- decide which of the events count?
  - total clicks/total pageviews?
  - unique clicks per cookie / unique pageviews per cookie
- other things
  - page load but no click.
  - page load and after 15 minutes it clicks
    - is the lag important?
    - if the 2 times on different days then do we consider them to be related or not related
  - may want to look the data at day level, time of days, minutes, weekday effects, weekend
  - technology being used
    - some browser do not support jaavscript. thus events may not come at all

Example of defining metric

High-level metric: click throigh probability
Def 1 (Cookie probability): For each , number of cookies that click divided by number of cookies
Def 2 (Pageview probability): Number of pageviews with a click within divided by number of pageviews
Def 3 (Rate): Number of clicks divided by number of pageviews

Filtering and Segmenting

filtering may need to happen due to external traffic
- suspicious traffic needs to filtered out or at least flagged
need to filter if your results only effect subset of users
- only English users
- only Mobile users
filtering to only affected users can increase power and sensitivity of experiment
done to de-bias the data
need to take care that it does not introduce bias in your data
- say if a metric can only be measured on logged in users then you may bias the data as some new users/non-commital users may not have created account. So you are biasing your data towards more committed users
one way to do that is to slice your data to check whether you want to apply the filter or not
- say by geography, language, platform
- calculate metric for all slices separately
- if spam is coming from a particular country then it makes sense to filter it out. But if the filtering is happening disproportionality then the bias may be increasing
Build intuition. Get an idea of what is expected and what is unexpected
look at data over time in slices, see if there are patterns
looking at data as per segments can be useful as it can be good for evaluating definitions (looking at the data segment-wise) and helps in building intuition

We look at total active cookies and we see a spike. Is the spike due to weekly variation? A good to check that is to look at week-over-week plot i.e. plot data of point in week compared to previous week's data. The spike is still there. Hence, this is not due to weekly variations.

We can look at data by geography and we see it is due to a specific country. At this point talking with the engineering team and talking about the case would be helpful.

If you suspect a click-tracking issue then how can that be identified?

Having both CTR and CTP on same graph just tells us that they are in the same direction, nothing more.

The below is suspicious but still not enough as the user behaviour is expected to be different on different platforms.

Both desktop and mobile have similar results as expected from CTP.

The CTP is lower than CTR which is not exactly what you would expect. Slightly lower but not significantly lower unless you expect your users to be clicking multiple times.

Summary Metrics

The metrics used so far are direct data measurements
- page view, clicks
for many cases the summary metric is obvious
- the CTR, CTP examples have averages built-in
- if we are counting the unique number of cookies then it is the sum
other cases the summary metric is not obvious
- when the measurement is itself a number
  - load time of a video
  - how many terms in a query
  - position of first click on the page
- can choose from many - mean, median, 25th percentile, 90th percentile etc.
- you want your metric to be sensitive enough to measure the effect of feature change
- distribution of metric will help you choose
  - compute a histogram
  - if normal shape then mean/median is going to make more sense
  - if lopsided then 25th, 75th, 90th percentile
- categories of metrics
  - sums and counts
    - e.g. count users who visit page
  - distributional like mean, median etc.
    - e.g. mean age of users, mean latency of page
  - probabilities, rates
  - ratios
    - business metrics make a lot of sense here
    - e.g. Prob of revenue generating clicks vs. prob. of any click

Common distributions in online data

Let’s talk about some common distributions that come up when you look at real user data.

For example, let’s measure the rate at which users click on a result on our search page, analogously, we could measure the average staytime on the results page before traveling to a result. In this case, you’d probably see what we call a Poisson distribution, or that the stay times would be exponentially distributed.

Another common distribution of user data is a “power-law,” Zipfian or Pareto distribution. That basically means that the probability of a more extreme value, z, decreases like 1/z (or 1/z^exponent). This distribution also comes up in other rare events such as the frequency of words in a text (the most common word is really really common compared to the next word on the list). These types of heavy-tailed distributions are common in internet data.

Finally, you may have data that is a composition of different distributions - latency often has this characteristic because users on fast internet connection form one group and users on dial-up or cell phone networks form another. Even on mobile phones you may have differences between carriers, or newer cell phones vs. older text-based displays. This forms what is called a mixture distribution that can be hard to detect or characterize well.

The key here is not to necessarily come up with a distribution to match if the answer isn’t clear - that can be helpful - but to choose summary statistics that make the most sense for what you do have. If you have a distribution that is lopsided with a very long tail, choosing the mean probably doesn’t work for you very well - and in the case of something like the Pareto, the mean may be infinite!

Sensitivity and Robustness of metrics

sensitivity means it moves when interesting things happen
robustness means it doesn't move when non-interesting things happen
in case of load times
- mean can be used but any outlier load time which is quite high will cause the mean to move a lot. It is not robust
- median can be used as it won't move much due to few changes but that is a weakness also. If a small fraction of users are affected then it will not move much
- maybe 90th or 99th percentile is better option in this case.
how to measure
- do an experiment
  - e.g. increase quality of video. The load time should increase. See whether the metric moves as we want it to move
- do an A/A test
  - Compare without changing anything.
  - To ensure that the metric is not picking spurious things and does not jump around

We make the density graphs of various videos based on the histograms

We can see the shape of distribution is similar for the videos. Now to choose metrics we plot of bunch of them

The ones that are moving around are not robust enough for comparable videos. So 90th and 99th percentile are not robust enough

Say we plot the same graph but for videos with different resolutions. The latency should go down for videos with lower resolution

By this we can see that the median and mean are not sensitive enough.

Absolute or Relative Difference

How to compute the comparison between your experiment and control groups
absolute difference
- if you are starting out
relative difference
- has the benefit that you need to choose only 1 practical significance boundary. You don't have to change it as system changes
- has the problem that ratios are not as well behaved as absolute differences

Absolute vs. relative difference

Suppose you run an experiment where you measure the number of visits to your homepage, and you measure 5000 visits in the control and 7000 in the experiment. Then the absolute difference is the result of subtracting one from the other, that is, 2000. The relative difference is the absolute difference divided by the control metric, that is, 40%.

Relative differences in probabilities

For probability metrics, people often use percentage points to refer to absolute differences and percentages to refer to relative differences. For example, if your control click-through-probability were 5%, and your experiment click-through-probability were 7%, the absolute difference would be 2 percentage points, and the relative difference would be 40 percent. However, sometimes people will refer to the absolute difference as a 2 percent change, so if someone gives you a percentage, it's important to clarify whether they mean a relative or absolute difference!

Variability

If we have a metric that varies a lot under normal circumstances then the metric may not be good for us as the practical significance level may not be feasible with this metric
In case we have nice data/simple metrics then the confidence intervals can be calculated theoretically. If the data is not nice/complex metrics (like ratios, percentiles) then it needs to be done empirically

Calculating Variability

To calculate confidence interval we need

variance (or standard deviation)
distribution

Variability of different metrics may vary a lot
Variability of metric depend on the underlying distribution
Variability can be computed empirically or analytically

Non-Parameteric Methods

Metrics can be calculated without assuming the underlying distributions
Sign test - Run test multiple times. Note the the number of times there was a difference. Find how likely X times difference was to have occurred by chance

Calculating variance empirically

calculating variance means makes assumption about underlying data
for simple distributions we can make assumption but for complex distribution it is better to calculate it empirically
Uses of A/A test
- compare result to what you expect (sanity check w.r.t. to analytic estimate) if you cannot make assumption about distribution
  - example present here
- if you could make assumption about distribution then you can calculate the variance empirically and then use the assumption about distribution to calculate confidence interval
- directly estimate confidence interval

Lessons Learnt

Coming up with metric & validating them can be tricky and may take more time compared to running the experiment itself
metrics that seem to make perfect business sense may not be good enough as a metric
data collection
- click through rate
  - remove spam/don't remove spam
  - particular region or all region
  - impression or page view
  - first page or all next pages
- lot of considerations need to be taken to precisely define
- latency
  - first byte load
  - last byte load
  - how did you measure the load time
sensitivity/robustness
- latency
  - mean does not move at all as people have different connections
  - higher percentiles make more sense here
- search
  - tasks per user per day
    - does it have to be a day or week? you may search more consistenly weekly compared to daily basis

variability

start analytically and start getting feel of your data

sanity checking is very important
- google search results test having latency as an invariant made all of it much more sensible
what makes a bad query result?
- user clicking on 2 or 3 results or going to next page? Not really
- porn search queries behave like that
- people use search for navigational purposes
- looking for people we may click on linkedin, twitter, facebook etc.