This paper discusses the OEC, or overall evaluation criterion, in more detail.
Additional techniques for the same are present here https://storage.googleapis.com/supplemental_media/udacityu/3954679115/additional_techniques.pdf
We look at total active cookies and we see a spike. Is the spike due to weekly variation? A good to check that is to look at week-over-week plot i.e. plot data of point in week compared to previous week's data. The spike is still there. Hence, this is not due to weekly variations.
We can look at data by geography and we see it is due to a specific country. At this point talking with the engineering team and talking about the case would be helpful.
If you suspect a click-tracking issue then how can that be identified?
Having both CTR and CTP on same graph just tells us that they are in the same direction, nothing more.
The below is suspicious but still not enough as the user behaviour is expected to be different on different platforms.
Both desktop and mobile have similar results as expected from CTP.
The CTP is lower than CTR which is not exactly what you would expect. Slightly lower but not significantly lower unless you expect your users to be clicking multiple times.
Let’s talk about some common distributions that come up when you look at real user data.
For example, let’s measure the rate at which users click on a result on our search page, analogously, we could measure the average staytime on the results page before traveling to a result. In this case, you’d probably see what we call a Poisson distribution, or that the stay times would be exponentially distributed.
Another common distribution of user data is a “power-law,” Zipfian or Pareto distribution. That basically means that the probability of a more extreme value, z, decreases like 1/z (or 1/z^exponent). This distribution also comes up in other rare events such as the frequency of words in a text (the most common word is really really common compared to the next word on the list). These types of heavy-tailed distributions are common in internet data.
Finally, you may have data that is a composition of different distributions - latency often has this characteristic because users on fast internet connection form one group and users on dial-up or cell phone networks form another. Even on mobile phones you may have differences between carriers, or newer cell phones vs. older text-based displays. This forms what is called a mixture distribution that can be hard to detect or characterize well.
The key here is not to necessarily come up with a distribution to match if the answer isn’t clear - that can be helpful - but to choose summary statistics that make the most sense for what you do have. If you have a distribution that is lopsided with a very long tail, choosing the mean probably doesn’t work for you very well - and in the case of something like the Pareto, the mean may be infinite!
We make the density graphs of various videos based on the histograms
We can see the shape of distribution is similar for the videos. Now to choose metrics we plot of bunch of them
The ones that are moving around are not robust enough for comparable videos. So 90th and 99th percentile are not robust enough
Say we plot the same graph but for videos with different resolutions. The latency should go down for videos with lower resolution
By this we can see that the median and mean are not sensitive enough.
Suppose you run an experiment where you measure the number of visits to your homepage, and you measure 5000 visits in the control and 7000 in the experiment. Then the absolute difference is the result of subtracting one from the other, that is, 2000. The relative difference is the absolute difference divided by the control metric, that is, 40%.
For probability metrics, people often use percentage points to refer to absolute differences and percentages to refer to relative differences. For example, if your control click-through-probability were 5%, and your experiment click-through-probability were 7%, the absolute difference would be 2 percentage points, and the relative difference would be 40 percent. However, sometimes people will refer to the absolute difference as a 2 percent change, so if someone gives you a percentage, it's important to clarify whether they mean a relative or absolute difference!
To calculate confidence interval we need
metrics that seem to make perfect business sense may not be good enough as a metric
data collection
variability