It would be nice to do a Judea Pearl-type DAG

Let's say we're interested in predicting a college-football game. What are all the things that influence the outcome? Here's a list of things that come to mind:

  • Team A's offensive strength ($A_o$).
  • Team B's offensive strength ($B_o$).
  • Team A's defensive strength ($A_d$).
  • Team B's defensive strength ($B_d$).
  • Team A's special-teams strength ($A_s$).
  • Team B's special-teams strength ($B_s$).
  • Team A's "heart and determination" ($A_h$).
  • Team B's "heart and determination" ($B_h$).
  • Home-field advantage ($H$).
  • Referees ($R$).
  • Other influences, which I'll call The X Factor.

Obviously, this list is incomplete: there are missing variables (perhaps each team's previous-week result) and some variables are aggregates of more finely grained variables (for example, offensive abilities is a combination of passing abilities and rushing abilities.) But to make things easy, pretend that only these variables determine the outcome of football games and they do so in the following way:

$MOV = (A_o − B_d) − (B_o − A_d) + (A_s − B_s) + (A_h − B_h) + H + R + X$,

where $MOV$ is Team A's margin of victory. $MOV$ can take positive and negative values. A negative $MOV$ means Team B wins.

We can use this equation to make predictions. For example, given two equal teams ($A_o = B_o$, $A_d = B_d$, $A_s = B_s$, and $A_h = B_h$) and unbiased refs ($R = 0$), A will win by $H + X$ points.

Equations require consistency of units:

  1. To be equal, quantities must have the same units. Five oranges do not equal five apples, and five miles do not equal five miles per hour. This means the left- and right-hand side equations must have the same units. In our football equation, the left-hand side is expressed in units. Then the right-hand side must be measured in points too.
  2. We can only add and subtract quantities with the same units. Therefore, Because the left-hand side is measured in points, the right-hand side must be measured in points. And because the right-hand side is a sum, each part of the right-hand side must be measured in points.

We call this precisely defined relationship between the causes (on the right) and the effect (on the left) the data-generating process. This equation is easily interpeted:

  1. A one-point change in any of these causes changes $MOV$ by one point; for example, increasing $H$ from 2 to 3, holding the other causes constant, increases $MOV$ by 1.
  2. $(A_o − B_d)$: A's offensive contribution to $MOV$ depends not only on A's offensive strength but also on B's defensive strength. Each additional point of A's offensive strength increases $MOV$ by one, and each additional point of B's defensive strength decreases $MOV$ by one.
  3. $− (B_o − A_d)$: B's offensive contribution to $MOV$ depends not only on B's offensive strength but also on A's defensive strength. Each additional point of B's offensive strength decreases $MOV$ by one, and each additional point of A's defensive strength increases $MOV$ by one.
  4. $(A_s − B_s)$: A's special-teams contribution to $MOV$ depends not only on A's special-teams strength but also on B's special-teams strength. Each additional point of A's special-teams strength increases $MOV$ by one, and each additional point of B's special-teams strength decreases $MOV$ by one.
  5. $(A_h − B_h)$: A's heart and determination contributes to $MOV$ only as much as it exceeds B's heart and determination. If B's heart exceeds A's heart, this term is negative.
  6. $H$: conventional wisdom tells us that this term is positive if A is home and negative if A is away. An $H$ of 3 means that the home team gets the equivalent of an extra field goal by playing at home.
  7. $R$: The refs can be biased. A positive $R$ means the refs make calls in favor of A, and a negative $R$ means the refs make calls in favor of B.
  8. $X$: This is a catch-all. Many things can affect the outcome of the game, such as weather, injuries, and unlucky bounces. $X$ captures all of these influences. $X$ is positive if, in the aggregate, these things help A and negative if these things hurt A.

Why Statistics

If we perfectly knew the values for each part of the right-hand side of the equation, we could perfectly predict the result of each game. Unfortunately,

Random Variable

A random variable or stochastic variable is, roughly speaking, a variable whose value results from a measurement on some type of random process.

It is easy to confuse random variables with algebraic variables, but the two differ. The value of an algebraic variable is deterministic (i.e. the variable can take multiple values, but given inputs to the deterministic process there is only one possible value that the algebraic variable can take) while the value of a random variable is at least partly determined by a random process (i.e. even if a deterministic process underlies a random variable, knowing inputs to the deterministic process is not good enough to know the value of the random variable with certainty.) Here are a few examples:

  1. algebraic variable y: y = 2x + 3 If we know x, then we know y with certainty. If x = 2, y must equal 7. If x = 1, y cannot equal anything but 5. An algebraic variable like this has a two-way functional relationship: we can calculate x given y (x = (y − 3)) (?).
  2. random variable y: Nature assigns y such that P (y = 1) = .5 and P (y = 0) = .5 In this example, y can take 0 or 1. There is not a deterministic process that determines whether y will equal 1 or 0, so knowing x (or any other potential inputs) does not tell us with certainty what value y will take; y could still take a 0 or 1. Unless the process assigns an outcome with probability 1, it is random.
  3. random variable z: z = 2x + 3 + y Assume y from the above example. Then knowing x does not tell us with certainty what value z will take. If x = 1, then z could equal 5 or 6 (with equal probability). If x = 4, then z could equal 11 or 12 (with equal probability). Note that a variable that is a function of a random variable is also a random variable.

Often we treat deterministic processes as random because it is simpler to think of them that way. For example, if we knew the exact weight and measurement of a die and the speed, height, rotation, etc. at which it was tossed, we might be able to figure out exactly which side would come up (this has been demonstrated using the coin toss). But getting that information and doing those calculations is a burden, and treating it as random is simpler.

Formally, a random variable is a function from a probability space, typically to the real numbers, that is measurable. (For finite probability spaces, the measurable requirement is superfluous.) Random variables can be classified as either discrete (a random variable that may assume either a finite number of values or an infinite sequence of values) or as continuous (a variable that may assume any numerical value in an interval or collection of intervals). A random variable's possible values might represent the possible outcomes of a yet-to-be-performed experiment, or the potential values of a quantity whose already-existing value is uncertain (for example, as a result of incomplete information or imprecise measurements).


In [ ]:

Power Distribution

$f_X(X=x|k) = cx^{-k}$ Note that $x$ and $k$ need constraints. For example, if $k = -2$ the distribution doesn't integrate:


In [2]:
import numpy
import matplotlib.pyplot as plt
%matplotlib inline

x = numpy.linspace(0.1,10,99)
y = x**2
plt.plot(x,y)
plt.show()


To force large $x$ toward 0, $k$ needs to be positive. And if $k$ is positive, $x \geq 1$.

Let's find the normalizing constant ($c$):

$1 = \int_{1}^{\infty} c x^{-k} dx$

$1 = c \bigl[ \frac{1}{1-k} x^{1-k} + d \bigr]_{1}^{\infty}$

$1 = \frac{c}{1-k} \bigl[ 0 - 1 \bigr]$

$1 = \frac{c}{k-1}$

$c = k-1$

So the power law density function is $\begin{equation} f_X(X=x | k)=\begin{cases} (k-1)x^{-k} & \text{if }1 \leq x < \infty \text{ and } k > 0 \\ 0 & \text{otherwise}. \end{cases} \end{equation}$

Here's what $f_X(X=x | k=2)$ looks like:


In [12]:
power_k2 = lambda x: x**-2 if x>=1 else 0
x = numpy.linspace(0.1,10,99)
y = [power_k2(z) for z in x]
plt.plot(x,y)
plt.show()


Nassim Taleb offered the following quiz that uses a power distribution. Note his typo ("$q=.07$" should be "$q=.007$"). Using our equation, what is $k$? <img src="taleb_tweet.png", width=500>

First integrate the distribution from some $y$ to infinity: $1-F_X(X=x|k) = 1 - \int_1^y (k-1)x^{-k} dx$

$1-F_X(X=x|k) = 1 - \biggl[ (k-1) \bigl[ \frac{1}{1-k} x^{1-k} \bigr]_1^y \biggr]$

$1-F_X(X=x|k) = 1 - \biggl[ (-1) \bigl[ y^{1-k} - 1 \bigr] \biggr]$

$1-F_X(X=x|k) = 1 - \biggl[ 1 - y^{1-k} \biggr]$

$1-F_X(X=x|k) = y^{1-k}$

Then

$.45 = .007^{1-k}$

$\ln{.45} = (1-k) \ln{.007}$

$k = 1 - \frac{ \ln{.45} }{ \ln{.007} }$

$k = .84$

Convex set of distributions

Is this a mixture distribution?

What happens as data move from simple to complex? We look at it using a convex set of a simple distribution (uniform) and a complex distribution (power).

First the uniform:

$\begin{equation} f_X(X=x)=\begin{cases} 1 & \text{if }1 \leq x \leq 2 \\ 0 & \text{otherwise}. \end{cases} \end{equation}$

Then the power:

$\begin{equation} g_X(X=x | k)=\begin{cases} (k-1)x^{-k} & \text{if }1 \leq x < \infty \text{ and } k \geq 0 \\ 0 & \text{otherwise}. \end{cases} \end{equation}$

And the convex set:

$\begin{equation} h_X(X=x | \alpha, k)=\begin{cases} \alpha + (1-\alpha)(k-1)x^{-k} & \text{if }1 \leq x < 2 \text{ and } k > 0 \\ (1-\alpha)(k-1)x^{-k} & \text{if }2 \leq x < \infty \text{ and } k > 0 \\ 0 & \text{otherwise}. \end{cases} \end{equation}$

(No normalizing constant needed because those were included in the input distributions.)


In [9]:
def convex_dist(x,alpha,k):
    if x>=1 and x<2 and k>0:
        return alpha + (1-alpha)*(k-1)*x**-k
    elif x>=2 and k>0:
        return (1-alpha)*(k-1)*x**-k
    else:
        return 0

x = numpy.linspace(0.1,10,99)
y0 = [convex_dist(z,0,2) for z in x]
y5 = [convex_dist(z,0.5,2) for z in x]
y1 = [convex_dist(z,1,2) for z in x]
plt.plot(x,y0)
plt.plot(x,y5)
plt.plot(x,y1)
plt.show()