Sometimes, particularly when sampling uniformly at random from a population, every possible outcome is equally likely. Then, calculating the probability of an event is an exercise in counting:
$$P(\text{Event }E) = \frac{\text{# outcomes in }E}{\text{# possible outcomes}}$$Vitamin 6 had the following problem:
$$P(\text{All 4 faculty}) = \frac{\text{# of ways to get those 4 faculty}}{\text{# possible samples (not counting reshuffled samples)}}$$Suppose we sample 4 people uniformly at random without replacement from the population of 103 DS100 students and faculty. What is the chance our sample contains all 4 faculty?
In these situations, it's helpful to know the basic formula for the number of ways to choose $m$ things from $n$ things without replacement, if order doesn't matter (that is, if you don't count reshuffled copies as different samples). It's ${n \choose m} = \frac{n!}{m! (n-m)!}$.
How about this harder question:
Suppose we sample 10 people uniformly at random without replacement from the population of 103 DS100 students and faculty. What is the chance our sample contains all 4 faculty?
First, what's the denominator?
There are ${103 \choose 10}$ possible samples.
What about the numerator?
It counts the number of ways to get a sample of 10 people that includes the 4 faculty. Of course, all of these ways have to involve the 4 faculty members being in the sample! Since order doesn't matter, we can just "fix" the first 4 people as the 4 faculty members. Then we're choosing the remaining 6 people from the remaining 99 nonfaculty.
$$P(\text{All 4 faculty in a sample of 10}) = \frac{{99 \choose 6}}{{103 \choose 10}}$$Now here is a much harder question. We won't even answer it! But it's useful to illustrate some traps in combinatorial reasoning.
Suppose we sample 10 people uniformly at random with replacement from the population of 103 DS100 students and faculty. (So a sample can include a person multiple times.) What is the chance our sample contains all 4 faculty?
What's the denominator?
First, we really have to ask: What's the right way to think about the sample space?
The question doesn't talk about ordering. You can think of the samples as ordered or unordered, whichever is more convenient. But you have to be consistent in computing the numerator and denominator.
If order doesn't matter, the denominator is hard to figure out! This isn't a combinatorics class, so we won't go there. (It turns out to be the best way to go, though.)
If you let order matter (that is, if you count shuffled versions of the same sample as different), then the denominator is easy: For each slot there are 103 choices, so it's $103^{10}$.
If order matters, then what's the numerator?
Here is a tempting but incorrect line of reasoning. The 4 faculty are fixed, and there are $103^6$ ways to choose and order the other people. So if the faculty were first and in order [Hellerstein, Yu, Gonzales, Nolan], there would be $103^6$ ways to choose the others. But now the faculty can show up anywhere and in any order. There are ${10 \choose 4}$ ways to choose the faculty members' positions in the sample, and $4!$ ways to order them given those positions. So the numerator is $103^6 {10 \choose 4} 4!$.
This reasoning is incorrect. Can you find a sample that's counted more than once?
This double-counts situations like the following: Say we put the faculty in the first 4 positions in the order above. One possibility is that the next rest-of-sample member is Nolan again:
[Hellerstein, Yu, Gonzales, Nolan, Nolan, <anything>, <anything>, <anything>, <anything>, <anything>]
But we also counted this situation under the possibility that the faculty occupy the first 3 positions and the 5th position, and the first rest-of-sample member (in position 4) is Nolan.
Suppose you'd like to estimate the chance $p$ that a coin comes up heads, by flipping it many times. (It's not exactly .5!) Suppose flips
is an array containing the flip outcomes, 0 for tails and 1 for heads.
An obvious estimate of $p$ is the proportion of times the coin came up heads when you flipped it, np.mean(flips)
. It turns out that that natural estimate is also the maximum likelihood estimate.
The idea behind maximum likelihood is to guess that a parameter's value is whatever maximizes the probability of the data we see. So we first have to write down the probability of the data, as a function of the parameter.
$$L(\pi) = P(\text{Our sequence of flip outcomes, assuming }p = \pi)$$How do we calculate that probability?
Since the flips are independent, this is just:
$$\begin{align} L(\pi) = &P(\text{flip 0 was }\texttt{flips[0]}\text{, assuming }p = \pi) \\ \times &P(\text{flip 1 was }\texttt{flips[1]}\text{, assuming }p = \pi) \\ \times &\cdots \end{align}$$Now, each term where flips[0] == 1
will be $\pi$ and each other term will be $(1-\pi)$. Gathering together the terms with exponents, we get:
Here's a picture if there are 10 flips and 6 heads:
In [13]:
import numpy as np
%matplotlib inline
import matplotlib.pyplot as plt
def plot_bernoulli_likelihood(n, k, log=False):
pis = np.linspace(.01, .99, 100)
liks = k*np.log(pis) + (n-k)*np.log(1-pis) if log else pis**k*(1-pis)**(n-k)
plt.plot(pis, liks)
plt.xlabel("π")
plt.ylabel("log(L(π))" if log else "L(π)")
plt.title("{} function for {:d} flips and {:d} heads".format(
"Log-likelihood" if log else "Likelihood",
n,
k))
plt.show()
In [14]:
plot_bernoulli_likelihood(10, 6)
In [15]:
plot_bernoulli_likelihood(100, 60)
In [18]:
plot_bernoulli_likelihood(100, 60, log=True)
The MLE is the $\pi$ that maximizes $L$. What's the general strategy for finding the maximum value of a function?
Find places where its derivative with respect to $\pi$ is 0.
You do have to make sure you're in the right situation. For example:
This function has 2 places where its derivative is 0. One is a local maximum, and the other is a local minimum. But that local maximum isn't a global maximum! If we limit the domain of the function to the displayed region, the maximum occurs at the edge of the domain. Sometimes it's important to check for that.
In this class, we won't dwell on this problem too much. In this case, there is one place where the derivative is 0, and it's the maximum.
How do we do that for $L$?
It is very often useful to take the logarithm of $L$. That's okay, since probabilities are nonnegative, and if $\log(x) > \log(y)$ then $x > y$. So $\arg \max \log(L) = \arg \max L$.
You could also look at $L^2$ or $\sqrt{L}$ for similar reasons, but those wouldn't be useful in this case. Likelihood functions are often probabilities of many independent events, because each observation is independent. So they often involve products. Logarithms are useful when they can convert products (which are hard to differentiate) into sums (which are easy to differentiate). Remember, sums are easy to differential because differentiation commutes with summation. If $g(x) = \texttt{np.sum([func(x) for func in funcs])}$, then:
$$[\frac{d}{dx} g](x) = \texttt{np.sum}([\frac{d}{dx}\texttt{func(x) for func in funcs}])$$So:
$$\log L(\pi) = \texttt{sum(flips)} \log(\pi) + (\texttt{len(flips)-sum(flips)}) \log(1-\pi)$$Now, derive. This isn't a calculus class, so it's okay to use something like Wolfram Alpha for this step. Cite your source, though! In this case, the derivative is not hard.
$$0 = [\frac{d}{d\pi} \log L](\pi^*) = \texttt{sum(flips)} \frac{1}{\pi^*} - (\texttt{len(flips)-sum(flips)}) \frac{1}{1-\pi^*}$$$$\implies (1 - \pi^*) \texttt{sum(flips)} - \pi^* (\texttt{len(flips)-sum(flips)}) = 0$$$$\implies \pi^* = \frac{\texttt{sum(flips)}}{\texttt{len(flips)}} = \texttt{np.mean(flips)}$$In this case, the MLE is the obvious thing. This is true for many situations, but not for all.
Suppose you're trying to find a job as a data scientist. Job applications are pretty random, so even though you're a great candidate, the chance you're hired after each application is less than 1.
Let's make an heroic assumption:
Each time you apply, there's a fixed chance $p$ you're hired. The chance doesn't change over time. You continue to apply forever until you're hired.
Under this assumption, each application can be modeled with a Bernoulli($p$) distribution. These are called independent "Bernoulli($p$) trials".
The number of times you apply and fail before you get hired, $T$, also has a well-known distribution: the Geometric($p$) distribution. Let's figure out the chances.
$$P(T = 0) = P(\text{Hired on the first try}) = ?$$
$p$.
How about $P(T = 1)$?
$(1-p)p$. It's the chance you fail the first time and succeed the second time.
What's the general formula?
Of course, this doesn't help much unless you know $p$. Suppose you identify a class of people who previously interviewed for jobs with the same hiring chance $p$. These times are in an array num_failures
. What's the MLE of $p$?
In [49]:
def plot_geometric_likelihood(num_failures, log=False):
pis = np.linspace(.01, .99, 100)
liks = len(num_failures)*np.log(pis) + sum(num_failures)*np.log(1-pis) if log else pis**len(num_failures)*(1-pis)**sum(num_failures)
plt.plot(pis, liks)
plt.xlabel("π")
plt.ylabel("log(L(π))" if log else "L(π)")
plt.title("{} function for the Geometric\ndistribution given the data".format(
"Log-likelihood" if log else "Likelihood"))
plt.show()
In [50]:
num_failures_example = np.array([2, 5, 0, 0, 1, 4])
plot_geometric_likelihood(num_failures_example)
In [51]:
plot_geometric_likelihood(num_failures_example, log=True)
$L(π) =$ np.prod((1-π)**num_failures * π)
Again, it helps to take the log before differentiating.
$\log(L(π)) =$ sum(np.log((1-π)**num_failures * π))
... = sum(num_failures)*np.log(1-π) + len(num_failures)*np.log(π)
Can you interpret this formula?
It's just the average number of failures divided by the average number of tries. That's the same thing you'd get if you thought about the Bernoulli trials as entirely separate. Does that make sense?
You may recall the "German tank problem" from Data 8. In short, the Germans have tanks labeled with serial numbers 1 through $N$, and we observe a random sample (with replacement) of the serial numbers. We'd like to estimate $N$, since that tells us how many tanks the Germans have. The observed serial numbers are in an array called serials
.
There are many estimates of $N$, and the MLE is one of them. Let's calculate it.
What's the likelihood function?
Say the max is $\nu$. The chance of seeing any number is $1/\nu$, except that the chance of seeing a number above $\nu$ (or below 1) is 0. So the likelihood function is:
$$L(ν) = \texttt{np.prod((1/ν) * (serials <= ν)}$$What's the MLE, $\nu^*$?
This one doesn't really require calculus or algebra. If any of the serial numbers are above $\nu$, then the likelihood is 0. Otherwise, it's $1/\nu^{\texttt{len(serials)}}$. That just decreases as $\nu$ gets bigger, so we should choose $\nu$ to be as small as possible without being less than any of the observations. In other words, $\nu^* = \texttt{max(serials)}$.
So the MLE is the obvious estimator.
In [165]:
def plot_uniform_likelihood(serials, log=False):
lower = max(int(.75*max(serials)), 2*min(serials) - max(serials))
upper = 2*max(serials) - min(serials)
if log:
nus = np.arange(max(serials), upper+1, 1)
liks = -len(serials)*np.log(nus)
plt.scatter(nus, liks, label='likelihood', s=.03)
plt.axvline(max(serials), linestyle='--', color='red', label='likelihood is -$\\infty$ past here')
plt.xlim([lower-10, upper+10])
plt.legend(bbox_to_anchor=(1.6, .5))
else:
nus = np.arange(lower, upper+1, 1)
liks = [np.prod((serials <= nu) / nu) for nu in nus]
plt.scatter(nus, liks, s=.03)
plt.ylim([-.2*max(liks), 1.5*max(liks)])
plt.xlabel("$\\nu$")
plt.ylabel("log(L($\\nu$))" if log else "L($\\nu$)")
plt.title("{} function for the discrete uniform\ndistribution given the data".format(
"Log-likelihood" if log else "Likelihood"))
plt.show()
In [166]:
serials_example = np.array([10, 124, 111, 78, 80, 3])
plot_uniform_likelihood(serials_example)
In [167]:
plot_uniform_likelihood(serials_example, log=True)
Does this seem like a good estimator to you?
It's okay, but not great. It seems likely there are serial numbers above max(serials)
. Many $N$s modestly larger than $\nu$ would reasonably explain the data, while any $N$ less than $\nu$ is ruled out entirely by the data.