To wrap up this week's introduction into the fundamental tools of data science, today we'll jump into probability and statistics. This is yet another topic that could span an entire course, and like linear algebra I highly recommend taking a stats course at some point. By the end of this lecture, you should be able to
Of course, these definitions are gross oversimplifications, but I'll illustrate with an example.
You're an intern at the Large Hadron Collider (LHC), and you've been instrumental in devising the experiments that seem to have uncovered hard evidence for the existence of the Higgs Boson. This evidence consists of the results of billions of subatomic collisions taking place deep in the bowels of the LHC.
Some high-level executives from the funding agencies responsible for keeping the lights on at the LHC want to hear about these exciting results, but all your immediate bosses are indisposed and have asked you to step in.
How do you present the findings? Do you provide a raw list of the billions of collisions? Do you take some averages of the data? How many? How representative of the full billion data points are these handful of averages? Can you quantify the uncertainty in these averages, i.e. how far off they could potentially be?
First-order statistics are summaries of the data that rely only on the data itself.
In [2]:
import numpy as np
np.random.seed(29384924)
data = np.random.randint(10, size = 100)
print(data)
Some very straightforward statistics are the number of data points, the largest value, and the smallest value. These shouldn't be immediately ignored, but they are of limited utility.
In [5]:
print("Number of data points: {}".format(data.shape[0]))
print("Largest value: {}".format(data.max()))
print("Smallest value: {}".format(data.min()))
Mean: You've seen this all the time--it's the average of all the data!
In [4]:
print(data.mean()) # Mean
The mean is very simple to compute: it's the sum of the data divided by the number of data points. The mean depends at least a little bit on every single data point. This can be advantageous in certain situations, as it varies smoothly as more data points are added or some are removed.
However, this property can also be problematic. A famous example that exploits this weakness is that in the mid-1980s, the major at UNC with the highest average starting salary was geography...however, this was mostly on account of a particular NBA star by the name of Michael Jordan.
In [9]:
outlier = np.array([1, 1, 2, 3, 2, 1, 3, 2, 38]) # Note the outlier of 38 at the end.
print(outlier.mean())
The mean is sensitive to outliers, meaning one or two data points that lie well beyond all the others can disproportionately affect the value of the mean. In the above simple example, the lone outlier of 38 pulls the mean to be larger than all the other data points except for 38; not exactly a representative sample!
Median: The "middle" data point.
In [12]:
print(data)
print(np.median(data))
The median is computed by sorting all your data and picking out the middle value (or averaging the two middle data points, if you have an even amount of data). In this way, the median does not directly depend on the vast majority of your data; just on whatever value(s) happen to be in the middle. It is also not trivial to compute: you have to sort your data first, which might be tricky if you have billions of data points.
On the other hand, the median is not sensitive to outliers. In fact, it's robust to outliers, meaning it wholesale ignores them.
In [13]:
print(outlier)
print(np.median(outlier))
Second-order statistics rely both on the data itself, and first-order statistics, to compute.
Variance: This measures how spread out your data are. More specifically, it measures how much your data varies from its mean.
In [3]:
print(data)
print(data.var())
The variance is computed by subtracting each individual data point from the average of the whole data set, squaring this difference, and summing all these differences together before finally dividing by the number of data points.
Variance may not be familiar, but you've probably heard of its relative: standard deviation is just the square root of the variance!
In [4]:
print(np.sqrt(data.var()))
print(data.std())
Like the mean, variance (and standard deviation) uses all the data points to compute, and is therefore sensitive to outliers.
Interquartile Range: The difference between the 75% percentile and 25% of the data.
In [6]:
print(np.percentile(data, 75) - np.percentile(data, 25))
This, like the median, is robust to outliers. But also like the median, it relies on sorting the data first, then picking out the value 1/4 of the way down the dataset and subtracting it from the value 3/4 of the way down the dataset. This can be expensive in large datasets.
So far, we've looked at ways of quantitatively summarizing datasets. These involve first and second-order statistics, including
These methods for describing and summarizing data will come in handy as we deal with random variables.
A random variable is, first and foremost, a variable: we don't know its value. It is often represented by a capital letter, e.g. $X$.
While we can't know the exact value of a random variable, we often have a handful of observations of the variable--usually denoted with a lowercase version of the corresponding capital letter, e.g. $x$--and it's from these observations that we compute things like mean and variance in order to describe the random variable itself.
Is this all coming together, or what?!
As a result of these observations $x_1, x_2, ..., x_n$ of the random variable $X$, we can usually say something about how $X$ is distributed.
In [7]:
%matplotlib inline
import matplotlib.pyplot as plt
import seaborn as sns
# Observe 100 data points from a Gaussian random variable with mean = 0.0 and variance = 1.0.
observations = np.random.normal(size = 100)
plt.hist(observations)
Out[7]:
It's tough to see, isn't it? Let's try 1000 observations.
In [8]:
# Observe **1000** data points from a Gaussian random variable with mean = 0.0 and variance = 1.0.
observations = np.random.normal(size = 1000)
plt.hist(observations)
Out[8]:
That looks a little better! Maybe 10,000 data points, just for grins?
In [10]:
# Observe **** 10,000 **** data points from a Gaussian random variable with mean = 0.0 and variance = 1.0.
observations = np.random.normal(size = 10000)
_, _, _ = plt.hist(observations, bins = 25)
There's the bell curve we know and love!
We had some observations $x$--a regular array with a bunch of numbers in it (10,000 by the end, to be exact). In fact, here's what the array looked like:
In [11]:
print(observations)
This could be any old dataset! In fact, forget for a moment that we generated this dataset ourselves, and instead think that this could be a dataset we picked up from the web.
We're able to compute some statistics from it:
In [12]:
print("Mean: {:.2f}".format(observations.mean()))
print("Variance: {:.2f}".format(observations.var()))
You'll notice the mean is very close to 0, and the variance is likewise very close to 1. Since we ourselves set the mean and variance for the random number generator, we know that these are very close to the true mean and true variance, but in general we wouldn't necessarily know that.
Instead, we'd have computed the sample mean and sample variance from the data, as we did above, and then assumed the data to be Gaussian; after all, it certainly looks like a bell curve!
This brings us to distributions. When we made those histograms of the observations, we were creating a distribution of the data. The mean and variance statistics are parameters that govern the shape of those distributions. Take a look at how the distribution changes when we change its parameters:
In [15]:
from scipy.stats import norm
xs = np.linspace(-5, 5, 100)
plt.plot(xs, norm.pdf(xs, loc = 0, scale = 1), '-', label = "mean=0, var=1")
plt.plot(xs, norm.pdf(xs, loc = 0, scale = 2), '--', label = "mean=0, var=2")
plt.plot(xs, norm.pdf(xs, loc = 0, scale = 0.5), ':', label = "mean=0, var=0.5")
plt.plot(xs, norm.pdf(xs, loc = -1, scale = 1), '-.', label = "mean=-1, var=1")
plt.legend(loc = 0)
plt.title("Various Normal Distributions")
Out[15]:
Notice how changing the mean shifts the bell curve around along the x-axis, and changing the variance either squeezes the bell curve to be tall and skinny (small variance, tightly packed around the mean), or smashes it to be flat and short (large variance, spread out).
These distributions allow us to say something about the probability of a random variable taking a certain specific value.
In fact, if you look at the previous plot of the normal bell curves--picking a spot along one of those curves gives you the probability of the random variable representing that distribution taking on that particular value you picked!
I hope it becomes clear that, the most likely value for a random variable to take is its mean. This even has a special name: the expected value. It means, on average, this is the value we're going to get from this random variable.
We have a special notation for probability:
$P(X = mean)$
$P$ is the universal symbol for "probability of", followed by some event. In this case, our event is "$X = mean$".
A few other properties to be aware of:
These three points are referred to as the Axioms of Probability and form the foundation for pretty much every other rule of probability that has ever been and will ever be discovered.
A good way of learning probability is to visualize it. Take this spinner:
It's split into 12 segments. You could consider each segment to be one particular "event", and that event is true if, when you spin it, the spinner stops on that segment. So the probability of landing on any one specific segment is $1/12$. The probability of landing on any segment at all is 1.
Two events $A$ and $B$ are dependent if having knowledge about one of them implicitly gives you knowledge about the other. On the other hand, they're independent if knowing one tells you nothing about the other. Take an example of flipping a coin:
I have a penny; a regular old penny. I flip it once, and it lands on Heads. I flip it 9 more times, and it lands on Heads each time. What is the probability that the next flip will be Heads?
If you said $1/2$, you're correct! Coin flips are independent events; you could flip the coin 100 times and get 100 heads, and the probability of tails would still be $1/2$. Knowing one coin flip or 100 coin flips tells you nothing about future coin flips.
Now, I want to know what the probability is of two consecutive coin flips returning Heads. If the first flip is Heads, what is the probability of both flips being Heads? What if the first flip is Tails?
In this case, the two coin flips are dependent. If the first flip is Tails, then it's impossible for both coin flips to be Heads. On the other hand, if the first coin flip is Heads, then while it's not certain that both coin flips can be Heads, it's still a possibility. Thus, knowing one can tell you something about the other.
If two events $A$ and $B$ are independent, their probability can be written as:
$P(A, B) = P(A) * P(B)$
This is a huge simplification that comes up in many data science algorithms: if you can prove two random variables in your data are statistically independent, analyzing their behavior in concert with each other becomes much easier.
On the other hand, if two events are dependent, then we can define the probabilities of these events in terms of their conditional probabilities:
$P(A, B) = P(A | B) * P(B)$
This says "the probability of $A$ and $B$ is the conditional probability of $A$ given $B$, multiplied by the probability of $B$."
Conditional probability is way of "fixing" a random variable(s) we don't know, so that we can (in some sense) "solve" for the other random variable(s). So when we say:
$P(A, B) = P(A | B) * P(B)$
This tells us that, for the sake of this computation, we're assuming we know what $B$ is in $P(A | B)$, as knowing $B$ gives us additional information in figuring out what $A$ is (again, since $A$ and $B$ are dependent).
Which brings us, at last, to Bayes' Theorem and what is probably the hardest but most important part of this entire lecture.
(Thank you, Rev. Thomas Bayes)
Bayes' Theorem is a clever rearrangement of conditional probability, which allows you to update conditional probabilities as more information comes in. For two events, $A$ and $B$, Bayes' Theorem states:
As we've seen, $P(A)$ and $P(B)$ are the probabilities of those two events independent of each other, $P(B | A)$ is the probability of $B$ given that we know $A$, and $P(A | B)$ is the probability of $A$ given that we know $B$.
It may seem like a lot of math, but this is an incredibly useful formula that will come up later. And should you decide to pursue probability and statistics courses, you'll see this even more often!
Some questions to discuss and consider:
1: You have a very small dataset with a large variance. What first-order statistic would you use to summarize the data? Justify your answer.
2: Correlation is an analytical strategy to determine if, when one random variable changes, a second random variable also changes. Random variables can be positively correlated (as one goes up, the other goes up), negatively correlated (as one goes up, the other goes down), or uncorrelated (the behavior of one cannot predict the behavior of the other). Without consulting Google, what statistics that we have covered in this lecture do you think would be useful for computing correlation between two random variables? Why?
3: Go back to the spinner graphic. We know that the probability of landing on any specific segment is 1/12. What is the probability of landing on a blue segment? What about a red segment? What about a red OR yellow segment?
4: Recall the "conditional probability chain rule" from earlier in this lecture, i.e. that $P(A, B) = P(A | B) * P(B)$. Assume $A$ and $B$ are independent. Prove $P(A, B) = P(A)$.
5: Bayes' Theorem is a clever rearrangement of conditional probability using basic principles. Starting from there, $P(A, B) = P(A | B) * P(B)$, see if you can derive Bayes' Theorem for yourself.