In this lab we will introduce you to some basic probability and statistics concepts. You may have heard of many of the terms we plan to go over today but if not, do not fret. Hopefully by the end of this session you will all leave here feeling like stats masters!
By the end of this section, hopefully you will be able to answer the following questions with confidence:
1) What is the mean, standard deviation, and variance of a collection of data?
2) How do these concepts help me interpret and better understand the data I'm given?
Technical note: this notebook was written in python2 and then converted to python3. See differences between python2 and python3
First off, you all may have heard of the mean value of a set of data, but if not, the "mean" is simply the average. Suppose you're given the following values:
$$ 5,3,7,4,2 $$What is their average? We can compute this by hand:
$$ \frac{1}{5} \left( 5 + 3 + 7 + 4 + 2 \right) = \frac{21}{5} = 4.2$$This is fine while we have a small set of values, but what if we heard thousands of values, or even hundreds of thousands? This is where computers can be very handy.
Note: the concept of a "middle value" is often better done with the median, which is the number such that half of the data set is above, and half is below. The median is more robust than the mean -- for example, if you were typing $ 5,3,7,4,2 $ into your computer, but forgot a comma and typed $ 5,3,7,42 $ instead, the mean of $ 5,3,7,42 $ is $14.25$, which is very different than $4.2$. However, the median of $ 5,3,7,42 $ is 6, which is close to 4 (the median of $ 5,3,7,4,2 $). For example, to get a sense of "typical" housing prices, median housing price is used instead of mean housing price, otherwise a single \$10 million mansion will skew the whole dataset.
Before we get into the coding part of the lesson, let's look at these values on a number line and get some intuition for what the mean value represents.
In [1]:
import matplotlib.pyplot as plt
plt.style.use('seaborn-whitegrid')
import numpy as np
fig = plt.figure()
ax = plt.axes()
x = np.array((5,3,7,4,2))
ax.scatter(x,np.zeros(len(x)), s=75, label = 'Values')
ax.scatter(4.2,0, s=75, label = 'Mean')
ax.legend()
Out[1]:
Now let's generate some random numbers and calculate their means. I know this may seem tedious but the code will be useful later!
In [2]:
import random as rnd
for x in range(10):
print(rnd.randint(1,101)) # rnd.randint(start,end) returns an integer value between values start and end
In [3]:
for x in range(10):
print(rnd.random()) # rnd.random() returns a pseudo-random value between 0.0 and 1.0
In [ ]:
# Fill in the following code to compute the mean of a set of randomly generated numbers
num_values = 0 # change to reflect number of values
current_total = 0.0 # keep track of current total
for x in range(num_values):
# generate random number (either integer or decimal between 0 and 1) and store it in the variable named val
# keep a running sum of the values
print(val)
print("The average values is ",...) # the mean
In [ ]:
# If we'd like to plot our values and their corresponding means we would need to store the generated values.
# Rewrite the code above to store the values as they are generated. Instead of keeping a running sum, use
# numpy's "sum" function to calculate the mean after the loop!
# Code goes here
Using the provided code above, can you plot the generated values with their mean?
In [ ]:
# Plot code goes here!
Now that we've got a better understanding for the mean, what about the standard deviation and variance?
Intuitively, the standard deviation is a measure of spread of the data.
Suppose we have a 3-ft ruler that we'd like to measure a friend's height with. Let's also assume that the person of interest is taller than 3 feet and WE ONLY HAVE ONE RULER. Therefore, we need to be creative as to how we measure our friend. We could use a 3-ft long string as a placeholder and place the ruler tip-to-tail to measure the remaining distance. As you could imagine, our measurements would be slightly off. However, it's reasonable to think that if we took many measurements, that the mean measurement would be pretty close to the true height! In this case, the standard deviation would be a measure of the "spread" of all our (potentially) inaccurate measurements around the average (or mean) height.
Using the standard deviation, we can get an idea for the interval for which a majority of our measurements lie. If we plotted our data on a number line as before and included the mean, we would see that 68% of our measurements fall within one standard deviation (on either side) of the mean!
Finally, the variance is simply the square of the standard deviation!
$$ std \ dev = \sqrt{var} $$
In [ ]:
fig = plt.figure()
ax = plt.axes()
x = np.array((5,3,7,4,2))
ax.scatter(x,np.zeros(len(x)), s=75, label = 'Values')
ax.scatter(4.2,0, s=75, label = 'Mean')
std = np.std(x)
x_ticks = np.append([1,3,4,5,7], [4.2-std,4.2+std])
plt.axvline(x=4.2-std,color='m')
plt.axvline(x=4.2+std,color='m')
ax.legend()
ax.set_xticks(x_ticks)
fig.suptitle('One Standard Deviation',fontsize=30)
In [ ]:
fig = plt.figure()
ax = plt.axes()
x = np.array((5,3,7,4,2))
ax.scatter(x,np.zeros(len(x)), s=75, label = 'Values')
ax.scatter(4.2,0, s=75, label = 'Mean')
std = np.std(x)
x_ticks = np.append([1,3,4,5,7], [4.2-2*std,4.2+2*std])
plt.axvline(x=4.2-2*std,color='m')
plt.axvline(x=4.2+2*std,color='m')
ax.legend()
ax.set_xticks(x_ticks)
fig.suptitle('Two Standard Deviations',fontsize=30)
In [ ]:
fig = plt.figure()
ax = plt.axes()
x = np.array((5,3,7,4,2))
ax.scatter(x,np.zeros(len(x)), s=75, label = 'Values')
ax.scatter(4.2,0, s=75, label = 'Mean')
std = np.std(x)
x_ticks = np.append([1,3,4,5,7], [4.2-3*std,4.2+3*std])
plt.axvline(x=4.2-3*std,color='m')
plt.axvline(x=4.2+3*std,color='m')
ax.legend()
ax.set_xticks(x_ticks)
fig.suptitle('Three Standard Deviations',fontsize=30)
In [5]:
from IPython.display import Image
Image(url = 'https://amp.businessinsider.com/images/546e68776bb3f74f68b7d0ba-750-544.jpg')
Out[5]:
In [ ]:
# Let's simulate this on our own!
num_draws = 10 # enter number of measurements to take
true_height, std_dev = 160, 1 # play around with changing the standard deviation. Do your results make sense?
height_msr = np.random.normal(true_height, std_dev,num_draws)
In [ ]:
fig = plt.figure()
ax = plt.axes()
plt.hist(height_msr)
x_ticks = np.append(ax.get_xticks(), true_height)
plt.axvline(x=true_height,color='m')
ax.legend()
ax.set_xticks(x_ticks)
fig.suptitle('Simulated Height Measurements',fontsize=30)
plt.ylabel('Number of Measurements',fontsize = 15)
plt.xlabel('Measured Height (cm)',fontsize = 15)
Khan Academy says it the best:
"Probability is simply how likely something is to happen.
Whenever we're unsure about the outsome of an event, we can talk about the probabilities of certain outcomes - how likely they are. The analysis of events governed by probability is called statistics."
The best example for understanding probability is flipping a coin.
There are two possible outcomes - heads or tails.
What's the probability of the coin landing on heads? We can find out using the equation $$P(H) = ? $$ You may already know the likelihood is 50% (assuming a fair coin). But how do we work that out?
$$ P(Event) = \frac{\text{number of ways it can happen}}{\text{total number of outcomes}} $$In the case of flipping a coin, $P(H) = \frac{1}{2}$.
Let's test this by simulation!
In [ ]:
# In this cell we will write a block of code to simulate flipping a coin and plotting the results.
p = ... # pick probability of getting heads
n = ... # pick number of trials
outcome = [] # empty container to store the results
for i in range(n):
# generate a number between 0 and 1 (Hint: we used a built-in function before that can help us with this)
result = ...
if result <= p:
outcome.append(1)
else:
outcome.append(0)
fig = plt.figure()
plt.hist(outcome)
fig.suptitle('Flipping a Coin, Heads = 1, Tails = 0',fontsize=20)
In [6]:
Image(url='http://www.milefoot.com/math/discrete/counting/images/cards.png')
Out[6]:
The primary deck of 52 playing cards in use today includes thirteen ranks of each of the four Franch suits, diamonds (♦), spades (♠), hearts (♥) and clubs (♣), with reversible Rouennais "court" or face cards.
Each suit includes an ace, a king, a queen, a jack, and ranks two through ten, with each card depicting that many symbols of its suit.
Two (sometimes one or four) Jokers are included in commercial decks but many games require them to be removed before play. For simplicity, we will also assume there the Jokers of our deck have been removed.
Going back to our definition of probability
$$ P(Event) = \frac{\text{number of ways it can happen}}{\text{total number of outcomes}}, $$
what is the probability of drawing an ace?
$$ P(ace) = \frac{\text{number of aces}}{\text{total playing cards}} = \frac{4}{52} = \frac{1}{13}. $$
Similarly, we could compute the probability of NOT drawing an ace $$ P(not \ ace) = \frac{\text{number of non-ace cards}}{\text{total playing cards}} = \frac{48}{52} = \frac{12}{13}. $$
$$ P(not \ ace) = \frac{52}{52} - \frac{\text{number of aces}}{\text{total playing cards}} = \frac{48}{52} = \frac{12}{13}. $$
Notice that $P(ace) + P(not \ ace) = 1$. That is, the probability of either drawing an ace from a deck of cards or not drawing an ace is 100%!
In [7]:
# The code below to takes a random draw from a deck of cards
# Define the special face cards, suits and color
faces = {11: "Jack", 12: "Queen", 13: "King", 1: "Ace"}
suits = {1: "Spades", 2: "Hearts", 3: "Diamonds", 4: "Clubs"}
colors = {"Spades" : "Black", "Clubs" : "Black", "Diamonds" : "Red", "Hearts" : "Red"}
# Draw a random integer value. Card_num can be any of the 13 cards to choose from and suits picks a random suit
card_num = rnd.randint(1,13)
suit_num = rnd.randint(1,4)
# Get the card face, suit and corresponding color
card_face = faces.get(card_num,card_num)
card_suit = suits[suit_num]
card_color = colors.get(card_suit)
print(card_face, card_suit, card_color)
In [8]:
num_draws = 10 # Can change number of draws but slow (on personal computer)
i = 0
cards = []
while i < num_draws:
# Draw a random integer value. Card_num can be any of the 13 cards to choose from and suits picks a random suit
card_num = rnd.randint(1,13)
suit_num = rnd.randint(1,4)
# Get the card face, suit and corresponding color
card_face = faces.get(card_num,card_num)
card_suit = suits[suit_num]
cards.append([card_face,card_suit,1])
i = i+1
# remove duplicates (assuming we are using only one deck)
''' for j in range(i-1):
if cards[j][0] == cards[i-1][0]:
if cards[j][1] == cards[i-1][1]:
cards.remove(cards[i-1])
i = i-1
'''
# Note: Won't remove duplicates so that we see probability of drawing a specific card
# Instead, count number of occurences
for j in range(i-1):
if cards[j][0] == cards[i-1][0]:
if cards[j][1] == cards[i-1][1]:
cards[j][2] = cards[j][2] + 1
cards.remove(cards[i-1])
i = i-1
print(cards)
In [113]:
# Another way to draw, should be faster
FullDeck = np.array( range(4*13) )
#print( FullDeck )
#print( (FullDeck // 4)+1 ) # The ranks
#print( (FullDeck % 4) + 1 ) # The suits
ranks = {2:"2",3:"3",4:"4",5:"5",6:"6",7:"7",8:"8",9:"9",10:"10",11: "Jack", 12: "Queen", 13: "King", 1: "Ace"}
suits = {1: "Spades", 2: "Hearts", 3: "Diamonds", 4: "Clubs"}
# Take a random draw, without replacement. How do we do without replacement efficiently? Shuffle.
nHands = round(1e5)
fourOfAKind = 0
fullHouse = 0
threeOfAKind= 0
twoPair = 0
for i in range(nHands):
y = FullDeck.copy()
np.random.shuffle(y)
y = y[:5] # a five-card hand
#print( y )
ranksList = ( y//4 ) + 1
suitsList = ( y%4 ) + 1
# Sanity check:
if nHands <= 2 :
print(">> Hand",i+1)
for j in range(5):
rk = ranks[ ranksList[j] ]
st = suits[ suitsList[j] ]
print("Card is", rk, "of", st)
# Now, suppose we want to check for how often we get three of a kind. The suit is irrelevant
# (slight non-issue: you might worry that we do NOT want to count "better" hands, so what about flushes?
# A 3-of-a-kind flush is impossible! Same for the others, except straights)
#print( ranksList )
unique, unique_counts = np.unique( ranksList, return_counts=True )
#print( unique )
#print( unique_counts )
ln = len( unique )
mx = max( unique_counts )
if ln == 2:
# it could be (A,B,B,B,B) or (A,A,B,B,B) or (A,A,A,B,B) or (A,A,A,A,B)
# so either way, it's a four-of-a-kind, or a full-house
if mx == 4:
fourOfAKind = fourOfAKind + 1
elif mx == 3:
fullHouse = fullHouse + 1
else:
print("This should not happen!")
elif ln == 3:
# it could be two pair, or three-of-a-kind
if mx == 3:
threeOfAKind = threeOfAKind + 1
elif mx == 2:
twoPair = twoPair + 1
else:
print("This should not happen! Someone made a mistake")
# We can find the true answers here: https://en.wikipedia.org/wiki/Poker_probability
# For formatted output, see https://www.python-course.eu/python3_formatted_output.php
print("Empirical probabilities:")
print("Four of a kind: {0:.4f} (true probability is {1:.4f})".format(fourOfAKind/nHands*100,0.0240))
print("Full house: {0:.4f} (true probability is {1:.4f})".format(fullHouse/nHands*100,0.1441))
print("Three of a kind: {0:.4f} (true probability is {1:.4f})".format(threeOfAKind/nHands*100,2.1128))
print("Two pair: {0:.4f} (true probability is {1:.4f})".format(twoPair/nHands*100,4.7539))
Informally, we say that two variables are independent if they have nothing to do with each other. For example, if x is the number of fish caught in the Pacific Ocean last year, and y is the number of videos on youtube last year, then x and y are independent variables.
If variables are not independent, we call them dependent. If two variables are correlated, then they are dependent (and almost vice-versa: if they are dependent, then are often, but not always correlated. Phrased another way, if variables are uncorrelated, then in some cases, such as for Gaussian variables, they are independent).
Positive correlation means that if variable x is high (above its mean), then variable y is likely to be high too; and if x is low, then y is likely to be low too. Correlation is symmetric in the two variables.
Negative correlation means that if x is high, then y is likely to be low, and vice-versa.
Causation is the idea that one variable controls the other. For example, the amount of money I spend on ice cream is correlated with the amount of money I earn from my job, but the causation is one-way (not symmetric). That is, if I earn more or less money from my job, I have more or less money to spend on ice cream. But if I eat more ice cream, that does not mean that I will earn more money.
The false idea that "correlation means causation" is one of the biggest fallacies in science, and in machine learning as well.
Independence must be taken into account when doing things like surveys. For example, suppose I want to find out what percentage of people in the US are literate. I can send out a survey to everyone's house, ask if they are literate or not, and record the results. Of course, my results will tell me that 100\% of people are literate, because my survey suffers from extreme selection-bias: a very particular type of people are more likely to respond to the survey.