In this lab we will introduce you to some basic probability and statistics concepts. You may have heard of many of the terms we plan to go over today but if not, do not fret. Hopefully by the end of this session you will all leave here feeling like stats masters!
By the end of this section, hopefully you will be able to answer the following questions with confidence:
1) What is the mean, standard deviation, and variance of a collection of data?
2) How do these concepts help me interpret and better understand the data I'm given?
TODO: data8 Lab 5 may be helpful here! Just got over the basic intuition of what the mean and std. dev. are, maybe with some plots. Tie in the connection to variance and how they together help to distinguish rare and expected values.
First off, you all may have heard of the mean value of a set of data, but if not, the mean is simply the average. Suppose you're given the following values:
$$ 5,3,7,4,2 $$What is their average? We can compute this by hand:
$$ \frac{1}{5} \left( 5 + 3 + 7 + 4 + 2 \right) = \frac{21}{5} = 4.2$$This is fine while we have a small set of values, but what if we heard thousands of values, or even hundreds of thousands? This is where computers can be very handy.
Before we get into the coding part of the lesson, let's look at these values on a number line and get some intuition for what the mean value represents.
In [2]:
import matplotlib.pyplot as plt
plt.style.use('seaborn-whitegrid')
import numpy as np
fig = plt.figure()
ax = plt.axes()
x = np.array((5,3,7,4,2))
ax.scatter(x,np.zeros(len(x)), s=75, label = 'Values')
ax.scatter(4.2,0, s=75, label = 'Mean')
ax.legend()
Out[2]:
Now let's generate some random numbers and calculate their means. I know this may seem tedious but the code will be useful later!
In [3]:
import random as rnd
for x in range(10):
print rnd.randint(1,101) # rnd.randint(start,end) returns an integer value between values start and end
In [36]:
for x in range(10):
print rnd.random() # rnd.random() returns a pseudo-random value between 0.0 and 1.0
In [12]:
# Fill in the following code to compute the mean of a set of randomly generated numbers
num_values = 2 # change to reflect number of values
current_total = 0.0 # variable to keep track of current total
for x in range(num_values):
val = rnd.random() # generate random number (either integer or decimal between 0 and 1) and store it in the variable named val
current_total = current_total + val # keep a running sum of the values
print val
print "The average value is", current_total/num_values
In [ ]:
# If we'd like to plot our values and their corresponding means we would need to store the generated values.
# Rewrite the code above to store the values as they are generated. Instead of keeping a running sum, use
# numpy's "sum" function to calculate the mean after the loop!
# Code goes here
Using the provided code above, can you plot the generated values with their mean?
In [ ]:
# Plot code goes here!
Now that we've got a better understanding for the mean, what about the standard deviation and variance?
Intuitively, the standard deviation is a measure of spread of the data.
Suppose we have a 3-ft ruler that we'd like to measure a friend's height with. Let's also assume that the person of interest is taller than 3 feet and WE ONLY HAVE ONE RULER. Therefore, we need to be creative as to how we measure our friend. We could use a 3-ft long string as a placeholder and place the ruler tip-to-tail to measure the remaining distance. As you could imagine, our measurements would be slightly off. However, it's reasonable to think that if we took many measurements, that the mean measurement would be pretty close to the true height! In this case, the standard deviation would be a measure of the "spread" of all our (potentially) inaccurate measurements around the average (or mean) height.
Using the standard deviation, we can get an idea for the interval for which a majority of our measurements lie. If we plotted our data on a number line as before and included the mean, we would see that 68% of our measurements fall within one standard deviation (on either side) of the mean!
Finally, the variance is simply the square of the standard deviation!
$$ std \ dev = \sqrt{var} $$
In [60]:
fig = plt.figure()
ax = plt.axes()
x = np.array((5,3,7,4,2))
ax.scatter(x,np.zeros(len(x)), s=75, label = 'Values')
ax.scatter(4.2,0, s=75, label = 'Mean')
std = np.std(x)
x_ticks = np.append([1,3,4,5,7], [4.2-std,4.2+std])
plt.axvline(x=4.2-std,color='m')
plt.axvline(x=4.2+std,color='m')
ax.legend()
ax.set_xticks(x_ticks)
fig.suptitle('One Standard Deviation',fontsize=30)
Out[60]:
In [63]:
fig = plt.figure()
ax = plt.axes()
x = np.array((5,3,7,4,2))
ax.scatter(x,np.zeros(len(x)), s=75, label = 'Values')
ax.scatter(4.2,0, s=75, label = 'Mean')
std = np.std(x)
x_ticks = np.append([1,3,4,5,7], [4.2-2*std,4.2+2*std])
plt.axvline(x=4.2-2*std,color='m')
plt.axvline(x=4.2+2*std,color='m')
ax.legend()
ax.set_xticks(x_ticks)
fig.suptitle('Two Standard Deviations',fontsize=30)
Out[63]:
In [65]:
fig = plt.figure()
ax = plt.axes()
x = np.array((5,3,7,4,2))
ax.scatter(x,np.zeros(len(x)), s=75, label = 'Values')
ax.scatter(4.2,0, s=75, label = 'Mean')
std = np.std(x)
x_ticks = np.append([1,3,4,5,7], [4.2-3*std,4.2+3*std])
plt.axvline(x=4.2-3*std,color='m')
plt.axvline(x=4.2+3*std,color='m')
ax.legend()
ax.set_xticks(x_ticks)
fig.suptitle('Three Standard Deviations',fontsize=30)
Out[65]:
In [41]:
from IPython.display import Image
Image(url = 'https://amp.businessinsider.com/images/546e68776bb3f74f68b7d0ba-750-544.jpg')
Out[41]:
In [83]:
# Let's simulate this on our own!
num_draws = 10 # enter number of measurements to take
true_height, std_dev = 160, 1 # play around with changing the standard deviation. Do your results make sense?
height_msr = np.random.normal(true_height, std_dev,num_draws)
In [84]:
fig = plt.figure()
ax = plt.axes()
plt.hist(height_msr)
x_ticks = np.append(ax.get_xticks(), true_height)
plt.axvline(x=true_height,color='m')
ax.legend()
ax.set_xticks(x_ticks)
fig.suptitle('Simulated Height Measurements',fontsize=30)
plt.ylabel('Number of Measurements',fontsize = 15)
plt.xlabel('Measured Height (cm)',fontsize = 15)
Out[84]:
Khan Academy says it the best:
"Probability is simply how likely something is to happen.
Whenever we're unsure about the outsome of an event, we can talk about the probabilities of certain outcomes - how likely they are. The analysis of events governed by probability is called statistics."
The best example for understanding probability is flipping a coin.
There are two possible outcomes - heads or tails.
What's the probability of the coin landing on heads? We can find out using the equation $$P(H) = ? $$ You may already know the likelihood is 50% (assuming a fair coin). But how do we work that out?
$$ P(Event) = \frac{\text{number of ways it can happen}}{\text{total number of outcomes}} $$In the case of flipping a coin, $P(H) = \frac{1}{2}$.
Let's test this by simulation!
In [13]:
# In this cell we will write a block of code to simulate flipping a coin and plotting the results.
p = 0.5 # pick probability of getting heads
n = 10# pick number of trials
outcome = [] # empty container to store the results
for i in range(n):
result = rnd.random()# generate a number between 0 and 1 (Hint: we used a built-in function before that can help us with this)
if result <= p:
outcome.append(1)
else:
outcome.append(0)
fig = plt.figure()
plt.hist(outcome)
fig.suptitle('Flipping a Coin, Heads = 1, Tails = 0',fontsize=20)
Out[13]:
In [1]:
Image(url='http://www.milefoot.com/math/discrete/counting/images/cards.png')
The primary deck of 52 playing cards in use today includes thirteen ranks of each of the four Franch suits, diamonds (♦), spades (♠), hearts (♥) and clubs (♣), with reversible Rouennais "court" or face cards.
Each suit includes an ace, a king, a queen, a jack, and ranks two through ten, with each card depicting that many symbols of its suit.
Two (sometimes one or four) Jokers are included in commercial decks but many games require them to be removed before play. For simplicity, we will also assume there the Jokers of our deck have been removed.
Going back to our definition of probability
$$ P(Event) = \frac{\text{number of ways it can happen}}{\text{total number of outcomes}}, $$
what is the probability of drawing an ace?
$$ P(ace) = \frac{\text{number of aces}}{\text{total playing cards}} = \frac{4}{52} = \frac{1}{13}. $$
Similarly, we could compute the probability of NOT drawing an ace $$ P(not \ ace) = \frac{\text{number of non-ace cards}}{\text{total playing cards}} = \frac{48}{52} = \frac{12}{13}. $$
$$ P(not \ ace) = \frac{52}{52} - \frac{\text{number of aces}}{\text{total playing cards}} = \frac{48}{52} = \frac{12}{13}. $$
Notice that $P(ace) + P(not \ ace) = 1$. That is, the probability of either drawing an ace from a deck of cards or not drawing an ace is 100%!
In [3]:
# The code below to takes a random draw from a deck of cards
# Define the special face cards, suits and color
faces = {11: "Jack", 12: "Queen", 13: "King", 1: "Ace"}
suits = {1: "Spades", 2: "Hearts", 3: "Diamonds", 4: "Clubs"}
colors = {"Spades" : "Black", "Clubs" : "Black", "Diamonds" : "Red", "Hearts" : "Red"}
# Draw a random integer value. Card_num can be any of the 13 cards to choose from and suits picks a random suit
card_num = rnd.randint(1,13)
suit_num = rnd.randint(1,4)
# Get the card face, suit and corresponding color
card_face = faces.get(card_num,card_num)
card_suit = suits[suit_num]
card_color = colors.get(card_suit)
print card_face, card_suit, card_color
In [ ]:
num_draws = 10 # Can change number of draws but slow (on personal computer)
i = 0
cards = []
while i < num_draws:
# Draw a random integer value. Card_num can be any of the 13 cards to choose from and suits picks a random suit
card_num = rnd.randint(1,13)
suit_num = rnd.randint(1,4)
# Get the card face, suit and corresponding color
card_face = faces.get(card_num,card_num)
card_suit = suits[suit_num]
cards.append([card_face,card_suit,1])
i = i+1
# remove duplicates (assuming we are using only one deck)
''' for j in range(i-1):
if cards[j][0] == cards[i-1][0]:
if cards[j][1] == cards[i-1][1]:
cards.remove(cards[i-1])
i = i-1
'''
# Note: Won't remove duplicates so that we see probability of drawing a specific card
# Instead, count number of occurences
for j in range(i-1):
if cards[j][0] == cards[i-1][0]:
if cards[j][1] == cards[i-1][1]:
cards[j][2] = cards[j][2] + 1
cards.remove(cards[i-1])
i = i-1
print cards