In [17]:
# HIDDEN
from datascience import *
%matplotlib inline
import matplotlib.pyplot as plots
plots.style.use('fivethirtyeight')
import math
import numpy as np
from scipy import stats
from ipywidgets import interact, interactive, fixed, interact_manual
import ipywidgets as widgets
import nbinteract as nbi
The correlation coefficient measures the strength of the linear relationship between two variables. Graphically, it measures how clustered the scatter diagram is around a straight line.
The term correlation coefficient isn't easy to say, so it is usually shortened to correlation and denoted by $r$.
Here are some mathematical facts about $r$ that we will just observe by simulation.
The function r_scatter
takes a value of $r$ as its argument and simulates a scatter plot with a correlation very close to $r$. Because of randomness in the simulation, the correlation is not expected to be exactly equal to $r$.
Call r_scatter
a few times, with different values of $r$ as the argument, and see how the scatter plot changes.
When $r=1$ the scatter plot is perfectly linear and slopes upward. When $r=-1$, the scatter plot is perfectly linear and slopes downward. When $r=0$, the scatter plot is a formless cloud around the horizontal axis, and the variables are said to be uncorrelated.
In [6]:
z = np.random.normal(0, 1, 500)
def r_scatter(xs, r):
"""
Generate y-values for a scatter plot with correlation approximately r
"""
return r*xs + (np.sqrt(1-r**2))*z
corr_opts = {
'aspect_ratio': 1,
'xlim': (-3.5, 3.5),
'ylim': (-3.5, 3.5),
}
nbi.scatter(np.random.normal(size=500), r_scatter, options=corr_opts, r=(-1, 1, 0.05))
The formula for $r$ is not apparent from our observations so far. It has a mathematical basis that is outside the scope of this class. However, as you will see, the calculation is straightforward and helps us understand several of the properties of $r$.
Formula for $r$:
$r$ is the average of the products of the two variables, when both variables are measured in standard units.
Here are the steps in the calculation. We will apply the steps to a simple table of values of $x$ and $y$.
In [18]:
x = np.arange(1, 7, 1)
y = make_array(2, 3, 1, 5, 2, 7)
t = Table().with_columns(
'x', x,
'y', y
)
t
Out[18]:
Based on the scatter diagram, we expect that $r$ will be positive but not equal to 1.
In [19]:
nbi.scatter(t.column(0), t.column(1), options={'aspect_ratio': 1})
Step 1. Convert each variable to standard units.
In [21]:
def standard_units(nums):
return (nums - np.mean(nums)) / np.std(nums)
In [22]:
t_su = t.with_columns(
'x (standard units)', standard_units(x),
'y (standard units)', standard_units(y)
)
t_su
Out[22]:
Step 2. Multiply each pair of standard units.
In [23]:
t_product = t_su.with_column('product of standard units', t_su.column(2) * t_su.column(3))
t_product
Out[23]:
Step 3. $r$ is the average of the products computed in Step 2.
In [24]:
# r is the average of the products of standard units
r = np.mean(t_product.column(4))
r
Out[24]:
As expected, $r$ is positive but not equal to 1.
The calculation shows that:
In [25]:
nbi.scatter(t.column(1), t.column(0), options={'aspect_ratio': 1})
correlation
functionWe are going to be calculating correlations repeatedly, so it will help to define a function that computes it by performing all the steps described above. Let's define a function correlation
that takes a table and the labels of two columns in the table. The function returns $r$, the mean of the products of those column values in standard units.
In [26]:
def correlation(t, x, y):
return np.mean(standard_units(t.column(x))*standard_units(t.column(y)))
In [31]:
interact(correlation, t=fixed(t),
x=widgets.ToggleButtons(options=['x', 'y'], description='x-axis'),
y=widgets.ToggleButtons(options=['x', 'y'], description='y-axis'))
Out[31]:
Let's call the function on the x
and y
columns of t
. The function returns the same answer to the correlation between $x$ and $y$ as we got by direct application of the formula for $r$.
In [24]:
correlation(t, 'x', 'y')
Out[24]:
As we noticed, the order in which the variables are specified doesn't matter.
In [25]:
correlation(t, 'y', 'x')
Out[25]:
Calling correlation
on columns of the table suv
gives us the correlation between price and mileage as well as the correlation between price and acceleration.
In [34]:
suv = (Table.read_table('https://www.inferentialthinking.com/notebooks/hybrid.csv')
.where('class', 'SUV'))
interact(correlation, t=fixed(suv),
x=widgets.ToggleButtons(options=['mpg', 'msrp', 'acceleration'],
description='x-axis'),
y=widgets.ToggleButtons(options=['mpg', 'msrp', 'acceleration'],
description='y-axis'))
Out[34]:
In [26]:
correlation(suv, 'mpg', 'msrp')
Out[26]:
In [27]:
correlation(suv, 'acceleration', 'msrp')
Out[27]:
These values confirm what we had observed:
Correlation is a simple and powerful concept, but it is sometimes misused. Before using $r$, it is important to be aware of what correlation does and does not measure.