In [ ]:
%matplotlib inline
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
sns.set(style='white')
from utils import decorate
from thinkstats2 import Pmf, Cdf
import thinkstats2
import thinkplot
In [ ]:
%time brfss = pd.read_hdf('brfss.hdf5', 'brfss')
In [ ]:
brfss.shape
In [ ]:
brfss.head()
In [ ]:
brfss.describe()
In [ ]:
height = brfss['HTM4']
weight = brfss['WTKG3']
In [ ]:
plt.plot(height, weight, 'o')
plt.xlabel('Height in cm')
plt.ylabel('Weight in kg');
The center of this plot is saturated, so it is not as dark as it should be, which means the rest of the plot is relatively darker than it should be. It gives too much visual weight to the outliers and obscures the shape of the relationship.
Exercise: Use keywords alpha
and markersize
to avoid saturation.
In [ ]:
# Solution goes here
With transparency and smaller markers, you will be able to see that height and weight are discretized.
Exercise: Use np.random.normal
to add enough noise to height and weight so the vertical lines in the scatter plot are blurred out. Create variables named height_jitter
and weight_jitter
.
In [ ]:
# Solution goes here
In [ ]:
from scipy.stats import linregress
subset = brfss.dropna(subset=['WTKG3', 'HTM4'])
xs = subset['HTM4']
ys = subset['WTKG3']
res = linregress(xs, ys)
res
The LinregressResult
object contains the estimated parameters and a few other statistics.
We can use the estimated slope
and intercept
to plot the line of best fit.
In [ ]:
# jitter the data
height_jitter = height + np.random.normal(0, 2, size=len(height))
weight_jitter = weight + np.random.normal(0, 2, size=len(weight))
# make the scatter plot
plt.plot(height_jitter, weight_jitter, 'o', markersize=1, alpha=0.02)
plt.axis([140, 200, 0, 160])
# plot the line of best fit
fx = np.array([xs.min(), xs.max()])
fy = res.intercept + res.slope * fx
plt.plot(fx, fy, '-', alpha=0.5)
# label the axes
plt.xlabel('Height in cm')
plt.ylabel('Weight in kg')
plt.axis([140, 200, 0, 160]);
In [ ]:
# Solution goes here
Exercise: Use linregress
to estimate the slope and intercept of the line of best fit for this data.
Note: as in the previous example, use dropna
to drop rows that contain NaN for either variable, and use the resulting subset to compute the arguments for linregress
.
In [ ]:
# Solution goes here
Exercise: Generate a plot that shows the estimated line and a scatter plot of the data.
In [ ]:
# Solution goes here
The Seaborn package, which is usually imported as sns
, provides two functions used to show the distribution of one variable as a function of another variable.
The following box plot shows the distribution of weight in each age category. Read the documentation so you know what it means.
In [ ]:
data = brfss.dropna(subset=['AGE', 'WTKG3'])
sns.boxplot(x='AGE', y='WTKG3', data=data, whis=10)
sns.despine(left=True, bottom=True)
plt.xlabel('Age in years')
plt.ylabel('Weight in kg');
This figure makes the shape of the relationship clearer; average weight increases between ages 20 and 50, and then decreases.
A violin plot is another way to show the same thing. Again, read the documentation so you know what it means.
In [ ]:
sns.violinplot(x='AGE', y='WTKG3', data=data, inner=None)
sns.despine(left=True, bottom=True)
plt.xlabel('Age in years')
plt.ylabel('Weight in kg');
Exercise: Make a box plot that shows the distribution of weight as a function of income. The variable INCOME2
contains income codes with 8 levels.
Use dropna
to select the rows with valid income and weight information.
In [ ]:
# Solution goes here
Exercise: Make a violin plot with the same variables.
In [ ]:
# Solution goes here
In [ ]:
grouped = brfss.groupby('AGE')
for name, group in grouped['WTKG3']:
print(name, group.median())
To get the other percentiles, we can use a Cdf
.
In [ ]:
ps = [95, 75, 50, 25, 5]
for name, group in grouped['WTKG3']:
percentiles = Cdf(group).Percentiles(ps)
print(name, percentiles)
Now I'll collect those results in a list of arrays:
In [ ]:
res = []
for name, group in grouped['WTKG3']:
percentiles = Cdf(group).Percentiles(ps)
res.append(percentiles)
res
To get the age groups, we can extract the "keys" from the groupby object.
In [ ]:
xs = grouped.groups.keys()
xs
Now, we want to loop through the columns of the list of arrays; to do that, we want to transpose it.
In [ ]:
rows = np.transpose(res)
rows
Now we can plot the percentiles across the groups.
In [ ]:
width = [1,2,5,2,1]
for i, qs in enumerate(rows):
plt.plot(xs, qs, label=ps[i], linewidth=width[i], color='C4')
decorate(xlabel='Age (years)',
ylabel='Weight (kg)')
In my opinion, this plot shows the shape of the relationship most clearly.
In [ ]:
sns.boxplot(x='HTM4', y='WTKG3', data=data, whis=10)
sns.despine(left=True, bottom=True)
plt.xlabel('Height in cm')
plt.ylabel('Weight in kg');
This would look better and mean more if there were fewer height groups. We can use pd.cut
to put people into height groups where each group spans 10 cm.
In [ ]:
bins = np.arange(0, height.max(), 10)
brfss['_HTMG10'] = pd.cut(brfss['HTM4'], bins=bins, labels=bins[:-1]).astype(float)
Now here's what the plot looks like.
In [ ]:
sns.boxplot(x='_HTMG10', y='WTKG3', data=brfss, whis=10)
plt.xticks(rotation=30)
sns.despine(left=True, bottom=True)
plt.xlabel('Height in cm')
plt.ylabel('Weight in kg');
Exercise: Plot percentiles of weight versus these height groups.
In [ ]:
# Solution goes here
In [ ]:
# Solution goes here
In [ ]:
# Solution goes here
In [ ]:
# Solution goes here
Exercise: The variable _VEGESU1
contains the self-reported number of serving of vegetables each respondent eats per day. Explore relationships between this variable and the others variables in the dataset, and design visualizations that show any relationship you find as clearly as possible.
In [ ]:
# Solution goes here
In [ ]:
# Solution goes here
In [ ]:
# Solution goes here
In [ ]:
subset = brfss[['HTM4', 'WTKG3', 'AGE']]
subset.corr()
Exercise: Compute a correlation matrix for age, income, and vegetable servings.
In [ ]:
subset = brfss[['AGE', 'INCOME2', '_VEGESU1']]
subset.corr()
To calibrate your sense of correlation, let's look at scatter plots for fake data with different values of rho
.
The following function generates random normally-distributed data with approximately the given coefficient of correlation.
In [ ]:
def gen_corr(rho):
means = [0, 0]
covs = [[1, rho], [rho, 1]]
m = np.random.multivariate_normal(means, covs, 100)
return np.transpose(m)
This function makes a scatter plot and shows the actual value of rho
.
In [ ]:
def plot_scatter(rho, seed=1):
np.random.seed(seed)
xs, ys = gen_corr(rho)
rho = np.corrcoef(xs, ys)[0][1]
plt.plot(xs, ys, 'o', alpha=0.5)
plt.xlabel('x')
plt.ylabel('y')
ax = plt.gca()
label_rho(ax, rho)
return xs, ys
In [ ]:
def label_rho(ax, rho):
label = 'ρ = %0.2f' % rho
plt.text(0.05, 0.95, label,
horizontalalignment='left',
verticalalignment='top',
transform=ax.transAxes,
fontsize=12)
The following plots show what scatter plots look like with different values of rho
.
In [ ]:
res = []
xs, ys = plot_scatter(0, seed=18)
res.append((xs, ys))
In [ ]:
xs, ys = plot_scatter(0.25, seed=18)
res.append((xs, ys))
In [ ]:
xs, ys = plot_scatter(0.5, seed=18)
res.append((xs, ys))
In [ ]:
xs, ys = plot_scatter(0.75, seed=18)
res.append((xs, ys))
In [ ]:
xs, ys = plot_scatter(0.95, seed=18)
res.append((xs, ys))
Here are all the plots side-by-side for comparison.
In [ ]:
fig, axes = plt.subplots(ncols=5, sharey=True, figsize=(15,3))
for ax, (xs, ys) in zip(axes, res):
ax.plot(xs, ys, 'o', alpha=0.5)
rho = np.corrcoef(xs, ys)[0][1]
label_rho(ax, rho)
In [ ]:
np.random.seed(18)
xs = np.linspace(-1, 1)
ys = xs**2 + np.random.normal(0, 0.05, len(xs))
plt.plot(xs, ys, 'o', alpha=0.5)
plt.xlabel('x')
plt.ylabel('y');
This relationship is quite strong, in the sense that we can make a much better guess about y
if we know x
than if we don't.
But if we compute correlations, they don't show the relationship.
In [ ]:
df = pd.DataFrame(dict(xs=xs, ys=ys))
df.corr(method='pearson')
In [ ]:
df.corr(method='spearman')
In [ ]:
df.corr(method='kendall')
In [ ]:
np.random.seed(18)
xs = np.linspace(20, 50)
ys1 = 75 + 0.02 * xs + np.random.normal(0, 0.15, len(xs))
plt.plot(xs, ys1, 'o', alpha=0.5)
plt.xlabel('Age in years')
plt.ylabel('Weight in kg')
rho = np.corrcoef(xs, ys1)[0][1]
label_rho(plt.gca(), rho)
In [ ]:
np.random.seed(18)
xs = np.linspace(20, 50)
ys2 = 65 + 0.2 * xs + np.random.normal(0, 3, len(xs))
plt.plot(xs, ys2, 'o', alpha=0.5)
plt.xlabel('Age in years')
plt.ylabel('Weight in kg')
rho = np.corrcoef(xs, ys2)[0][1]
label_rho(plt.gca(), rho)
Which relationship is stronger?
It depends on what we mean. Clearly, the first one has a higher coefficient of correlation. In that world, knowing someone's age would allow you to make a better guess about their weight.
But look more closely at the y-axis in the two plots. How much weight do people gain per year in each of these hypothetical worlds?
In [ ]:
from scipy.stats import linregress
res = linregress(xs, ys1)
res
In [ ]:
res = linregress(xs, ys2)
res
In fact, the slope for the second data set is almost 10 times higher.
The following figures show the same data again, this time with the line of best fit and the estimated slope.
In [ ]:
def label_slope(ax, slope):
label = 'm = %0.3f' % slope
plt.text(0.05, 0.95, label,
horizontalalignment='left',
verticalalignment='top',
transform=ax.transAxes,
fontsize=12)
In [ ]:
res = linregress(xs, ys1)
fx = np.array([xs.min(), xs.max()])
fy = res.intercept + res.slope * fx
plt.plot(xs, ys1, 'o', alpha=0.5)
plt.plot(fx, fy, '-', alpha=0.5)
plt.xlabel('Age in years')
plt.ylabel('Weight in kg')
label_slope(plt.gca(), res.slope)
plt.gca().get_ylim()
In [ ]:
res = linregress(xs, ys2)
fx = np.array([xs.min(), xs.max()])
fy = res.intercept + res.slope * fx
plt.plot(xs, ys2, 'o', alpha=0.5)
plt.plot(fx, fy, '-', alpha=0.5)
plt.xlabel('Age in years')
plt.ylabel('Weight in kg')
label_slope(plt.gca(), res.slope)
plt.gca().get_ylim()
The difference is not obvious from looking at the figure; you have to look carefully at the y-axis labels and the estimated slope.
And you have to interpret the slope in context. In the first case, people gain about 0.019 kg per year, which works out to less than half a pound per decade. In the second case, they gain almost 4 pounds per decade.
But remember that in the first case, the coefficient of correlation is substantially higher.
Exercise: So, in which case is the relationship "stronger"? Write a sentence or two below to summarize your thoughts.