homework03


Homework 3

Visualizing relationships between variables

Allen Downey

MIT License


In [ ]:
%matplotlib inline

import pandas as pd
import numpy as np

import matplotlib.pyplot as plt
import seaborn as sns
sns.set(style='white')

from utils import decorate
from thinkstats2 import Pmf, Cdf

import thinkstats2
import thinkplot

Loading


In [ ]:
%time brfss = pd.read_hdf('brfss.hdf5', 'brfss')

In [ ]:
brfss.shape

In [ ]:
brfss.head()

In [ ]:
brfss.describe()

Scatter plot

Scatter plots are a good way to visualize the relationship between two variables, but it is surprising hard to make a good one.

Here's a simple plot of height and weight.


In [ ]:
height = brfss['HTM4']
weight = brfss['WTKG3']

In [ ]:
plt.plot(height, weight, 'o')

plt.xlabel('Height in cm')
plt.ylabel('Weight in kg');

The center of this plot is saturated, so it is not as dark as it should be, which means the rest of the plot is relatively darker than it should be. It gives too much visual weight to the outliers and obscures the shape of the relationship.

Exercise: Use keywords alpha and markersize to avoid saturation.


In [ ]:
# Solution goes here

With transparency and smaller markers, you will be able to see that height and weight are discretized.

Exercise: Use np.random.normal to add enough noise to height and weight so the vertical lines in the scatter plot are blurred out. Create variables named height_jitter and weight_jitter.


In [ ]:
# Solution goes here

Linear regression

We can use scipy.stats to find the linear least squares fit to weight as a function of height.


In [ ]:
from scipy.stats import linregress

subset = brfss.dropna(subset=['WTKG3', 'HTM4'])
xs = subset['HTM4']
ys = subset['WTKG3']

res = linregress(xs, ys)
res

The LinregressResult object contains the estimated parameters and a few other statistics.

We can use the estimated slope and intercept to plot the line of best fit.


In [ ]:
# jitter the data
height_jitter = height + np.random.normal(0, 2, size=len(height))
weight_jitter = weight + np.random.normal(0, 2, size=len(weight))

# make the scatter plot
plt.plot(height_jitter, weight_jitter, 'o', markersize=1, alpha=0.02)
plt.axis([140, 200, 0, 160])

# plot the line of best fit
fx = np.array([xs.min(), xs.max()])
fy = res.intercept + res.slope * fx
plt.plot(fx, fy, '-', alpha=0.5)

# label the axes
plt.xlabel('Height in cm')
plt.ylabel('Weight in kg')
plt.axis([140, 200, 0, 160]);

Weight and age

Exercise: Make a scatter plot of weight and age. The variable AGE is discretized in 5-year intervals, so you might want to jitter it.

Adjust transparency and marker size to generate the best view of the relationship.


In [ ]:
# Solution goes here

Exercise: Use linregress to estimate the slope and intercept of the line of best fit for this data.

Note: as in the previous example, use dropna to drop rows that contain NaN for either variable, and use the resulting subset to compute the arguments for linregress.


In [ ]:
# Solution goes here

Exercise: Generate a plot that shows the estimated line and a scatter plot of the data.


In [ ]:
# Solution goes here

Box and violin plots

The Seaborn package, which is usually imported as sns, provides two functions used to show the distribution of one variable as a function of another variable.

The following box plot shows the distribution of weight in each age category. Read the documentation so you know what it means.


In [ ]:
data = brfss.dropna(subset=['AGE', 'WTKG3'])

sns.boxplot(x='AGE', y='WTKG3', data=data, whis=10)

sns.despine(left=True, bottom=True)
plt.xlabel('Age in years')
plt.ylabel('Weight in kg');

This figure makes the shape of the relationship clearer; average weight increases between ages 20 and 50, and then decreases.

A violin plot is another way to show the same thing. Again, read the documentation so you know what it means.


In [ ]:
sns.violinplot(x='AGE', y='WTKG3', data=data, inner=None)

sns.despine(left=True, bottom=True)
plt.xlabel('Age in years')
plt.ylabel('Weight in kg');

Exercise: Make a box plot that shows the distribution of weight as a function of income. The variable INCOME2 contains income codes with 8 levels.

Use dropna to select the rows with valid income and weight information.


In [ ]:
# Solution goes here

Exercise: Make a violin plot with the same variables.


In [ ]:
# Solution goes here

Plotting percentiles

One more way to show the relationship between two variables is to break one variables into groups and plot percentiles of the other variable across groups.

As a starting place, here's the median weight in each age group.


In [ ]:
grouped = brfss.groupby('AGE')

for name, group in grouped['WTKG3']:
    print(name, group.median())

To get the other percentiles, we can use a Cdf.


In [ ]:
ps = [95, 75, 50, 25, 5]

for name, group in grouped['WTKG3']:
    percentiles = Cdf(group).Percentiles(ps)
    print(name, percentiles)

Now I'll collect those results in a list of arrays:


In [ ]:
res = []
for name, group in grouped['WTKG3']:
    percentiles = Cdf(group).Percentiles(ps)
    res.append(percentiles)
    
res

To get the age groups, we can extract the "keys" from the groupby object.


In [ ]:
xs = grouped.groups.keys()
xs

Now, we want to loop through the columns of the list of arrays; to do that, we want to transpose it.


In [ ]:
rows = np.transpose(res)
rows

Now we can plot the percentiles across the groups.


In [ ]:
width = [1,2,5,2,1]

for i, qs in enumerate(rows):
    plt.plot(xs, qs, label=ps[i], linewidth=width[i], color='C4')
    
decorate(xlabel='Age (years)',
         ylabel='Weight (kg)')

In my opinion, this plot shows the shape of the relationship most clearly.

Discretizing variables

Box plot, violin plots, and percentile line plots don't work as well if the number of groups on the x-axis is too big. For example, here's a box plot of weight versus height.


In [ ]:
sns.boxplot(x='HTM4', y='WTKG3', data=data, whis=10)

sns.despine(left=True, bottom=True)
plt.xlabel('Height in cm')
plt.ylabel('Weight in kg');

This would look better and mean more if there were fewer height groups. We can use pd.cut to put people into height groups where each group spans 10 cm.


In [ ]:
bins = np.arange(0, height.max(), 10)
brfss['_HTMG10'] = pd.cut(brfss['HTM4'], bins=bins, labels=bins[:-1]).astype(float)

Now here's what the plot looks like.


In [ ]:
sns.boxplot(x='_HTMG10', y='WTKG3', data=brfss, whis=10)
plt.xticks(rotation=30)

sns.despine(left=True, bottom=True)
plt.xlabel('Height in cm')
plt.ylabel('Weight in kg');

Exercise: Plot percentiles of weight versus these height groups.


In [ ]:
# Solution goes here

In [ ]:
# Solution goes here

In [ ]:
# Solution goes here

In [ ]:
# Solution goes here

Vegetables

Exercise: The variable _VEGESU1 contains the self-reported number of serving of vegetables each respondent eats per day. Explore relationships between this variable and the others variables in the dataset, and design visualizations that show any relationship you find as clearly as possible.


In [ ]:
# Solution goes here

In [ ]:
# Solution goes here

In [ ]:
# Solution goes here

Correlation

One way to compute correlations is the Pandas method corr, which returns a correlation matrix.


In [ ]:
subset = brfss[['HTM4', 'WTKG3', 'AGE']]
subset.corr()

Exercise: Compute a correlation matrix for age, income, and vegetable servings.


In [ ]:
subset = brfss[['AGE', 'INCOME2', '_VEGESU1']]
subset.corr()

Correlation calibration

To calibrate your sense of correlation, let's look at scatter plots for fake data with different values of rho.

The following function generates random normally-distributed data with approximately the given coefficient of correlation.


In [ ]:
def gen_corr(rho):
    means = [0, 0]
    covs = [[1, rho], [rho, 1]]
    m = np.random.multivariate_normal(means, covs, 100)
    return np.transpose(m)

This function makes a scatter plot and shows the actual value of rho.


In [ ]:
def plot_scatter(rho, seed=1):
    np.random.seed(seed)
    xs, ys = gen_corr(rho)
    rho = np.corrcoef(xs, ys)[0][1]

    plt.plot(xs, ys, 'o', alpha=0.5)
    plt.xlabel('x')
    plt.ylabel('y')
    ax = plt.gca()
    label_rho(ax, rho)
    
    return xs, ys

In [ ]:
def label_rho(ax, rho):
    label = 'ρ = %0.2f' % rho
    plt.text(0.05, 0.95, label, 
             horizontalalignment='left', 
             verticalalignment='top', 
             transform=ax.transAxes,
             fontsize=12)

The following plots show what scatter plots look like with different values of rho.


In [ ]:
res = []
xs, ys = plot_scatter(0, seed=18)
res.append((xs, ys))

In [ ]:
xs, ys = plot_scatter(0.25, seed=18)
res.append((xs, ys))

In [ ]:
xs, ys = plot_scatter(0.5, seed=18)
res.append((xs, ys))

In [ ]:
xs, ys = plot_scatter(0.75, seed=18)
res.append((xs, ys))

In [ ]:
xs, ys = plot_scatter(0.95, seed=18)
res.append((xs, ys))

Here are all the plots side-by-side for comparison.


In [ ]:
fig, axes = plt.subplots(ncols=5, sharey=True, figsize=(15,3)) 

for ax, (xs, ys) in zip(axes, res):
    ax.plot(xs, ys, 'o', alpha=0.5)
    rho = np.corrcoef(xs, ys)[0][1]
    label_rho(ax, rho)

Nonlinear relationships

Here an example that generates fake data with a nonlinear relationship.


In [ ]:
np.random.seed(18)
xs = np.linspace(-1, 1)
ys = xs**2 + np.random.normal(0, 0.05, len(xs))

plt.plot(xs, ys, 'o', alpha=0.5)
plt.xlabel('x')
plt.ylabel('y');

This relationship is quite strong, in the sense that we can make a much better guess about y if we know x than if we don't.

But if we compute correlations, they don't show the relationship.


In [ ]:
df = pd.DataFrame(dict(xs=xs, ys=ys))
df.corr(method='pearson')

In [ ]:
df.corr(method='spearman')

In [ ]:
df.corr(method='kendall')

Correlation strength

Here are two fake datasets showing hypothetical relationships between weight and age.


In [ ]:
np.random.seed(18)
xs = np.linspace(20, 50)
ys1 = 75 + 0.02 * xs + np.random.normal(0, 0.15, len(xs))

plt.plot(xs, ys1, 'o', alpha=0.5)
plt.xlabel('Age in years')
plt.ylabel('Weight in kg')

rho = np.corrcoef(xs, ys1)[0][1]
label_rho(plt.gca(), rho)

In [ ]:
np.random.seed(18)
xs = np.linspace(20, 50)
ys2 = 65 + 0.2 * xs + np.random.normal(0, 3, len(xs))

plt.plot(xs, ys2, 'o', alpha=0.5)
plt.xlabel('Age in years')
plt.ylabel('Weight in kg')

rho = np.corrcoef(xs, ys2)[0][1]
label_rho(plt.gca(), rho)

Which relationship is stronger?

It depends on what we mean. Clearly, the first one has a higher coefficient of correlation. In that world, knowing someone's age would allow you to make a better guess about their weight.

But look more closely at the y-axis in the two plots. How much weight do people gain per year in each of these hypothetical worlds?


In [ ]:
from scipy.stats import linregress
        
res = linregress(xs, ys1)
res

In [ ]:
res = linregress(xs, ys2)
res

In fact, the slope for the second data set is almost 10 times higher.

The following figures show the same data again, this time with the line of best fit and the estimated slope.


In [ ]:
def label_slope(ax, slope):
    label = 'm = %0.3f' % slope
    plt.text(0.05, 0.95, label, 
             horizontalalignment='left', 
             verticalalignment='top', 
             transform=ax.transAxes,
             fontsize=12)

In [ ]:
res = linregress(xs, ys1)
fx = np.array([xs.min(), xs.max()])
fy = res.intercept + res.slope * fx

plt.plot(xs, ys1, 'o', alpha=0.5)
plt.plot(fx, fy, '-', alpha=0.5)

plt.xlabel('Age in years')
plt.ylabel('Weight in kg')
label_slope(plt.gca(), res.slope)

plt.gca().get_ylim()

In [ ]:
res = linregress(xs, ys2)
fx = np.array([xs.min(), xs.max()])
fy = res.intercept + res.slope * fx

plt.plot(xs, ys2, 'o', alpha=0.5)
plt.plot(fx, fy, '-', alpha=0.5)

plt.xlabel('Age in years')
plt.ylabel('Weight in kg')
label_slope(plt.gca(), res.slope)

plt.gca().get_ylim()

The difference is not obvious from looking at the figure; you have to look carefully at the y-axis labels and the estimated slope.

And you have to interpret the slope in context. In the first case, people gain about 0.019 kg per year, which works out to less than half a pound per decade. In the second case, they gain almost 4 pounds per decade.

But remember that in the first case, the coefficient of correlation is substantially higher.

Exercise: So, in which case is the relationship "stronger"? Write a sentence or two below to summarize your thoughts.