Relationships between variables

Copyright 2015 Allen Downey

License: Creative Commons Attribution 4.0 International


In [1]:
# this future import makes this code mostly compatible with Python 2 and 3
from __future__ import print_function, division

import numpy as np
import pandas as pd
import math

import thinkplot
import thinkstats2

np.random.seed(17)

%matplotlib inline

To explore the relationship between height and weight, I'll load data from the Behavioral Risk Factor Surveillance Survey (BRFSS).


In [2]:
def ReadBrfss(filename='CDBRFS08.ASC.gz', compression='gzip', nrows=None):
    """Reads the BRFSS data.

    filename: string
    compression: string
    nrows: int number of rows to read, or None for all

    returns: DataFrame
    """
    var_info = [
        ('age', 101, 102, int),
        ('sex', 143, 143, int),
        ('wtyrago', 127, 130, int),
        ('finalwt', 799, 808, int),
        ('wtkg2', 1254, 1258, int),
        ('htm3', 1251, 1253, int),
        ]
    columns = ['name', 'start', 'end', 'type']
    variables = pd.DataFrame(var_info, columns=columns)
    variables.end += 1
    dct = thinkstats2.FixedWidthVariables(variables, index_base=1)

    df = dct.ReadFixedWidth(filename, compression=compression, nrows=nrows)
    CleanBrfssFrame(df)
    return df

The following function cleans some of the variables we'll need.


In [3]:
def CleanBrfssFrame(df):
    """Recodes BRFSS variables.

    df: DataFrame
    """
    # clean age
    df.age.replace([7, 9], float('NaN'), inplace=True)

    # clean height
    df.htm3.replace([999], float('NaN'), inplace=True)

    # clean weight
    df.wtkg2.replace([99999], float('NaN'), inplace=True)
    df.wtkg2 /= 100.0

    # clean weight a year ago
    df.wtyrago.replace([7777, 9999], float('NaN'), inplace=True)
    df['wtyrago'] = df.wtyrago.apply(lambda x: x/2.2 if x < 9000 else x-9000)

Now we'll read the data into a Pandas DataFrame.


In [4]:
brfss = ReadBrfss(nrows=None)
brfss.shape


Out[4]:
(414509, 6)

And drop any rows that are missing height or weight (about 5%).


In [5]:
complete = brfss.dropna(subset=['htm3', 'wtkg2'])
complete.shape


Out[5]:
(395832, 6)

Here's what the first few rows look like.


In [6]:
complete.head()


Out[6]:
age sex wtyrago finalwt wtkg2 htm3
0 82 2 76.363636 185.870345 70.91 157
1 65 2 72.727273 126.603027 72.73 163
3 61 1 73.636364 517.926275 73.64 170
4 26 1 88.636364 1252.624630 88.64 185
5 42 1 118.181818 415.161314 109.09 183

And we can summarize each of the columns.


In [7]:
complete.describe()


Out[7]:
age sex wtyrago finalwt wtkg2 htm3
count 393518.000000 395832.000000 388137.000000 395832.000000 395832.000000 395832.000000
mean 54.891207 1.612730 79.766921 562.527274 79.044413 168.956188
std 16.742237 0.487127 20.574880 1076.206594 19.547890 10.390752
min 18.000000 1.000000 22.727273 1.695143 20.000000 61.000000
25% 43.000000 1.000000 64.545455 97.241254 64.550000 163.000000
50% 55.000000 2.000000 77.272727 234.914579 77.270000 168.000000
75% 67.000000 2.000000 90.909091 593.585278 90.910000 178.000000
max 99.000000 2.000000 342.272727 60995.111700 300.000000 236.000000

Since the data set is large, I'll start with a small random subset and we'll work our way up.


In [8]:
sample = thinkstats2.SampleRows(complete, 1000)

For convenience, I'll extract the columns we want as Pandas Series.


In [9]:
heights = sample.htm3
weights = sample.wtkg2

And then we can look at a scatterplot. By default, Scatter uses alpha=0.2, so when multiple data points are stacked, the intensity of the plot adds up (at least approximately).


In [10]:
thinkplot.Scatter(heights, weights)
thinkplot.Config(xlabel='height (cm)',
                 ylabel='weight (kg)',
                 legend=False)


The outliers stretch the bounds of the figure, making it harder to see the shape of the core. We can adjust the limits by hand.


In [11]:
thinkplot.Scatter(heights, weights)
thinkplot.Config(xlabel='height (cm)',
                 ylabel='weight (kg)',
                 axis=[140, 210, 20, 200],
                 legend=False)


The data points fall in columns because the heights were collected in inches and converted to cm. We can smooth this out by jittering the data.


In [12]:
heights = thinkstats2.Jitter(heights, 2.0)
weights = thinkstats2.Jitter(weights, 0.5)

The following figure shows the effect of jittering.


In [13]:
thinkplot.Scatter(heights, weights)
thinkplot.Config(xlabel='height (cm)',
                 ylabel='weight (kg)',
                 axis=[140, 210, 20, 200],
                 legend=False)


With only 1000 samples, this works fine, but if we scale up to 10,000, we have a problem.


In [14]:
sample = thinkstats2.SampleRows(complete, 10000)
heights = sample.htm3
weights = sample.wtkg2
heights = thinkstats2.Jitter(heights, 2.0)
weights = thinkstats2.Jitter(weights, 0.5)

In the highest density parts of the figure, the ink is saturated, so they are not as dark as they should be, and the outliers are darker than they should be.


In [15]:
thinkplot.Scatter(heights, weights)
thinkplot.Config(xlabel='height (cm)',
                 ylabel='weight (kg)',
                 axis=[140, 210, 20, 200],
                 legend=False)


This problem -- saturated scatter plots -- is amazingly common. I see it all the time in published papers, even in good journals.

With moderate data sizes, you can avoid saturation by decreasing the marker size and alpha.


In [16]:
thinkplot.Scatter(heights, weights, alpha=0.1, s=10)
thinkplot.Config(xlabel='height (cm)',
                 ylabel='weight (kg)',
                 axis=[140, 210, 20, 200],
                 legend=False)


That's better. Although now the horizontal lines are more apparent, probably because people round their weight off to round numbers (in pounds). We could address that by adding more jittering, but I will leave it alone for now.

If we increase the sample size again, to 100,000, we have to decrease the marker size and alpha level even more.


In [17]:
sample = thinkstats2.SampleRows(complete, 100000)
heights = sample.htm3
weights = sample.wtkg2
heights = thinkstats2.Jitter(heights, 3.5)
weights = thinkstats2.Jitter(weights, 1.5)

In [18]:
thinkplot.Scatter(heights, weights, alpha=0.1, s=1)
thinkplot.Config(xlabel='height (cm)',
                 ylabel='weight (kg)',
                 axis=[140, 210, 20, 200],
                 legend=False)


Finally, we can generate a plot with the entire sample, about 395,000 respondents.


In [19]:
sample = complete
heights = sample.htm3
weights = sample.wtkg2
heights = thinkstats2.Jitter(heights, 3.5)
weights = thinkstats2.Jitter(weights, 1.5)

And I decreased the marker size one more time.


In [20]:
thinkplot.Scatter(heights, weights, alpha=0.07, s=0.5)
thinkplot.Config(xlabel='height (cm)',
                 ylabel='weight (kg)',
                 axis=[140, 210, 20, 200],
                 legend=False)


This is about the best we can do, but it still has a few problems. The biggest problem with this version is that it takes a long time to generate, and the resulting figure is big.

An alternative to a scatterplot is a hexbin plot, which divides the plane into hexagonal bins, counts the number of entries in each bin, and colors the hexagons in proportion to the number of entries.


In [21]:
thinkplot.HexBin(heights, weights)
thinkplot.Config(xlabel='height (cm)',
                 ylabel='weight (kg)',
                 axis=[140, 210, 20, 200],
                 legend=False)


The resulting figure is smaller and faster to generate, but it doesn't show all features of the scatterplot clearly.

There are a few other options for visualizing relationships between variables. One is to group respondents by height and compute the CDF of weight for each group.

I use np.digitize and DataFrame.groupby to group respondents by height:


In [22]:
bins = np.arange(135, 210, 10)
indices = np.digitize(complete.htm3, bins)
groups = complete.groupby(indices)

Then I compute a CDF for each group (except the first and last).


In [23]:
mean_heights = [group.htm3.mean() for i, group in groups][1:-1]
cdfs = [thinkstats2.Cdf(group.wtkg2) for i, group in groups][1:-1]

The following plot shows the distributions of weight.


In [24]:
thinkplot.PrePlot(7)
for mean, cdf in zip(mean_heights, cdfs):
    thinkplot.Cdf(cdf, label='%.0f cm' % mean)
    
thinkplot.Config(xlabel='weight (kg)',
                 ylabel='CDF',
                 axis=[20, 200, 0, 1],
                 legend=True)


Using the CDFs, we can read off the percentiles of weight for each height group, and plot these weights agains the mean height in each group.


In [25]:
thinkplot.PrePlot(5)
for percent in [90, 75, 50, 25, 10]:
    weight_percentiles = [cdf.Percentile(percent) for cdf in cdfs]
    label = '%dth' % percent
    thinkplot.Plot(mean_heights, weight_percentiles, label=label)
    
thinkplot.Config(xlabel='height (cm)',
                 ylabel='weight (kg)',
                 axis=[135, 220, 35, 145],
                 legend=True)


This figure shows more clearly that the relationship between these variables is non-linear. Based on background information, I expect the distribution of weight to be lognormal, so I would try plotting weight on a log scale.


In [26]:
thinkplot.PrePlot(5)
for percent in [90, 75, 50, 25, 10]:
    weight_percentiles = [cdf.Percentile(percent) for cdf in cdfs]
    label = '%dth' % percent
    thinkplot.Plot(mean_heights, weight_percentiles, label=label)
    
thinkplot.Config(xlabel='height (cm)',
                 ylabel='weight (kg)',
                 yscale='log',
                 axis=[135, 220, 35, 145],
                 legend=True)


That relationship looks more linear, although not perfectly.

Correlation

After looking at a scatterplot, if you conclude that the relationship is at least approximately linear, you could compute a coefficient of correlation to quantify the strength of the relationship.


In [27]:
heights.corr(weights)


Out[27]:
0.4800073405376597

A correlation of $\rho = 0.48$ is moderately strong -- I'll say more about what that means in a minute.

Since the relationship is more linear under a log transform, we might transform weight first, before computing the correlation.


In [28]:
heights.corr(np.log(weights))


Out[28]:
0.50125544452164073

As expected, the correlation is a little higher with the transform.

Spearman's rank correlation can measure the strength of a non-linear relationship, provided it is monotonic.


In [29]:
heights.corr(weights, method='spearman')


Out[29]:
0.50941335170837654

And Spearman's correlation is a little stronger still.

Remember that correlation measures the strength of a linear relationship, but says nothing about the slope of the line that relates the variables.

We can use LeastSquares to estimate the slope of the least squares fit.


In [30]:
inter, slope = thinkstats2.LeastSquares(heights, weights)
inter, slope


Out[30]:
(-66.028661453741748, 0.85866796179740601)

So each additional cm of height adds almost a kilo of weight!

Here's what that line looks like, superimposed on the scatterplot:


In [31]:
fit_xs, fit_ys = thinkstats2.FitLine(heights, inter, slope)
thinkplot.Scatter(heights, weights, alpha=0.07, s=0.5)
thinkplot.Plot(fit_xs, fit_ys, color='gray')
thinkplot.Config(xlabel='height (cm)',
                 ylabel='weight (kg)',
                 axis=[140, 210, 20, 200],
                 legend=False)


The fit line is a little higher than the visual center of mass because it is being pulled up by the outliers.

Here's the same thing using the log transform:


In [32]:
log_weights = np.log(weights)
inter, slope = thinkstats2.LeastSquares(heights, log_weights)
fit_xs, fit_ys = thinkstats2.FitLine(heights, inter, slope)
thinkplot.Scatter(heights, log_weights, alpha=0.07, s=0.5)
thinkplot.Plot(fit_xs, fit_ys, color='gray')
thinkplot.Config(xlabel='height (cm)',
                 ylabel='log weight (kg)',
                 axis=[140, 210, 3.5, 5.5],
                 legend=False)


That looks better, although maybe still not the line a person would have drawn.

The residuals are the distances between each point and the fitted line.


In [33]:
inter, slope = thinkstats2.LeastSquares(heights, weights)
res = thinkstats2.Residuals(heights, weights, inter, slope)

The coefficient of determination $R^2$ is the fraction of the variance in weight we can eliminate by taking height into account.


In [34]:
var_y = weights.var()
var_res = res.var()
R2 = 1 - var_res / var_y
R2


Out[34]:
0.23040899121143543

The value $R^2 = 0.23$ indicates a moderately strong relationship.

Note that the coefficient of determination is related to the coefficient of correlation, $\rho^2 = R^2$. So if we compute the sqrt of $R^2$, we should get $\rho$.


In [35]:
math.sqrt(R2)


Out[35]:
0.480009365753873

And here's the correlation again:


In [36]:
thinkstats2.Corr(heights, weights)


Out[36]:
0.48000734053765987

If you see a high value of $\rho$, you should not be too impressed. If you square it, you get $R^2$, which you can interpret as the decrease in variance if you use the predictor (height) to guess the weight.

But even the decrease in variance overstates the practical effect of the predictor. A better measure is the decrease in root mean squared error (RMSE).


In [37]:
RMSE_without = weights.std()
RMSE_without


Out[37]:
19.6035063029834

If you guess weight without knowing height, you expect to be off by 19.6 kg on average.


In [38]:
RMSE_with = res.std()
RMSE_with


Out[38]:
17.197437874064118

If height is known, you can decrease the error to 17.2 kg on average.


In [39]:
(1 - RMSE_with / RMSE_without) * 100


Out[39]:
12.273663658593115

So knowing height improves your guesses/predictions by about 12%.


In [ ]: