Relationships between variables

License: Creative Commons Attribution 4.0 International



In [1]:

    
# this future import makes this code mostly compatible with Python 2 and 3
from __future__ import print_function, division

import numpy as np
import pandas as pd
import math

import thinkplot
import thinkstats2

np.random.seed(17)

%matplotlib inline

To explore the relationship between height and weight, I'll load data from the Behavioral Risk Factor Surveillance Survey (BRFSS).



In [2]:

    
def ReadBrfss(filename='CDBRFS08.ASC.gz', compression='gzip', nrows=None):
    """Reads the BRFSS data.

    filename: string
    compression: string
    nrows: int number of rows to read, or None for all

    returns: DataFrame
    """
    var_info = [
        ('age', 101, 102, int),
        ('sex', 143, 143, int),
        ('wtyrago', 127, 130, int),
        ('finalwt', 799, 808, int),
        ('wtkg2', 1254, 1258, int),
        ('htm3', 1251, 1253, int),
        ]
    columns = ['name', 'start', 'end', 'type']
    variables = pd.DataFrame(var_info, columns=columns)
    variables.end += 1
    dct = thinkstats2.FixedWidthVariables(variables, index_base=1)

    df = dct.ReadFixedWidth(filename, compression=compression, nrows=nrows)
    CleanBrfssFrame(df)
    return df

The following function cleans some of the variables we'll need.



In [3]:

    
def CleanBrfssFrame(df):
    """Recodes BRFSS variables.

    df: DataFrame
    """
    # clean age
    df.age.replace([7, 9], float('NaN'), inplace=True)

    # clean height
    df.htm3.replace([999], float('NaN'), inplace=True)

    # clean weight
    df.wtkg2.replace([99999], float('NaN'), inplace=True)
    df.wtkg2 /= 100.0

    # clean weight a year ago
    df.wtyrago.replace([7777, 9999], float('NaN'), inplace=True)
    df['wtyrago'] = df.wtyrago.apply(lambda x: x/2.2 if x < 9000 else x-9000)

Now we'll read the data into a Pandas DataFrame.



In [4]:

    
brfss = ReadBrfss(nrows=None)
brfss.shape









    Out[4]:





(414509, 6)

And drop any rows that are missing height or weight (about 5%).



In [5]:

    
complete = brfss.dropna(subset=['htm3', 'wtkg2'])
complete.shape









    Out[5]:





(395832, 6)

Here's what the first few rows look like.



In [6]:

    
complete.head()









    Out[6]:






  
    
      
      age
      sex
      wtyrago
      finalwt
      wtkg2
      htm3
    
  
  
    
      0
      82
      2
      76.363636
      185.870345
      70.91
      157
    
    
      1
      65
      2
      72.727273
      126.603027
      72.73
      163
    
    
      3
      61
      1
      73.636364
      517.926275
      73.64
      170
    
    
      4
      26
      1
      88.636364
      1252.624630
      88.64
      185
    
    
      5
      42
      1
      118.181818
      415.161314
      109.09
      183

And we can summarize each of the columns.



In [7]:

    
complete.describe()









    Out[7]:






  
    
      
      age
      sex
      wtyrago
      finalwt
      wtkg2
      htm3
    
  
  
    
      count
      393518.000000
      395832.000000
      388137.000000
      395832.000000
      395832.000000
      395832.000000
    
    
      mean
      54.891207
      1.612730
      79.766921
      562.527274
      79.044413
      168.956188
    
    
      std
      16.742237
      0.487127
      20.574880
      1076.206594
      19.547890
      10.390752
    
    
      min
      18.000000
      1.000000
      22.727273
      1.695143
      20.000000
      61.000000
    
    
      25%
      43.000000
      1.000000
      64.545455
      97.241254
      64.550000
      163.000000
    
    
      50%
      55.000000
      2.000000
      77.272727
      234.914579
      77.270000
      168.000000
    
    
      75%
      67.000000
      2.000000
      90.909091
      593.585278
      90.910000
      178.000000
    
    
      max
      99.000000
      2.000000
      342.272727
      60995.111700
      300.000000
      236.000000

Since the data set is large, I'll start with a small random subset and we'll work our way up.



In [8]:

    
sample = thinkstats2.SampleRows(complete, 1000)

For convenience, I'll extract the columns we want as Pandas Series.



In [9]:

    
heights = sample.htm3
weights = sample.wtkg2

And then we can look at a scatterplot. By default, Scatter uses alpha=0.2, so when multiple data points are stacked, the intensity of the plot adds up (at least approximately).



In [10]:

    
thinkplot.Scatter(heights, weights)
thinkplot.Config(xlabel='height (cm)',
                 ylabel='weight (kg)',
                 legend=False)

The outliers stretch the bounds of the figure, making it harder to see the shape of the core. We can adjust the limits by hand.



In [11]:

    
thinkplot.Scatter(heights, weights)
thinkplot.Config(xlabel='height (cm)',
                 ylabel='weight (kg)',
                 axis=[140, 210, 20, 200],
                 legend=False)

The data points fall in columns because the heights were collected in inches and converted to cm. We can smooth this out by jittering the data.



In [12]:

    
heights = thinkstats2.Jitter(heights, 2.0)
weights = thinkstats2.Jitter(weights, 0.5)

The following figure shows the effect of jittering.



In [13]:

    
thinkplot.Scatter(heights, weights)
thinkplot.Config(xlabel='height (cm)',
                 ylabel='weight (kg)',
                 axis=[140, 210, 20, 200],
                 legend=False)

With only 1000 samples, this works fine, but if we scale up to 10,000, we have a problem.



In [14]:

    
sample = thinkstats2.SampleRows(complete, 10000)
heights = sample.htm3
weights = sample.wtkg2
heights = thinkstats2.Jitter(heights, 2.0)
weights = thinkstats2.Jitter(weights, 0.5)

In the highest density parts of the figure, the ink is saturated, so they are not as dark as they should be, and the outliers are darker than they should be.



In [15]:

    
thinkplot.Scatter(heights, weights)
thinkplot.Config(xlabel='height (cm)',
                 ylabel='weight (kg)',
                 axis=[140, 210, 20, 200],
                 legend=False)

This problem -- saturated scatter plots -- is amazingly common. I see it all the time in published papers, even in good journals.

With moderate data sizes, you can avoid saturation by decreasing the marker size and alpha.



In [16]:

    
thinkplot.Scatter(heights, weights, alpha=0.1, s=10)
thinkplot.Config(xlabel='height (cm)',
                 ylabel='weight (kg)',
                 axis=[140, 210, 20, 200],
                 legend=False)

That's better. Although now the horizontal lines are more apparent, probably because people round their weight off to round numbers (in pounds). We could address that by adding more jittering, but I will leave it alone for now.

If we increase the sample size again, to 100,000, we have to decrease the marker size and alpha level even more.



In [17]:

    
sample = thinkstats2.SampleRows(complete, 100000)
heights = sample.htm3
weights = sample.wtkg2
heights = thinkstats2.Jitter(heights, 3.5)
weights = thinkstats2.Jitter(weights, 1.5)



In [18]:

    
thinkplot.Scatter(heights, weights, alpha=0.1, s=1)
thinkplot.Config(xlabel='height (cm)',
                 ylabel='weight (kg)',
                 axis=[140, 210, 20, 200],
                 legend=False)

Finally, we can generate a plot with the entire sample, about 395,000 respondents.



In [19]:

    
sample = complete
heights = sample.htm3
weights = sample.wtkg2
heights = thinkstats2.Jitter(heights, 3.5)
weights = thinkstats2.Jitter(weights, 1.5)

And I decreased the marker size one more time.



In [20]:

    
thinkplot.Scatter(heights, weights, alpha=0.07, s=0.5)
thinkplot.Config(xlabel='height (cm)',
                 ylabel='weight (kg)',
                 axis=[140, 210, 20, 200],
                 legend=False)

This is about the best we can do, but it still has a few problems. The biggest problem with this version is that it takes a long time to generate, and the resulting figure is big.

An alternative to a scatterplot is a hexbin plot, which divides the plane into hexagonal bins, counts the number of entries in each bin, and colors the hexagons in proportion to the number of entries.



In [21]:

    
thinkplot.HexBin(heights, weights)
thinkplot.Config(xlabel='height (cm)',
                 ylabel='weight (kg)',
                 axis=[140, 210, 20, 200],
                 legend=False)

The resulting figure is smaller and faster to generate, but it doesn't show all features of the scatterplot clearly.

There are a few other options for visualizing relationships between variables. One is to group respondents by height and compute the CDF of weight for each group.

I use np.digitize and DataFrame.groupby to group respondents by height:



In [22]:

    
bins = np.arange(135, 210, 10)
indices = np.digitize(complete.htm3, bins)
groups = complete.groupby(indices)

Then I compute a CDF for each group (except the first and last).



In [23]:

    
mean_heights = [group.htm3.mean() for i, group in groups][1:-1]
cdfs = [thinkstats2.Cdf(group.wtkg2) for i, group in groups][1:-1]

The following plot shows the distributions of weight.



In [24]:

    
thinkplot.PrePlot(7)
for mean, cdf in zip(mean_heights, cdfs):
    thinkplot.Cdf(cdf, label='%.0f cm' % mean)
    
thinkplot.Config(xlabel='weight (kg)',
                 ylabel='CDF',
                 axis=[20, 200, 0, 1],
                 legend=True)

Using the CDFs, we can read off the percentiles of weight for each height group, and plot these weights agains the mean height in each group.



In [25]:

    
thinkplot.PrePlot(5)
for percent in [90, 75, 50, 25, 10]:
    weight_percentiles = [cdf.Percentile(percent) for cdf in cdfs]
    label = '%dth' % percent
    thinkplot.Plot(mean_heights, weight_percentiles, label=label)
    
thinkplot.Config(xlabel='height (cm)',
                 ylabel='weight (kg)',
                 axis=[135, 220, 35, 145],
                 legend=True)

This figure shows more clearly that the relationship between these variables is non-linear. Based on background information, I expect the distribution of weight to be lognormal, so I would try plotting weight on a log scale.



In [26]:

    
thinkplot.PrePlot(5)
for percent in [90, 75, 50, 25, 10]:
    weight_percentiles = [cdf.Percentile(percent) for cdf in cdfs]
    label = '%dth' % percent
    thinkplot.Plot(mean_heights, weight_percentiles, label=label)
    
thinkplot.Config(xlabel='height (cm)',
                 ylabel='weight (kg)',
                 yscale='log',
                 axis=[135, 220, 35, 145],
                 legend=True)

That relationship looks more linear, although not perfectly.

Correlation

After looking at a scatterplot, if you conclude that the relationship is at least approximately linear, you could compute a coefficient of correlation to quantify the strength of the relationship.



In [27]:

    
heights.corr(weights)









    Out[27]:





0.4800073405376597

A correlation of $\rho = 0.48$ is moderately strong -- I'll say more about what that means in a minute.

Since the relationship is more linear under a log transform, we might transform weight first, before computing the correlation.



In [28]:

    
heights.corr(np.log(weights))









    Out[28]:





0.50125544452164073

As expected, the correlation is a little higher with the transform.

Spearman's rank correlation can measure the strength of a non-linear relationship, provided it is monotonic.



In [29]:

    
heights.corr(weights, method='spearman')









    Out[29]:





0.50941335170837654

And Spearman's correlation is a little stronger still.

Remember that correlation measures the strength of a linear relationship, but says nothing about the slope of the line that relates the variables.

We can use LeastSquares to estimate the slope of the least squares fit.



In [30]:

    
inter, slope = thinkstats2.LeastSquares(heights, weights)
inter, slope









    Out[30]:





(-66.028661453741748, 0.85866796179740601)

So each additional cm of height adds almost a kilo of weight!

Here's what that line looks like, superimposed on the scatterplot:



In [31]:

    
fit_xs, fit_ys = thinkstats2.FitLine(heights, inter, slope)
thinkplot.Scatter(heights, weights, alpha=0.07, s=0.5)
thinkplot.Plot(fit_xs, fit_ys, color='gray')
thinkplot.Config(xlabel='height (cm)',
                 ylabel='weight (kg)',
                 axis=[140, 210, 20, 200],
                 legend=False)

The fit line is a little higher than the visual center of mass because it is being pulled up by the outliers.

Here's the same thing using the log transform:



In [32]:

    
log_weights = np.log(weights)
inter, slope = thinkstats2.LeastSquares(heights, log_weights)
fit_xs, fit_ys = thinkstats2.FitLine(heights, inter, slope)
thinkplot.Scatter(heights, log_weights, alpha=0.07, s=0.5)
thinkplot.Plot(fit_xs, fit_ys, color='gray')
thinkplot.Config(xlabel='height (cm)',
                 ylabel='log weight (kg)',
                 axis=[140, 210, 3.5, 5.5],
                 legend=False)

That looks better, although maybe still not the line a person would have drawn.

The residuals are the distances between each point and the fitted line.



In [33]:

    
inter, slope = thinkstats2.LeastSquares(heights, weights)
res = thinkstats2.Residuals(heights, weights, inter, slope)

The coefficient of determination $R^2$ is the fraction of the variance in weight we can eliminate by taking height into account.



In [34]:

    
var_y = weights.var()
var_res = res.var()
R2 = 1 - var_res / var_y
R2









    Out[34]:





0.23040899121143543

The value $R^2 = 0.23$ indicates a moderately strong relationship.

Note that the coefficient of determination is related to the coefficient of correlation, $\rho^2 = R^2$. So if we compute the sqrt of $R^2$, we should get $\rho$.



In [35]:

    
math.sqrt(R2)









    Out[35]:





0.480009365753873

And here's the correlation again:



In [36]:

    
thinkstats2.Corr(heights, weights)









    Out[36]:





0.48000734053765987

If you see a high value of $\rho$, you should not be too impressed. If you square it, you get $R^2$, which you can interpret as the decrease in variance if you use the predictor (height) to guess the weight.

But even the decrease in variance overstates the practical effect of the predictor. A better measure is the decrease in root mean squared error (RMSE).



In [37]:

    
RMSE_without = weights.std()
RMSE_without









    Out[37]:





19.6035063029834

If you guess weight without knowing height, you expect to be off by 19.6 kg on average.



In [38]:

    
RMSE_with = res.std()
RMSE_with









    Out[38]:





17.197437874064118

If height is known, you can decrease the error to 17.2 kg on average.



In [39]:

    
(1 - RMSE_with / RMSE_without) * 100









    Out[39]:





12.273663658593115

So knowing height improves your guesses/predictions by about 12%.



In [ ]:

	age	sex	wtyrago	finalwt	wtkg2	htm3
0	82	2	76.363636	185.870345	70.91	157
1	65	2	72.727273	126.603027	72.73	163
3	61	1	73.636364	517.926275	73.64	170
4	26	1	88.636364	1252.624630	88.64	185
5	42	1	118.181818	415.161314	109.09	183

	age	sex	wtyrago	finalwt	wtkg2	htm3
count	393518.000000	395832.000000	388137.000000	395832.000000	395832.000000	395832.000000
mean	54.891207	1.612730	79.766921	562.527274	79.044413	168.956188
std	16.742237	0.487127	20.574880	1076.206594	19.547890	10.390752
min	18.000000	1.000000	22.727273	1.695143	20.000000	61.000000
25%	43.000000	1.000000	64.545455	97.241254	64.550000	163.000000
50%	55.000000	2.000000	77.272727	234.914579	77.270000	168.000000
75%	67.000000	2.000000	90.909091	593.585278	90.910000	178.000000
max	99.000000	2.000000	342.272727	60995.111700	300.000000	236.000000