```
In [1]:
```# this future import makes this code mostly compatible with Python 2 and 3
from __future__ import print_function, division
import numpy as np
import pandas as pd
import math
import thinkplot
import thinkstats2
np.random.seed(17)
%matplotlib inline

```
In [2]:
```def ReadBrfss(filename='CDBRFS08.ASC.gz', compression='gzip', nrows=None):
"""Reads the BRFSS data.
filename: string
compression: string
nrows: int number of rows to read, or None for all
returns: DataFrame
"""
var_info = [
('age', 101, 102, int),
('sex', 143, 143, int),
('wtyrago', 127, 130, int),
('finalwt', 799, 808, int),
('wtkg2', 1254, 1258, int),
('htm3', 1251, 1253, int),
]
columns = ['name', 'start', 'end', 'type']
variables = pd.DataFrame(var_info, columns=columns)
variables.end += 1
dct = thinkstats2.FixedWidthVariables(variables, index_base=1)
df = dct.ReadFixedWidth(filename, compression=compression, nrows=nrows)
CleanBrfssFrame(df)
return df

The following function cleans some of the variables we'll need.

```
In [3]:
```def CleanBrfssFrame(df):
"""Recodes BRFSS variables.
df: DataFrame
"""
# clean age
df.age.replace([7, 9], float('NaN'), inplace=True)
# clean height
df.htm3.replace([999], float('NaN'), inplace=True)
# clean weight
df.wtkg2.replace([99999], float('NaN'), inplace=True)
df.wtkg2 /= 100.0
# clean weight a year ago
df.wtyrago.replace([7777, 9999], float('NaN'), inplace=True)
df['wtyrago'] = df.wtyrago.apply(lambda x: x/2.2 if x < 9000 else x-9000)

Now we'll read the data into a Pandas DataFrame.

```
In [4]:
```brfss = ReadBrfss(nrows=None)
brfss.shape

```
Out[4]:
```

And drop any rows that are missing height or weight (about 5%).

```
In [5]:
```complete = brfss.dropna(subset=['htm3', 'wtkg2'])
complete.shape

```
Out[5]:
```

Here's what the first few rows look like.

```
In [6]:
```complete.head()

```
Out[6]:
```

And we can summarize each of the columns.

```
In [7]:
```complete.describe()

```
Out[7]:
```

Since the data set is large, I'll start with a small random subset and we'll work our way up.

```
In [8]:
```sample = thinkstats2.SampleRows(complete, 1000)

For convenience, I'll extract the columns we want as Pandas Series.

```
In [9]:
```heights = sample.htm3
weights = sample.wtkg2

`Scatter`

uses `alpha=0.2`

, so when multiple data points are stacked, the intensity of the plot adds up (at least approximately).

```
In [10]:
```thinkplot.Scatter(heights, weights)
thinkplot.Config(xlabel='height (cm)',
ylabel='weight (kg)',
legend=False)

```
```

```
In [11]:
```thinkplot.Scatter(heights, weights)
thinkplot.Config(xlabel='height (cm)',
ylabel='weight (kg)',
axis=[140, 210, 20, 200],
legend=False)

```
```

```
In [12]:
```heights = thinkstats2.Jitter(heights, 2.0)
weights = thinkstats2.Jitter(weights, 0.5)

The following figure shows the effect of jittering.

```
In [13]:
```thinkplot.Scatter(heights, weights)
thinkplot.Config(xlabel='height (cm)',
ylabel='weight (kg)',
axis=[140, 210, 20, 200],
legend=False)

```
```

With only 1000 samples, this works fine, but if we scale up to 10,000, we have a problem.

```
In [14]:
```sample = thinkstats2.SampleRows(complete, 10000)
heights = sample.htm3
weights = sample.wtkg2
heights = thinkstats2.Jitter(heights, 2.0)
weights = thinkstats2.Jitter(weights, 0.5)

```
In [15]:
```thinkplot.Scatter(heights, weights)
thinkplot.Config(xlabel='height (cm)',
ylabel='weight (kg)',
axis=[140, 210, 20, 200],
legend=False)

```
```

This problem -- saturated scatter plots -- is amazingly common. I see it all the time in published papers, even in good journals.

With moderate data sizes, you can avoid saturation by decreasing the marker size and `alpha`

.

```
In [16]:
```thinkplot.Scatter(heights, weights, alpha=0.1, s=10)
thinkplot.Config(xlabel='height (cm)',
ylabel='weight (kg)',
axis=[140, 210, 20, 200],
legend=False)

```
```

That's better. Although now the horizontal lines are more apparent, probably because people round their weight off to round numbers (in pounds). We could address that by adding more jittering, but I will leave it alone for now.

If we increase the sample size again, to 100,000, we have to decrease the marker size and alpha level even more.

```
In [17]:
```sample = thinkstats2.SampleRows(complete, 100000)
heights = sample.htm3
weights = sample.wtkg2
heights = thinkstats2.Jitter(heights, 3.5)
weights = thinkstats2.Jitter(weights, 1.5)

```
In [18]:
```thinkplot.Scatter(heights, weights, alpha=0.1, s=1)
thinkplot.Config(xlabel='height (cm)',
ylabel='weight (kg)',
axis=[140, 210, 20, 200],
legend=False)

```
```

Finally, we can generate a plot with the entire sample, about 395,000 respondents.

```
In [19]:
```sample = complete
heights = sample.htm3
weights = sample.wtkg2
heights = thinkstats2.Jitter(heights, 3.5)
weights = thinkstats2.Jitter(weights, 1.5)

And I decreased the marker size one more time.

```
In [20]:
```thinkplot.Scatter(heights, weights, alpha=0.07, s=0.5)
thinkplot.Config(xlabel='height (cm)',
ylabel='weight (kg)',
axis=[140, 210, 20, 200],
legend=False)

```
```

This is about the best we can do, but it still has a few problems. The biggest problem with this version is that it takes a long time to generate, and the resulting figure is big.

An alternative to a scatterplot is a hexbin plot, which divides the plane into hexagonal bins, counts the number of entries in each bin, and colors the hexagons in proportion to the number of entries.

```
In [21]:
```thinkplot.HexBin(heights, weights)
thinkplot.Config(xlabel='height (cm)',
ylabel='weight (kg)',
axis=[140, 210, 20, 200],
legend=False)

```
```

The resulting figure is smaller and faster to generate, but it doesn't show all features of the scatterplot clearly.

There are a few other options for visualizing relationships between variables. One is to group respondents by height and compute the CDF of weight for each group.

I use `np.digitize`

and `DataFrame.groupby`

to group respondents by height:

```
In [22]:
```bins = np.arange(135, 210, 10)
indices = np.digitize(complete.htm3, bins)
groups = complete.groupby(indices)

Then I compute a CDF for each group (except the first and last).

```
In [23]:
```mean_heights = [group.htm3.mean() for i, group in groups][1:-1]
cdfs = [thinkstats2.Cdf(group.wtkg2) for i, group in groups][1:-1]

The following plot shows the distributions of weight.

```
In [24]:
```thinkplot.PrePlot(7)
for mean, cdf in zip(mean_heights, cdfs):
thinkplot.Cdf(cdf, label='%.0f cm' % mean)
thinkplot.Config(xlabel='weight (kg)',
ylabel='CDF',
axis=[20, 200, 0, 1],
legend=True)

```
```

```
In [25]:
```thinkplot.PrePlot(5)
for percent in [90, 75, 50, 25, 10]:
weight_percentiles = [cdf.Percentile(percent) for cdf in cdfs]
label = '%dth' % percent
thinkplot.Plot(mean_heights, weight_percentiles, label=label)
thinkplot.Config(xlabel='height (cm)',
ylabel='weight (kg)',
axis=[135, 220, 35, 145],
legend=True)

```
```

```
In [26]:
```thinkplot.PrePlot(5)
for percent in [90, 75, 50, 25, 10]:
weight_percentiles = [cdf.Percentile(percent) for cdf in cdfs]
label = '%dth' % percent
thinkplot.Plot(mean_heights, weight_percentiles, label=label)
thinkplot.Config(xlabel='height (cm)',
ylabel='weight (kg)',
yscale='log',
axis=[135, 220, 35, 145],
legend=True)

```
```

```
In [27]:
```heights.corr(weights)

```
Out[27]:
```

A correlation of $\rho = 0.48$ is moderately strong -- I'll say more about what that means in a minute.

Since the relationship is more linear under a log transform, we might transform weight first, before computing the correlation.

```
In [28]:
```heights.corr(np.log(weights))

```
Out[28]:
```

As expected, the correlation is a little higher with the transform.

Spearman's rank correlation can measure the strength of a non-linear relationship, provided it is monotonic.

```
In [29]:
```heights.corr(weights, method='spearman')

```
Out[29]:
```

And Spearman's correlation is a little stronger still.

Remember that correlation measures the strength of a linear relationship, but says nothing about the slope of the line that relates the variables.

We can use `LeastSquares`

to estimate the slope of the least squares fit.

```
In [30]:
```inter, slope = thinkstats2.LeastSquares(heights, weights)
inter, slope

```
Out[30]:
```

So each additional cm of height adds almost a kilo of weight!

Here's what that line looks like, superimposed on the scatterplot:

```
In [31]:
```fit_xs, fit_ys = thinkstats2.FitLine(heights, inter, slope)
thinkplot.Scatter(heights, weights, alpha=0.07, s=0.5)
thinkplot.Plot(fit_xs, fit_ys, color='gray')
thinkplot.Config(xlabel='height (cm)',
ylabel='weight (kg)',
axis=[140, 210, 20, 200],
legend=False)

```
```

The fit line is a little higher than the visual center of mass because it is being pulled up by the outliers.

Here's the same thing using the log transform:

```
In [32]:
```log_weights = np.log(weights)
inter, slope = thinkstats2.LeastSquares(heights, log_weights)
fit_xs, fit_ys = thinkstats2.FitLine(heights, inter, slope)
thinkplot.Scatter(heights, log_weights, alpha=0.07, s=0.5)
thinkplot.Plot(fit_xs, fit_ys, color='gray')
thinkplot.Config(xlabel='height (cm)',
ylabel='log weight (kg)',
axis=[140, 210, 3.5, 5.5],
legend=False)

```
```

That looks better, although maybe still not the line a person would have drawn.

The residuals are the distances between each point and the fitted line.

```
In [33]:
```inter, slope = thinkstats2.LeastSquares(heights, weights)
res = thinkstats2.Residuals(heights, weights, inter, slope)

```
In [34]:
```var_y = weights.var()
var_res = res.var()
R2 = 1 - var_res / var_y
R2

```
Out[34]:
```

The value $R^2 = 0.23$ indicates a moderately strong relationship.

Note that the coefficient of determination is related to the coefficient of correlation, $\rho^2 = R^2$. So if we compute the sqrt of $R^2$, we should get $\rho$.

```
In [35]:
```math.sqrt(R2)

```
Out[35]:
```

And here's the correlation again:

```
In [36]:
```thinkstats2.Corr(heights, weights)

```
Out[36]:
```

If you see a high value of $\rho$, you should not be too impressed. If you square it, you get $R^2$, which you can interpret as the decrease in variance if you use the predictor (height) to guess the weight.

But even the decrease in variance overstates the practical effect of the predictor. A better measure is the decrease in root mean squared error (RMSE).

```
In [37]:
```RMSE_without = weights.std()
RMSE_without

```
Out[37]:
```

If you guess weight without knowing height, you expect to be off by 19.6 kg on average.

```
In [38]:
```RMSE_with = res.std()
RMSE_with

```
Out[38]:
```

If height is known, you can decrease the error to 17.2 kg on average.

```
In [39]:
```(1 - RMSE_with / RMSE_without) * 100

```
Out[39]:
```

So knowing height improves your guesses/predictions by about 12%.

```
In [ ]:
```