Copyright 2016 Allen B. Downey

MIT License: https://opensource.org/licenses/MIT

```
In [1]:
```from __future__ import print_function, division
%matplotlib inline
import numpy as np
import brfss
import thinkstats2
import thinkplot

```
In [2]:
```df = brfss.ReadBrfss(nrows=None)

The following function selects a random subset of a `DataFrame`

.

```
In [3]:
```def SampleRows(df, nrows, replace=False):
indices = np.random.choice(df.index, nrows, replace=replace)
sample = df.loc[indices]
return sample

I'll extract the height in cm and the weight in kg of the respondents in the sample.

```
In [4]:
```sample = SampleRows(df, 5000)
heights, weights = sample.htm3, sample.wtkg2

Here's a simple scatter plot with `alpha=1`

, so each data point is fully saturated.

```
In [5]:
```thinkplot.Scatter(heights, weights, alpha=1)
thinkplot.Config(xlabel='Height (cm)',
ylabel='Weight (kg)',
axis=[140, 210, 20, 200],
legend=False)

```
```

The data fall in obvious columns because they were rounded off. We can reduce this visual artifact by adding some random noice to the data.

NOTE: The version of `Jitter`

in the book uses noise with a uniform distribution. Here I am using a normal distribution. The normal distribution does a better job of blurring artifacts, but the uniform distribution might be more true to the data.

```
In [6]:
```def Jitter(values, jitter=0.5):
n = len(values)
return np.random.normal(0, jitter, n) + values

```
In [7]:
```heights = Jitter(heights, 1.4)
weights = Jitter(weights, 0.5)

And here's what the jittered data look like.

```
In [8]:
```thinkplot.Scatter(heights, weights, alpha=1.0)
thinkplot.Config(xlabel='Height (cm)',
ylabel='Weight (kg)',
axis=[140, 210, 20, 200],
legend=False)

```
```

The columns are gone, but now we have a different problem: saturation. Where there are many overlapping points, the plot is not as dark as it should be, which means that the outliers are darker than they should be, which gives the impression that the data are more scattered than they actually are.

This is a surprisingly common problem, even in papers published in peer-reviewed journals.

We can usually solve the saturation problem by adjusting `alpha`

and the size of the markers, `s`

.

```
In [9]:
```thinkplot.Scatter(heights, weights, alpha=0.1, s=10)
thinkplot.Config(xlabel='Height (cm)',
ylabel='Weight (kg)',
axis=[140, 210, 20, 200],
legend=False)

```
```

`HexBin`

plot, which breaks the plane into bins, counts the number of respondents in each bin, and colors each bin in proportion to its count.

```
In [10]:
```thinkplot.HexBin(heights, weights)
thinkplot.Config(xlabel='Height (cm)',
ylabel='Weight (kg)',
axis=[140, 210, 20, 200],
legend=False)

```
```

**Exercise:** So far we have been working with a subset of only 5000 respondents. When we include the entire dataset, making an effective scatterplot can be tricky. As an exercise, experiment with `Scatter`

and `HexBin`

to make a plot that represents the entire dataset well.

```
In [11]:
```# Solution
# With smaller markers, I needed more aggressive jittering to
# blur the measurement artifacts
# With this dataset, using all of the rows might be more trouble
# than it's worth. Visualizing a subset of the data might be
# more practical and more effective.
heights = Jitter(df.htm3, 2.8)
weights = Jitter(df.wtkg2, 1.0)
thinkplot.Scatter(heights, weights, alpha=0.01, s=2)
thinkplot.Config(xlabel='Height (cm)',
ylabel='Weight (kg)',
axis=[140, 210, 20, 200],
legend=False)

```
```

```
In [12]:
```cleaned = df.dropna(subset=['htm3', 'wtkg2'])

Then I'll divide the dataset into groups by height.

```
In [13]:
```bins = np.arange(135, 210, 5)
indices = np.digitize(cleaned.htm3, bins)
groups = cleaned.groupby(indices)

Here are the number of respondents in each group:

```
In [14]:
```for i, group in groups:
print(i, len(group))

```
```

Now we can compute the CDF of weight within each group.

```
In [15]:
```mean_heights = [group.htm3.mean() for i, group in groups]
cdfs = [thinkstats2.Cdf(group.wtkg2) for i, group in groups]

And then extract the 25th, 50th, and 75th percentile from each group.

```
In [16]:
```for percent in [75, 50, 25]:
weight_percentiles = [cdf.Percentile(percent) for cdf in cdfs]
label = '%dth' % percent
thinkplot.Plot(mean_heights, weight_percentiles, label=label)
thinkplot.Config(xlabel='Height (cm)',
ylabel='Weight (kg)',
axis=[140, 210, 20, 200],
legend=False)

```
```

**Exercise:** Yet another option is to divide the dataset into groups and then plot the CDF for each group. As an exercise, divide the dataset into a smaller number of groups and plot the CDF for each group.

```
In [17]:
```# Solution
bins = np.arange(140, 210, 10)
indices = np.digitize(cleaned.htm3, bins)
groups = cleaned.groupby(indices)
cdfs = [thinkstats2.Cdf(group.wtkg2) for i, group in groups]
thinkplot.PrePlot(len(cdfs))
thinkplot.Cdfs(cdfs)
thinkplot.Config(xlabel='Weight (kg)',
ylabel='CDF',
axis=[20, 200, 0, 1],
legend=False)

```
```

The following function computes the covariance of two variables using NumPy's `dot`

function.

```
In [18]:
```def Cov(xs, ys, meanx=None, meany=None):
xs = np.asarray(xs)
ys = np.asarray(ys)
if meanx is None:
meanx = np.mean(xs)
if meany is None:
meany = np.mean(ys)
cov = np.dot(xs-meanx, ys-meany) / len(xs)
return cov

And here's an example:

```
In [19]:
```heights, weights = cleaned.htm3, cleaned.wtkg2
Cov(heights, weights)

```
Out[19]:
```

```
In [20]:
```def Corr(xs, ys):
xs = np.asarray(xs)
ys = np.asarray(ys)
meanx, varx = thinkstats2.MeanVar(xs)
meany, vary = thinkstats2.MeanVar(ys)
corr = Cov(xs, ys, meanx, meany) / np.sqrt(varx * vary)
return corr

The correlation of height and weight is about 0.51, which is a moderately strong correlation.

```
In [21]:
```Corr(heights, weights)

```
Out[21]:
```

NumPy provides a function that computes correlations, too:

```
In [22]:
```np.corrcoef(heights, weights)

```
Out[22]:
```

Pearson's correlation is not robust in the presence of outliers, and it tends to underestimate the strength of non-linear relationships.

Spearman's correlation is more robust, and it can handle non-linear relationships as long as they are monotonic. Here's a function that computes Spearman's correlation:

```
In [23]:
```import pandas as pd
def SpearmanCorr(xs, ys):
xranks = pd.Series(xs).rank()
yranks = pd.Series(ys).rank()
return Corr(xranks, yranks)

For heights and weights, Spearman's correlation is a little higher:

```
In [24]:
```SpearmanCorr(heights, weights)

```
Out[24]:
```

`Series`

provides a method that computes correlations, and it offers `spearman`

as one of the options.

```
In [25]:
```def SpearmanCorr(xs, ys):
xs = pd.Series(xs)
ys = pd.Series(ys)
return xs.corr(ys, method='spearman')

The result is the same as for the one we wrote.

```
In [26]:
```SpearmanCorr(heights, weights)

```
Out[26]:
```

```
In [27]:
```Corr(cleaned.htm3, np.log(cleaned.wtkg2))

```
Out[27]:
```

```
In [28]:
```import first
live, firsts, others = first.MakeFrames()
live = live.dropna(subset=['agepreg', 'totalwgt_lb'])

```
In [29]:
```# Solution
ages = live.agepreg
weights = live.totalwgt_lb
print('Corr', Corr(ages, weights))
print('SpearmanCorr', SpearmanCorr(ages, weights))

```
```

```
In [30]:
```# Solution
def BinnedPercentiles(df):
"""Bin the data by age and plot percentiles of weight for each bin.
df: DataFrame
"""
bins = np.arange(10, 48, 3)
indices = np.digitize(df.agepreg, bins)
groups = df.groupby(indices)
ages = [group.agepreg.mean() for i, group in groups][1:-1]
cdfs = [thinkstats2.Cdf(group.totalwgt_lb) for i, group in groups][1:-1]
thinkplot.PrePlot(3)
for percent in [75, 50, 25]:
weights = [cdf.Percentile(percent) for cdf in cdfs]
label = '%dth' % percent
thinkplot.Plot(ages, weights, label=label)
thinkplot.Config(xlabel="Mother's age (years)",
ylabel='Birth weight (lbs)',
xlim=[14, 45], legend=True)
BinnedPercentiles(live)

```
```

```
In [31]:
```# Solution
def ScatterPlot(ages, weights, alpha=1.0, s=20):
"""Make a scatter plot and save it.
ages: sequence of float
weights: sequence of float
alpha: float
"""
thinkplot.Scatter(ages, weights, alpha=alpha)
thinkplot.Config(xlabel='Age (years)',
ylabel='Birth weight (lbs)',
xlim=[10, 45],
ylim=[0, 15],
legend=False)
ScatterPlot(ages, weights, alpha=0.05, s=10)

```
```

```
In [32]:
```# Solution
# My conclusions:
# 1) The scatterplot shows a weak relationship between the variables but
# it is hard to see clearly.
# 2) The correlations support this. Pearson's is around 0.07, Spearman's
# is around 0.09. The difference between them suggests some influence
# of outliers or a non-linear relationsip.
# 3) Plotting percentiles of weight versus age suggests that the
# relationship is non-linear. Birth weight increases more quickly
# in the range of mother's age from 15 to 25. After that, the effect
# is weaker.

```
In [ ]:
```