```
In [ ]:
```%matplotlib inline
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
sns.set(style='white')
from utils import decorate
from thinkstats2 import Pmf, Cdf
import thinkstats2
import thinkplot

```
In [ ]:
```%time brfss = pd.read_hdf('brfss.hdf5', 'brfss')

```
In [ ]:
```brfss.shape

```
In [ ]:
```brfss.head()

```
In [ ]:
```brfss.describe()

```
In [ ]:
```height = brfss['HTM4']
weight = brfss['WTKG3']

```
In [ ]:
```plt.plot(height, weight, 'o')
plt.xlabel('Height in cm')
plt.ylabel('Weight in kg');

The center of this plot is saturated, so it is not as dark as it should be, which means the rest of the plot is relatively darker than it should be. It gives too much visual weight to the outliers and obscures the shape of the relationship.

**Exercise:** Use keywords `alpha`

and `markersize`

to avoid saturation.

```
In [ ]:
``````
# Solution goes here
```

With transparency and smaller markers, you will be able to see that height and weight are discretized.

**Exercise:** Use `np.random.normal`

to add enough noise to height and weight so the vertical lines in the scatter plot are blurred out. Create variables named `height_jitter`

and `weight_jitter`

.

```
In [ ]:
``````
# Solution goes here
```

```
In [ ]:
```from scipy.stats import linregress
subset = brfss.dropna(subset=['WTKG3', 'HTM4'])
xs = subset['HTM4']
ys = subset['WTKG3']
res = linregress(xs, ys)
res

The `LinregressResult`

object contains the estimated parameters and a few other statistics.

We can use the estimated `slope`

and `intercept`

to plot the line of best fit.

```
In [ ]:
```# jitter the data
height_jitter = height + np.random.normal(0, 2, size=len(height))
weight_jitter = weight + np.random.normal(0, 2, size=len(weight))
# make the scatter plot
plt.plot(height_jitter, weight_jitter, 'o', markersize=1, alpha=0.02)
plt.axis([140, 200, 0, 160])
# plot the line of best fit
fx = np.array([xs.min(), xs.max()])
fy = res.intercept + res.slope * fx
plt.plot(fx, fy, '-', alpha=0.5)
# label the axes
plt.xlabel('Height in cm')
plt.ylabel('Weight in kg')
plt.axis([140, 200, 0, 160]);

```
In [ ]:
``````
# Solution goes here
```

**Exercise:** Use `linregress`

to estimate the slope and intercept of the line of best fit for this data.

Note: as in the previous example, use `dropna`

to drop rows that contain NaN for either variable, and use the resulting subset to compute the arguments for `linregress`

.

```
In [ ]:
``````
# Solution goes here
```

**Exercise:** Generate a plot that shows the estimated line and a scatter plot of the data.

```
In [ ]:
``````
# Solution goes here
```

The Seaborn package, which is usually imported as `sns`

, provides two functions used to show the distribution of one variable as a function of another variable.

The following box plot shows the distribution of weight in each age category. Read the documentation so you know what it means.

```
In [ ]:
```data = brfss.dropna(subset=['AGE', 'WTKG3'])
sns.boxplot(x='AGE', y='WTKG3', data=data, whis=10)
sns.despine(left=True, bottom=True)
plt.xlabel('Age in years')
plt.ylabel('Weight in kg');

This figure makes the shape of the relationship clearer; average weight increases between ages 20 and 50, and then decreases.

A violin plot is another way to show the same thing. Again, read the documentation so you know what it means.

```
In [ ]:
```sns.violinplot(x='AGE', y='WTKG3', data=data, inner=None)
sns.despine(left=True, bottom=True)
plt.xlabel('Age in years')
plt.ylabel('Weight in kg');

**Exercise:** Make a box plot that shows the distribution of weight as a function of income. The variable `INCOME2`

contains income codes with 8 levels.

Use `dropna`

to select the rows with valid income and weight information.

```
In [ ]:
``````
# Solution goes here
```

**Exercise:** Make a violin plot with the same variables.

```
In [ ]:
``````
# Solution goes here
```

```
In [ ]:
```grouped = brfss.groupby('AGE')
for name, group in grouped['WTKG3']:
print(name, group.median())

To get the other percentiles, we can use a `Cdf`

.

```
In [ ]:
```ps = [95, 75, 50, 25, 5]
for name, group in grouped['WTKG3']:
percentiles = Cdf(group).Percentiles(ps)
print(name, percentiles)

Now I'll collect those results in a list of arrays:

```
In [ ]:
```res = []
for name, group in grouped['WTKG3']:
percentiles = Cdf(group).Percentiles(ps)
res.append(percentiles)
res

To get the age groups, we can extract the "keys" from the groupby object.

```
In [ ]:
```xs = grouped.groups.keys()
xs

Now, we want to loop through the columns of the list of arrays; to do that, we want to transpose it.

```
In [ ]:
```rows = np.transpose(res)
rows

Now we can plot the percentiles across the groups.

```
In [ ]:
```width = [1,2,5,2,1]
for i, qs in enumerate(rows):
plt.plot(xs, qs, label=ps[i], linewidth=width[i], color='C4')
decorate(xlabel='Age (years)',
ylabel='Weight (kg)')

In my opinion, this plot shows the shape of the relationship most clearly.

```
In [ ]:
```sns.boxplot(x='HTM4', y='WTKG3', data=data, whis=10)
sns.despine(left=True, bottom=True)
plt.xlabel('Height in cm')
plt.ylabel('Weight in kg');

`pd.cut`

to put people into height groups where each group spans 10 cm.

```
In [ ]:
```bins = np.arange(0, height.max(), 10)
brfss['_HTMG10'] = pd.cut(brfss['HTM4'], bins=bins, labels=bins[:-1]).astype(float)

Now here's what the plot looks like.

```
In [ ]:
```sns.boxplot(x='_HTMG10', y='WTKG3', data=brfss, whis=10)
plt.xticks(rotation=30)
sns.despine(left=True, bottom=True)
plt.xlabel('Height in cm')
plt.ylabel('Weight in kg');

**Exercise:** Plot percentiles of weight versus these height groups.

```
In [ ]:
``````
# Solution goes here
```

```
In [ ]:
``````
# Solution goes here
```

```
In [ ]:
``````
# Solution goes here
```

```
In [ ]:
``````
# Solution goes here
```

**Exercise:** The variable `_VEGESU1`

contains the self-reported number of serving of vegetables each respondent eats per day. Explore relationships between this variable and the others variables in the dataset, and design visualizations that show any relationship you find as clearly as possible.

```
In [ ]:
``````
# Solution goes here
```

```
In [ ]:
``````
# Solution goes here
```

```
In [ ]:
``````
# Solution goes here
```

```
In [ ]:
```subset = brfss[['HTM4', 'WTKG3', 'AGE']]
subset.corr()

**Exercise:** Compute a correlation matrix for age, income, and vegetable servings.

```
In [ ]:
```subset = brfss[['AGE', 'INCOME2', '_VEGESU1']]
subset.corr()

To calibrate your sense of correlation, let's look at scatter plots for fake data with different values of `rho`

.

The following function generates random normally-distributed data with approximately the given coefficient of correlation.

```
In [ ]:
```def gen_corr(rho):
means = [0, 0]
covs = [[1, rho], [rho, 1]]
m = np.random.multivariate_normal(means, covs, 100)
return np.transpose(m)

This function makes a scatter plot and shows the actual value of `rho`

.

```
In [ ]:
```def plot_scatter(rho, seed=1):
np.random.seed(seed)
xs, ys = gen_corr(rho)
rho = np.corrcoef(xs, ys)[0][1]
plt.plot(xs, ys, 'o', alpha=0.5)
plt.xlabel('x')
plt.ylabel('y')
ax = plt.gca()
label_rho(ax, rho)
return xs, ys

```
In [ ]:
```def label_rho(ax, rho):
label = 'ρ = %0.2f' % rho
plt.text(0.05, 0.95, label,
horizontalalignment='left',
verticalalignment='top',
transform=ax.transAxes,
fontsize=12)

The following plots show what scatter plots look like with different values of `rho`

.

```
In [ ]:
```res = []
xs, ys = plot_scatter(0, seed=18)
res.append((xs, ys))

```
In [ ]:
```xs, ys = plot_scatter(0.25, seed=18)
res.append((xs, ys))

```
In [ ]:
```xs, ys = plot_scatter(0.5, seed=18)
res.append((xs, ys))

```
In [ ]:
```xs, ys = plot_scatter(0.75, seed=18)
res.append((xs, ys))

```
In [ ]:
```xs, ys = plot_scatter(0.95, seed=18)
res.append((xs, ys))

Here are all the plots side-by-side for comparison.

```
In [ ]:
```fig, axes = plt.subplots(ncols=5, sharey=True, figsize=(15,3))
for ax, (xs, ys) in zip(axes, res):
ax.plot(xs, ys, 'o', alpha=0.5)
rho = np.corrcoef(xs, ys)[0][1]
label_rho(ax, rho)

```
In [ ]:
```np.random.seed(18)
xs = np.linspace(-1, 1)
ys = xs**2 + np.random.normal(0, 0.05, len(xs))
plt.plot(xs, ys, 'o', alpha=0.5)
plt.xlabel('x')
plt.ylabel('y');

This relationship is quite strong, in the sense that we can make a much better guess about `y`

if we know `x`

than if we don't.

But if we compute correlations, they don't show the relationship.

```
In [ ]:
```df = pd.DataFrame(dict(xs=xs, ys=ys))
df.corr(method='pearson')

```
In [ ]:
```df.corr(method='spearman')

```
In [ ]:
```df.corr(method='kendall')

```
In [ ]:
```np.random.seed(18)
xs = np.linspace(20, 50)
ys1 = 75 + 0.02 * xs + np.random.normal(0, 0.15, len(xs))
plt.plot(xs, ys1, 'o', alpha=0.5)
plt.xlabel('Age in years')
plt.ylabel('Weight in kg')
rho = np.corrcoef(xs, ys1)[0][1]
label_rho(plt.gca(), rho)

```
In [ ]:
```np.random.seed(18)
xs = np.linspace(20, 50)
ys2 = 65 + 0.2 * xs + np.random.normal(0, 3, len(xs))
plt.plot(xs, ys2, 'o', alpha=0.5)
plt.xlabel('Age in years')
plt.ylabel('Weight in kg')
rho = np.corrcoef(xs, ys2)[0][1]
label_rho(plt.gca(), rho)

Which relationship is stronger?

It depends on what we mean. Clearly, the first one has a higher coefficient of correlation. In that world, knowing someone's age would allow you to make a better guess about their weight.

But look more closely at the y-axis in the two plots. How much weight do people gain per year in each of these hypothetical worlds?

```
In [ ]:
```from scipy.stats import linregress
res = linregress(xs, ys1)
res

```
In [ ]:
```res = linregress(xs, ys2)
res

In fact, the slope for the second data set is almost 10 times higher.

```
In [ ]:
```def label_slope(ax, slope):
label = 'm = %0.3f' % slope
plt.text(0.05, 0.95, label,
horizontalalignment='left',
verticalalignment='top',
transform=ax.transAxes,
fontsize=12)

```
In [ ]:
```res = linregress(xs, ys1)
fx = np.array([xs.min(), xs.max()])
fy = res.intercept + res.slope * fx
plt.plot(xs, ys1, 'o', alpha=0.5)
plt.plot(fx, fy, '-', alpha=0.5)
plt.xlabel('Age in years')
plt.ylabel('Weight in kg')
label_slope(plt.gca(), res.slope)
plt.gca().get_ylim()

```
In [ ]:
```res = linregress(xs, ys2)
fx = np.array([xs.min(), xs.max()])
fy = res.intercept + res.slope * fx
plt.plot(xs, ys2, 'o', alpha=0.5)
plt.plot(fx, fy, '-', alpha=0.5)
plt.xlabel('Age in years')
plt.ylabel('Weight in kg')
label_slope(plt.gca(), res.slope)
plt.gca().get_ylim()

The difference is not obvious from looking at the figure; you have to look carefully at the y-axis labels and the estimated slope.

And you have to interpret the slope in context. In the first case, people gain about 0.019 kg per year, which works out to less than half a pound per decade. In the second case, they gain almost 4 pounds per decade.

But remember that in the first case, the coefficient of correlation is substantially higher.

**Exercise:** So, in which case is the relationship "stronger"? Write a sentence or two below to summarize your thoughts.