In [1]:
from __future__ import print_function, division
%matplotlib inline
import numpy as np
import nsfg
import first
import analytic
import thinkstats2
import seaborn
Let's read data from the BRFSS:
In [2]:
import brfss
df = brfss.ReadBrfss()
In [3]:
df.describe()
Out[3]:
If we group by sex, we get a DataFrameGroupBy object.
In [4]:
groupby = df.groupby('sex')
groupby
Out[4]:
If we select a particular column from the GroupBy, we get a SeriesGroupBy object.
In [5]:
seriesgroupby = groupby.htm3
seriesgroupby
Out[5]:
If you invoke a reduce method on a DataFrameGroupBy, you get a DataFrame:
In [6]:
groupby.mean()
Out[6]:
If you invoke a reduce method on a SeriesGroupBy, you get a Series:
In [7]:
seriesgroupby.mean()
Out[7]:
You can use aggregate to apply a collection of reduce methods:
In [8]:
groupby.aggregate(['mean', 'std'])
Out[8]:
In [9]:
seriesgroupby.aggregate(['mean', 'std'])
Out[9]:
If the reduce method you want is not available, you can make your own:
In [10]:
def trimmed_mean(series):
lower, upper = series.quantile([0.05, 0.95])
return series.clip(lower, upper).mean()
Here's how it works when we apply it directly:
In [11]:
trimmed_mean(df.htm3)
Out[11]:
And we can use apply to apply it to each group:
In [12]:
seriesgroupby.apply(trimmed_mean)
Out[12]:
Let's say we want to group people into deciles (bottom 10%, next 10%, and so on).
We can start by defining the cumulative probabilities that mark the borders between deciles.
In [13]:
ps = np.linspace(0, 1, 11)
ps
Out[13]:
And then use deciles to find the values that correspond to those cumulative probabilities.
In [14]:
series = df.htm3
bins = series.quantile(ps)
bins
Out[14]:
digitize takes a series and a sequence of bin boundaries, and computes the bin index for each element in the series.
In [15]:
np.digitize(series, bins)
Out[15]:
Exercise: Collect the code snippets from the previous cells to write a function called digitize that takes a Series and a number of bins and return the results from np.digitize.
In [16]:
def digitize(series, n=11):
ps = np.linspace(0, 1, n)
bins = series.quantile(ps)
return np.digitize(series, bins)
Now, if your digitize function is working, we can assign the results to a new column in the DataFrame:
In [17]:
df['height_decile'] = digitize(df.htm3)
df.height_decile.describe()
Out[17]:
And then group by height_decile
In [18]:
groupby = df.groupby('height_decile')
Now we can compute means for each variable in each group:
In [19]:
groupby.mean()
Out[19]:
It looks like:
The shortest people are older than the tallest people, on average.
The shortest people are much more likely to be female (no surprise there).
The shortest people are lighter than the tallest people (wtkg2), and they were lighter last year, too (wtyrago).
Shorter people are more oversampled, so they have lower final weights. This is at least partly, and maybe entirely, due to the relationship with sex.
The fact that all of these variables are associates with height suggests that it will be important to control for age and sex for almost any analysis we want to do with this data.
Nevertheless, we'll start with a simple analysis looking at weights within each height group.
In [20]:
weights = groupby.wtkg2
weights
Out[20]:
If we apply quantile to a SeriesGroupBy, we get back a Series with a MultiIndex.
In [21]:
quantiles = weights.quantile([0.25, 0.5, 0.75])
quantiles
Out[21]:
In [22]:
type(quantiles.index)
Out[22]:
If you unstack a MultiIndex, the inner level of the MultIndex gets broken out into columns.
In [23]:
quantiles.unstack()
Out[23]:
Which makes it convenient to plot each of the columns as a line.
In [24]:
quantiles.unstack().plot()
Out[24]:
The other view of this data we might like is the CDF of weight within each height group.
We can use apply with the Cdf constructor from thinkstats2. The results is a Series of Cdf objects.
In [25]:
from thinkstats2 import Cdf
cdfs = weights.apply(Cdf)
cdfs
Out[25]:
And now we can plot the CDFs
In [26]:
import thinkplot
thinkplot.Cdfs(cdfs[1:11:2])
thinkplot.Config(xlabel='Weight (kg)', ylabel='Cdf')
Exercise: Plot CDFs of weight for men and women separately, broken out by decile of height.
In [27]:
groupby = df.groupby(['sex', 'height_decile'])
In [28]:
groupby.mean()
Out[28]:
In [29]:
groupby.wtkg2.mean()
Out[29]:
In [30]:
cdfs = groupby.wtkg2.apply(Cdf)
cdfs
Out[30]:
In [31]:
cdfs.unstack()
Out[31]:
In [32]:
men = cdfs.unstack().loc[1]
men
Out[32]:
In [33]:
thinkplot.Cdfs(men[1:11:2])
In [34]:
women = cdfs.unstack().loc[2]
women
Out[34]:
In [35]:
thinkplot.Cdfs(women[1:11:2])
In [ ]: