Pandas Examples

http://thinkstats2.com

Copyright 2017 Allen B. Downey

MIT License: https://opensource.org/licenses/MIT


In [1]:
from __future__ import print_function, division

%matplotlib inline

import numpy as np

import nsfg
import first
import analytic

import thinkstats2

import seaborn

Let's read data from the BRFSS:


In [2]:
import brfss
df = brfss.ReadBrfss()

In [3]:
df.describe()


Out[3]:
age sex wtyrago finalwt wtkg2 htm3
count 410856.000000 414509.000000 390399.000000 414509.000000 398484.000000 409129.000000
mean 54.862180 1.624368 79.721319 561.774700 78.992453 168.825190
std 16.737702 0.484286 20.565164 1076.538764 19.546157 10.352653
min 18.000000 1.000000 22.727273 1.695143 20.000000 61.000000
25% 43.000000 1.000000 64.545455 97.006804 64.550000 160.000000
50% 55.000000 2.000000 77.272727 234.010543 77.270000 168.000000
75% 67.000000 2.000000 90.909091 590.775576 90.910000 175.000000
max 99.000000 2.000000 342.272727 60995.111700 309.090000 236.000000

If we group by sex, we get a DataFrameGroupBy object.


In [4]:
groupby = df.groupby('sex')
groupby


Out[4]:
<pandas.core.groupby.DataFrameGroupBy object at 0x7f6c6c1cf2d0>

If we select a particular column from the GroupBy, we get a SeriesGroupBy object.


In [5]:
seriesgroupby = groupby.htm3
seriesgroupby


Out[5]:
<pandas.core.groupby.SeriesGroupBy object at 0x7f6c3c88d590>

If you invoke a reduce method on a DataFrameGroupBy, you get a DataFrame:


In [6]:
groupby.mean()


Out[6]:
age wtyrago finalwt wtkg2 htm3
sex
1 54.114237 89.880639 727.426119 89.016966 178.066221
2 55.314401 73.278344 462.115407 72.684712 163.223475

If you invoke a reduce method on a SeriesGroupBy, you get a Series:


In [7]:
seriesgroupby.mean()


Out[7]:
sex
1    178.066221
2    163.223475
Name: htm3, dtype: float64

You can use aggregate to apply a collection of reduce methods:


In [8]:
groupby.aggregate(['mean', 'std'])


Out[8]:
age wtyrago finalwt wtkg2 htm3
mean std mean std mean std mean std mean std
sex
1 54.114237 16.383712 89.880639 19.260805 727.426119 1370.544678 89.016966 18.240528 178.066221 7.723563
2 55.314401 16.932158 73.278344 18.678964 462.115407 836.457063 72.684712 17.609006 163.223475 7.269156

In [9]:
seriesgroupby.aggregate(['mean', 'std'])


Out[9]:
mean std
sex
1 178.066221 7.723563
2 163.223475 7.269156

If the reduce method you want is not available, you can make your own:


In [10]:
def trimmed_mean(series):
    lower, upper = series.quantile([0.05, 0.95])
    return series.clip(lower, upper).mean()

Here's how it works when we apply it directly:


In [11]:
trimmed_mean(df.htm3)


Out[11]:
168.65401132650092

And we can use apply to apply it to each group:


In [12]:
seriesgroupby.apply(trimmed_mean)


Out[12]:
sex
1    178.109121
2    163.210210
Name: htm3, dtype: float64

The digitize-groupby combo

Let's say we want to group people into deciles (bottom 10%, next 10%, and so on).

We can start by defining the cumulative probabilities that mark the borders between deciles.


In [13]:
ps = np.linspace(0, 1, 11)
ps


Out[13]:
array([ 0. ,  0.1,  0.2,  0.3,  0.4,  0.5,  0.6,  0.7,  0.8,  0.9,  1. ])

And then use deciles to find the values that correspond to those cumulative probabilities.


In [14]:
series = df.htm3
bins = series.quantile(ps)
bins


Out[14]:
0.0     61.0
0.1    157.0
0.2    160.0
0.3    163.0
0.4    165.0
0.5    168.0
0.6    170.0
0.7    175.0
0.8    178.0
0.9    183.0
1.0    236.0
Name: htm3, dtype: float64

digitize takes a series and a sequence of bin boundaries, and computes the bin index for each element in the series.


In [15]:
np.digitize(series, bins)


Out[15]:
array([2, 4, 5, ..., 9, 2, 9])

Exercise: Collect the code snippets from the previous cells to write a function called digitize that takes a Series and a number of bins and return the results from np.digitize.


In [16]:
def digitize(series, n=11):
    ps = np.linspace(0, 1, n)
    bins = series.quantile(ps)
    return np.digitize(series, bins)

Now, if your digitize function is working, we can assign the results to a new column in the DataFrame:


In [17]:
df['height_decile'] = digitize(df.htm3)
df.height_decile.describe()


Out[17]:
count    414509.000000
mean          5.897512
std           2.924237
min           1.000000
25%           4.000000
50%           6.000000
75%           9.000000
max          11.000000
Name: height_decile, dtype: float64

And then group by height_decile


In [18]:
groupby = df.groupby('height_decile')

Now we can compute means for each variable in each group:


In [19]:
groupby.mean()


Out[19]:
age sex wtyrago finalwt wtkg2 htm3
height_decile
1 59.343555 1.978576 65.732170 482.633403 65.071502 152.147807
2 57.708848 1.976869 68.925145 460.942597 68.415169 157.001891
3 56.693958 1.964062 71.014838 461.123901 70.420280 160.002595
4 55.411945 1.943195 72.962619 481.562164 72.417512 163.000728
5 55.010687 1.890204 74.834351 498.016131 74.251253 165.002383
6 54.469226 1.781678 77.115205 534.779185 76.563233 168.000509
7 53.708454 1.573876 80.716256 570.787161 80.072338 171.405279
8 53.354186 1.306171 85.399642 643.930094 84.663829 175.001213
9 53.838675 1.134095 89.774340 651.359825 88.932201 178.888785
10 51.531986 1.034134 98.338009 714.941775 97.387332 186.214057
11 55.246926 1.758967 71.630761 792.074824 71.236649 236.000000

It looks like:

  1. The shortest people are older than the tallest people, on average.

  2. The shortest people are much more likely to be female (no surprise there).

  3. The shortest people are lighter than the tallest people (wtkg2), and they were lighter last year, too (wtyrago).

  4. Shorter people are more oversampled, so they have lower final weights. This is at least partly, and maybe entirely, due to the relationship with sex.

The fact that all of these variables are associates with height suggests that it will be important to control for age and sex for almost any analysis we want to do with this data.

Nevertheless, we'll start with a simple analysis looking at weights within each height group.


In [20]:
weights = groupby.wtkg2
weights


Out[20]:
<pandas.core.groupby.SeriesGroupBy object at 0x7f6c34e6a6d0>

If we apply quantile to a SeriesGroupBy, we get back a Series with a MultiIndex.


In [21]:
quantiles = weights.quantile([0.25, 0.5, 0.75])
quantiles


Out[21]:
height_decile      
1              0.25     54.55
               0.50     62.73
               0.75     72.73
2              0.25     57.73
               0.50     65.91
               0.75     76.36
3              0.25     59.09
               0.50     68.18
               0.75     78.64
4              0.25     61.36
               0.50     68.18
               0.75     81.36
5              0.25     63.18
               0.50     71.36
               0.75     81.82
6              0.25     64.09
               0.50     72.73
               0.75     84.09
7              0.25     68.18
               0.50     77.27
               0.75     88.64
8              0.25     72.73
               0.50     81.82
               0.75     90.91
9              0.25     77.27
               0.50     86.36
               0.75     97.27
10             0.25     84.09
               0.50     95.45
               0.75    106.82
11             0.25     59.09
               0.50     69.09
               0.75     80.00
dtype: float64

In [22]:
type(quantiles.index)


Out[22]:
pandas.indexes.multi.MultiIndex

If you unstack a MultiIndex, the inner level of the MultIndex gets broken out into columns.


In [23]:
quantiles.unstack()


Out[23]:
0.25 0.5 0.75
height_decile
1 54.55 62.73 72.73
2 57.73 65.91 76.36
3 59.09 68.18 78.64
4 61.36 68.18 81.36
5 63.18 71.36 81.82
6 64.09 72.73 84.09
7 68.18 77.27 88.64
8 72.73 81.82 90.91
9 77.27 86.36 97.27
10 84.09 95.45 106.82
11 59.09 69.09 80.00

Which makes it convenient to plot each of the columns as a line.


In [24]:
quantiles.unstack().plot()


Out[24]:
<matplotlib.axes._subplots.AxesSubplot at 0x7f6c34e43890>

The other view of this data we might like is the CDF of weight within each height group.

We can use apply with the Cdf constructor from thinkstats2. The results is a Series of Cdf objects.


In [25]:
from thinkstats2 import Cdf

cdfs = weights.apply(Cdf)
cdfs


Out[25]:
height_decile
1     [0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, ...
2     [0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, ...
3     [0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, ...
4     [0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, ...
5     [0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, ...
6     [0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, ...
7     [0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, ...
8     [0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, ...
9     [0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, ...
10    [0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, ...
11    [0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, ...
Name: wtkg2, dtype: object

And now we can plot the CDFs


In [26]:
import thinkplot

thinkplot.Cdfs(cdfs[1:11:2])
thinkplot.Config(xlabel='Weight (kg)', ylabel='Cdf')


Exercise: Plot CDFs of weight for men and women separately, broken out by decile of height.


In [27]:
groupby = df.groupby(['sex', 'height_decile'])

In [28]:
groupby.mean()


Out[28]:
age wtyrago finalwt wtkg2 htm3
sex height_decile
1 1 53.844743 71.977922 1191.234678 71.514913 150.233293
2 56.595658 71.746929 1003.079353 71.003320 157.025469
3 54.869454 72.944564 964.772919 72.326525 160.028037
4 56.042489 74.553472 879.420399 74.016273 163.005126
5 55.165862 76.385110 917.796857 75.760722 165.013377
6 56.300059 79.337985 814.286624 78.624502 168.001398
7 56.170706 82.956049 705.872453 82.235605 171.753016
8 55.213608 86.572882 706.591295 85.843598 175.001515
9 54.734034 90.392173 676.266016 89.530755 178.925710
10 51.677645 98.621484 721.615568 97.668112 186.240144
11 50.650927 79.295724 1195.072324 78.971384 236.000000
2 1 59.464339 65.595065 467.120026 64.928121 152.189721
2 57.735134 68.857631 448.105460 68.351735 157.001333
3 56.762345 70.942543 442.349122 70.347362 160.001647
4 55.373809 72.864145 457.600597 72.317857 163.000463
5 54.991480 74.636768 446.241040 74.057330 165.001027
6 53.955434 76.467033 456.712813 75.963038 168.000260
7 51.872079 78.972655 470.481305 78.390760 171.147071
8 49.126449 82.582291 501.930594 81.844168 175.000528
9 48.030870 85.524509 490.531017 84.845692 178.650349
10 47.388489 89.769177 526.098271 89.063530 185.475904
11 56.722523 69.371896 664.090555 68.941935 NaN

In [29]:
groupby.wtkg2.mean()


Out[29]:
sex  height_decile
1    1                71.514913
     2                71.003320
     3                72.326525
     4                74.016273
     5                75.760722
     6                78.624502
     7                82.235605
     8                85.843598
     9                89.530755
     10               97.668112
     11               78.971384
2    1                64.928121
     2                68.351735
     3                70.347362
     4                72.317857
     5                74.057330
     6                75.963038
     7                78.390760
     8                81.844168
     9                84.845692
     10               89.063530
     11               68.941935
Name: wtkg2, dtype: float64

In [30]:
cdfs = groupby.wtkg2.apply(Cdf)
cdfs


Out[30]:
sex  height_decile
1    1                [0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, ...
     2                [0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, ...
     3                [0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, ...
     4                [0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, ...
     5                [0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, ...
     6                [0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, ...
     7                [0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, ...
     8                [0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, ...
     9                [0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, ...
     10               [0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, ...
     11               [0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, ...
2    1                [0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, ...
     2                [0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, ...
     3                [0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, ...
     4                [0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, ...
     5                [0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, ...
     6                [0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, ...
     7                [0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, ...
     8                [0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, ...
     9                [0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, ...
     10               [0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, ...
     11               [0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, ...
Name: wtkg2, dtype: object

In [31]:
cdfs.unstack()


Out[31]:
height_decile 1 2 3 4 5 6 7 8 9 10 11
sex
1 [0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, ... [0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, ... [0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, ... [0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, ... [0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, ... [0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, ... [0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, ... [0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, ... [0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, ... [0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, ... [0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, ...
2 [0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, ... [0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, ... [0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, ... [0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, ... [0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, ... [0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, ... [0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, ... [0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, ... [0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, ... [0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, ... [0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, ...

In [32]:
men = cdfs.unstack().loc[1]
men


Out[32]:
height_decile
1     [0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, ...
2     [0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, ...
3     [0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, ...
4     [0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, ...
5     [0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, ...
6     [0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, ...
7     [0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, ...
8     [0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, ...
9     [0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, ...
10    [0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, ...
11    [0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, ...
Name: 1, dtype: object

In [33]:
thinkplot.Cdfs(men[1:11:2])



In [34]:
women = cdfs.unstack().loc[2]
women


Out[34]:
height_decile
1     [0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, ...
2     [0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, ...
3     [0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, ...
4     [0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, ...
5     [0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, ...
6     [0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, ...
7     [0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, ...
8     [0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, ...
9     [0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, ...
10    [0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, ...
11    [0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, ...
Name: 2, dtype: object

In [35]:
thinkplot.Cdfs(women[1:11:2])



In [ ]: