Title: Group-wise Models Date: 2013-06-28 12:00 Author: cfarmer Email: carson.farmer@gmail.com Category: Statistical Modeling for Python Tags: Helpful tips, Python, Statistical Modeling, Teaching Slug: statistical-modeling-python-group-wise Latex: yes Status: draft

Group-wise Models

Calculating means and other simple statistics is a matter of using the right function in Python. The 'pandas' package - which you should import in to Python as shown in previous tutorials - makes it straightforward to calculate either a "grand" statistic or a "group-wise" statistic. To illustrate:

Load the pandas and numpy libraries, needed only once per session:


In [1]:
import numpy as np
import pandas as pd

Read in data you are interested in analyzing, for instance the Cherry-Blossom 2008 data described in the textbook, which give the running times of all 12,302 participants in a ten-mile long running race held in Washington, DC in March 2008:

`Cherry-Blossom-2008.csv`


In [4]:
run = pd.read_csv("http://www.mosaic-web.org/go/datasets/Cherry-Blossom-2008.csv")
run.columns


Out[4]:
array(['position', 'division', 'total', 'name', 'age', 'place', 'net',
       'gun', 'sex'], dtype=object)

Calculate a grand mean on the gun time - the time from the start of the race, signalled by a gun, and when each runner crossed the finish line:


In [5]:
run.gun.mean()


Out[5]:
93.72503928900467

Other 'grand' statistics include:


In [6]:
run.gun.median()


Out[6]:
93.6833333333333

In [8]:
run.gun.std()


Out[8]:
14.96898682484778

To tell Python that you want to break the statistics down by groups, use the groupby() function. You will be using this notation frequently in looking at group-wise models.


In [13]:
run.groupby("sex").gun.mean()


Out[13]:
sex
F      98.772771
M      88.256734
Name: gun, dtype: float64

Other statistics work the same way, for instance:


In [12]:
run.groupby("sex").gun.std()


Out[12]:
sex
F      13.337132
M      14.718514
Name: gun, dtype: float64

Another example ... wage broken down by sector of the economy, using the 'Current Population Survey' wage data used in the textbook:

`cps.csv`


In [14]:
cps = pd.read_csv("http://www.mosaic-web.org/go/datasets/cps.csv")
cps.groupby("sector").wage.mean()


Out[14]:
sector
clerical     7.422577
const        9.502000
manag       12.704000
manuf        8.036029
other        8.500588
prof        11.947429
sales        7.592632
service      6.537470
Name: wage, dtype: float64

A study done in the early 1970s in Whickham, UK, examined the health consequences of smoking. The method of the study was simple: interview women to find out who smokes and who doesn’t. Then, 20 years later, follow-up to find out who is still living.

`whickham.csv`

In the Whickham smoking data example from the textbook, the outcome for each person was not a number but a category: Alive or Dead at the time of the follow-up.


In [15]:
w = pd.read_csv("http://www.mosaic-web.org/go/datasets/whickham.csv")
w.columns


Out[15]:
Index([outcome, smoker, age], dtype=object)

In [16]:
w.outcome.unique()


Out[16]:
array(['Alive', 'Dead'], dtype=object)

To find the proportion of people who were alive at the end of the 20-year followup period, you can use a computational trick. Convert the outcome variable to True or False to indicate whether an individual is alive, then take the mean of the True/False values. Python, like many computer languages, treats True as 1 and False as 0 for the purposes of doing arithmetic.


In [29]:
w["alive"] = (w.outcome=="Alive")
w.alive.mean()


Out[29]:
0.71917808219178081

This can also be done using the following 'short hand' version:


In [30]:
(w.outcome=="Alive").mean()


Out[30]:
0.71917808219178081

Here's the breakdown according to smoking status:


In [31]:
w.groupby("smoker").alive.mean()


Out[31]:
smoker
No        0.685792
Yes       0.761168
Name: alive, dtype: float64

A more meaningful question is whether smokers are different from non-smokers when holding other variables constant, such as age. To address this question, you need to add age into the model.

It might be natural to consider each age - 35, 36, 37, and so on - as a separate group, but you won’t get very many members of each group. And, likely, the data for 35 year-olds has quite a lot to say about 36 year-olds, so it doesn’t make sense to treat them as completely separate groups.

You can use the cut() function to divide up a quantitative variable into groups. You get to specify the breaks between groups. Using functionality we have already seen (e.g., when we created the new variable alive above), you can add the new variable to an existing data frame:


In [34]:
w["agegroups"] = pd.cut(w.age, bins=(0,30,40,53,64,75,100))
w.groupby("agegroups").alive.mean()


Out[34]:
agegroups
(0, 30]      0.979167
(30, 40]     0.948413
(40, 53]     0.832143
(53, 64]     0.625498
(64, 75]     0.201183
(75, 100]    0.000000
Name: alive, dtype: float64

In [36]:
w.groupby(("agegroups", "smoker")).alive.mean()


Out[36]:
agegroups  smoker
(0, 30]    No        0.981818
           Yes       0.975610
(30, 40]   No        0.955224
           Yes       0.940678
(40, 53]   No        0.876106
           Yes       0.802395
(53, 64]   No        0.669291
           Yes       0.580645
(64, 75]   No        0.213740
           Yes       0.157895
(75, 100]  No        0.000000
           Yes       0.000000
Name: alive, dtype: float64

The mean has been calculated group-by-group. This is a very widely used technique, but there is a better approach that will be introduced in later chapters: use quantitative variables directly without dividing them into groups.

Model Values and Residuals

A group-wise model tells you the model value for each group. There is additional information that you will want to generate about models. Two fundamental aspects of a model are the 'fitted' model values for each case and the 'residual' for each case. To make it easy to do these calculations, Python has a set of modeling functions from the 'statsmodels' library that helps keep track of the data used in creating and fitting models. Later tutorials will introduce the ols() function more thouroughly - ordinary least squares - that is central to statistical modeling.

For now, we will use ols() to create a simple model based on groupwise means, and worry about all the additional diagnostics and summary statistics later.

From a very young age, shoes for boys tend to be wider than shoes for girls. Is this because boys have wider feet, or because it is assumed that girls are willing to sacrifice comfort for fashion, even in elementary school? To assess the former, Mary C. Meyer, a statistician from Arbor, MI [measured kids' feet](http://www.amstat.org/publications/jse/v14n1/datasets.meyer.html) in a fourth grade classroom back in October 1997. Her ultimate goal was to answer the question: "do boys have wider feet than girls?". We are going to work with this data throughout these tutorials to build simple models and test for various statistical relationships.

`kidsfeet.csv`


In [40]:
import statsmodels.formula.api as sm # import statsmodels
kids = pd.read_csv("http://www.mosaic-web.org/go/datasets/kidsfeet.csv")

In [50]:
mod = sm.ols(formula='width ~ sex -1', df=kids).fit()
mod.params


Out[50]:
sex[B]    9.190000
sex[G]    8.784211
dtype: float64

Note the use of the ~ (pronounced "tilde") notation above. You will be using this notation frequently in building models. It means, "model by", "broken down by", "versus", or "as a function of".

The fittedvalues attribute of a model, such as generated by ols().fit(), returns the fitted model value for each case in the data used to fit the model. To do this, the model takes each case in turn, figures out which group it belongs to and the corresponding model value, and then returns the set of model value for each of the cases. For example:


In [52]:
mod.fittedvalues.values # use `values` attribute to return simple array of fitted model values


Out[52]:
array([ 9.19      ,  9.19      ,  9.19      ,  9.19      ,  9.19      ,
        9.19      ,  9.19      ,  8.78421053,  8.78421053,  9.19      ,
        9.19      ,  9.19      ,  9.19      ,  9.19      ,  8.78421053,
        8.78421053,  8.78421053,  8.78421053,  8.78421053,  8.78421053,
        9.19      ,  9.19      ,  8.78421053,  8.78421053,  8.78421053,
        9.19      ,  8.78421053,  9.19      ,  9.19      ,  9.19      ,
        8.78421053,  8.78421053,  8.78421053,  9.19      ,  9.19      ,
        8.78421053,  8.78421053,  8.78421053,  8.78421053])

The residuals are found by subtracting the case-by-case model value from the actual values for each case. This is accomplished by the resid attribute of the mod model, which works in much the same way as fitted:


In [54]:
mod.resid.values # again, use `values` to return simple array of residuals


Out[54]:
array([-0.79      , -0.39      ,  0.51      ,  0.61      , -0.29      ,
        0.51      ,  0.41      ,  0.01578947,  0.51578947, -0.39      ,
        0.61      , -0.29      , -0.09      ,  0.61      ,  0.51578947,
       -0.88421053, -0.08421053,  0.01578947,  0.21578947,  0.71578947,
        0.01      , -0.59      , -0.48421053,  0.21578947, -0.68421053,
        0.21      ,  0.71578947,  0.31      , -0.29      ,  0.11      ,
        0.51578947, -0.18421053, -0.18421053, -0.19      , -0.59      ,
       -0.28421053,  0.21578947, -0.88421053,  0.01578947])

For each case, adding the residual to the fitted value will produce the value of the response variable.

The var() function calculates variance. There is a simple additive relationship among the variances of the response variable, the model's fitted values, and the corresponding residual values:


In [55]:
kids.width.var()


Out[55]:
0.25967611336032431

In [56]:
mod.fittedvalues.var()


Out[56]:
0.042221819731342917

In [57]:
mod.resid.var()


Out[57]:
0.21745429362880894

The variance of the response variable is equal to the variance of the fitted values and the variance of the residuals. Below, we show this (response var equal to fitted var plus residual var), with rounding to 8 significant digits to avoid numerical roundoff errors:


In [64]:
kids.width.var().round(8) == (mod.fittedvalues.var() + mod.resid.var()).round(8)


Out[64]:
True

Next time on, Statistical Modeling: A Fresh Approach for Python...

  • Confidence Intervals

Reference

As with all 'Statistical Modeling: A Fresh Approach for Python' tutorials, this tutorial is based directly on material from 'Statistical Modeling: A Fresh Approach (2nd Edition)' by Daniel Kaplan.

I have made an effort to keep the text and explanations consistent between the original (R-based) version and the Python tutorials, in order to keep things comparable. With that in mind, any errors, omissions, and/or differences between the two versions are mine, and any questions, comments, and/or concerns should be directed to me.