Calculating means and other simple statistics is a matter of using the right function in Python. The 'pandas' package - which you should import
in to Python as shown in previous tutorials - makes it straightforward to calculate either a "grand" statistic or a "group-wise" statistic. To illustrate:
Load the pandas and numpy libraries, needed only once per session:
In [1]:
import numpy as np
import pandas as pd
Read in data you are interested in analyzing, for instance the Cherry-Blossom 2008 data described in the textbook, which give the running times of all 12,302 participants in a ten-mile long running race held in Washington, DC in March 2008:
`Cherry-Blossom-2008.csv`
In [4]:
run = pd.read_csv("http://www.mosaic-web.org/go/datasets/Cherry-Blossom-2008.csv")
run.columns
Out[4]:
Calculate a grand mean on the gun
time - the time from the start of the race, signalled by a gun, and when each runner crossed the finish line:
In [5]:
run.gun.mean()
Out[5]:
Other 'grand' statistics include:
In [6]:
run.gun.median()
Out[6]:
In [8]:
run.gun.std()
Out[8]:
To tell Python that you want to break the statistics down by groups, use the groupby()
function. You will be using this notation frequently in looking at group-wise models.
In [13]:
run.groupby("sex").gun.mean()
Out[13]:
Other statistics work the same way, for instance:
In [12]:
run.groupby("sex").gun.std()
Out[12]:
Another example ... wage
broken down by sector
of the economy, using the 'Current Population Survey' wage data used in the textbook:
`cps.csv`
In [14]:
cps = pd.read_csv("http://www.mosaic-web.org/go/datasets/cps.csv")
cps.groupby("sector").wage.mean()
Out[14]:
A study done in the early 1970s in Whickham, UK, examined the health consequences of smoking. The method of the study was simple: interview women to find out who smokes and who doesn’t. Then, 20 years later, follow-up to find out who is still living.
`whickham.csv`
In the Whickham smoking data example from the textbook, the outcome for each person was not a number but a category: Alive or Dead at the time of the follow-up.
In [15]:
w = pd.read_csv("http://www.mosaic-web.org/go/datasets/whickham.csv")
w.columns
Out[15]:
In [16]:
w.outcome.unique()
Out[16]:
To find the proportion of people who were alive at the end of the 20-year followup period, you can use a computational trick. Convert the outcome variable to True
or False
to indicate whether an individual is alive, then take the mean of the True/False values. Python, like many computer languages, treats True
as 1 and False
as 0 for the purposes of doing arithmetic.
In [29]:
w["alive"] = (w.outcome=="Alive")
w.alive.mean()
Out[29]:
This can also be done using the following 'short hand' version:
In [30]:
(w.outcome=="Alive").mean()
Out[30]:
Here's the breakdown according to smoking status:
In [31]:
w.groupby("smoker").alive.mean()
Out[31]:
A more meaningful question is whether smokers are different from non-smokers when holding other variables constant, such as age. To address this question, you need to add age into the model.
It might be natural to consider each age - 35, 36, 37, and so on - as a separate group, but you won’t get very many members of each group. And, likely, the data for 35 year-olds has quite a lot to say about 36 year-olds, so it doesn’t make sense to treat them as completely separate groups.
You can use the cut()
function to divide up a quantitative variable into groups. You get to specify the breaks between groups. Using functionality we have already seen (e.g., when we created the new variable alive
above), you can add the new variable to an existing data frame:
In [34]:
w["agegroups"] = pd.cut(w.age, bins=(0,30,40,53,64,75,100))
w.groupby("agegroups").alive.mean()
Out[34]:
In [36]:
w.groupby(("agegroups", "smoker")).alive.mean()
Out[36]:
The mean has been calculated group-by-group. This is a very widely used technique, but there is a better approach that will be introduced in later chapters: use quantitative variables directly without dividing them into groups.
A group-wise model tells you the model value for each group. There is additional information that you will want to generate about models. Two fundamental aspects of a model are the 'fitted' model values for each case and the 'residual' for each case. To make it easy to do these calculations, Python has a set of modeling functions from the 'statsmodels' library that helps keep track of the data used in creating and fitting models. Later tutorials will introduce the ols()
function more thouroughly - ordinary least squares - that is central to statistical modeling.
For now, we will use ols()
to create a simple model based on groupwise means, and worry about all the additional diagnostics and summary statistics later.
From a very young age, shoes for boys tend to be wider than shoes for girls. Is this because boys have wider feet, or because it is assumed that girls are willing to sacrifice comfort for fashion, even in elementary school? To assess the former, Mary C. Meyer, a statistician from Arbor, MI [measured kids' feet](http://www.amstat.org/publications/jse/v14n1/datasets.meyer.html) in a fourth grade classroom back in October 1997. Her ultimate goal was to answer the question: "do boys have wider feet than girls?". We are going to work with this data throughout these tutorials to build simple models and test for various statistical relationships.
`kidsfeet.csv`
In [40]:
import statsmodels.formula.api as sm # import statsmodels
kids = pd.read_csv("http://www.mosaic-web.org/go/datasets/kidsfeet.csv")
In [50]:
mod = sm.ols(formula='width ~ sex -1', df=kids).fit()
mod.params
Out[50]:
Note the use of the ~
(pronounced "tilde") notation above. You will be using this notation frequently in building models. It means, "model by", "broken down by", "versus", or "as a function of".
The fittedvalues
attribute of a model, such as generated by ols().fit()
, returns the fitted model value for each case in the data used to fit the model. To do this, the model takes each case in turn, figures out which group it belongs to and the corresponding model value, and then returns the set of model value for each of the cases. For example:
In [52]:
mod.fittedvalues.values # use `values` attribute to return simple array of fitted model values
Out[52]:
The residuals are found by subtracting the case-by-case model value from the actual values for each case. This is accomplished by the resid
attribute of the mod
model, which works in much the same way as fitted
:
In [54]:
mod.resid.values # again, use `values` to return simple array of residuals
Out[54]:
For each case, adding the residual to the fitted value will produce the value of the response variable.
The var()
function calculates variance. There is a simple additive relationship among the variances of the response variable, the model's fitted values, and the corresponding residual values:
In [55]:
kids.width.var()
Out[55]:
In [56]:
mod.fittedvalues.var()
Out[56]:
In [57]:
mod.resid.var()
Out[57]:
The variance of the response variable is equal to the variance of the fitted values and the variance of the residuals. Below, we show this (response var equal to fitted var plus residual var), with rounding to 8 significant digits to avoid numerical roundoff errors:
In [64]:
kids.width.var().round(8) == (mod.fittedvalues.var() + mod.resid.var()).round(8)
Out[64]:
As with all 'Statistical Modeling: A Fresh Approach for Python' tutorials, this tutorial is based directly on material from 'Statistical Modeling: A Fresh Approach (2nd Edition)' by Daniel Kaplan.
I have made an effort to keep the text and explanations consistent between the original (R-based) version and the Python tutorials, in order to keep things comparable. With that in mind, any errors, omissions, and/or differences between the two versions are mine, and any questions, comments, and/or concerns should be directed to me.