Anecdotal evidence usually fails, because:
To address the limitations of anecdotes, we will use the tools of statistics, which include:
In [1]:
import matplotlib
import pandas as pd
%matplotlib inline
We can easily access the data frame and its columns with scripts intthe https://github.com/AllenDowney/ThinkStats2 repo.
In [2]:
import nsfg
df = nsfg.ReadFemPreg()
df.head()
Out[2]:
In [3]:
pregordr = df['pregordr']
pregordr[2:5]
Out[3]:
Print value counts for birthord and compare to results published in the codebook
In [4]:
birthord_counts = df.birthord.value_counts().sort_index()
birthord_counts
Out[4]:
In [5]:
birthord_counts.plot(kind='bar')
Out[5]:
Print value counts for prglngth and compare to results published in the codebook
In [6]:
df['prglngth_cut'] = pd.cut(df.prglngth,bins=[0,13,26,50])
df.prglngth_cut.value_counts().sort_index()
Out[6]:
Compute the mean birthweight.
In [7]:
df.totalwgt_lb.mean()
Out[7]:
Create a new column named totalwgt_kg that contains birth weight in kilograms. Compute its mean. Remember that when you create a new column, you have to use dictionary syntax, not dot notation.
In [8]:
df['totalwgt_kg'] = 0.45359237 * df.totalwgt_lb
df.totalwgt_kg.mean()
Out[8]:
One important note: when you add a new column to a DataFrame, you must use dictionary syntax, like this
# CORRECT
df['totalwgt_lb'] = df.birthwgt_lb + df.birthwgt_oz / 16.0
Not dot notation, like this:
# WRONG!
df.totalwgt_lb = df.birthwgt_lb + df.birthwgt_oz / 16.0
The version with dot notation adds an attribute to the DataFrame object, but that attribute is not treated as a new column.
Create a boolean Series.
In [9]:
lve_birth = df.outcome == 1
lve_birth.tail()
Out[9]:
Use a boolean Series to select the records for the pregnancies that ended in live birth.
In [10]:
live = df[df.outcome == 1]
len(live)
Out[10]:
Count the number of live births with birthwgt_lb between 0 and 5 pounds (including both). The result should be 1125.
In [11]:
len(live[(0<=live.birthwgt_lb) & (live.birthwgt_lb<=5)])
Out[11]:
Count the number of live births with birthwgt_lb between 9 and 95 pounds (including both). The result should be 798
In [12]:
len(live[(9<=live.birthwgt_lb) & (live.birthwgt_lb<95)])
Out[12]:
Use birthord to select the records for first babies and others. How many are there of each?
In [13]:
firsts = df[df.birthord==1]
others = df[df.birthord>1]
len(firsts), len(others)
Out[13]:
Compute the mean weight for first babies and others.
In [14]:
firsts.totalwgt_lb.mean(), others.totalwgt_lb.mean()
Out[14]:
Compute the mean prglngth for first babies and others. Compute the difference in means, expressed in hours.
In [15]:
firsts.prglngth.mean(), others.prglngth.mean()
Out[15]:
In [16]:
import thinkstats2
resp = thinkstats2.ReadStataDct('2002FemResp.dct').ReadFixedWidth('2002FemResp.dat.gz', compression='gzip')
In [17]:
preg = nsfg.ReadFemPreg()
preg_map = nsfg.MakePregMap(preg)
In [18]:
for index, pregnum in resp.pregnum.iteritems():
caseid = resp.caseid[index]
indices = preg_map[caseid]
# check that pregnum from the respondent file equals
# the number of records in the pregnancy file
if len(indices) != pregnum:
print(caseid, len(indices), pregnum)
break
Governments are good sources because data from public research is often freely available. Good places to start include:
Two of book auther's favorite data sets are: