This notebook contains a "one-day paper", my attempt to pose a research question, answer it, and publish the results in one work day.
Copyright 2016 Allen B. Downey
MIT License: https://opensource.org/licenses/MIT
In [1]:
from __future__ import print_function, division
import thinkstats2
import thinkplot
import pandas as pd
import numpy as np
import statsmodels.formula.api as smf
%matplotlib inline
According to Wikipedia, the Trivers-Willard hypothesis:
"...suggests that female mammals are able to adjust offspring sex ratio in response to their maternal condition. For example, it may predict greater parental investment in males by parents in 'good conditions' and greater investment in females by parents in 'poor conditions' (relative to parents in good condition)."
For humans, the hypothesis suggests that people with relatively high social status might be more likely to have boys. Some studies have shown evidence for this hypothesis, but based on my very casual survey, it is not persuasive.
To test whether the T-W hypothesis holds up in humans, I downloaded birth data for the nearly 4 million babies born in the U.S. in 2014.
I selected variables that seemed likely to be related to social status and used logistic regression to identify variables associated with sex ratio.
Summary of results
Running regression with one variable at a time, many of the variables have a statistically significant effect on sex ratio, with the sign of the effect generally in the direction predicted by T-W.
However, many of the variables are also correlated with race. If we control for either the mother's race or the father's race, or both, most other variables have no additional predictive power.
Contrary to other reports, the age of the parents seems to have no predictive power.
Strangely, the variable that shows the strongest and most consistent relationship with sex ratio is the number of prenatal visits. Although it seems obvious that prenatal visits are a proxy for quality of health care and general socioeconomic status, the sign of the effect is opposite what T-W predicts; that is, more prenatal visits is a strong predictor of lower sex ratio (more girls).
Following convention, I report sex ratio in terms of boys per 100 girls. The overall sex ratio at birth is about 105; that is, 105 boys are born for every 100 girls.
In [2]:
names = ['year', 'mager9', 'mnativ', 'restatus', 'mbrace', 'mhisp_r',
'mar_p', 'dmar', 'meduc', 'fagerrec11', 'fbrace', 'fhisp_r', 'feduc',
'lbo_rec', 'previs_rec', 'wic', 'height', 'bmi_r', 'pay_rec', 'sex']
colspecs = [(9, 12),
(79, 79),
(84, 84),
(104, 104),
(110, 110),
(115, 115),
(119, 119),
(120, 120),
(124, 124),
(149, 150),
(156, 156),
(160, 160),
(163, 163),
(179, 179),
(242, 243),
(251, 251),
(280, 281),
(287, 287),
(436, 436),
(475, 475),
]
colspecs = [(start-1, end) for start, end in colspecs]
In [3]:
df = None
In [4]:
filename = 'Nat2014PublicUS.c20150514.r20151022.txt.gz'
#df = pd.read_fwf(filename, compression='gzip', header=None, names=names, colspecs=colspecs)
#df.head()
In [5]:
# store the dataframe for faster loading
#store = pd.HDFStore('store.h5')
#store['births2014'] = df
#store.close()
In [6]:
# load the dataframe
store = pd.HDFStore('store.h5')
df = store['births2014']
store.close()
In [7]:
def series_to_ratio(series):
"""Takes a boolean series and computes sex ratio.
"""
boys = np.mean(series)
return np.round(100 * boys / (1-boys)).astype(int)
I have to recode sex as 0
or 1
to make logit
happy.
In [8]:
df['boy'] = (df.sex=='M').astype(int)
df.boy.value_counts().sort_index()
Out[8]:
All births are from 2014.
In [9]:
df.year.value_counts().sort_index()
Out[9]:
Mother's age:
In [10]:
df.mager9.value_counts().sort_index()
Out[10]:
In [11]:
var = 'mager9'
df[[var, 'boy']].groupby(var).aggregate(series_to_ratio)
Out[11]:
In [12]:
df.mager9.isnull().mean()
Out[12]:
In [13]:
df['youngm'] = df.mager9<=2
df['oldm'] = df.mager9>=7
df.youngm.mean(), df.oldm.mean()
Out[13]:
Mother's nativity (1 = born in the U.S.)
In [14]:
df.mnativ.replace([3], np.nan, inplace=True)
df.mnativ.value_counts().sort_index()
Out[14]:
In [15]:
var = 'mnativ'
df[[var, 'boy']].groupby(var).aggregate(series_to_ratio)
Out[15]:
Residence status (1=resident)
In [16]:
df.restatus.value_counts().sort_index()
Out[16]:
In [17]:
var = 'restatus'
df[[var, 'boy']].groupby(var).aggregate(series_to_ratio)
Out[17]:
Mother's race (1=White, 2=Black, 3=American Indian or Alaskan Native, 4=Asian or Pacific Islander)
In [18]:
df.mbrace.value_counts().sort_index()
Out[18]:
In [19]:
var = 'mbrace'
df[[var, 'boy']].groupby(var).aggregate(series_to_ratio)
Out[19]:
Mother's Hispanic origin (0=Non-Hispanic)
In [20]:
df.mhisp_r.replace([9], np.nan, inplace=True)
df.mhisp_r.value_counts().sort_index()
Out[20]:
In [21]:
def copy_null(df, oldvar, newvar):
df.loc[df[oldvar].isnull(), newvar] = np.nan
In [22]:
df['mhisp'] = df.mhisp_r > 0
copy_null(df, 'mhisp_r', 'mhisp')
df.mhisp.isnull().mean(), df.mhisp.mean()
Out[22]:
In [23]:
var = 'mhisp'
df[[var, 'boy']].groupby(var).aggregate(series_to_ratio)
Out[23]:
Marital status (1=Married)
In [24]:
df.dmar.value_counts().sort_index()
Out[24]:
In [25]:
var = 'dmar'
df[[var, 'boy']].groupby(var).aggregate(series_to_ratio)
Out[25]:
Paternity acknowledged, if unmarried (Y=yes, N=no, X=not applicable, U=unknown).
I recode X (not applicable because married) as Y (paternity acknowledged).
In [26]:
df.mar_p.replace(['U'], np.nan, inplace=True)
df.mar_p.replace(['X'], 'Y', inplace=True)
df.mar_p.value_counts().sort_index()
Out[26]:
In [27]:
var = 'mar_p'
df[[var, 'boy']].groupby(var).aggregate(series_to_ratio)
Out[27]:
Mother's education level
In [28]:
df.meduc.replace([9], np.nan, inplace=True)
df.meduc.value_counts().sort_index()
Out[28]:
In [29]:
var = 'meduc'
df[[var, 'boy']].groupby(var).aggregate(series_to_ratio)
Out[29]:
In [30]:
df['lowed'] = df.meduc <= 2
copy_null(df, 'meduc', 'lowed')
df.lowed.isnull().mean(), df.lowed.mean()
Out[30]:
Father's age, in 10 ranges
In [31]:
df.fagerrec11.replace([11], np.nan, inplace=True)
df.fagerrec11.value_counts().sort_index()
Out[31]:
In [32]:
var = 'fagerrec11'
df[[var, 'boy']].groupby(var).aggregate(series_to_ratio)
Out[32]:
In [33]:
df['youngf'] = df.fagerrec11<=2
copy_null(df, 'fagerrec11', 'youngf')
df.youngf.isnull().mean(), df.youngf.mean()
Out[33]:
In [34]:
df['oldf'] = df.fagerrec11>=8
copy_null(df, 'fagerrec11', 'oldf')
df.oldf.isnull().mean(), df.oldf.mean()
Out[34]:
Father's race
In [35]:
df.fbrace.replace([9], np.nan, inplace=True)
df.fbrace.value_counts().sort_index()
Out[35]:
In [36]:
var = 'fbrace'
df[[var, 'boy']].groupby(var).aggregate(series_to_ratio)
Out[36]:
Father's Hispanic origin (0=non-hispanic, other values indicate country of origin)
In [37]:
df.fhisp_r.replace([9], np.nan, inplace=True)
df.fhisp_r.value_counts().sort_index()
Out[37]:
In [38]:
df['fhisp'] = df.fhisp_r > 0
copy_null(df, 'fhisp_r', 'fhisp')
df.fhisp.isnull().mean(), df.fhisp.mean()
Out[38]:
In [39]:
var = 'fhisp'
df[[var, 'boy']].groupby(var).aggregate(series_to_ratio)
Out[39]:
Father's education level
In [40]:
df.feduc.replace([9], np.nan, inplace=True)
df.feduc.value_counts().sort_index()
Out[40]:
In [41]:
var = 'feduc'
df[[var, 'boy']].groupby(var).aggregate(series_to_ratio)
Out[41]:
Live birth order.
In [42]:
df.lbo_rec.replace([9], np.nan, inplace=True)
df.lbo_rec.value_counts().sort_index()
Out[42]:
In [43]:
var = 'lbo_rec'
df[[var, 'boy']].groupby(var).aggregate(series_to_ratio)
Out[43]:
In [44]:
df['highbo'] = df.lbo_rec >= 5
copy_null(df, 'lbo_rec', 'highbo')
df.highbo.isnull().mean(), df.highbo.mean()
Out[44]:
Number of prenatal visits, in 11 ranges
In [45]:
df.previs_rec.replace([12], np.nan, inplace=True)
df.previs_rec.value_counts().sort_index()
Out[45]:
In [46]:
df.previs_rec.mean()
df['previs'] = df.previs_rec - 7
In [47]:
var = 'previs'
df[[var, 'boy']].groupby(var).aggregate(series_to_ratio)
Out[47]:
In [48]:
df['no_previs'] = df.previs_rec <= 1
copy_null(df, 'previs_rec', 'no_previs')
df.no_previs.isnull().mean(), df.no_previs.mean()
Out[48]:
Whether the mother is eligible for food stamps
In [49]:
df.wic.replace(['U'], np.nan, inplace=True)
df.wic.value_counts().sort_index()
Out[49]:
In [50]:
var = 'wic'
df[[var, 'boy']].groupby(var).aggregate(series_to_ratio)
Out[50]:
Mother's height in inches
In [51]:
df.height.replace([99], np.nan, inplace=True)
df.height.value_counts().sort_index()
Out[51]:
In [52]:
df['mshort'] = df.height<60
copy_null(df, 'height', 'mshort')
df.mshort.isnull().mean(), df.mshort.mean()
Out[52]:
In [53]:
df['mtall'] = df.height>=70
copy_null(df, 'height', 'mtall')
df.mtall.isnull().mean(), df.mtall.mean()
Out[53]:
In [54]:
var = 'mshort'
df[[var, 'boy']].groupby(var).aggregate(series_to_ratio)
Out[54]:
In [55]:
var = 'mtall'
df[[var, 'boy']].groupby(var).aggregate(series_to_ratio)
Out[55]:
Mother's BMI in 6 ranges
In [56]:
df.bmi_r.replace([9], np.nan, inplace=True)
df.bmi_r.value_counts().sort_index()
Out[56]:
In [57]:
var = 'bmi_r'
df[[var, 'boy']].groupby(var).aggregate(series_to_ratio)
Out[57]:
In [58]:
df['obese'] = df.bmi_r >= 4
copy_null(df, 'bmi_r', 'obese')
df.obese.isnull().mean(), df.obese.mean()
Out[58]:
Payment method (1=Medicaid, 2=Private insurance, 3=Self pay, 4=Other)
In [59]:
df.pay_rec.replace([9], np.nan, inplace=True)
df.pay_rec.value_counts().sort_index()
Out[59]:
In [60]:
var = 'pay_rec'
df[[var, 'boy']].groupby(var).aggregate(series_to_ratio)
Out[60]:
Sex of baby
In [61]:
df.sex.value_counts().sort_index()
Out[61]:
In [62]:
def logodds_to_ratio(logodds):
"""Convert from log odds to probability."""
odds = np.exp(logodds)
return 100 * odds
def summarize(results):
"""Summarize parameters in terms of birth ratio."""
inter_or = results.params['Intercept']
inter_rat = logodds_to_ratio(inter_or)
for value, lor in results.params.iteritems():
if value=='Intercept':
continue
rat = logodds_to_ratio(inter_or + lor)
code = '*' if results.pvalues[value] < 0.05 else ' '
print('%-20s %0.1f %0.1f' % (value, inter_rat, rat), code)
Now I'll run models with each variable, one at a time.
Mother's age seems to have no predictive value:
In [63]:
model = smf.logit('boy ~ mager9', data=df)
results = model.fit()
summarize(results)
results.summary()
Out[63]:
The estimated ratios for young mothers is higher, and the ratio for older mothers is lower, but neither is statistically significant.
In [64]:
model = smf.logit('boy ~ youngm + oldm', data=df)
results = model.fit()
summarize(results)
results.summary()
Out[64]:
Whether the mother was born in the U.S. has no predictive value
In [65]:
model = smf.logit('boy ~ C(mnativ)', data=df)
results = model.fit()
summarize(results)
results.summary()
Out[65]:
Neither does residence status
In [66]:
model = smf.logit('boy ~ C(restatus)', data=df)
results = model.fit()
summarize(results)
results.summary()
Out[66]:
Mother's race seems to have predictive value. Relative to whites, black and Native American mothers have more girls; Asians have more boys.
In [67]:
model = smf.logit('boy ~ C(mbrace)', data=df)
results = model.fit()
summarize(results)
results.summary()
Out[67]:
Hispanic mothers have more girls.
In [68]:
model = smf.logit('boy ~ mhisp', data=df)
results = model.fit()
summarize(results)
results.summary()
Out[68]:
If the mother is married or unmarried but paternity is acknowledged, the sex ratio is higher (more boys)
In [69]:
model = smf.logit('boy ~ C(mar_p)', data=df)
results = model.fit()
summarize(results)
results.summary()
Out[69]:
Being unmarried predicts more girls.
In [70]:
model = smf.logit('boy ~ C(dmar)', data=df)
results = model.fit()
summarize(results)
results.summary()
Out[70]:
Each level of mother's education predicts a small increase in the probability of a boy.
In [71]:
model = smf.logit('boy ~ meduc', data=df)
results = model.fit()
summarize(results)
results.summary()
Out[71]:
In [72]:
model = smf.logit('boy ~ lowed', data=df)
results = model.fit()
summarize(results)
results.summary()
Out[72]:
Older fathers are slightly more likely to have girls (but this apparent effect could be due to chance).
In [73]:
model = smf.logit('boy ~ fagerrec11', data=df)
results = model.fit()
summarize(results)
results.summary()
Out[73]:
In [74]:
model = smf.logit('boy ~ youngf + oldf', data=df)
results = model.fit()
summarize(results)
results.summary()
Out[74]:
Predictions based on father's race are similar to those based on mother's race: more girls for black and Native American fathers; more boys for Asian fathers.
In [75]:
model = smf.logit('boy ~ C(fbrace)', data=df)
results = model.fit()
summarize(results)
results.summary()
Out[75]:
If the father is Hispanic, that predicts more girls.
In [76]:
model = smf.logit('boy ~ fhisp', data=df)
results = model.fit()
summarize(results)
results.summary()
Out[76]:
Father's education level might predict more boys, but the apparent effect could be due to chance.
In [77]:
model = smf.logit('boy ~ feduc', data=df)
results = model.fit()
summarize(results)
results.summary()
Out[77]:
Babies with high birth order are slightly more likely to be girls.
In [78]:
model = smf.logit('boy ~ lbo_rec', data=df)
results = model.fit()
summarize(results)
results.summary()
Out[78]:
In [79]:
model = smf.logit('boy ~ highbo', data=df)
results = model.fit()
summarize(results)
results.summary()
Out[79]:
Strangely, prenatal visits are associated with an increased probability of girls.
In [80]:
model = smf.logit('boy ~ previs', data=df)
results = model.fit()
summarize(results)
results.summary()
Out[80]:
The effect seems to be non-linear at zero, so I'm adding a boolean for no prenatal visits.
In [81]:
model = smf.logit('boy ~ no_previs + previs', data=df)
results = model.fit()
summarize(results)
results.summary()
Out[81]:
If the mother qualifies for food stamps, she is more likely to have a girl.
In [82]:
model = smf.logit('boy ~ wic', data=df)
results = model.fit()
summarize(results)
results.summary()
Out[82]:
Mother's height seems to have no predictive value.
In [83]:
model = smf.logit('boy ~ height', data=df)
results = model.fit()
summarize(results)
results.summary()
Out[83]:
In [84]:
model = smf.logit('boy ~ mtall + mshort', data=df)
results = model.fit()
summarize(results)
results.summary()
Out[84]:
Mother's with higher BMI are more likely to have girls.
In [85]:
model = smf.logit('boy ~ bmi_r', data=df)
results = model.fit()
summarize(results)
results.summary()
Out[85]:
In [86]:
model = smf.logit('boy ~ obese', data=df)
results = model.fit()
summarize(results)
results.summary()
Out[86]:
If payment was made by Medicaid, the baby is more likely to be a girl. Private insurance, self-payment, and other payment method are associated with more boys.
In [87]:
model = smf.logit('boy ~ C(pay_rec)', data=df)
results = model.fit()
summarize(results)
results.summary()
Out[87]:
However, none of the previous results should be taken too seriously. We only tested one variable at a time, and many of these apparent effects disappear when we add control variables.
In particular, if we control for father's race and Hispanic origin, the mother's race has no additional predictive value.
In [88]:
formula = ('boy ~ C(fbrace) + fhisp + C(mbrace) + mhisp')
model = smf.logit(formula, data=df)
results = model.fit()
summarize(results)
results.summary()
Out[88]:
In fact, once we control for father's race and Hispanic origin, almost every other variable becomes statistically insignificant, including acknowledged paternity.
In [89]:
formula = ('boy ~ C(fbrace) + fhisp + mar_p')
model = smf.logit(formula, data=df)
results = model.fit()
summarize(results)
results.summary()
Out[89]:
Being married still predicts more boys.
In [90]:
formula = ('boy ~ C(fbrace) + fhisp + dmar')
model = smf.logit(formula, data=df)
results = model.fit()
summarize(results)
results.summary()
Out[90]:
The effect of education disappears.
In [91]:
formula = ('boy ~ C(fbrace) + fhisp + lowed')
model = smf.logit(formula, data=df)
results = model.fit()
summarize(results)
results.summary()
Out[91]:
The effect of birth order disappears.
In [92]:
formula = ('boy ~ C(fbrace) + fhisp + highbo')
model = smf.logit(formula, data=df)
results = model.fit()
summarize(results)
results.summary()
Out[92]:
WIC is no longer associated with more girls.
In [93]:
formula = ('boy ~ C(fbrace) + fhisp + wic')
model = smf.logit(formula, data=df)
results = model.fit()
summarize(results)
results.summary()
Out[93]:
The effect of obesity disappears.
In [94]:
formula = ('boy ~ C(fbrace) + fhisp + obese')
model = smf.logit(formula, data=df)
results = model.fit()
summarize(results)
results.summary()
Out[94]:
The effect of payment method is diminished, but self-payment is still associated with more boys.
In [95]:
formula = ('boy ~ C(fbrace) + fhisp + C(pay_rec)')
model = smf.logit(formula, data=df)
results = model.fit()
summarize(results)
results.summary()
Out[95]:
But the effect of prenatal visits is still a strong predictor of more girls.
In [96]:
formula = ('boy ~ C(fbrace) + fhisp + previs')
model = smf.logit(formula, data=df)
results = model.fit()
summarize(results)
results.summary()
Out[96]:
And the effect is even stronger if we add a boolean to capture the nonlinearity at 0 visits.
In [97]:
formula = ('boy ~ C(fbrace) + fhisp + previs + no_previs')
model = smf.logit(formula, data=df)
results = model.fit()
summarize(results)
results.summary()
Out[97]:
Now if we control for father's race and Hispanic origin as well as number of prenatal visits, the effect of marriage disappears.
In [98]:
formula = ('boy ~ C(fbrace) + fhisp + previs + dmar')
model = smf.logit(formula, data=df)
results = model.fit()
summarize(results)
results.summary()
Out[98]:
The effect of payment method disappears.
In [99]:
formula = ('boy ~ C(fbrace) + fhisp + previs + C(pay_rec)')
model = smf.logit(formula, data=df)
results = model.fit()
summarize(results)
results.summary()
Out[99]:
Here's a version with the addition of a boolean for no prenatal visits.
In [100]:
formula = ('boy ~ C(fbrace) + fhisp + previs + no_previs')
model = smf.logit(formula, data=df)
results = model.fit()
summarize(results)
results.summary()
Out[100]:
Now, surprisingly, the mother's age has a small effect.
In [101]:
formula = ('boy ~ C(fbrace) + fhisp + previs + no_previs + mager9')
model = smf.logit(formula, data=df)
results = model.fit()
summarize(results)
results.summary()
Out[101]:
So does the father's age. But both age effects are small and borderline significant.
In [104]:
formula = ('boy ~ C(fbrace) + fhisp + previs + no_previs + fagerrec11')
model = smf.logit(formula, data=df)
results = model.fit()
summarize(results)
results.summary()
Out[104]:
In [110]:
white = df[(df.mbrace==1) & (df.fbrace==1)]
len(white)
Out[110]:
And compute sex ratios for each level of previs
In [111]:
var = 'previs'
white[[var, 'boy']].groupby(var).aggregate(series_to_ratio)
Out[111]:
The effect holds up. People with fewer than average prenatal visits are substantially more likely to have boys.
In [112]:
formula = ('boy ~ previs + no_previs')
model = smf.logit(formula, data=white)
results = model.fit()
summarize(results)
results.summary()
Out[112]:
In [113]:
inter = results.params['Intercept']
slope = results.params['previs']
inter, slope
Out[113]:
In [114]:
previs = np.arange(-5, 5)
logodds = inter + slope * previs
odds = np.exp(logodds)
odds * 100
Out[114]:
In [116]:
formula = ('boy ~ dmar')
model = smf.logit(formula, data=white)
results = model.fit()
summarize(results)
results.summary()
Out[116]:
In [117]:
formula = ('boy ~ lowed')
model = smf.logit(formula, data=white)
results = model.fit()
summarize(results)
results.summary()
Out[117]:
In [118]:
formula = ('boy ~ highbo')
model = smf.logit(formula, data=white)
results = model.fit()
summarize(results)
results.summary()
Out[118]:
In [119]:
formula = ('boy ~ wic')
model = smf.logit(formula, data=white)
results = model.fit()
summarize(results)
results.summary()
Out[119]:
In [120]:
formula = ('boy ~ obese')
model = smf.logit(formula, data=white)
results = model.fit()
summarize(results)
results.summary()
Out[120]:
In [123]:
formula = ('boy ~ C(pay_rec)')
model = smf.logit(formula, data=white)
results = model.fit()
summarize(results)
results.summary()
Out[123]:
In [124]:
formula = ('boy ~ mager9')
model = smf.logit(formula, data=white)
results = model.fit()
summarize(results)
results.summary()
Out[124]:
In [125]:
formula = ('boy ~ youngm + oldm')
model = smf.logit(formula, data=white)
results = model.fit()
summarize(results)
results.summary()
Out[125]:
In [126]:
formula = ('boy ~ youngf + oldf')
model = smf.logit(formula, data=white)
results = model.fit()
summarize(results)
results.summary()
Out[126]:
In [ ]:
In [ ]: