Does Trivers-Willard apply to people?

This notebook contains a "one-day paper", my attempt to pose a research question, answer it, and publish the results in one work day.

MIT License: https://opensource.org/licenses/MIT



In [1]:

    
from __future__ import print_function, division

import thinkstats2
import thinkplot

import pandas as pd
import numpy as np

import statsmodels.formula.api as smf

%matplotlib inline

Trivers-Willard

According to Wikipedia, the Trivers-Willard hypothesis:

"...suggests that female mammals are able to adjust offspring sex ratio in response to their maternal condition. For example, it may predict greater parental investment in males by parents in 'good conditions' and greater investment in females by parents in 'poor conditions' (relative to parents in good condition)."

For humans, the hypothesis suggests that people with relatively high social status might be more likely to have boys. Some studies have shown evidence for this hypothesis, but based on my very casual survey, it is not persuasive.

To test whether the T-W hypothesis holds up in humans, I downloaded birth data for the nearly 4 million babies born in the U.S. in 2014.

I selected variables that seemed likely to be related to social status and used logistic regression to identify variables associated with sex ratio.

Summary of results

Running regression with one variable at a time, many of the variables have a statistically significant effect on sex ratio, with the sign of the effect generally in the direction predicted by T-W.
However, many of the variables are also correlated with race. If we control for either the mother's race or the father's race, or both, most other variables have no additional predictive power.
Contrary to other reports, the age of the parents seems to have no predictive power.
Strangely, the variable that shows the strongest and most consistent relationship with sex ratio is the number of prenatal visits. Although it seems obvious that prenatal visits are a proxy for quality of health care and general socioeconomic status, the sign of the effect is opposite what T-W predicts; that is, more prenatal visits is a strong predictor of lower sex ratio (more girls).

Following convention, I report sex ratio in terms of boys per 100 girls. The overall sex ratio at birth is about 105; that is, 105 boys are born for every 100 girls.

Data cleaning

Here's how I loaded the data:



In [2]:

    
names = ['year', 'mager9', 'restatus', 'mbrace', 'mhisp_r',
        'mar_p', 'dmar', 'meduc', 'fagerrec11', 'fbrace', 'fhisp_r', 'feduc', 
        'lbo_rec', 'previs_rec', 'wic', 'height', 'bmi_r', 'pay_rec', 'sex']
colspecs = [(15, 18),
            (93, 93),
            (138, 138),
            (143, 143),
            (148, 148),
            (152, 152),
            (153, 153),
            (155, 155),
            (186, 187),
            (191, 191),
            (195, 195),
            (197, 197),
            (212, 212),
            (272, 273),
            (281, 281),
            (555, 556),
            (533, 533),
            (413, 413),
            (436, 436),
           ]

colspecs = [(start-1, end) for start, end in colspecs]



In [3]:

    
df = None



In [4]:

    
filename = 'Nat2013PublicUS.r20141016.gz'
#df = pd.read_fwf(filename, compression='gzip', header=None, names=names, colspecs=colspecs)
#df.head()



In [5]:

    
# store the dataframe for faster loading

#store = pd.HDFStore('store.h5')
#store['births2013'] = df
#store.close()



In [6]:

    
# load the dataframe

store = pd.HDFStore('store.h5')
df = store['births2013']
store.close()



In [7]:

    
def series_to_ratio(series):
    """Takes a boolean series and computes sex ratio.
    """
    boys = np.mean(series)
    return np.round(100 * boys / (1-boys)).astype(int)

I have to recode sex as 0 or 1 to make logit happy.



In [8]:

    
df['boy'] = (df.sex=='M').astype(int)
df.boy.value_counts().sort_index()









    Out[8]:





0    1923390
1    2017374
Name: boy, dtype: int64

All births are from 2014.



In [9]:

    
df.year.value_counts().sort_index()









    Out[9]:





2013    3940764
Name: year, dtype: int64

Mother's age:



In [10]:

    
df.mager9.value_counts().sort_index()









    Out[10]:





1       3099
2     273598
3     898163
4    1123370
5    1039480
6     485088
7     109738
8       7539
9        689
Name: mager9, dtype: int64



In [11]:

    
var = 'mager9'
df[[var, 'boy']].groupby(var).aggregate(series_to_ratio)



In [12]:

    
df.mager9.isnull().mean()









    Out[12]:





0.0



In [13]:

    
df['youngm'] = df.mager9<=2
df['oldm'] = df.mager9>=7
df.youngm.mean(), df.oldm.mean()









    Out[13]:





(0.070214049864442532, 0.029934804520138733)

Residence status (1=resident)



In [14]:

    
df.restatus.value_counts().sort_index()









    Out[14]:





1    2847673
2     999320
3      85188
4       8583
Name: restatus, dtype: int64



In [15]:

    
var = 'restatus'
df[[var, 'boy']].groupby(var).aggregate(series_to_ratio)

Mother's race (1=White, 2=Black, 3=American Indian or Alaskan Native, 4=Asian or Pacific Islander)



In [16]:

    
df.mbrace.value_counts().sort_index()









    Out[16]:





1    2993686
2     635120
3      46011
4     265947
Name: mbrace, dtype: int64



In [17]:

    
var = 'mbrace'
df[[var, 'boy']].groupby(var).aggregate(series_to_ratio)

Mother's Hispanic origin (0=Non-Hispanic)



In [18]:

    
df.mhisp_r.replace([9], np.nan, inplace=True)
df.mhisp_r.value_counts().sort_index()









    Out[18]:





0    3005012
1     552005
2      68313
3      18855
4     131436
5     137518
Name: mhisp_r, dtype: int64



In [19]:

    
def copy_null(df, oldvar, newvar):
    df.loc[df[oldvar].isnull(), newvar] = np.nan



In [20]:

    
df['mhisp'] = df.mhisp_r > 0
copy_null(df, 'mhisp_r', 'mhisp')
df.mhisp.isnull().mean(), df.mhisp.mean()









    Out[20]:





(0.007010062008280628, 0.23207123488329956)



In [21]:

    
var = 'mhisp'
df[[var, 'boy']].groupby(var).aggregate(series_to_ratio)

Marital status (1=Married)



In [22]:

    
df.dmar.value_counts().sort_index()









    Out[22]:





1    2342660
2    1598104
Name: dmar, dtype: int64



In [23]:

    
var = 'dmar'
df[[var, 'boy']].groupby(var).aggregate(series_to_ratio)

Paternity acknowledged, if unmarried (Y=yes, N=no, X=not applicable, U=unknown).

I recode X (not applicable because married) as Y (paternity acknowledged).



In [24]:

    
df.mar_p.replace(['U'], np.nan, inplace=True)
df.mar_p.replace(['X'], 'Y', inplace=True)
df.mar_p.value_counts().sort_index()









    Out[24]:





N     429652
Y    3127707
Name: mar_p, dtype: int64



In [25]:

    
var = 'mar_p'
df[[var, 'boy']].groupby(var).aggregate(series_to_ratio)

Mother's education level



In [26]:

    
df.meduc.replace([9], np.nan, inplace=True)
df.meduc.value_counts().sort_index()









    Out[26]:





1    136701
2    421293
3    879956
4    753056
5    280660
6    669170
7    297054
8     84707
Name: meduc, dtype: int64



In [27]:

    
var = 'meduc'
df[[var, 'boy']].groupby(var).aggregate(series_to_ratio)



In [28]:

    
df['lowed'] = df.meduc <= 2
copy_null(df, 'meduc', 'lowed')
df.lowed.isnull().mean(), df.lowed.mean()









    Out[28]:





(0.1061131800838619, 0.15840415466202917)

Father's age, in 10 ranges



In [29]:

    
df.fagerrec11.replace([11], np.nan, inplace=True)
df.fagerrec11.value_counts().sort_index()









    Out[29]:





1        368
2      92982
3     510562
4     862475
5     993126
6     603109
7     257493
8      84476
9      27431
10     11627
Name: fagerrec11, dtype: int64



In [30]:

    
var = 'fagerrec11'
df[[var, 'boy']].groupby(var).aggregate(series_to_ratio)



In [31]:

    
df['youngf'] = df.fagerrec11<=2
copy_null(df, 'fagerrec11', 'youngf')
df.youngf.isnull().mean(), df.youngf.mean()









    Out[31]:





(0.12614685883244975, 0.027107873073010633)



In [32]:

    
df['oldf'] = df.fagerrec11>=8
copy_null(df, 'fagerrec11', 'oldf')
df.oldf.isnull().mean(), df.oldf.mean()









    Out[32]:





(0.12614685883244975, 0.03587299402465234)

Father's race



In [33]:

    
df.fbrace.replace([9], np.nan, inplace=True)
df.fbrace.value_counts().sort_index()









    Out[33]:





1    2466993
2     476535
3      35143
4     222529
Name: fbrace, dtype: int64



In [34]:

    
var = 'fbrace'
df[[var, 'boy']].groupby(var).aggregate(series_to_ratio)

Father's Hispanic origin (0=non-hispanic, other values indicate country of origin)



In [35]:

    
df.fhisp_r.replace([9], np.nan, inplace=True)
df.fhisp_r.value_counts().sort_index()









    Out[35]:





0    2604861
1     491433
2      58210
3      17820
4     104385
5     120142
Name: fhisp_r, dtype: int64



In [36]:

    
df['fhisp'] = df.fhisp_r > 0
copy_null(df, 'fhisp_r', 'fhisp')
df.fhisp.isnull().mean(), df.fhisp.mean()









    Out[36]:





(0.13802222107185308, 0.2331541772070662)



In [37]:

    
var = 'fhisp'
df[[var, 'boy']].groupby(var).aggregate(series_to_ratio)

Father's education level



In [38]:

    
df.feduc.replace([9], np.nan, inplace=True)
df.feduc.value_counts().sort_index()









    Out[38]:





1    136789
2    326194
3    879276
4    591364
5    212259
6    564045
7    220241
8     99780
Name: feduc, dtype: int64



In [39]:

    
var = 'feduc'
df[[var, 'boy']].groupby(var).aggregate(series_to_ratio)

Live birth order.



In [40]:

    
df.lbo_rec.replace([9], np.nan, inplace=True)
df.lbo_rec.value_counts().sort_index()









    Out[40]:





1    1550114
2    1246847
3     654946
4     276936
5     108168
6      44188
7      20301
8      20732
Name: lbo_rec, dtype: int64



In [41]:

    
var = 'lbo_rec'
df[[var, 'boy']].groupby(var).aggregate(series_to_ratio)



In [42]:

    
df['highbo'] = df.lbo_rec >= 5
copy_null(df, 'lbo_rec', 'highbo')
df.highbo.isnull().mean(), df.highbo.mean()









    Out[42]:





(0.0047026414167405106, 0.04930585442166603)

Number of prenatal visits, in 11 ranges



In [43]:

    
df.previs_rec.replace([12], np.nan, inplace=True)
df.previs_rec.value_counts().sort_index()









    Out[43]:





1      55475
2      41960
3      92649
4     191376
5     356722
6     806697
7     992507
8     677219
9     385084
10     98410
11    124040
Name: previs_rec, dtype: int64



In [44]:

    
df.previs_rec.mean()
df['previs'] = df.previs_rec - 7



In [45]:

    
var = 'previs'
df[[var, 'boy']].groupby(var).aggregate(series_to_ratio)



In [46]:

    
df['no_previs'] = df.previs_rec <= 1
copy_null(df, 'previs_rec', 'no_previs')
df.no_previs.isnull().mean(), df.no_previs.mean()









    Out[46]:





(0.030102030976734459, 0.014514124159273119)

Whether the mother is eligible for food stamps



In [47]:

    
df.wic.replace(['U'], np.nan, inplace=True)
df.wic.value_counts().sort_index()









    Out[47]:





N    1906790
Y    1568093
Name: wic, dtype: int64



In [48]:

    
var = 'wic'
df[[var, 'boy']].groupby(var).aggregate(series_to_ratio)

Mother's height in inches



In [49]:

    
df.height.replace([99], np.nan, inplace=True)
df.height.value_counts().sort_index()









    Out[49]:





30         8
31         3
32         2
33         1
36         9
37         4
38         6
39        16
40         8
41        14
42         6
43         7
44         2
45         7
46         7
47        20
48       759
49       492
50       346
51       362
52       450
53      1360
54      1301
55      2532
56      6411
57     17089
58     19417
59     74060
60    193506
61    246137
62    436950
63    448774
64    512975
65    415945
66    397583
67    308439
68    177404
69    117775
70     57814
71     30853
72     14269
73      4872
74      2369
75       955
76       540
77       626
78      1000
Name: height, dtype: int64



In [50]:

    
df['mshort'] = df.height<60
copy_null(df, 'height', 'mshort')
df.mshort.isnull().mean(), df.mshort.mean()









    Out[50]:





(0.11350058009056112, 0.03569472890251425)



In [51]:

    
df['mtall'] = df.height>=70
copy_null(df, 'height', 'mtall')
df.mtall.isnull().mean(), df.mtall.mean()









    Out[51]:





(0.11350058009056112, 0.032431225552707395)



In [52]:

    
var = 'mshort'
df[[var, 'boy']].groupby(var).aggregate(series_to_ratio)



In [53]:

    
var = 'mtall'
df[[var, 'boy']].groupby(var).aggregate(series_to_ratio)

Mother's BMI in 6 ranges



In [54]:

    
df.bmi_r.replace([9], np.nan, inplace=True)
df.bmi_r.value_counts().sort_index()









    Out[54]:





1     128922
2    1578486
3     869407
4     458241
5     217090
6     149572
Name: bmi_r, dtype: int64



In [55]:

    
var = 'bmi_r'
df[[var, 'boy']].groupby(var).aggregate(series_to_ratio)



In [56]:

    
df['obese'] = df.bmi_r >= 4
copy_null(df, 'bmi_r', 'obese')
df.obese.isnull().mean(), df.obese.mean()









    Out[56]:





(0.13678718136889192, 0.2424959976106191)

Payment method (1=Medicaid, 2=Private insurance, 3=Self pay, 4=Other)



In [57]:

    
df.pay_rec.replace([9], np.nan, inplace=True)
df.pay_rec.value_counts().sort_index()









    Out[57]:





1    1530635
2    1663943
3     153560
4     171457
Name: pay_rec, dtype: int64



In [58]:

    
var = 'pay_rec'
df[[var, 'boy']].groupby(var).aggregate(series_to_ratio)

Sex of baby



In [59]:

    
df.sex.value_counts().sort_index()









    Out[59]:





F    1923390
M    2017374
Name: sex, dtype: int64

Regression models

Here are some functions I'll use to interpret the results of logistic regression



In [60]:

    
def logodds_to_ratio(logodds):
    """Convert from log odds to probability."""
    odds = np.exp(logodds)
    return 100 * odds

def summarize(results):
    """Summarize parameters in terms of birth ratio."""
    inter_or = results.params['Intercept']
    inter_rat = logodds_to_ratio(inter_or)
    
    for value, lor in results.params.iteritems():
        if value=='Intercept':
            continue
        
        rat = logodds_to_ratio(inter_or + lor)
        code = '*' if results.pvalues[value] < 0.05 else ' '
        
        print('%-20s   %0.1f   %0.1f' % (value, inter_rat, rat), code)

Now I'll run models with each variable, one at a time.

Mother's age seems to have no predictive value:



In [61]:

    
model = smf.logit('boy ~ mager9', data=df)    
results = model.fit()
summarize(results)
results.summary()









    



Optimization terminated successfully.
         Current function value: 0.692863
         Iterations 3
mager9                 105.2   105.1  






    Out[61]:





Logit Regression Results

  Dep. Variable:         boy          No. Observations:     3940764  


  Model:                Logit         Df Residuals:         3940762  


  Method:                MLE          Df Model:                  1   


  Date:           Tue, 17 May 2016    Pseudo R-squ.:       1.110e-07 


  Time:               16:24:34        Log-Likelihood:     -2.7304e+06


  converged:            True          LL-Null:            -2.7304e+06


                                    LLR p-value:          0.4363   




               coef      std err       z       P>|z|  [95.0% Conf. Int.] 


  Intercept      0.0504      0.004     13.906   0.000      0.043     0.058


  mager9        -0.0006      0.001     -0.778   0.436     -0.002     0.001

The estimated ratios for young mothers is higher, and the ratio for older mothers is lower, but neither is statistically significant.



In [62]:

    
model = smf.logit('boy ~ youngm + oldm', data=df)    
results = model.fit()
summarize(results)
results.summary()









    



Optimization terminated successfully.
         Current function value: 0.692862
         Iterations 3
youngm[T.True]         104.9   105.4  
oldm[T.True]           104.9   104.1  






    Out[62]:





Logit Regression Results

  Dep. Variable:         boy          No. Observations:     3940764  


  Model:                Logit         Df Residuals:         3940761  


  Method:                MLE          Df Model:                  2   


  Date:           Tue, 17 May 2016    Pseudo R-squ.:       6.086e-07 


  Time:               16:24:39        Log-Likelihood:     -2.7304e+06


  converged:            True          LL-Null:            -2.7304e+06


                                    LLR p-value:          0.1898   




                    coef      std err       z       P>|z|  [95.0% Conf. Int.] 


  Intercept           0.0476      0.001     44.817   0.000      0.046     0.050


  youngm[T.True]      0.0047      0.004      1.189   0.234     -0.003     0.012


  oldm[T.True]       -0.0078      0.006     -1.323   0.186     -0.019     0.004

Neither does residence status



In [63]:

    
model = smf.logit('boy ~ C(restatus)', data=df)    
results = model.fit()
summarize(results)
results.summary()









    



Optimization terminated successfully.
         Current function value: 0.692861
         Iterations 3
C(restatus)[T.2]       104.7   105.5 *
C(restatus)[T.3]       104.7   105.3  
C(restatus)[T.4]       104.7   106.2  






    Out[63]:





Logit Regression Results

  Dep. Variable:         boy          No. Observations:     3940764  


  Model:                Logit         Df Residuals:         3940760  


  Method:                MLE          Df Model:                  3   


  Date:           Tue, 17 May 2016    Pseudo R-squ.:       2.140e-06 


  Time:               16:25:07        Log-Likelihood:     -2.7304e+06


  converged:            True          LL-Null:            -2.7304e+06


                                    LLR p-value:         0.008550  




                      coef      std err       z       P>|z|  [95.0% Conf. Int.] 


  Intercept             0.0456      0.001     38.456   0.000      0.043     0.048


  C(restatus)[T.2]      0.0077      0.002      3.329   0.001      0.003     0.012


  C(restatus)[T.3]      0.0057      0.007      0.819   0.413     -0.008     0.019


  C(restatus)[T.4]      0.0143      0.022      0.662   0.508     -0.028     0.057

Mother's race seems to have predictive value. Relative to whites, black and Native American mothers have more girls; Asians have more boys.



In [64]:

    
model = smf.logit('boy ~ C(mbrace)', data=df)    
results = model.fit()
summarize(results)
results.summary()









    



Optimization terminated successfully.
         Current function value: 0.692855
         Iterations 3
C(mbrace)[T.2]         105.1   103.2 *
C(mbrace)[T.3]         105.1   105.5  
C(mbrace)[T.4]         105.1   106.5 *






    Out[64]:





Logit Regression Results

  Dep. Variable:         boy          No. Observations:     3940764  


  Model:                Logit         Df Residuals:         3940760  


  Method:                MLE          Df Model:                  3   


  Date:           Tue, 17 May 2016    Pseudo R-squ.:       1.126e-05 


  Time:               16:25:34        Log-Likelihood:     -2.7304e+06


  converged:            True          LL-Null:            -2.7304e+06


                                    LLR p-value:         2.843e-13 




                    coef      std err       z       P>|z|  [95.0% Conf. Int.] 


  Intercept           0.0497      0.001     43.014   0.000      0.047     0.052


  C(mbrace)[T.2]     -0.0184      0.003     -6.659   0.000     -0.024    -0.013


  C(mbrace)[T.3]      0.0039      0.009      0.412   0.680     -0.015     0.022


  C(mbrace)[T.4]      0.0132      0.004      3.266   0.001      0.005     0.021

Hispanic mothers have more girls.



In [65]:

    
model = smf.logit('boy ~ mhisp', data=df)    
results = model.fit()
summarize(results)
results.summary()









    



Optimization terminated successfully.
         Current function value: 0.692863
         Iterations 3
mhisp                  105.1   104.3 *






    Out[65]:





Logit Regression Results

  Dep. Variable:         boy          No. Observations:     3913139  


  Model:                Logit         Df Residuals:         3913137  


  Method:                MLE          Df Model:                  1   


  Date:           Tue, 17 May 2016    Pseudo R-squ.:       1.830e-06 


  Time:               16:25:39        Log-Likelihood:     -2.7113e+06


  converged:            True          LL-Null:            -2.7113e+06


                                    LLR p-value:         0.001634  




               coef      std err       z       P>|z|  [95.0% Conf. Int.] 


  Intercept      0.0494      0.001     42.780   0.000      0.047     0.052


  mhisp         -0.0075      0.002     -3.150   0.002     -0.012    -0.003

If the mother is married or unmarried but paternity is acknowledged, the sex ratio is higher (more boys)



In [66]:

    
model = smf.logit('boy ~ C(mar_p)', data=df)    
results = model.fit()
summarize(results)
results.summary()









    



Optimization terminated successfully.
         Current function value: 0.692857
         Iterations 3
C(mar_p)[T.Y]          102.9   105.2 *






    Out[66]:





Logit Regression Results

  Dep. Variable:         boy          No. Observations:     3557359  


  Model:                Logit         Df Residuals:         3557357  


  Method:                MLE          Df Model:                  1   


  Date:           Tue, 17 May 2016    Pseudo R-squ.:       8.735e-06 


  Time:               16:26:06        Log-Likelihood:     -2.4647e+06


  converged:            True          LL-Null:            -2.4648e+06


                                    LLR p-value:         5.309e-11 




                   coef      std err       z       P>|z|  [95.0% Conf. Int.] 


  Intercept          0.0289      0.003      9.480   0.000      0.023     0.035


  C(mar_p)[T.Y]      0.0214      0.003      6.562   0.000      0.015     0.028

Being unmarried predicts more girls.



In [67]:

    
model = smf.logit('boy ~ C(dmar)', data=df)    
results = model.fit()
summarize(results)
results.summary()









    



Optimization terminated successfully.
         Current function value: 0.692860
         Iterations 3
C(dmar)[T.2]           105.3   104.3 *






    Out[67]:





Logit Regression Results

  Dep. Variable:         boy          No. Observations:     3940764  


  Model:                Logit         Df Residuals:         3940762  


  Method:                MLE          Df Model:                  1   


  Date:           Tue, 17 May 2016    Pseudo R-squ.:       4.091e-06 


  Time:               16:26:33        Log-Likelihood:     -2.7304e+06


  converged:            True          LL-Null:            -2.7304e+06


                                    LLR p-value:         2.286e-06 




                  coef      std err       z       P>|z|  [95.0% Conf. Int.] 


  Intercept         0.0516      0.001     39.508   0.000      0.049     0.054


  C(dmar)[T.2]     -0.0097      0.002     -4.726   0.000     -0.014    -0.006

Each level of mother's education predicts a small increase in the probability of a boy.



In [68]:

    
model = smf.logit('boy ~ meduc', data=df)    
results = model.fit()
summarize(results)
results.summary()









    



Optimization terminated successfully.
         Current function value: 0.692863
         Iterations 3
meduc                  103.9   104.1 *






    Out[68]:





Logit Regression Results

  Dep. Variable:         boy          No. Observations:     3522597  


  Model:                Logit         Df Residuals:         3522595  


  Method:                MLE          Df Model:                  1   


  Date:           Tue, 17 May 2016    Pseudo R-squ.:       2.858e-06 


  Time:               16:26:38        Log-Likelihood:     -2.4407e+06


  converged:            True          LL-Null:            -2.4407e+06


                                    LLR p-value:         0.0001875 




               coef      std err       z       P>|z|  [95.0% Conf. Int.] 


  Intercept      0.0379      0.003     13.590   0.000      0.032     0.043


  meduc          0.0023      0.001      3.735   0.000      0.001     0.003



In [69]:

    
model = smf.logit('boy ~ lowed', data=df)    
results = model.fit()
summarize(results)
results.summary()









    



Optimization terminated successfully.
         Current function value: 0.692863
         Iterations 3
lowed                  105.0   104.1 *






    Out[69]:





Logit Regression Results

  Dep. Variable:         boy          No. Observations:     3522597  


  Model:                Logit         Df Residuals:         3522595  


  Method:                MLE          Df Model:                  1   


  Date:           Tue, 17 May 2016    Pseudo R-squ.:       1.737e-06 


  Time:               16:26:43        Log-Likelihood:     -2.4407e+06


  converged:            True          LL-Null:            -2.4407e+06


                                    LLR p-value:         0.003591  




               coef      std err       z       P>|z|  [95.0% Conf. Int.] 


  Intercept      0.0489      0.001     42.080   0.000      0.047     0.051


  lowed         -0.0085      0.003     -2.912   0.004     -0.014    -0.003

Older fathers are slightly more likely to have girls (but this apparent effect could be due to chance).



In [70]:

    
model = smf.logit('boy ~ fagerrec11', data=df)    
results = model.fit()
summarize(results)
results.summary()









    



Optimization terminated successfully.
         Current function value: 0.692832
         Iterations 3
fagerrec11             106.0   105.8 *






    Out[70]:





Logit Regression Results

  Dep. Variable:         boy          No. Observations:     3443649  


  Model:                Logit         Df Residuals:         3443647  


  Method:                MLE          Df Model:                  1   


  Date:           Tue, 17 May 2016    Pseudo R-squ.:       1.017e-06 


  Time:               16:26:47        Log-Likelihood:     -2.3859e+06


  converged:            True          LL-Null:            -2.3859e+06


                                    LLR p-value:          0.02764  




                coef      std err       z       P>|z|  [95.0% Conf. Int.] 


  Intercept       0.0583      0.004     15.100   0.000      0.051     0.066


  fagerrec11     -0.0017      0.001     -2.202   0.028     -0.003    -0.000



In [71]:

    
model = smf.logit('boy ~ youngf + oldf', data=df)    
results = model.fit()
summarize(results)
results.summary()









    



Optimization terminated successfully.
         Current function value: 0.692831
         Iterations 3
youngf                 105.2   106.5  
oldf                   105.2   103.5 *






    Out[71]:





Logit Regression Results

  Dep. Variable:         boy          No. Observations:     3443649  


  Model:                Logit         Df Residuals:         3443646  


  Method:                MLE          Df Model:                  2   


  Date:           Tue, 17 May 2016    Pseudo R-squ.:       2.423e-06 


  Time:               16:26:51        Log-Likelihood:     -2.3859e+06


  converged:            True          LL-Null:            -2.3859e+06


                                    LLR p-value:         0.003089  




               coef      std err       z       P>|z|  [95.0% Conf. Int.] 


  Intercept      0.0504      0.001     45.235   0.000      0.048     0.053


  youngf         0.0127      0.007      1.908   0.056     -0.000     0.026


  oldf          -0.0160      0.006     -2.752   0.006     -0.027    -0.005

Predictions based on father's race are similar to those based on mother's race: more girls for black and Native American fathers; more boys for Asian fathers.



In [72]:

    
model = smf.logit('boy ~ C(fbrace)', data=df)    
results = model.fit()
summarize(results)
results.summary()









    



Optimization terminated successfully.
         Current function value: 0.692818
         Iterations 3
C(fbrace)[T.2.0]       105.4   103.5 *
C(fbrace)[T.3.0]       105.4   105.2  
C(fbrace)[T.4.0]       105.4   107.0 *






    Out[72]:





Logit Regression Results

  Dep. Variable:         boy          No. Observations:     3201200  


  Model:                Logit         Df Residuals:         3201196  


  Method:                MLE          Df Model:                  3   


  Date:           Tue, 17 May 2016    Pseudo R-squ.:       1.082e-05 


  Time:               16:27:18        Log-Likelihood:     -2.2179e+06


  converged:            True          LL-Null:            -2.2179e+06


                                    LLR p-value:         2.121e-10 




                      coef      std err       z       P>|z|  [95.0% Conf. Int.] 


  Intercept             0.0523      0.001     41.046   0.000      0.050     0.055


  C(fbrace)[T.2.0]     -0.0176      0.003     -5.566   0.000     -0.024    -0.011


  C(fbrace)[T.3.0]     -0.0012      0.011     -0.114   0.909     -0.022     0.020


  C(fbrace)[T.4.0]      0.0153      0.004      3.455   0.001      0.007     0.024

If the father is Hispanic, that predicts more girls.



In [73]:

    
model = smf.logit('boy ~ fhisp', data=df)    
results = model.fit()
summarize(results)
results.summary()









    



Optimization terminated successfully.
         Current function value: 0.692829
         Iterations 3
fhisp                  105.4   104.3 *






    Out[73]:





Logit Regression Results

  Dep. Variable:         boy          No. Observations:     3396851  


  Model:                Logit         Df Residuals:         3396849  


  Method:                MLE          Df Model:                  1   


  Date:           Tue, 17 May 2016    Pseudo R-squ.:       3.902e-06 


  Time:               16:27:24        Log-Likelihood:     -2.3534e+06


  converged:            True          LL-Null:            -2.3534e+06


                                    LLR p-value:         1.824e-05 




               coef      std err       z       P>|z|  [95.0% Conf. Int.] 


  Intercept      0.0528      0.001     42.594   0.000      0.050     0.055


  fhisp         -0.0110      0.003     -4.286   0.000     -0.016    -0.006

Father's education level might predict more boys, but the apparent effect could be due to chance.



In [74]:

    
model = smf.logit('boy ~ feduc', data=df)    
results = model.fit()
summarize(results)
results.summary()









    



Optimization terminated successfully.
         Current function value: 0.692828
         Iterations 3
feduc                  104.5   104.6 *






    Out[74]:





Logit Regression Results

  Dep. Variable:         boy          No. Observations:     3029948  


  Model:                Logit         Df Residuals:         3029946  


  Method:                MLE          Df Model:                  1   


  Date:           Tue, 17 May 2016    Pseudo R-squ.:       1.471e-06 


  Time:               16:27:28        Log-Likelihood:     -2.0992e+06


  converged:            True          LL-Null:            -2.0992e+06


                                    LLR p-value:          0.01294  




               coef      std err       z       P>|z|  [95.0% Conf. Int.] 


  Intercept      0.0437      0.003     14.857   0.000      0.038     0.049


  feduc          0.0016      0.001      2.485   0.013      0.000     0.003

Babies with high birth order are slightly more likely to be girls.



In [75]:

    
model = smf.logit('boy ~ lbo_rec', data=df)    
results = model.fit()
summarize(results)
results.summary()









    



Optimization terminated successfully.
         Current function value: 0.692860
         Iterations 3
lbo_rec                105.7   105.3 *






    Out[75]:





Logit Regression Results

  Dep. Variable:         boy          No. Observations:     3922232  


  Model:                Logit         Df Residuals:         3922230  


  Method:                MLE          Df Model:                  1   


  Date:           Tue, 17 May 2016    Pseudo R-squ.:       4.349e-06 


  Time:               16:27:35        Log-Likelihood:     -2.7176e+06


  converged:            True          LL-Null:            -2.7176e+06


                                    LLR p-value:         1.163e-06 




               coef      std err       z       P>|z|  [95.0% Conf. Int.] 


  Intercept      0.0559      0.002     28.436   0.000      0.052     0.060


  lbo_rec       -0.0039      0.001     -4.862   0.000     -0.005    -0.002



In [76]:

    
model = smf.logit('boy ~ highbo', data=df)    
results = model.fit()
summarize(results)
results.summary()









    



Optimization terminated successfully.
         Current function value: 0.692863
         Iterations 3
highbo                 104.9   104.0  






    Out[76]:





Logit Regression Results

  Dep. Variable:         boy          No. Observations:     3922232  


  Model:                Logit         Df Residuals:         3922230  


  Method:                MLE          Df Model:                  1   


  Date:           Tue, 17 May 2016    Pseudo R-squ.:       6.688e-07 


  Time:               16:27:41        Log-Likelihood:     -2.7176e+06


  converged:            True          LL-Null:            -2.7176e+06


                                    LLR p-value:          0.05657  




               coef      std err       z       P>|z|  [95.0% Conf. Int.] 


  Intercept      0.0481      0.001     46.434   0.000      0.046     0.050


  highbo        -0.0089      0.005     -1.907   0.057     -0.018     0.000

Strangely, prenatal visits are associated with an increased probability of girls.



In [77]:

    
model = smf.logit('boy ~ previs', data=df)    
results = model.fit()
summarize(results)
results.summary()









    



Optimization terminated successfully.
         Current function value: 0.692829
         Iterations 3
previs                 104.6   103.6 *






    Out[77]:





Logit Regression Results

  Dep. Variable:         boy          No. Observations:     3822139  


  Model:                Logit         Df Residuals:         3822137  


  Method:                MLE          Df Model:                  1   


  Date:           Tue, 17 May 2016    Pseudo R-squ.:       5.759e-05 


  Time:               16:27:46        Log-Likelihood:     -2.6481e+06


  converged:            True          LL-Null:            -2.6482e+06


                                    LLR p-value:         2.660e-68 




               coef      std err       z       P>|z|  [95.0% Conf. Int.] 


  Intercept      0.0453      0.001     44.004   0.000      0.043     0.047


  previs        -0.0095      0.001    -17.463   0.000     -0.011    -0.008

The effect seems to be non-linear at zero, so I'm adding a boolean for no prenatal visits.



In [78]:

    
model = smf.logit('boy ~ no_previs + previs', data=df)    
results = model.fit()
summarize(results)
results.summary()









    



Optimization terminated successfully.
         Current function value: 0.692828
         Iterations 3
no_previs              104.7   102.4 *
previs                 104.7   103.6 *






    Out[78]:





Logit Regression Results

  Dep. Variable:         boy          No. Observations:     3822139  


  Model:                Logit         Df Residuals:         3822136  


  Method:                MLE          Df Model:                  2   


  Date:           Tue, 17 May 2016    Pseudo R-squ.:       5.862e-05 


  Time:               16:27:52        Log-Likelihood:     -2.6481e+06


  converged:            True          LL-Null:            -2.6482e+06


                                    LLR p-value:         3.790e-68 




               coef      std err       z       P>|z|  [95.0% Conf. Int.] 


  Intercept      0.0455      0.001     44.041   0.000      0.043     0.048


  no_previs     -0.0216      0.009     -2.339   0.019     -0.040    -0.004


  previs        -0.0101      0.001    -17.064   0.000     -0.011    -0.009

If the mother qualifies for food stamps, she is more likely to have a girl.



In [79]:

    
model = smf.logit('boy ~ wic', data=df)    
results = model.fit()
summarize(results)
results.summary()









    



Optimization terminated successfully.
         Current function value: 0.692864
         Iterations 3
wic[T.Y]               105.2   104.4 *






    Out[79]:





Logit Regression Results

  Dep. Variable:         boy          No. Observations:     3474883  


  Model:                Logit         Df Residuals:         3474881  


  Method:                MLE          Df Model:                  1   


  Date:           Tue, 17 May 2016    Pseudo R-squ.:       2.939e-06 


  Time:               16:28:18        Log-Likelihood:     -2.4076e+06


  converged:            True          LL-Null:            -2.4076e+06


                                    LLR p-value:         0.0001686 




               coef      std err       z       P>|z|  [95.0% Conf. Int.] 


  Intercept      0.0511      0.001     35.284   0.000      0.048     0.054


  wic[T.Y]      -0.0081      0.002     -3.762   0.000     -0.012    -0.004

Mother's height seems to have no predictive value.



In [80]:

    
model = smf.logit('boy ~ height', data=df)    
results = model.fit()
summarize(results)
results.summary()









    



Optimization terminated successfully.
         Current function value: 0.692865
         Iterations 3
height                 102.6   102.7  






    Out[80]:





Logit Regression Results

  Dep. Variable:         boy          No. Observations:     3493485  


  Model:                Logit         Df Residuals:         3493483  


  Method:                MLE          Df Model:                  1   


  Date:           Tue, 17 May 2016    Pseudo R-squ.:       1.650e-07 


  Time:               16:28:24        Log-Likelihood:     -2.4205e+06


  converged:            True          LL-Null:            -2.4205e+06


                                    LLR p-value:          0.3714   




               coef      std err       z       P>|z|  [95.0% Conf. Int.] 


  Intercept      0.0260      0.024      1.079   0.280     -0.021     0.073


  height         0.0003      0.000      0.894   0.371     -0.000     0.001



In [81]:

    
model = smf.logit('boy ~ mtall + mshort', data=df)    
results = model.fit()
summarize(results)
results.summary()









    



Optimization terminated successfully.
         Current function value: 0.692864
         Iterations 3
mtall                  104.9   104.8  
mshort                 104.9   103.6 *






    Out[81]:





Logit Regression Results

  Dep. Variable:         boy          No. Observations:     3493485  


  Model:                Logit         Df Residuals:         3493482  


  Method:                MLE          Df Model:                  2   


  Date:           Tue, 17 May 2016    Pseudo R-squ.:       9.267e-07 


  Time:               16:28:30        Log-Likelihood:     -2.4205e+06


  converged:            True          LL-Null:            -2.4205e+06


                                    LLR p-value:          0.1061   




               coef      std err       z       P>|z|  [95.0% Conf. Int.] 


  Intercept      0.0480      0.001     43.296   0.000      0.046     0.050


  mtall         -0.0015      0.006     -0.242   0.809     -0.013     0.010


  mshort        -0.0122      0.006     -2.111   0.035     -0.024    -0.001

Mother's with higher BMI are more likely to have girls.



In [82]:

    
model = smf.logit('boy ~ bmi_r', data=df)    
results = model.fit()
summarize(results)
results.summary()









    



Optimization terminated successfully.
         Current function value: 0.692864
         Iterations 3
bmi_r                  106.0   105.6 *






    Out[82]:





Logit Regression Results

  Dep. Variable:         boy          No. Observations:     3401718  


  Model:                Logit         Df Residuals:         3401716  


  Method:                MLE          Df Model:                  1   


  Date:           Tue, 17 May 2016    Pseudo R-squ.:       3.511e-06 


  Time:               16:28:36        Log-Likelihood:     -2.3569e+06


  converged:            True          LL-Null:            -2.3569e+06


                                    LLR p-value:         4.736e-05 




               coef      std err       z       P>|z|  [95.0% Conf. Int.] 


  Intercept      0.0581      0.003     20.400   0.000      0.053     0.064


  bmi_r         -0.0038      0.001     -4.068   0.000     -0.006    -0.002



In [83]:

    
model = smf.logit('boy ~ obese', data=df)    
results = model.fit()
summarize(results)
results.summary()









    



Optimization terminated successfully.
         Current function value: 0.692865
         Iterations 3
obese                  105.1   104.2 *






    Out[83]:





Logit Regression Results

  Dep. Variable:         boy          No. Observations:     3401718  


  Model:                Logit         Df Residuals:         3401716  


  Method:                MLE          Df Model:                  1   


  Date:           Tue, 17 May 2016    Pseudo R-squ.:       1.997e-06 


  Time:               16:28:41        Log-Likelihood:     -2.3569e+06


  converged:            True          LL-Null:            -2.3569e+06


                                    LLR p-value:         0.002152  




               coef      std err       z       P>|z|  [95.0% Conf. Int.] 


  Intercept      0.0493      0.001     39.558   0.000      0.047     0.052


  obese         -0.0078      0.003     -3.068   0.002     -0.013    -0.003

If payment was made by Medicaid, the baby is more likely to be a girl. Private insurance, self-payment, and other payment method are associated with more boys.



In [84]:

    
model = smf.logit('boy ~ C(pay_rec)', data=df)    
results = model.fit()
summarize(results)
results.summary()









    



Optimization terminated successfully.
         Current function value: 0.692861
         Iterations 3
C(pay_rec)[T.2.0]      104.4   105.1 *
C(pay_rec)[T.3.0]      104.4   106.4 *
C(pay_rec)[T.4.0]      104.4   105.7 *






    Out[84]:





Logit Regression Results

  Dep. Variable:         boy          No. Observations:     3519595  


  Model:                Logit         Df Residuals:         3519591  


  Method:                MLE          Df Model:                  3   


  Date:           Tue, 17 May 2016    Pseudo R-squ.:       4.468e-06 


  Time:               16:29:09        Log-Likelihood:     -2.4386e+06


  converged:            True          LL-Null:            -2.4386e+06


                                    LLR p-value:         7.204e-05 




                       coef      std err       z       P>|z|  [95.0% Conf. Int.] 


  Intercept              0.0428      0.002     26.465   0.000      0.040     0.046


  C(pay_rec)[T.2.0]      0.0070      0.002      3.116   0.002      0.003     0.011


  C(pay_rec)[T.3.0]      0.0195      0.005      3.633   0.000      0.009     0.030


  C(pay_rec)[T.4.0]      0.0128      0.005      2.518   0.012      0.003     0.023

Adding controls

However, none of the previous results should be taken too seriously. We only tested one variable at a time, and many of these apparent effects disappear when we add control variables.

In particular, if we control for father's race and Hispanic origin, the mother's race has no additional predictive value.



In [85]:

    
formula = ('boy ~ C(fbrace) + fhisp + C(mbrace) + mhisp')
model = smf.logit(formula, data=df)
results = model.fit()
summarize(results)
results.summary()









    



Optimization terminated successfully.
         Current function value: 0.692815
         Iterations 3
C(fbrace)[T.2.0]       105.7   103.1 *
C(fbrace)[T.3.0]       105.7   104.5  
C(fbrace)[T.4.0]       105.7   106.8  
C(mbrace)[T.2]         105.7   106.3  
C(mbrace)[T.3]         105.7   107.7  
C(mbrace)[T.4]         105.7   106.0  
fhisp                  105.7   104.7 *
mhisp                  105.7   105.2  






    Out[85]:





Logit Regression Results

  Dep. Variable:         boy          No. Observations:     3178957  


  Model:                Logit         Df Residuals:         3178948  


  Method:                MLE          Df Model:                  8   


  Date:           Tue, 17 May 2016    Pseudo R-squ.:       1.589e-05 


  Time:               16:30:01        Log-Likelihood:     -2.2024e+06


  converged:            True          LL-Null:            -2.2025e+06


                                    LLR p-value:         4.910e-12 




                      coef      std err       z       P>|z|  [95.0% Conf. Int.] 


  Intercept             0.0552      0.001     37.128   0.000      0.052     0.058


  C(fbrace)[T.2.0]     -0.0244      0.006     -4.319   0.000     -0.036    -0.013


  C(fbrace)[T.3.0]     -0.0112      0.012     -0.903   0.366     -0.036     0.013


  C(fbrace)[T.4.0]      0.0101      0.007      1.382   0.167     -0.004     0.024


  C(mbrace)[T.2]        0.0058      0.006      0.968   0.333     -0.006     0.018


  C(mbrace)[T.3]        0.0187      0.013      1.434   0.152     -0.007     0.044


  C(mbrace)[T.4]        0.0027      0.007      0.383   0.702     -0.011     0.016


  fhisp                -0.0090      0.004     -2.042   0.041     -0.018    -0.000


  mhisp                -0.0047      0.004     -1.076   0.282     -0.013     0.004

In fact, once we control for father's race and Hispanic origin, almost every other variable becomes statistically insignificant, including acknowledged paternity.



In [86]:

    
formula = ('boy ~ C(fbrace) + fhisp + mar_p')
model = smf.logit(formula, data=df)
results = model.fit()
summarize(results)
results.summary()









    



Optimization terminated successfully.
         Current function value: 0.692814
         Iterations 3
C(fbrace)[T.2.0]       106.0   103.9 *
C(fbrace)[T.3.0]       106.0   105.7  
C(fbrace)[T.4.0]       106.0   107.0  
mar_p[T.Y]             106.0   105.7  
fhisp                  106.0   104.6 *






    Out[86]:





Logit Regression Results

  Dep. Variable:         boy          No. Observations:     2865016  


  Model:                Logit         Df Residuals:         2865010  


  Method:                MLE          Df Model:                  5   


  Date:           Tue, 17 May 2016    Pseudo R-squ.:       1.324e-05 


  Time:               16:30:50        Log-Likelihood:     -1.9849e+06


  converged:            True          LL-Null:            -1.9850e+06


                                    LLR p-value:         4.153e-10 




                      coef      std err       z       P>|z|  [95.0% Conf. Int.] 


  Intercept             0.0579      0.015      3.834   0.000      0.028     0.088


  C(fbrace)[T.2.0]     -0.0195      0.003     -5.734   0.000     -0.026    -0.013


  C(fbrace)[T.3.0]     -0.0027      0.012     -0.232   0.816     -0.026     0.020


  C(fbrace)[T.4.0]      0.0093      0.005      1.936   0.053     -0.000     0.019


  mar_p[T.Y]           -0.0023      0.015     -0.156   0.876     -0.032     0.027


  fhisp                -0.0128      0.003     -4.120   0.000     -0.019    -0.007

Being married still predicts more boys.



In [87]:

    
formula = ('boy ~ C(fbrace) + fhisp + dmar')
model = smf.logit(formula, data=df)
results = model.fit()
summarize(results)
results.summary()









    



Optimization terminated successfully.
         Current function value: 0.692816
         Iterations 3
C(fbrace)[T.2.0]       105.3   103.2 *
C(fbrace)[T.3.0]       105.3   105.1  
C(fbrace)[T.4.0]       105.3   106.7 *
fhisp                  105.3   103.9 *
dmar                   105.3   105.6  






    Out[87]:





Logit Regression Results

  Dep. Variable:         boy          No. Observations:     3182886  


  Model:                Logit         Df Residuals:         3182880  


  Method:                MLE          Df Model:                  5   


  Date:           Tue, 17 May 2016    Pseudo R-squ.:       1.516e-05 


  Time:               16:31:20        Log-Likelihood:     -2.2052e+06


  converged:            True          LL-Null:            -2.2052e+06


                                    LLR p-value:         4.631e-13 




                      coef      std err       z       P>|z|  [95.0% Conf. Int.] 


  Intercept             0.0521      0.003     15.120   0.000      0.045     0.059


  C(fbrace)[T.2.0]     -0.0210      0.003     -6.234   0.000     -0.028    -0.014


  C(fbrace)[T.3.0]     -0.0021      0.011     -0.190   0.849     -0.023     0.019


  C(fbrace)[T.4.0]      0.0125      0.004      2.794   0.005      0.004     0.021


  fhisp                -0.0133      0.003     -4.459   0.000     -0.019    -0.007


  dmar                  0.0026      0.003      1.001   0.317     -0.002     0.008

The effect of education disappears.



In [88]:

    
formula = ('boy ~ C(fbrace) + fhisp + lowed')
model = smf.logit(formula, data=df)
results = model.fit()
summarize(results)
results.summary()









    



Optimization terminated successfully.
         Current function value: 0.692813
         Iterations 3
C(fbrace)[T.2.0]       105.7   103.6 *
C(fbrace)[T.3.0]       105.7   105.4  
C(fbrace)[T.4.0]       105.7   106.7  
fhisp                  105.7   104.2 *
lowed                  105.7   106.2  






    Out[88]:





Logit Regression Results

  Dep. Variable:         boy          No. Observations:     2844274  


  Model:                Logit         Df Residuals:         2844268  


  Method:                MLE          Df Model:                  5   


  Date:           Tue, 17 May 2016    Pseudo R-squ.:       1.362e-05 


  Time:               16:31:47        Log-Likelihood:     -1.9706e+06


  converged:            True          LL-Null:            -1.9706e+06


                                    LLR p-value:         2.422e-10 




                      coef      std err       z       P>|z|  [95.0% Conf. Int.] 


  Intercept             0.0554      0.002     35.743   0.000      0.052     0.058


  C(fbrace)[T.2.0]     -0.0197      0.003     -5.751   0.000     -0.026    -0.013


  C(fbrace)[T.3.0]     -0.0025      0.012     -0.215   0.830     -0.026     0.021


  C(fbrace)[T.4.0]      0.0091      0.005      1.882   0.060     -0.000     0.019


  fhisp                -0.0141      0.003     -4.359   0.000     -0.020    -0.008


  lowed                 0.0045      0.004      1.196   0.232     -0.003     0.012

The effect of birth order disappears.



In [89]:

    
formula = ('boy ~ C(fbrace) + fhisp + highbo')
model = smf.logit(formula, data=df)
results = model.fit()
summarize(results)
results.summary()









    



Optimization terminated successfully.
         Current function value: 0.692816
         Iterations 3
C(fbrace)[T.2.0]       105.7   103.6 *
C(fbrace)[T.3.0]       105.7   105.7  
C(fbrace)[T.4.0]       105.7   107.0 *
fhisp                  105.7   104.3 *
highbo                 105.7   105.5  






    Out[89]:





Logit Regression Results

  Dep. Variable:         boy          No. Observations:     3170216  


  Model:                Logit         Df Residuals:         3170210  


  Method:                MLE          Df Model:                  5   


  Date:           Tue, 17 May 2016    Pseudo R-squ.:       1.527e-05 


  Time:               16:32:15        Log-Likelihood:     -2.1964e+06


  converged:            True          LL-Null:            -2.1964e+06


                                    LLR p-value:         4.151e-13 




                      coef      std err       z       P>|z|  [95.0% Conf. Int.] 


  Intercept             0.0553      0.001     37.705   0.000      0.052     0.058


  C(fbrace)[T.2.0]     -0.0202      0.003     -6.216   0.000     -0.027    -0.014


  C(fbrace)[T.3.0]     -0.0002      0.011     -0.016   0.987     -0.022     0.021


  C(fbrace)[T.4.0]      0.0123      0.004      2.735   0.006      0.003     0.021


  fhisp                -0.0129      0.003     -4.410   0.000     -0.019    -0.007


  highbo               -0.0022      0.005     -0.404   0.686     -0.013     0.009

WIC is no longer associated with more girls.



In [90]:

    
formula = ('boy ~ C(fbrace) + fhisp + wic')
model = smf.logit(formula, data=df)
results = model.fit()
summarize(results)
results.summary()









    



Optimization terminated successfully.
         Current function value: 0.692815
         Iterations 3
C(fbrace)[T.2.0]       105.7   103.6 *
C(fbrace)[T.3.0]       105.7   105.3  
C(fbrace)[T.4.0]       105.7   106.8 *
wic[T.Y]               105.7   105.7  
fhisp                  105.7   104.3 *






    Out[90]:





Logit Regression Results

  Dep. Variable:         boy          No. Observations:     2798726  


  Model:                Logit         Df Residuals:         2798720  


  Method:                MLE          Df Model:                  5   


  Date:           Tue, 17 May 2016    Pseudo R-squ.:       1.387e-05 


  Time:               16:33:05        Log-Likelihood:     -1.9390e+06


  converged:            True          LL-Null:            -1.9390e+06


                                    LLR p-value:         2.328e-10 




                      coef      std err       z       P>|z|  [95.0% Conf. Int.] 


  Intercept             0.0554      0.002     32.654   0.000      0.052     0.059


  C(fbrace)[T.2.0]     -0.0200      0.004     -5.620   0.000     -0.027    -0.013


  C(fbrace)[T.3.0]     -0.0036      0.012     -0.299   0.765     -0.027     0.020


  C(fbrace)[T.4.0]      0.0102      0.005      2.103   0.035      0.001     0.020


  wic[T.Y]              0.0004      0.003      0.151   0.880     -0.005     0.006


  fhisp                -0.0129      0.003     -3.938   0.000     -0.019    -0.006

The effect of obesity disappears.



In [91]:

    
formula = ('boy ~ C(fbrace) + fhisp + obese')
model = smf.logit(formula, data=df)
results = model.fit()
summarize(results)
results.summary()









    



Optimization terminated successfully.
         Current function value: 0.692816
         Iterations 3
C(fbrace)[T.2.0]       105.8   103.7 *
C(fbrace)[T.3.0]       105.8   105.8  
C(fbrace)[T.4.0]       105.8   106.7  
fhisp                  105.8   104.5 *
obese                  105.8   105.3  






    Out[91]:





Logit Regression Results

  Dep. Variable:         boy          No. Observations:     2744537  


  Model:                Logit         Df Residuals:         2744531  


  Method:                MLE          Df Model:                  5   


  Date:           Tue, 17 May 2016    Pseudo R-squ.:       1.454e-05 


  Time:               16:33:32        Log-Likelihood:     -1.9015e+06


  converged:            True          LL-Null:            -1.9015e+06


                                    LLR p-value:         1.139e-10 




                      coef      std err       z       P>|z|  [95.0% Conf. Int.] 


  Intercept             0.0566      0.002     33.872   0.000      0.053     0.060


  C(fbrace)[T.2.0]     -0.0200      0.004     -5.681   0.000     -0.027    -0.013


  C(fbrace)[T.3.0]     -0.0001      0.012     -0.009   0.993     -0.024     0.023


  C(fbrace)[T.4.0]      0.0078      0.005      1.589   0.112     -0.002     0.018


  fhisp                -0.0128      0.003     -4.036   0.000     -0.019    -0.007


  obese                -0.0046      0.003     -1.588   0.112     -0.010     0.001

The effect of payment method is diminished, but self-payment is still associated with more boys.



In [92]:

    
formula = ('boy ~ C(fbrace) + fhisp + C(pay_rec)')
model = smf.logit(formula, data=df)
results = model.fit()
summarize(results)
results.summary()









    



Optimization terminated successfully.
         Current function value: 0.692811
         Iterations 3
C(fbrace)[T.2.0]       106.0   103.8 *
C(fbrace)[T.3.0]       106.0   105.3  
C(fbrace)[T.4.0]       106.0   106.9  
C(pay_rec)[T.2.0]      106.0   105.4  
C(pay_rec)[T.3.0]      106.0   108.2 *
C(pay_rec)[T.4.0]      106.0   106.9  
fhisp                  106.0   104.3 *






    Out[92]:





Logit Regression Results

  Dep. Variable:         boy          No. Observations:     2833065  


  Model:                Logit         Df Residuals:         2833057  


  Method:                MLE          Df Model:                  7   


  Date:           Tue, 17 May 2016    Pseudo R-squ.:       1.928e-05 


  Time:               16:34:22        Log-Likelihood:     -1.9628e+06


  converged:            True          LL-Null:            -1.9628e+06


                                    LLR p-value:         1.047e-13 




                       coef      std err       z       P>|z|  [95.0% Conf. Int.] 


  Intercept              0.0579      0.002     23.562   0.000      0.053     0.063


  C(fbrace)[T.2.0]      -0.0207      0.004     -5.863   0.000     -0.028    -0.014


  C(fbrace)[T.3.0]      -0.0065      0.012     -0.548   0.584     -0.030     0.017


  C(fbrace)[T.4.0]       0.0089      0.005      1.858   0.063     -0.000     0.018


  C(pay_rec)[T.2.0]     -0.0051      0.003     -1.901   0.057     -0.010     0.000


  C(pay_rec)[T.3.0]      0.0213      0.006      3.366   0.001      0.009     0.034


  C(pay_rec)[T.4.0]      0.0084      0.006      1.456   0.145     -0.003     0.020


  fhisp                 -0.0161      0.003     -4.953   0.000     -0.022    -0.010

But the effect of prenatal visits is still a strong predictor of more girls.



In [93]:

    
formula = ('boy ~ C(fbrace) + fhisp + previs')
model = smf.logit(formula, data=df)
results = model.fit()
summarize(results)
results.summary()









    



Optimization terminated successfully.
         Current function value: 0.692772
         Iterations 3
C(fbrace)[T.2.0]       105.7   103.0 *
C(fbrace)[T.3.0]       105.7   104.8  
C(fbrace)[T.4.0]       105.7   106.9 *
fhisp                  105.7   104.0 *
previs                 105.7   104.5 *






    Out[93]:





Logit Regression Results

  Dep. Variable:         boy          No. Observations:     3096464  


  Model:                Logit         Df Residuals:         3096458  


  Method:                MLE          Df Model:                  5   


  Date:           Tue, 17 May 2016    Pseudo R-squ.:       9.021e-05 


  Time:               16:34:51        Log-Likelihood:     -2.1451e+06


  converged:            True          LL-Null:            -2.1453e+06


                                    LLR p-value:         1.832e-81 




                      coef      std err       z       P>|z|  [95.0% Conf. Int.] 


  Intercept             0.0552      0.001     37.584   0.000      0.052     0.058


  C(fbrace)[T.2.0]     -0.0255      0.003     -7.701   0.000     -0.032    -0.019


  C(fbrace)[T.3.0]     -0.0082      0.011     -0.747   0.455     -0.030     0.013


  C(fbrace)[T.4.0]      0.0111      0.005      2.436   0.015      0.002     0.020


  fhisp                -0.0159      0.003     -5.363   0.000     -0.022    -0.010


  previs               -0.0115      0.001    -17.928   0.000     -0.013    -0.010

And the effect is even stronger if we add a boolean to capture the nonlinearity at 0 visits.



In [94]:

    
formula = ('boy ~ C(fbrace) + fhisp + previs + no_previs')
model = smf.logit(formula, data=df)
results = model.fit()
summarize(results)
results.summary()









    



Optimization terminated successfully.
         Current function value: 0.692772
         Iterations 3
C(fbrace)[T.2.0]       105.7   103.0 *
C(fbrace)[T.3.0]       105.7   104.8  
C(fbrace)[T.4.0]       105.7   106.9 *
fhisp                  105.7   104.0 *
previs                 105.7   104.5 *
no_previs              105.7   103.8  






    Out[94]:





Logit Regression Results

  Dep. Variable:         boy          No. Observations:     3096464  


  Model:                Logit         Df Residuals:         3096457  


  Method:                MLE          Df Model:                  6   


  Date:           Tue, 17 May 2016    Pseudo R-squ.:       9.072e-05 


  Time:               16:35:20        Log-Likelihood:     -2.1451e+06


  converged:            True          LL-Null:            -2.1453e+06


                                    LLR p-value:         5.706e-81 




                      coef      std err       z       P>|z|  [95.0% Conf. Int.] 


  Intercept             0.0554      0.001     37.601   0.000      0.052     0.058


  C(fbrace)[T.2.0]     -0.0254      0.003     -7.688   0.000     -0.032    -0.019


  C(fbrace)[T.3.0]     -0.0083      0.011     -0.754   0.451     -0.030     0.013


  C(fbrace)[T.4.0]      0.0110      0.005      2.423   0.015      0.002     0.020


  fhisp                -0.0159      0.003     -5.344   0.000     -0.022    -0.010


  previs               -0.0118      0.001    -17.440   0.000     -0.013    -0.010


  no_previs            -0.0183      0.012     -1.485   0.138     -0.042     0.006

More controls

Now if we control for father's race and Hispanic origin as well as number of prenatal visits, the effect of marriage disappears.



In [95]:

    
formula = ('boy ~ C(fbrace) + fhisp + previs + dmar')
model = smf.logit(formula, data=df)
results = model.fit()
summarize(results)
results.summary()









    



Optimization terminated successfully.
         Current function value: 0.692772
         Iterations 3
C(fbrace)[T.2.0]       105.7   103.0 *
C(fbrace)[T.3.0]       105.7   104.8  
C(fbrace)[T.4.0]       105.7   106.9 *
fhisp                  105.7   104.0 *
previs                 105.7   104.5 *
dmar                   105.7   105.7  






    Out[95]:





Logit Regression Results

  Dep. Variable:         boy          No. Observations:     3096464  


  Model:                Logit         Df Residuals:         3096457  


  Method:                MLE          Df Model:                  6   


  Date:           Tue, 17 May 2016    Pseudo R-squ.:       9.021e-05 


  Time:               16:35:49        Log-Likelihood:     -2.1451e+06


  converged:            True          LL-Null:            -2.1453e+06


                                    LLR p-value:         1.697e-80 




                      coef      std err       z       P>|z|  [95.0% Conf. Int.] 


  Intercept             0.0553      0.003     15.816   0.000      0.048     0.062


  C(fbrace)[T.2.0]     -0.0254      0.003     -7.384   0.000     -0.032    -0.019


  C(fbrace)[T.3.0]     -0.0082      0.011     -0.743   0.458     -0.030     0.013


  C(fbrace)[T.4.0]      0.0111      0.005      2.431   0.015      0.002     0.020


  fhisp                -0.0159      0.003     -5.242   0.000     -0.022    -0.010


  previs               -0.0115      0.001    -17.900   0.000     -0.013    -0.010


  dmar              -8.777e-05      0.003     -0.034   0.973     -0.005     0.005

The effect of payment method disappears.



In [96]:

    
formula = ('boy ~ C(fbrace) + fhisp + previs + C(pay_rec)')
model = smf.logit(formula, data=df)
results = model.fit()
summarize(results)
results.summary()









    



Optimization terminated successfully.
         Current function value: 0.692775
         Iterations 3
C(fbrace)[T.2.0]       105.7   103.1 *
C(fbrace)[T.3.0]       105.7   104.3  
C(fbrace)[T.4.0]       105.7   106.6  
C(pay_rec)[T.2.0]      105.7   105.6  
C(pay_rec)[T.3.0]      105.7   106.8  
C(pay_rec)[T.4.0]      105.7   106.6  
fhisp                  105.7   103.9 *
previs                 105.7   104.6 *






    Out[96]:





Logit Regression Results

  Dep. Variable:         boy          No. Observations:     2752104  


  Model:                Logit         Df Residuals:         2752095  


  Method:                MLE          Df Model:                  8   


  Date:           Tue, 17 May 2016    Pseudo R-squ.:       8.519e-05 


  Time:               16:36:40        Log-Likelihood:     -1.9066e+06


  converged:            True          LL-Null:            -1.9068e+06


                                    LLR p-value:         2.081e-65 




                       coef      std err       z       P>|z|  [95.0% Conf. Int.] 


  Intercept              0.0557      0.002     22.311   0.000      0.051     0.061


  C(fbrace)[T.2.0]      -0.0249      0.004     -6.923   0.000     -0.032    -0.018


  C(fbrace)[T.3.0]      -0.0132      0.012     -1.089   0.276     -0.037     0.011


  C(fbrace)[T.4.0]       0.0081      0.005      1.651   0.099     -0.002     0.018


  C(pay_rec)[T.2.0]     -0.0011      0.003     -0.397   0.691     -0.006     0.004


  C(pay_rec)[T.3.0]      0.0106      0.006      1.642   0.101     -0.002     0.023


  C(pay_rec)[T.4.0]      0.0083      0.006      1.412   0.158     -0.003     0.020


  fhisp                 -0.0170      0.003     -5.174   0.000     -0.023    -0.011


  previs                -0.0109      0.001    -15.813   0.000     -0.012    -0.010

Here's a version with the addition of a boolean for no prenatal visits.



In [97]:

    
formula = ('boy ~ C(fbrace) + fhisp + previs + no_previs')
model = smf.logit(formula, data=df)
results = model.fit()
summarize(results)
results.summary()









    



Optimization terminated successfully.
         Current function value: 0.692772
         Iterations 3
C(fbrace)[T.2.0]       105.7   103.0 *
C(fbrace)[T.3.0]       105.7   104.8  
C(fbrace)[T.4.0]       105.7   106.9 *
fhisp                  105.7   104.0 *
previs                 105.7   104.5 *
no_previs              105.7   103.8  






    Out[97]:





Logit Regression Results

  Dep. Variable:         boy          No. Observations:     3096464  


  Model:                Logit         Df Residuals:         3096457  


  Method:                MLE          Df Model:                  6   


  Date:           Tue, 17 May 2016    Pseudo R-squ.:       9.072e-05 


  Time:               16:37:07        Log-Likelihood:     -2.1451e+06


  converged:            True          LL-Null:            -2.1453e+06


                                    LLR p-value:         5.706e-81 




                      coef      std err       z       P>|z|  [95.0% Conf. Int.] 


  Intercept             0.0554      0.001     37.601   0.000      0.052     0.058


  C(fbrace)[T.2.0]     -0.0254      0.003     -7.688   0.000     -0.032    -0.019


  C(fbrace)[T.3.0]     -0.0083      0.011     -0.754   0.451     -0.030     0.013


  C(fbrace)[T.4.0]      0.0110      0.005      2.423   0.015      0.002     0.020


  fhisp                -0.0159      0.003     -5.344   0.000     -0.022    -0.010


  previs               -0.0118      0.001    -17.440   0.000     -0.013    -0.010


  no_previs            -0.0183      0.012     -1.485   0.138     -0.042     0.006

Now, surprisingly, the mother's age has a small effect.



In [98]:

    
formula = ('boy ~ C(fbrace) + fhisp + previs + no_previs + mager9')
model = smf.logit(formula, data=df)
results = model.fit()
summarize(results)
results.summary()









    



Optimization terminated successfully.
         Current function value: 0.692772
         Iterations 3
C(fbrace)[T.2.0]       106.3   103.5 *
C(fbrace)[T.3.0]       106.3   105.3  
C(fbrace)[T.4.0]       106.3   107.5 *
fhisp                  106.3   104.5 *
previs                 106.3   105.0 *
no_previs              106.3   104.3  
mager9                 106.3   106.1  






    Out[98]:





Logit Regression Results

  Dep. Variable:         boy          No. Observations:     3096464  


  Model:                Logit         Df Residuals:         3096456  


  Method:                MLE          Df Model:                  7   


  Date:           Tue, 17 May 2016    Pseudo R-squ.:       9.107e-05 


  Time:               16:37:37        Log-Likelihood:     -2.1451e+06


  converged:            True          LL-Null:            -2.1453e+06


                                    LLR p-value:         2.300e-80 




                      coef      std err       z       P>|z|  [95.0% Conf. Int.] 


  Intercept             0.0606      0.005     13.327   0.000      0.052     0.070


  C(fbrace)[T.2.0]     -0.0258      0.003     -7.770   0.000     -0.032    -0.019


  C(fbrace)[T.3.0]     -0.0089      0.011     -0.805   0.421     -0.030     0.013


  C(fbrace)[T.4.0]      0.0114      0.005      2.505   0.012      0.002     0.020


  fhisp                -0.0162      0.003     -5.437   0.000     -0.022    -0.010


  previs               -0.0117      0.001    -17.323   0.000     -0.013    -0.010


  no_previs            -0.0182      0.012     -1.479   0.139     -0.042     0.006


  mager9               -0.0012      0.001     -1.221   0.222     -0.003     0.001

So does the father's age. But both age effects are small and borderline significant.



In [99]:

    
formula = ('boy ~ C(fbrace) + fhisp + previs + no_previs + fagerrec11')
model = smf.logit(formula, data=df)
results = model.fit()
summarize(results)
results.summary()









    



Optimization terminated successfully.
         Current function value: 0.692772
         Iterations 3
C(fbrace)[T.2.0]       106.6   103.9 *
C(fbrace)[T.3.0]       106.6   105.6  
C(fbrace)[T.4.0]       106.6   107.8 *
fhisp                  106.6   104.8 *
previs                 106.6   105.3 *
no_previs              106.6   104.5  
fagerrec11             106.6   106.4 *






    Out[99]:





Logit Regression Results

  Dep. Variable:         boy          No. Observations:     3088921  


  Model:                Logit         Df Residuals:         3088913  


  Method:                MLE          Df Model:                  7   


  Date:           Tue, 17 May 2016    Pseudo R-squ.:       9.189e-05 


  Time:               16:38:07        Log-Likelihood:     -2.1399e+06


  converged:            True          LL-Null:            -2.1401e+06


                                    LLR p-value:         6.432e-81 




                      coef      std err       z       P>|z|  [95.0% Conf. Int.] 


  Intercept             0.0637      0.004     14.796   0.000      0.055     0.072


  C(fbrace)[T.2.0]     -0.0258      0.003     -7.763   0.000     -0.032    -0.019


  C(fbrace)[T.3.0]     -0.0094      0.011     -0.856   0.392     -0.031     0.012


  C(fbrace)[T.4.0]      0.0116      0.005      2.527   0.012      0.003     0.021


  fhisp                -0.0165      0.003     -5.530   0.000     -0.022    -0.011


  previs               -0.0118      0.001    -17.371   0.000     -0.013    -0.010


  no_previs            -0.0194      0.012     -1.569   0.117     -0.044     0.005


  fagerrec11           -0.0017      0.001     -2.046   0.041     -0.003 -7.15e-05

What's up with prenatal visits?

The predictive power of prenatal visits is still surprising to me. To make sure we're controlled for race, I'll select cases where both parents are white:



In [100]:

    
white = df[(df.mbrace==1) & (df.fbrace==1)]
len(white)









    Out[100]:





2373016

And compute sex ratios for each level of previs



In [101]:

    
var = 'previs'
white[[var, 'boy']].groupby(var).aggregate(series_to_ratio)

The effect holds up. People with fewer than average prenatal visits are substantially more likely to have boys.



In [102]:

    
formula = ('boy ~ previs + no_previs')
model = smf.logit(formula, data=white)
results = model.fit()
summarize(results)
results.summary()









    



Optimization terminated successfully.
         Current function value: 0.692766
         Iterations 3
previs                 105.3   104.0 *
no_previs              105.3   103.4  






    Out[102]:





Logit Regression Results

  Dep. Variable:         boy          No. Observations:     2315682  


  Model:                Logit         Df Residuals:         2315679  


  Method:                MLE          Df Model:                  2   


  Date:           Tue, 17 May 2016    Pseudo R-squ.:       7.429e-05 


  Time:               16:38:13        Log-Likelihood:     -1.6042e+06


  converged:            True          LL-Null:            -1.6043e+06


                                    LLR p-value:         1.719e-52 




               coef      std err       z       P>|z|  [95.0% Conf. Int.] 


  Intercept      0.0516      0.001     39.039   0.000      0.049     0.054


  previs        -0.0119      0.001    -14.973   0.000     -0.013    -0.010


  no_previs     -0.0180      0.015     -1.188   0.235     -0.048     0.012



In [103]:

    
inter = results.params['Intercept']
slope = results.params['previs']
inter, slope









    Out[103]:





(0.051571918028023904, -0.01191111958144815)



In [104]:

    
previs = np.arange(-5, 5)
logodds = inter + slope * previs
odds = np.exp(logodds)
odds * 100









    Out[104]:





array([ 111.75374016,  110.43052413,  109.1229756 ,  107.83090904,
        106.55414115,  105.29249079,  104.04577894,  102.81382875,
        101.59646541,  100.39351622])



In [105]:

    
formula = ('boy ~ dmar')
model = smf.logit(formula, data=white)
results = model.fit()
summarize(results)
results.summary()









    



Optimization terminated successfully.
         Current function value: 0.692808
         Iterations 3
dmar                   105.3   105.3  






    Out[105]:





Logit Regression Results

  Dep. Variable:         boy          No. Observations:     2373016  


  Model:                Logit         Df Residuals:         2373014  


  Method:                MLE          Df Model:                  1   


  Date:           Tue, 17 May 2016    Pseudo R-squ.:       1.132e-08 


  Time:               16:38:17        Log-Likelihood:     -1.6440e+06


  converged:            True          LL-Null:            -1.6440e+06


                                    LLR p-value:          0.8470   




               coef      std err       z       P>|z|  [95.0% Conf. Int.] 


  Intercept      0.0514      0.004     13.066   0.000      0.044     0.059


  dmar           0.0006      0.003      0.193   0.847     -0.005     0.006



In [106]:

    
formula = ('boy ~ lowed')
model = smf.logit(formula, data=white)
results = model.fit()
summarize(results)
results.summary()









    



Optimization terminated successfully.
         Current function value: 0.692803
         Iterations 3
lowed                  105.4   105.5  






    Out[106]:





Logit Regression Results

  Dep. Variable:         boy          No. Observations:     2128894  


  Model:                Logit         Df Residuals:         2128892  


  Method:                MLE          Df Model:                  1   


  Date:           Tue, 17 May 2016    Pseudo R-squ.:       1.716e-08 


  Time:               16:38:19        Log-Likelihood:     -1.4749e+06


  converged:            True          LL-Null:            -1.4749e+06


                                    LLR p-value:          0.8220   




               coef      std err       z       P>|z|  [95.0% Conf. Int.] 


  Intercept      0.0523      0.001     35.788   0.000      0.049     0.055


  lowed          0.0009      0.004      0.225   0.822     -0.007     0.009



In [107]:

    
formula = ('boy ~ highbo')
model = smf.logit(formula, data=white)
results = model.fit()
summarize(results)
results.summary()









    



Optimization terminated successfully.
         Current function value: 0.692808
         Iterations 3
highbo                 105.4   105.2  






    Out[107]:





Logit Regression Results

  Dep. Variable:         boy          No. Observations:     2364914  


  Model:                Logit         Df Residuals:         2364912  


  Method:                MLE          Df Model:                  1   


  Date:           Tue, 17 May 2016    Pseudo R-squ.:       1.675e-08 


  Time:               16:38:22        Log-Likelihood:     -1.6384e+06


  converged:            True          LL-Null:            -1.6384e+06


                                    LLR p-value:          0.8148   




               coef      std err       z       P>|z|  [95.0% Conf. Int.] 


  Intercept      0.0521      0.001     39.220   0.000      0.050     0.055


  highbo        -0.0015      0.006     -0.234   0.815     -0.014     0.011



In [108]:

    
formula = ('boy ~ wic')
model = smf.logit(formula, data=white)
results = model.fit()
summarize(results)
results.summary()









    



Optimization terminated successfully.
         Current function value: 0.692806
         Iterations 3
wic[T.Y]               105.4   105.3  






    Out[108]:





Logit Regression Results

  Dep. Variable:         boy          No. Observations:     2096546  


  Model:                Logit         Df Residuals:         2096544  


  Method:                MLE          Df Model:                  1   


  Date:           Tue, 17 May 2016    Pseudo R-squ.:       1.034e-07 


  Time:               16:38:38        Log-Likelihood:     -1.4525e+06


  converged:            True          LL-Null:            -1.4525e+06


                                    LLR p-value:          0.5836   




               coef      std err       z       P>|z|  [95.0% Conf. Int.] 


  Intercept      0.0528      0.002     30.597   0.000      0.049     0.056


  wic[T.Y]      -0.0016      0.003     -0.548   0.584     -0.007     0.004



In [109]:

    
formula = ('boy ~ obese')
model = smf.logit(formula, data=white)
results = model.fit()
summarize(results)
results.summary()









    



Optimization terminated successfully.
         Current function value: 0.692806
         Iterations 3
obese                  105.4   105.1  






    Out[109]:





Logit Regression Results

  Dep. Variable:         boy          No. Observations:     2061631  


  Model:                Logit         Df Residuals:         2061629  


  Method:                MLE          Df Model:                  1   


  Date:           Tue, 17 May 2016    Pseudo R-squ.:       3.055e-07 


  Time:               16:38:41        Log-Likelihood:     -1.4283e+06


  converged:            True          LL-Null:            -1.4283e+06


                                    LLR p-value:          0.3502   




               coef      std err       z       P>|z|  [95.0% Conf. Int.] 


  Intercept      0.0530      0.002     33.495   0.000      0.050     0.056


  obese         -0.0031      0.003     -0.934   0.350     -0.010     0.003



In [110]:

    
formula = ('boy ~ C(pay_rec)')
model = smf.logit(formula, data=white)
results = model.fit()
summarize(results)
results.summary()









    



Optimization terminated successfully.
         Current function value: 0.692804
         Iterations 3
C(pay_rec)[T.2.0]      105.3   105.2  
C(pay_rec)[T.3.0]      105.3   106.8  
C(pay_rec)[T.4.0]      105.3   106.5  






    Out[110]:





Logit Regression Results

  Dep. Variable:         boy          No. Observations:     2118925  


  Model:                Logit         Df Residuals:         2118921  


  Method:                MLE          Df Model:                  3   


  Date:           Tue, 17 May 2016    Pseudo R-squ.:       2.490e-06 


  Time:               16:38:58        Log-Likelihood:     -1.4680e+06


  converged:            True          LL-Null:            -1.4680e+06


                                    LLR p-value:          0.06264  




                       coef      std err       z       P>|z|  [95.0% Conf. Int.] 


  Intercept              0.0519      0.002     21.966   0.000      0.047     0.057


  C(pay_rec)[T.2.0]     -0.0012      0.003     -0.416   0.677     -0.007     0.005


  C(pay_rec)[T.3.0]      0.0134      0.007      1.853   0.064     -0.001     0.028


  C(pay_rec)[T.4.0]      0.0112      0.007      1.641   0.101     -0.002     0.024



In [111]:

    
formula = ('boy ~ mager9')
model = smf.logit(formula, data=white)
results = model.fit()
summarize(results)
results.summary()









    



Optimization terminated successfully.
         Current function value: 0.692806
         Iterations 3
mager9                 106.9   106.5 *






    Out[111]:





Logit Regression Results

  Dep. Variable:         boy          No. Observations:     2373016  


  Model:                Logit         Df Residuals:         2373014  


  Method:                MLE          Df Model:                  1   


  Date:           Tue, 17 May 2016    Pseudo R-squ.:       2.743e-06 


  Time:               16:39:02        Log-Likelihood:     -1.6440e+06


  converged:            True          LL-Null:            -1.6440e+06


                                    LLR p-value:         0.002671  




               coef      std err       z       P>|z|  [95.0% Conf. Int.] 


  Intercept      0.0666      0.005     13.343   0.000      0.057     0.076


  mager9        -0.0033      0.001     -3.003   0.003     -0.005    -0.001



In [112]:

    
formula = ('boy ~ youngm + oldm')
model = smf.logit(formula, data=white)
results = model.fit()
summarize(results)
results.summary()









    



Optimization terminated successfully.
         Current function value: 0.692806
         Iterations 3
youngm[T.True]         105.3   106.7 *
oldm[T.True]           105.3   103.9  






    Out[112]:





Logit Regression Results

  Dep. Variable:         boy          No. Observations:     2373016  


  Model:                Logit         Df Residuals:         2373013  


  Method:                MLE          Df Model:                  2   


  Date:           Tue, 17 May 2016    Pseudo R-squ.:       2.490e-06 


  Time:               16:39:07        Log-Likelihood:     -1.6440e+06


  converged:            True          LL-Null:            -1.6440e+06


                                    LLR p-value:          0.01668  




                    coef      std err       z       P>|z|  [95.0% Conf. Int.] 


  Intercept           0.0519      0.001     38.350   0.000      0.049     0.055


  youngm[T.True]      0.0129      0.006      2.146   0.032      0.001     0.025


  oldm[T.True]       -0.0137      0.008     -1.805   0.071     -0.029     0.001



In [113]:

    
formula = ('boy ~ youngf + oldf')
model = smf.logit(formula, data=white)
results = model.fit()
summarize(results)
results.summary()









    



Optimization terminated successfully.
         Current function value: 0.692807
         Iterations 3
youngf                 105.3   107.3 *
oldf                   105.3   103.9  






    Out[113]:





Logit Regression Results

  Dep. Variable:         boy          No. Observations:     2368037  


  Model:                Logit         Df Residuals:         2368034  


  Method:                MLE          Df Model:                  2   


  Date:           Tue, 17 May 2016    Pseudo R-squ.:       2.483e-06 


  Time:               16:39:11        Log-Likelihood:     -1.6406e+06


  converged:            True          LL-Null:            -1.6406e+06


                                    LLR p-value:          0.01701  




               coef      std err       z       P>|z|  [95.0% Conf. Int.] 


  Intercept      0.0520      0.001     38.943   0.000      0.049     0.055


  youngf         0.0182      0.009      2.137   0.033      0.002     0.035


  oldf          -0.0140      0.008     -1.834   0.067     -0.029     0.001



In [ ]:



In [ ]:

Dep. Variable:	boy	No. Observations:	3940764
Model:	Logit	Df Residuals:	3940762
Method:	MLE	Df Model:	1
Date:	Tue, 17 May 2016	Pseudo R-squ.:	1.110e-07
Time:	16:24:34	Log-Likelihood:	-2.7304e+06
converged:	True	LL-Null:	-2.7304e+06
		LLR p-value:	0.4363

	coef	std err	z	P>\|z\|	[95.0% Conf. Int.]
Intercept	0.0504	0.004	13.906	0.000	0.043 0.058
mager9	-0.0006	0.001	-0.778	0.436	-0.002 0.001

Dep. Variable:	boy	No. Observations:	3913139
Model:	Logit	Df Residuals:	3913137
Method:	MLE	Df Model:	1
Date:	Tue, 17 May 2016	Pseudo R-squ.:	1.830e-06
Time:	16:25:39	Log-Likelihood:	-2.7113e+06
converged:	True	LL-Null:	-2.7113e+06
		LLR p-value:	0.001634

Dep. Variable:	boy	No. Observations:	3557359
Model:	Logit	Df Residuals:	3557357
Method:	MLE	Df Model:	1
Date:	Tue, 17 May 2016	Pseudo R-squ.:	8.735e-06
Time:	16:26:06	Log-Likelihood:	-2.4647e+06
converged:	True	LL-Null:	-2.4648e+06
		LLR p-value:	5.309e-11

Dep. Variable:	boy	No. Observations:	3522597
Model:	Logit	Df Residuals:	3522595
Method:	MLE	Df Model:	1
Date:	Tue, 17 May 2016	Pseudo R-squ.:	2.858e-06
Time:	16:26:38	Log-Likelihood:	-2.4407e+06
converged:	True	LL-Null:	-2.4407e+06
		LLR p-value:	0.0001875

Dep. Variable:	boy	No. Observations:	3443649
Model:	Logit	Df Residuals:	3443647
Method:	MLE	Df Model:	1
Date:	Tue, 17 May 2016	Pseudo R-squ.:	1.017e-06
Time:	16:26:47	Log-Likelihood:	-2.3859e+06
converged:	True	LL-Null:	-2.3859e+06
		LLR p-value:	0.02764

Dep. Variable:	boy	No. Observations:	3201200
Model:	Logit	Df Residuals:	3201196
Method:	MLE	Df Model:	3
Date:	Tue, 17 May 2016	Pseudo R-squ.:	1.082e-05
Time:	16:27:18	Log-Likelihood:	-2.2179e+06
converged:	True	LL-Null:	-2.2179e+06
		LLR p-value:	2.121e-10

Dep. Variable:	boy	No. Observations:	3396851
Model:	Logit	Df Residuals:	3396849
Method:	MLE	Df Model:	1
Date:	Tue, 17 May 2016	Pseudo R-squ.:	3.902e-06
Time:	16:27:24	Log-Likelihood:	-2.3534e+06
converged:	True	LL-Null:	-2.3534e+06
		LLR p-value:	1.824e-05

Dep. Variable:	boy	No. Observations:	3029948
Model:	Logit	Df Residuals:	3029946
Method:	MLE	Df Model:	1
Date:	Tue, 17 May 2016	Pseudo R-squ.:	1.471e-06
Time:	16:27:28	Log-Likelihood:	-2.0992e+06
converged:	True	LL-Null:	-2.0992e+06
		LLR p-value:	0.01294

Dep. Variable:	boy	No. Observations:	3922232
Model:	Logit	Df Residuals:	3922230
Method:	MLE	Df Model:	1
Date:	Tue, 17 May 2016	Pseudo R-squ.:	4.349e-06
Time:	16:27:35	Log-Likelihood:	-2.7176e+06
converged:	True	LL-Null:	-2.7176e+06
		LLR p-value:	1.163e-06

Dep. Variable:	boy	No. Observations:	3822139
Model:	Logit	Df Residuals:	3822137
Method:	MLE	Df Model:	1
Date:	Tue, 17 May 2016	Pseudo R-squ.:	5.759e-05
Time:	16:27:46	Log-Likelihood:	-2.6481e+06
converged:	True	LL-Null:	-2.6482e+06
		LLR p-value:	2.660e-68

Dep. Variable:	boy	No. Observations:	3474883
Model:	Logit	Df Residuals:	3474881
Method:	MLE	Df Model:	1
Date:	Tue, 17 May 2016	Pseudo R-squ.:	2.939e-06
Time:	16:28:18	Log-Likelihood:	-2.4076e+06
converged:	True	LL-Null:	-2.4076e+06
		LLR p-value:	0.0001686

Dep. Variable:	boy	No. Observations:	3493485
Model:	Logit	Df Residuals:	3493483
Method:	MLE	Df Model:	1
Date:	Tue, 17 May 2016	Pseudo R-squ.:	1.650e-07
Time:	16:28:24	Log-Likelihood:	-2.4205e+06
converged:	True	LL-Null:	-2.4205e+06
		LLR p-value:	0.3714

Dep. Variable:	boy	No. Observations:	3401718
Model:	Logit	Df Residuals:	3401716
Method:	MLE	Df Model:	1
Date:	Tue, 17 May 2016	Pseudo R-squ.:	3.511e-06
Time:	16:28:36	Log-Likelihood:	-2.3569e+06
converged:	True	LL-Null:	-2.3569e+06
		LLR p-value:	4.736e-05

Dep. Variable:	boy	No. Observations:	3519595
Model:	Logit	Df Residuals:	3519591
Method:	MLE	Df Model:	3
Date:	Tue, 17 May 2016	Pseudo R-squ.:	4.468e-06
Time:	16:29:09	Log-Likelihood:	-2.4386e+06
converged:	True	LL-Null:	-2.4386e+06
		LLR p-value:	7.204e-05

Dep. Variable:	boy	No. Observations:	3178957
Model:	Logit	Df Residuals:	3178948
Method:	MLE	Df Model:	8
Date:	Tue, 17 May 2016	Pseudo R-squ.:	1.589e-05
Time:	16:30:01	Log-Likelihood:	-2.2024e+06
converged:	True	LL-Null:	-2.2025e+06
		LLR p-value:	4.910e-12

Dep. Variable:	boy	No. Observations:	2865016
Model:	Logit	Df Residuals:	2865010
Method:	MLE	Df Model:	5
Date:	Tue, 17 May 2016	Pseudo R-squ.:	1.324e-05
Time:	16:30:50	Log-Likelihood:	-1.9849e+06
converged:	True	LL-Null:	-1.9850e+06
		LLR p-value:	4.153e-10

Dep. Variable:	boy	No. Observations:	3182886
Model:	Logit	Df Residuals:	3182880
Method:	MLE	Df Model:	5
Date:	Tue, 17 May 2016	Pseudo R-squ.:	1.516e-05
Time:	16:31:20	Log-Likelihood:	-2.2052e+06
converged:	True	LL-Null:	-2.2052e+06
		LLR p-value:	4.631e-13

Dep. Variable:	boy	No. Observations:	2844274
Model:	Logit	Df Residuals:	2844268
Method:	MLE	Df Model:	5
Date:	Tue, 17 May 2016	Pseudo R-squ.:	1.362e-05
Time:	16:31:47	Log-Likelihood:	-1.9706e+06
converged:	True	LL-Null:	-1.9706e+06
		LLR p-value:	2.422e-10

Dep. Variable:	boy	No. Observations:	3170216
Model:	Logit	Df Residuals:	3170210
Method:	MLE	Df Model:	5
Date:	Tue, 17 May 2016	Pseudo R-squ.:	1.527e-05
Time:	16:32:15	Log-Likelihood:	-2.1964e+06
converged:	True	LL-Null:	-2.1964e+06
		LLR p-value:	4.151e-13

Dep. Variable:	boy	No. Observations:	2798726
Model:	Logit	Df Residuals:	2798720
Method:	MLE	Df Model:	5
Date:	Tue, 17 May 2016	Pseudo R-squ.:	1.387e-05
Time:	16:33:05	Log-Likelihood:	-1.9390e+06
converged:	True	LL-Null:	-1.9390e+06
		LLR p-value:	2.328e-10

Dep. Variable:	boy	No. Observations:	2744537
Model:	Logit	Df Residuals:	2744531
Method:	MLE	Df Model:	5
Date:	Tue, 17 May 2016	Pseudo R-squ.:	1.454e-05
Time:	16:33:32	Log-Likelihood:	-1.9015e+06
converged:	True	LL-Null:	-1.9015e+06
		LLR p-value:	1.139e-10

Dep. Variable:	boy	No. Observations:	2833065
Model:	Logit	Df Residuals:	2833057
Method:	MLE	Df Model:	7
Date:	Tue, 17 May 2016	Pseudo R-squ.:	1.928e-05
Time:	16:34:22	Log-Likelihood:	-1.9628e+06
converged:	True	LL-Null:	-1.9628e+06
		LLR p-value:	1.047e-13

Dep. Variable:	boy	No. Observations:	3096464
Model:	Logit	Df Residuals:	3096458
Method:	MLE	Df Model:	5
Date:	Tue, 17 May 2016	Pseudo R-squ.:	9.021e-05
Time:	16:34:51	Log-Likelihood:	-2.1451e+06
converged:	True	LL-Null:	-2.1453e+06
		LLR p-value:	1.832e-81

Dep. Variable:	boy	No. Observations:	2752104
Model:	Logit	Df Residuals:	2752095
Method:	MLE	Df Model:	8
Date:	Tue, 17 May 2016	Pseudo R-squ.:	8.519e-05
Time:	16:36:40	Log-Likelihood:	-1.9066e+06
converged:	True	LL-Null:	-1.9068e+06
		LLR p-value:	2.081e-65

Dep. Variable:	boy	No. Observations:	3088921
Model:	Logit	Df Residuals:	3088913
Method:	MLE	Df Model:	7
Date:	Tue, 17 May 2016	Pseudo R-squ.:	9.189e-05
Time:	16:38:07	Log-Likelihood:	-2.1399e+06
converged:	True	LL-Null:	-2.1401e+06
		LLR p-value:	6.432e-81

Dep. Variable:	boy	No. Observations:	2315682
Model:	Logit	Df Residuals:	2315679
Method:	MLE	Df Model:	2
Date:	Tue, 17 May 2016	Pseudo R-squ.:	7.429e-05
Time:	16:38:13	Log-Likelihood:	-1.6042e+06
converged:	True	LL-Null:	-1.6043e+06
		LLR p-value:	1.719e-52

Dep. Variable:	boy	No. Observations:	2373016
Model:	Logit	Df Residuals:	2373014
Method:	MLE	Df Model:	1
Date:	Tue, 17 May 2016	Pseudo R-squ.:	1.132e-08
Time:	16:38:17	Log-Likelihood:	-1.6440e+06
converged:	True	LL-Null:	-1.6440e+06
		LLR p-value:	0.8470

Dep. Variable:	boy	No. Observations:	2128894
Model:	Logit	Df Residuals:	2128892
Method:	MLE	Df Model:	1
Date:	Tue, 17 May 2016	Pseudo R-squ.:	1.716e-08
Time:	16:38:19	Log-Likelihood:	-1.4749e+06
converged:	True	LL-Null:	-1.4749e+06
		LLR p-value:	0.8220

Dep. Variable:	boy	No. Observations:	2364914
Model:	Logit	Df Residuals:	2364912
Method:	MLE	Df Model:	1
Date:	Tue, 17 May 2016	Pseudo R-squ.:	1.675e-08
Time:	16:38:22	Log-Likelihood:	-1.6384e+06
converged:	True	LL-Null:	-1.6384e+06
		LLR p-value:	0.8148