Does Trivers-Willard apply to people?

This notebook contains a "one-day paper", my attempt to pose a research question, answer it, and publish the results in one work day.

MIT License: https://opensource.org/licenses/MIT



In [1]:

    
from __future__ import print_function, division

import thinkstats2
import thinkplot

import pandas as pd
import numpy as np

import statsmodels.formula.api as smf

%matplotlib inline

Trivers-Willard

According to Wikipedia, the Trivers-Willard hypothesis:

"...suggests that female mammals are able to adjust offspring sex ratio in response to their maternal condition. For example, it may predict greater parental investment in males by parents in 'good conditions' and greater investment in females by parents in 'poor conditions' (relative to parents in good condition)."

For humans, the hypothesis suggests that people with relatively high social status might be more likely to have boys. Some studies have shown evidence for this hypothesis, but based on my very casual survey, it is not persuasive.

To test whether the T-W hypothesis holds up in humans, I downloaded birth data for the nearly 4 million babies born in the U.S. in 2014.

I selected variables that seemed likely to be related to social status and used logistic regression to identify variables associated with sex ratio.

Summary of results

Running regression with one variable at a time, many of the variables have a statistically significant effect on sex ratio, with the sign of the effect generally in the direction predicted by T-W.
However, many of the variables are also correlated with race. If we control for either the mother's race or the father's race, or both, most other variables have no additional predictive power.
Contrary to other reports, the age of the parents seems to have no predictive power.
Strangely, the variable that shows the strongest and most consistent relationship with sex ratio is the number of prenatal visits. Although it seems obvious that prenatal visits are a proxy for quality of health care and general socioeconomic status, the sign of the effect is opposite what T-W predicts; that is, more prenatal visits is a strong predictor of lower sex ratio (more girls).

Following convention, I report sex ratio in terms of boys per 100 girls. The overall sex ratio at birth is about 105; that is, 105 boys are born for every 100 girls.

Data cleaning

Here's how I loaded the data:



In [2]:

    
names = ['year', 'mager9', 'restatus', 'mbrace', 'mhisp_r',
        'mar_p', 'dmar', 'meduc', 'fagerrec11', 'fbrace', 'fhisp_r', 'feduc', 
        'lbo_rec', 'previs_rec', 'wic', 'height', 'bmi_r', 'pay_rec', 'sex']
colspecs = [(15, 18),
            (93, 93),
            (138, 138),
            (143, 143),
            (148, 148),
            (152, 152),
            (153, 153),
            (155, 155),
            (186, 187),
            (191, 191),
            (195, 195),
            (197, 197),
            (212, 212),
            (272, 273),
            (281, 281),
            (555, 556),
            (533, 533),
            (413, 413),
            (436, 436),
           ]

colspecs = [(start-1, end) for start, end in colspecs]



In [3]:

    
df = None



In [4]:

    
filename = 'Nat2012PublicUS.r20131217.gz'
#df = pd.read_fwf(filename, compression='gzip', header=None, names=names, colspecs=colspecs)
#df.head()



In [5]:

    
# store the dataframe for faster loading

#store = pd.HDFStore('store.h5')
#store['births2013'] = df
#store.close()









    



/home/downey/anaconda2/lib/python2.7/site-packages/IPython/core/interactiveshell.py:3066: PerformanceWarning: 
your performance may suffer as PyTables will pickle object types that it cannot
map directly to c-types [inferred_type->mixed,key->block2_values] [items->['mar_p', 'wic', 'sex']]

  exec(code_obj, self.user_global_ns, self.user_ns)



In [6]:

    
# load the dataframe

store = pd.HDFStore('store.h5')
df = store['births2013']
store.close()



In [6]:

    
def series_to_ratio(series):
    """Takes a boolean series and computes sex ratio.
    """
    boys = np.mean(series)
    return np.round(100 * boys / (1-boys)).astype(int)

I have to recode sex as 0 or 1 to make logit happy.



In [7]:

    
df['boy'] = (df.sex=='M').astype(int)
df.boy.value_counts().sort_index()









    Out[7]:





0    1935228
1    2025568
Name: boy, dtype: int64

All births are from 2014.



In [8]:

    
df.year.value_counts().sort_index()









    Out[8]:





2012    3960796
Name: year, dtype: int64

Mother's age:



In [9]:

    
df.mager9.value_counts().sort_index()









    Out[9]:





1       3676
2     305837
3     918221
4    1126139
5    1015784
6     473533
7     109807
8       7187
9        612
Name: mager9, dtype: int64



In [10]:

    
var = 'mager9'
df[[var, 'boy']].groupby(var).aggregate(series_to_ratio)



In [11]:

    
df.mager9.isnull().mean()









    Out[11]:





0.0



In [12]:

    
df['youngm'] = df.mager9<=2
df['oldm'] = df.mager9>=7
df.youngm.mean(), df.oldm.mean()









    Out[12]:





(0.078144140723228367, 0.029692516352773535)

Residence status (1=resident)



In [13]:

    
df.restatus.value_counts().sort_index()









    Out[13]:





1    2874513
2     993222
3      85106
4       7955
Name: restatus, dtype: int64



In [14]:

    
var = 'restatus'
df[[var, 'boy']].groupby(var).aggregate(series_to_ratio)

Mother's race (1=White, 2=Black, 3=American Indian or Alaskan Native, 4=Asian or Pacific Islander)



In [15]:

    
df.mbrace.value_counts().sort_index()









    Out[15]:





1    3007229
2     634411
3      46105
4     273051
Name: mbrace, dtype: int64



In [16]:

    
var = 'mbrace'
df[[var, 'boy']].groupby(var).aggregate(series_to_ratio)

Mother's Hispanic origin (0=Non-Hispanic)



In [17]:

    
df.mhisp_r.replace([9], np.nan, inplace=True)
df.mhisp_r.value_counts().sort_index()









    Out[17]:





0    3015510
1     562250
2      67192
3      17400
4     131955
5     135597
Name: mhisp_r, dtype: int64



In [18]:

    
def copy_null(df, oldvar, newvar):
    df.loc[df[oldvar].isnull(), newvar] = np.nan



In [19]:

    
df['mhisp'] = df.mhisp_r > 0
copy_null(df, 'mhisp_r', 'mhisp')
df.mhisp.isnull().mean(), df.mhisp.mean()









    Out[19]:





(0.0077994423343186571, 0.23267591269405055)



In [20]:

    
var = 'mhisp'
df[[var, 'boy']].groupby(var).aggregate(series_to_ratio)

Marital status (1=Married)



In [21]:

    
df.dmar.value_counts().sort_index()









    Out[21]:





1    2349102
2    1611694
Name: dmar, dtype: int64



In [22]:

    
var = 'dmar'
df[[var, 'boy']].groupby(var).aggregate(series_to_ratio)

Paternity acknowledged, if unmarried (Y=yes, N=no, X=not applicable, U=unknown).

I recode X (not applicable because married) as Y (paternity acknowledged).



In [23]:

    
df.mar_p.replace(['U'], np.nan, inplace=True)
df.mar_p.replace(['X'], 'Y', inplace=True)
df.mar_p.value_counts().sort_index()









    Out[23]:





N     430123
Y    3058398
Name: mar_p, dtype: int64



In [24]:

    
var = 'mar_p'
df[[var, 'boy']].groupby(var).aggregate(series_to_ratio)

Mother's education level



In [25]:

    
df.meduc.replace([9], np.nan, inplace=True)
df.meduc.value_counts().sort_index()









    Out[25]:





1    144045
2    443007
3    858548
4    732444
5    266066
6    644497
7    282351
8     81074
Name: meduc, dtype: int64



In [26]:

    
var = 'meduc'
df[[var, 'boy']].groupby(var).aggregate(series_to_ratio)



In [27]:

    
df['lowed'] = df.meduc <= 2
copy_null(df, 'meduc', 'lowed')
df.lowed.isnull().mean(), df.lowed.mean()









    Out[27]:





(0.12844993784077746, 0.17005983722051243)

Father's age, in 10 ranges



In [28]:

    
df.fagerrec11.replace([11], np.nan, inplace=True)
df.fagerrec11.value_counts().sort_index()









    Out[28]:





1        422
2     104428
3     527157
4     871442
5     977564
6     591733
7     257619
8      84016
9      26361
10     11389
Name: fagerrec11, dtype: int64



In [29]:

    
var = 'fagerrec11'
df[[var, 'boy']].groupby(var).aggregate(series_to_ratio)



In [30]:

    
df['youngf'] = df.fagerrec11<=2
copy_null(df, 'fagerrec11', 'youngf')
df.youngf.isnull().mean(), df.youngf.mean()









    Out[30]:





(0.12842494286502007, 0.03037254379975731)



In [31]:

    
df['oldf'] = df.fagerrec11>=8
copy_null(df, 'fagerrec11', 'oldf')
df.oldf.isnull().mean(), df.oldf.mean()









    Out[31]:





(0.12842494286502007, 0.03527270546801382)

Father's race



In [32]:

    
df.fbrace.replace([9], np.nan, inplace=True)
df.fbrace.value_counts().sort_index()









    Out[32]:





1    2475018
2     469930
3      35175
4     227463
Name: fbrace, dtype: int64



In [33]:

    
var = 'fbrace'
df[[var, 'boy']].groupby(var).aggregate(series_to_ratio)

Father's Hispanic origin (0=non-hispanic, other values indicate country of origin)



In [34]:

    
df.fhisp_r.replace([9], np.nan, inplace=True)
df.fhisp_r.value_counts().sort_index()









    Out[34]:





0    2603738
1     500926
2      57417
3      16953
4     105376
5     116056
Name: fhisp_r, dtype: int64



In [35]:

    
df['fhisp'] = df.fhisp_r > 0
copy_null(df, 'fhisp_r', 'fhisp')
df.fhisp.isnull().mean(), df.fhisp.mean()









    Out[35]:





(0.14146903804184816, 0.23429965187124352)



In [36]:

    
var = 'fhisp'
df[[var, 'boy']].groupby(var).aggregate(series_to_ratio)

Father's education level



In [37]:

    
df.feduc.replace([9], np.nan, inplace=True)
df.feduc.value_counts().sort_index()









    Out[37]:





1    142003
2    336801
3    852923
4    574580
5    201888
6    544487
7    210268
8     97452
Name: feduc, dtype: int64



In [38]:

    
var = 'feduc'
df[[var, 'boy']].groupby(var).aggregate(series_to_ratio)

Live birth order.



In [39]:

    
df.lbo_rec.replace([9], np.nan, inplace=True)
df.lbo_rec.value_counts().sort_index()









    Out[39]:





1    1574534
2    1248053
3     651817
4     276179
5     106197
6      43907
7      19899
8      20268
Name: lbo_rec, dtype: int64



In [40]:

    
var = 'lbo_rec'
df[[var, 'boy']].groupby(var).aggregate(series_to_ratio)



In [41]:

    
df['highbo'] = df.lbo_rec >= 5
copy_null(df, 'lbo_rec', 'highbo')
df.highbo.isnull().mean(), df.highbo.mean()









    Out[41]:





(0.0050348465308488492, 0.04828166686713083)

Number of prenatal visits, in 11 ranges



In [42]:

    
df.previs_rec.replace([12], np.nan, inplace=True)
df.previs_rec.value_counts().sort_index()









    Out[42]:





1       53862
2       39409
3       90791
4      191909
5      361056
6      809787
7     1023277
8      659674
9      385390
10      98941
11     124582
Name: previs_rec, dtype: int64



In [43]:

    
df.previs_rec.mean()
df['previs'] = df.previs_rec - 7



In [44]:

    
var = 'previs'
df[[var, 'boy']].groupby(var).aggregate(series_to_ratio)



In [45]:

    
df['no_previs'] = df.previs_rec <= 1
copy_null(df, 'previs_rec', 'no_previs')
df.no_previs.isnull().mean(), df.no_previs.mean()









    Out[45]:





(0.030831681308504656, 0.014031393099395157)

Whether the mother is eligible for food stamps



In [46]:

    
df.wic.replace(['U'], np.nan, inplace=True)
df.wic.value_counts().sort_index()









    Out[46]:





N    1820030
Y    1591601
Name: wic, dtype: int64



In [47]:

    
var = 'wic'
df[[var, 'boy']].groupby(var).aggregate(series_to_ratio)

Mother's height in inches



In [48]:

    
df.height.replace([99], np.nan, inplace=True)
df.height.value_counts().sort_index()









    Out[48]:





30        14
31         3
32         2
34         1
36        17
37         5
38         9
39         4
40        13
41        18
42         8
43         8
44         6
45        15
46         9
47        21
48       732
49       505
50       335
51       414
52       480
53      1384
54      1434
55      2561
56      6587
57     17396
58     19343
59     71557
60    190472
61    240815
62    424926
63    442238
64    505897
65    404563
66    390878
67    303110
68    174629
69    116518
70     56687
71     30085
72     14269
73      4971
74      2381
75       895
76       526
77       584
78      1011
Name: height, dtype: int64



In [49]:

    
df['mshort'] = df.height<60
copy_null(df, 'height', 'mshort')
df.mshort.isnull().mean(), df.mshort.mean()









    Out[49]:





(0.13443257365438666, 0.03584275286903034)



In [50]:

    
df['mtall'] = df.height>=70
copy_null(df, 'height', 'mtall')
df.mtall.isnull().mean(), df.mtall.mean()









    Out[50]:





(0.13443257365438666, 0.03249652309458583)



In [51]:

    
var = 'mshort'
df[[var, 'boy']].groupby(var).aggregate(series_to_ratio)



In [52]:

    
var = 'mtall'
df[[var, 'boy']].groupby(var).aggregate(series_to_ratio)

Mother's BMI in 6 ranges



In [53]:

    
df.bmi_r.replace([9], np.nan, inplace=True)
df.bmi_r.value_counts().sort_index()









    Out[53]:





1     129937
2    1573715
3     849357
4     442695
5     206615
6     141411
Name: bmi_r, dtype: int64



In [54]:

    
var = 'bmi_r'
df[[var, 'boy']].groupby(var).aggregate(series_to_ratio)



In [55]:

    
df['obese'] = df.bmi_r >= 4
copy_null(df, 'bmi_r', 'obese')
df.obese.isnull().mean(), df.obese.mean()









    Out[55]:





(0.15579343142136076, 0.23647872286338908)

Payment method (1=Medicaid, 2=Private insurance, 3=Self pay, 4=Other)



In [56]:

    
df.pay_rec.replace([9], np.nan, inplace=True)
df.pay_rec.value_counts().sort_index()









    Out[56]:





1    1497162
2    1628336
3     147475
4     174821
Name: pay_rec, dtype: int64



In [57]:

    
var = 'pay_rec'
df[[var, 'boy']].groupby(var).aggregate(series_to_ratio)

Sex of baby



In [58]:

    
df.sex.value_counts().sort_index()









    Out[58]:





F    1935228
M    2025568
Name: sex, dtype: int64

Regression models

Here are some functions I'll use to interpret the results of logistic regression



In [59]:

    
def logodds_to_ratio(logodds):
    """Convert from log odds to probability."""
    odds = np.exp(logodds)
    return 100 * odds

def summarize(results):
    """Summarize parameters in terms of birth ratio."""
    inter_or = results.params['Intercept']
    inter_rat = logodds_to_ratio(inter_or)
    
    for value, lor in results.params.iteritems():
        if value=='Intercept':
            continue
        
        rat = logodds_to_ratio(inter_or + lor)
        code = '*' if results.pvalues[value] < 0.05 else ' '
        
        print('%-20s   %0.1f   %0.1f' % (value, inter_rat, rat), code)

Now I'll run models with each variable, one at a time.

Mother's age seems to have no predictive value:



In [60]:

    
model = smf.logit('boy ~ mager9', data=df)    
results = model.fit()
summarize(results)
results.summary()









    



Optimization terminated successfully.
         Current function value: 0.692887
         Iterations 3
mager9                 104.9   104.8  






    Out[60]:





Logit Regression Results

  Dep. Variable:         boy          No. Observations:     3960796  


  Model:                Logit         Df Residuals:         3960794  


  Method:                MLE          Df Model:                  1   


  Date:           Wed, 18 May 2016    Pseudo R-squ.:       5.778e-08 


  Time:               14:54:16        Log-Likelihood:     -2.7444e+06


  converged:            True          LL-Null:            -2.7444e+06


                                    LLR p-value:          0.5733   




               coef      std err       z       P>|z|  [95.0% Conf. Int.] 


  Intercept      0.0475      0.004     13.358   0.000      0.041     0.055


  mager9        -0.0005      0.001     -0.563   0.573     -0.002     0.001

The estimated ratios for young mothers is higher, and the ratio for older mothers is lower, but neither is statistically significant.



In [61]:

    
model = smf.logit('boy ~ youngm + oldm', data=df)    
results = model.fit()
summarize(results)
results.summary()









    



Optimization terminated successfully.
         Current function value: 0.692886
         Iterations 3
youngm[T.True]         104.6   105.6 *
oldm[T.True]           104.6   104.4  






    Out[61]:





Logit Regression Results

  Dep. Variable:         boy          No. Observations:     3960796  


  Model:                Logit         Df Residuals:         3960793  


  Method:                MLE          Df Model:                  2   


  Date:           Wed, 18 May 2016    Pseudo R-squ.:       1.205e-06 


  Time:               14:54:22        Log-Likelihood:     -2.7444e+06


  converged:            True          LL-Null:            -2.7444e+06


                                    LLR p-value:          0.03667  




                    coef      std err       z       P>|z|  [95.0% Conf. Int.] 


  Intercept           0.0449      0.001     42.231   0.000      0.043     0.047


  youngm[T.True]      0.0095      0.004      2.529   0.011      0.002     0.017


  oldm[T.True]       -0.0020      0.006     -0.334   0.739     -0.014     0.010

Neither does residence status



In [62]:

    
model = smf.logit('boy ~ C(restatus)', data=df)    
results = model.fit()
summarize(results)
results.summary()









    



Optimization terminated successfully.
         Current function value: 0.692887
         Iterations 3
C(restatus)[T.2]       104.6   104.7  
C(restatus)[T.3]       104.6   105.4  
C(restatus)[T.4]       104.6   108.2  






    Out[62]:





Logit Regression Results

  Dep. Variable:         boy          No. Observations:     3960796  


  Model:                Logit         Df Residuals:         3960792  


  Method:                MLE          Df Model:                  3   


  Date:           Wed, 18 May 2016    Pseudo R-squ.:       6.393e-07 


  Time:               14:54:48        Log-Likelihood:     -2.7444e+06


  converged:            True          LL-Null:            -2.7444e+06


                                    LLR p-value:          0.3196   




                      coef      std err       z       P>|z|  [95.0% Conf. Int.] 


  Intercept             0.0452      0.001     38.300   0.000      0.043     0.048


  C(restatus)[T.2]      0.0008      0.002      0.338   0.735     -0.004     0.005


  C(restatus)[T.3]      0.0078      0.007      1.126   0.260     -0.006     0.021


  C(restatus)[T.4]      0.0335      0.022      1.493   0.136     -0.011     0.078

Mother's race seems to have predictive value. Relative to whites, black and Native American mothers have more girls; Asians have more boys.



In [63]:

    
model = smf.logit('boy ~ C(mbrace)', data=df)    
results = model.fit()
summarize(results)
results.summary()









    



Optimization terminated successfully.
         Current function value: 0.692881
         Iterations 3
C(mbrace)[T.2]         104.8   103.3 *
C(mbrace)[T.3]         104.8   104.0  
C(mbrace)[T.4]         104.8   106.3 *






    Out[63]:





Logit Regression Results

  Dep. Variable:         boy          No. Observations:     3960796  


  Model:                Logit         Df Residuals:         3960792  


  Method:                MLE          Df Model:                  3   


  Date:           Wed, 18 May 2016    Pseudo R-squ.:       8.640e-06 


  Time:               14:55:15        Log-Likelihood:     -2.7444e+06


  converged:            True          LL-Null:            -2.7444e+06


                                    LLR p-value:         2.829e-10 




                    coef      std err       z       P>|z|  [95.0% Conf. Int.] 


  Intercept           0.0471      0.001     40.838   0.000      0.045     0.049


  C(mbrace)[T.2]     -0.0149      0.003     -5.382   0.000     -0.020    -0.009


  C(mbrace)[T.3]     -0.0075      0.009     -0.799   0.424     -0.026     0.011


  C(mbrace)[T.4]      0.0143      0.004      3.567   0.000      0.006     0.022

Hispanic mothers have more girls.



In [64]:

    
model = smf.logit('boy ~ mhisp', data=df)    
results = model.fit()
summarize(results)
results.summary()









    



Optimization terminated successfully.
         Current function value: 0.692884
         Iterations 3
mhisp                  105.0   103.6 *






    Out[64]:





Logit Regression Results

  Dep. Variable:         boy          No. Observations:     3929904  


  Model:                Logit         Df Residuals:         3929902  


  Method:                MLE          Df Model:                  1   


  Date:           Wed, 18 May 2016    Pseudo R-squ.:       5.225e-06 


  Time:               14:55:20        Log-Likelihood:     -2.7230e+06


  converged:            True          LL-Null:            -2.7230e+06


                                    LLR p-value:         9.580e-08 




               coef      std err       z       P>|z|  [95.0% Conf. Int.] 


  Intercept      0.0485      0.001     42.133   0.000      0.046     0.051


  mhisp         -0.0127      0.002     -5.335   0.000     -0.017    -0.008

If the mother is married or unmarried but paternity is acknowledged, the sex ratio is higher (more boys)



In [65]:

    
model = smf.logit('boy ~ C(mar_p)', data=df)    
results = model.fit()
summarize(results)
results.summary()









    



Optimization terminated successfully.
         Current function value: 0.692875
         Iterations 3
C(mar_p)[T.Y]          103.4   104.9 *






    Out[65]:





Logit Regression Results

  Dep. Variable:         boy          No. Observations:     3488521  


  Model:                Logit         Df Residuals:         3488519  


  Method:                MLE          Df Model:                  1   


  Date:           Wed, 18 May 2016    Pseudo R-squ.:       4.062e-06 


  Time:               14:55:45        Log-Likelihood:     -2.4171e+06


  converged:            True          LL-Null:            -2.4171e+06


                                    LLR p-value:         9.370e-06 




                   coef      std err       z       P>|z|  [95.0% Conf. Int.] 


  Intercept          0.0338      0.003     11.071   0.000      0.028     0.040


  C(mar_p)[T.Y]      0.0144      0.003      4.431   0.000      0.008     0.021

Being unmarried predicts more girls.



In [66]:

    
model = smf.logit('boy ~ C(dmar)', data=df)    
results = model.fit()
summarize(results)
results.summary()









    



Optimization terminated successfully.
         Current function value: 0.692885
         Iterations 3
C(dmar)[T.2]           105.0   104.2 *






    Out[66]:





Logit Regression Results

  Dep. Variable:         boy          No. Observations:     3960796  


  Model:                Logit         Df Residuals:         3960794  


  Method:                MLE          Df Model:                  1   


  Date:           Wed, 18 May 2016    Pseudo R-squ.:       2.561e-06 


  Time:               14:56:11        Log-Likelihood:     -2.7444e+06


  converged:            True          LL-Null:            -2.7444e+06


                                    LLR p-value:         0.0001776 




                  coef      std err       z       P>|z|  [95.0% Conf. Int.] 


  Intercept         0.0487      0.001     37.345   0.000      0.046     0.051


  C(dmar)[T.2]     -0.0077      0.002     -3.749   0.000     -0.012    -0.004

Each level of mother's education predicts a small increase in the probability of a boy.



In [67]:

    
model = smf.logit('boy ~ meduc', data=df)    
results = model.fit()
summarize(results)
results.summary()









    



Optimization terminated successfully.
         Current function value: 0.692874
         Iterations 3
meduc                  103.4   103.7 *






    Out[67]:





Logit Regression Results

  Dep. Variable:         boy          No. Observations:     3452032  


  Model:                Logit         Df Residuals:         3452030  


  Method:                MLE          Df Model:                  1   


  Date:           Wed, 18 May 2016    Pseudo R-squ.:       5.742e-06 


  Time:               14:56:15        Log-Likelihood:     -2.3918e+06


  converged:            True          LL-Null:            -2.3918e+06


                                    LLR p-value:         1.599e-07 




               coef      std err       z       P>|z|  [95.0% Conf. Int.] 


  Intercept      0.0330      0.003     11.862   0.000      0.028     0.038


  meduc          0.0032      0.001      5.241   0.000      0.002     0.004



In [68]:

    
model = smf.logit('boy ~ lowed', data=df)    
results = model.fit()
summarize(results)
results.summary()









    



Optimization terminated successfully.
         Current function value: 0.692875
         Iterations 3
lowed                  105.0   103.7 *






    Out[68]:





Logit Regression Results

  Dep. Variable:         boy          No. Observations:     3452032  


  Model:                Logit         Df Residuals:         3452030  


  Method:                MLE          Df Model:                  1   


  Date:           Wed, 18 May 2016    Pseudo R-squ.:       3.472e-06 


  Time:               14:56:19        Log-Likelihood:     -2.3918e+06


  converged:            True          LL-Null:            -2.3918e+06


                                    LLR p-value:         4.594e-05 




               coef      std err       z       P>|z|  [95.0% Conf. Int.] 


  Intercept      0.0484      0.001     40.975   0.000      0.046     0.051


  lowed         -0.0117      0.003     -4.075   0.000     -0.017    -0.006

Older fathers are slightly more likely to have girls (but this apparent effect could be due to chance).



In [69]:

    
model = smf.logit('boy ~ fagerrec11', data=df)    
results = model.fit()
summarize(results)
results.summary()









    



Optimization terminated successfully.
         Current function value: 0.692865
         Iterations 3
fagerrec11             105.5   105.3  






    Out[69]:





Logit Regression Results

  Dep. Variable:         boy          No. Observations:     3452131  


  Model:                Logit         Df Residuals:         3452129  


  Method:                MLE          Df Model:                  1   


  Date:           Wed, 18 May 2016    Pseudo R-squ.:       5.250e-07 


  Time:               14:56:23        Log-Likelihood:     -2.3919e+06


  converged:            True          LL-Null:            -2.3919e+06


                                    LLR p-value:          0.1130   




                coef      std err       z       P>|z|  [95.0% Conf. Int.] 


  Intercept       0.0533      0.004     13.960   0.000      0.046     0.061


  fagerrec11     -0.0012      0.001     -1.585   0.113     -0.003     0.000



In [70]:

    
model = smf.logit('boy ~ youngf + oldf', data=df)    
results = model.fit()
summarize(results)
results.summary()









    



Optimization terminated successfully.
         Current function value: 0.692865
         Iterations 3
youngf                 104.9   105.8  
oldf                   104.9   104.2  






    Out[70]:





Logit Regression Results

  Dep. Variable:         boy          No. Observations:     3452131  


  Model:                Logit         Df Residuals:         3452128  


  Method:                MLE          Df Model:                  2   


  Date:           Wed, 18 May 2016    Pseudo R-squ.:       7.160e-07 


  Time:               14:56:28        Log-Likelihood:     -2.3919e+06


  converged:            True          LL-Null:            -2.3919e+06


                                    LLR p-value:          0.1804   




               coef      std err       z       P>|z|  [95.0% Conf. Int.] 


  Intercept      0.0474      0.001     42.574   0.000      0.045     0.050


  youngf         0.0088      0.006      1.405   0.160     -0.003     0.021


  oldf          -0.0068      0.006     -1.156   0.248     -0.018     0.005

Predictions based on father's race are similar to those based on mother's race: more girls for black and Native American fathers; more boys for Asian fathers.



In [71]:

    
model = smf.logit('boy ~ C(fbrace)', data=df)    
results = model.fit()
summarize(results)
results.summary()









    



Optimization terminated successfully.
         Current function value: 0.692850
         Iterations 3
C(fbrace)[T.2.0]       105.0   103.4 *
C(fbrace)[T.3.0]       105.0   104.7  
C(fbrace)[T.4.0]       105.0   107.1 *






    Out[71]:





Logit Regression Results

  Dep. Variable:         boy          No. Observations:     3207586  


  Model:                Logit         Df Residuals:         3207582  


  Method:                MLE          Df Model:                  3   


  Date:           Wed, 18 May 2016    Pseudo R-squ.:       1.138e-05 


  Time:               14:56:53        Log-Likelihood:     -2.2224e+06


  converged:            True          LL-Null:            -2.2224e+06


                                    LLR p-value:         6.021e-11 




                      coef      std err       z       P>|z|  [95.0% Conf. Int.] 


  Intercept             0.0492      0.001     38.677   0.000      0.047     0.052


  C(fbrace)[T.2.0]     -0.0161      0.003     -5.070   0.000     -0.022    -0.010


  C(fbrace)[T.3.0]     -0.0035      0.011     -0.328   0.743     -0.025     0.018


  C(fbrace)[T.4.0]      0.0191      0.004      4.360   0.000      0.011     0.028

If the father is Hispanic, that predicts more girls.



In [72]:

    
model = smf.logit('boy ~ fhisp', data=df)    
results = model.fit()
summarize(results)
results.summary()









    



Optimization terminated successfully.
         Current function value: 0.692864
         Iterations 3
fhisp                  105.2   103.6 *






    Out[72]:





Logit Regression Results

  Dep. Variable:         boy          No. Observations:     3400466  


  Model:                Logit         Df Residuals:         3400464  


  Method:                MLE          Df Model:                  1   


  Date:           Wed, 18 May 2016    Pseudo R-squ.:       8.006e-06 


  Time:               14:56:57        Log-Likelihood:     -2.3561e+06


  converged:            True          LL-Null:            -2.3561e+06


                                    LLR p-value:         8.137e-10 




               coef      std err       z       P>|z|  [95.0% Conf. Int.] 


  Intercept      0.0508      0.001     41.012   0.000      0.048     0.053


  fhisp         -0.0157      0.003     -6.142   0.000     -0.021    -0.011

Father's education level might predict more boys, but the apparent effect could be due to chance.



In [73]:

    
model = smf.logit('boy ~ feduc', data=df)    
results = model.fit()
summarize(results)
results.summary()









    



Optimization terminated successfully.
         Current function value: 0.692855
         Iterations 3
feduc                  103.9   104.1 *






    Out[73]:





Logit Regression Results

  Dep. Variable:         boy          No. Observations:     2960402  


  Model:                Logit         Df Residuals:         2960400  


  Method:                MLE          Df Model:                  1   


  Date:           Wed, 18 May 2016    Pseudo R-squ.:       3.476e-06 


  Time:               14:57:00        Log-Likelihood:     -2.0511e+06


  converged:            True          LL-Null:            -2.0511e+06


                                    LLR p-value:         0.0001591 




               coef      std err       z       P>|z|  [95.0% Conf. Int.] 


  Intercept      0.0379      0.003     12.866   0.000      0.032     0.044


  feduc          0.0025      0.001      3.776   0.000      0.001     0.004

Babies with high birth order are slightly more likely to be girls.



In [74]:

    
model = smf.logit('boy ~ lbo_rec', data=df)    
results = model.fit()
summarize(results)
results.summary()









    



Optimization terminated successfully.
         Current function value: 0.692885
         Iterations 3
lbo_rec                105.5   105.1 *






    Out[74]:





Logit Regression Results

  Dep. Variable:         boy          No. Observations:     3940854  


  Model:                Logit         Df Residuals:         3940852  


  Method:                MLE          Df Model:                  1   


  Date:           Wed, 18 May 2016    Pseudo R-squ.:       4.164e-06 


  Time:               14:57:05        Log-Likelihood:     -2.7306e+06


  converged:            True          LL-Null:            -2.7306e+06


                                    LLR p-value:         1.855e-06 




               coef      std err       z       P>|z|  [95.0% Conf. Int.] 


  Intercept      0.0536      0.002     27.348   0.000      0.050     0.057


  lbo_rec       -0.0038      0.001     -4.769   0.000     -0.005    -0.002



In [75]:

    
model = smf.logit('boy ~ highbo', data=df)    
results = model.fit()
summarize(results)
results.summary()









    



Optimization terminated successfully.
         Current function value: 0.692887
         Iterations 3
highbo                 104.7   103.6 *






    Out[75]:





Logit Regression Results

  Dep. Variable:         boy          No. Observations:     3940854  


  Model:                Logit         Df Residuals:         3940852  


  Method:                MLE          Df Model:                  1   


  Date:           Wed, 18 May 2016    Pseudo R-squ.:       8.626e-07 


  Time:               14:57:10        Log-Likelihood:     -2.7306e+06


  converged:            True          LL-Null:            -2.7306e+06


                                    LLR p-value:          0.02997  




               coef      std err       z       P>|z|  [95.0% Conf. Int.] 


  Intercept      0.0460      0.001     44.570   0.000      0.044     0.048


  highbo        -0.0102      0.005     -2.171   0.030     -0.019    -0.001

Strangely, prenatal visits are associated with an increased probability of girls.



In [76]:

    
model = smf.logit('boy ~ previs', data=df)    
results = model.fit()
summarize(results)
results.summary()









    



Optimization terminated successfully.
         Current function value: 0.692859
         Iterations 3
previs                 104.5   103.6 *






    Out[76]:





Logit Regression Results

  Dep. Variable:         boy          No. Observations:     3838678  


  Model:                Logit         Df Residuals:         3838676  


  Method:                MLE          Df Model:                  1   


  Date:           Wed, 18 May 2016    Pseudo R-squ.:       4.565e-05 


  Time:               14:57:15        Log-Likelihood:     -2.6597e+06


  converged:            True          LL-Null:            -2.6598e+06


                                    LLR p-value:         9.364e-55 




               coef      std err       z       P>|z|  [95.0% Conf. Int.] 


  Intercept      0.0436      0.001     42.437   0.000      0.042     0.046


  previs        -0.0086      0.001    -15.583   0.000     -0.010    -0.007

The effect seems to be non-linear at zero, so I'm adding a boolean for no prenatal visits.



In [77]:

    
model = smf.logit('boy ~ no_previs + previs', data=df)    
results = model.fit()
summarize(results)
results.summary()









    



Optimization terminated successfully.
         Current function value: 0.692856
         Iterations 3
no_previs              104.5   99.7 *
previs                 104.5   103.5 *






    Out[77]:





Logit Regression Results

  Dep. Variable:         boy          No. Observations:     3838678  


  Model:                Logit         Df Residuals:         3838675  


  Method:                MLE          Df Model:                  2   


  Date:           Wed, 18 May 2016    Pseudo R-squ.:       5.047e-05 


  Time:               14:57:21        Log-Likelihood:     -2.6597e+06


  converged:            True          LL-Null:            -2.6598e+06


                                    LLR p-value:         5.053e-59 




               coef      std err       z       P>|z|  [95.0% Conf. Int.] 


  Intercept      0.0440      0.001     42.713   0.000      0.042     0.046


  no_previs     -0.0473      0.009     -5.061   0.000     -0.066    -0.029


  previs        -0.0097      0.001    -16.347   0.000     -0.011    -0.009

If the mother qualifies for food stamps, she is more likely to have a girl.



In [78]:

    
model = smf.logit('boy ~ wic', data=df)    
results = model.fit()
summarize(results)
results.summary()









    



Optimization terminated successfully.
         Current function value: 0.692878
         Iterations 3
wic[T.Y]               105.2   104.2 *






    Out[78]:





Logit Regression Results

  Dep. Variable:         boy          No. Observations:     3411631  


  Model:                Logit         Df Residuals:         3411629  


  Method:                MLE          Df Model:                  1   


  Date:           Wed, 18 May 2016    Pseudo R-squ.:       3.607e-06 


  Time:               14:57:47        Log-Likelihood:     -2.3638e+06


  converged:            True          LL-Null:            -2.3639e+06


                                    LLR p-value:         3.635e-05 




               coef      std err       z       P>|z|  [95.0% Conf. Int.] 


  Intercept      0.0504      0.001     33.979   0.000      0.047     0.053


  wic[T.Y]      -0.0090      0.002     -4.130   0.000     -0.013    -0.005

Mother's height seems to have no predictive value.



In [79]:

    
model = smf.logit('boy ~ height', data=df)    
results = model.fit()
summarize(results)
results.summary()









    



Optimization terminated successfully.
         Current function value: 0.692877
         Iterations 3
height                 99.3   99.3 *






    Out[79]:





Logit Regression Results

  Dep. Variable:         boy          No. Observations:     3428336  


  Model:                Logit         Df Residuals:         3428334  


  Method:                MLE          Df Model:                  1   


  Date:           Wed, 18 May 2016    Pseudo R-squ.:       1.043e-06 


  Time:               14:57:51        Log-Likelihood:     -2.3754e+06


  converged:            True          LL-Null:            -2.3754e+06


                                    LLR p-value:          0.02598  




               coef      std err       z       P>|z|  [95.0% Conf. Int.] 


  Intercept     -0.0075      0.024     -0.309   0.757     -0.055     0.040


  height         0.0008      0.000      2.226   0.026      0.000     0.002



In [80]:

    
model = smf.logit('boy ~ mtall + mshort', data=df)    
results = model.fit()
summarize(results)
results.summary()









    



Optimization terminated successfully.
         Current function value: 0.692876
         Iterations 3
mtall                  104.8   104.0  
mshort                 104.8   103.3 *






    Out[80]:





Logit Regression Results

  Dep. Variable:         boy          No. Observations:     3428336  


  Model:                Logit         Df Residuals:         3428333  


  Method:                MLE          Df Model:                  2   


  Date:           Wed, 18 May 2016    Pseudo R-squ.:       1.593e-06 


  Time:               14:57:55        Log-Likelihood:     -2.3754e+06


  converged:            True          LL-Null:            -2.3754e+06


                                    LLR p-value:          0.02272  




               coef      std err       z       P>|z|  [95.0% Conf. Int.] 


  Intercept      0.0472      0.001     42.200   0.000      0.045     0.049


  mtall         -0.0076      0.006     -1.249   0.212     -0.020     0.004


  mshort        -0.0145      0.006     -2.494   0.013     -0.026    -0.003

Mother's with higher BMI are more likely to have girls.



In [81]:

    
model = smf.logit('boy ~ bmi_r', data=df)    
results = model.fit()
summarize(results)
results.summary()









    



Optimization terminated successfully.
         Current function value: 0.692879
         Iterations 3
bmi_r                  105.4   105.1 *






    Out[81]:





Logit Regression Results

  Dep. Variable:         boy          No. Observations:     3343730  


  Model:                Logit         Df Residuals:         3343728  


  Method:                MLE          Df Model:                  1   


  Date:           Wed, 18 May 2016    Pseudo R-squ.:       1.109e-06 


  Time:               14:57:59        Log-Likelihood:     -2.3168e+06


  converged:            True          LL-Null:            -2.3168e+06


                                    LLR p-value:          0.02338  




               coef      std err       z       P>|z|  [95.0% Conf. Int.] 


  Intercept      0.0523      0.003     18.191   0.000      0.047     0.058


  bmi_r         -0.0021      0.001     -2.267   0.023     -0.004    -0.000



In [82]:

    
model = smf.logit('boy ~ obese', data=df)    
results = model.fit()
summarize(results)
results.summary()









    



Optimization terminated successfully.
         Current function value: 0.692878
         Iterations 3
obese                  104.9   104.1 *






    Out[82]:





Logit Regression Results

  Dep. Variable:         boy          No. Observations:     3343730  


  Model:                Logit         Df Residuals:         3343728  


  Method:                MLE          Df Model:                  1   


  Date:           Wed, 18 May 2016    Pseudo R-squ.:       1.833e-06 


  Time:               14:58:03        Log-Likelihood:     -2.3168e+06


  converged:            True          LL-Null:            -2.3168e+06


                                    LLR p-value:         0.003567  




               coef      std err       z       P>|z|  [95.0% Conf. Int.] 


  Intercept      0.0481      0.001     38.389   0.000      0.046     0.051


  obese         -0.0075      0.003     -2.914   0.004     -0.013    -0.002

If payment was made by Medicaid, the baby is more likely to be a girl. Private insurance, self-payment, and other payment method are associated with more boys.



In [83]:

    
model = smf.logit('boy ~ C(pay_rec)', data=df)    
results = model.fit()
summarize(results)
results.summary()









    



Optimization terminated successfully.
         Current function value: 0.692877
         Iterations 3
C(pay_rec)[T.2.0]      104.4   105.1 *
C(pay_rec)[T.3.0]      104.4   105.3  
C(pay_rec)[T.4.0]      104.4   104.7  






    Out[83]:





Logit Regression Results

  Dep. Variable:         boy          No. Observations:     3447794  


  Model:                Logit         Df Residuals:         3447790  


  Method:                MLE          Df Model:                  3   


  Date:           Wed, 18 May 2016    Pseudo R-squ.:       2.074e-06 


  Time:               14:58:29        Log-Likelihood:     -2.3889e+06


  converged:            True          LL-Null:            -2.3889e+06


                                    LLR p-value:          0.01934  




                       coef      std err       z       P>|z|  [95.0% Conf. Int.] 


  Intercept              0.0427      0.002     26.107   0.000      0.039     0.046


  C(pay_rec)[T.2.0]      0.0067      0.002      2.944   0.003      0.002     0.011


  C(pay_rec)[T.3.0]      0.0094      0.005      1.720   0.085     -0.001     0.020


  C(pay_rec)[T.4.0]      0.0033      0.005      0.645   0.519     -0.007     0.013

Adding controls

However, none of the previous results should be taken too seriously. We only tested one variable at a time, and many of these apparent effects disappear when we add control variables.

In particular, if we control for father's race and Hispanic origin, the mother's race has no additional predictive value.



In [84]:

    
formula = ('boy ~ C(fbrace) + fhisp + C(mbrace) + mhisp')
model = smf.logit(formula, data=df)
results = model.fit()
summarize(results)
results.summary()









    



Optimization terminated successfully.
         Current function value: 0.692846
         Iterations 3
C(fbrace)[T.2.0]       105.5   103.3 *
C(fbrace)[T.3.0]       105.5   104.1  
C(fbrace)[T.4.0]       105.5   107.0  
C(mbrace)[T.2]         105.5   105.7  
C(mbrace)[T.3]         105.5   106.9  
C(mbrace)[T.4]         105.5   105.6  
fhisp                  105.5   104.1 *
mhisp                  105.5   105.0  






    Out[84]:





Logit Regression Results

  Dep. Variable:         boy          No. Observations:     3184121  


  Model:                Logit         Df Residuals:         3184112  


  Method:                MLE          Df Model:                  8   


  Date:           Wed, 18 May 2016    Pseudo R-squ.:       1.935e-05 


  Time:               14:59:16        Log-Likelihood:     -2.2061e+06


  converged:            True          LL-Null:            -2.2061e+06


                                    LLR p-value:         3.988e-15 




                      coef      std err       z       P>|z|  [95.0% Conf. Int.] 


  Intercept             0.0531      0.001     35.736   0.000      0.050     0.056


  C(fbrace)[T.2.0]     -0.0211      0.006     -3.688   0.000     -0.032    -0.010


  C(fbrace)[T.3.0]     -0.0125      0.013     -1.002   0.316     -0.037     0.012


  C(fbrace)[T.4.0]      0.0142      0.007      1.936   0.053     -0.000     0.029


  C(mbrace)[T.2]        0.0022      0.006      0.367   0.714     -0.010     0.014


  C(mbrace)[T.3]        0.0140      0.013      1.076   0.282     -0.012     0.040


  C(mbrace)[T.4]        0.0013      0.007      0.186   0.853     -0.012     0.015


  fhisp                -0.0132      0.004     -2.951   0.003     -0.022    -0.004


  mhisp                -0.0046      0.004     -1.045   0.296     -0.013     0.004

In fact, once we control for father's race and Hispanic origin, almost every other variable becomes statistically insignificant, including acknowledged paternity.



In [85]:

    
formula = ('boy ~ C(fbrace) + fhisp + mar_p')
model = smf.logit(formula, data=df)
results = model.fit()
summarize(results)
results.summary()









    



Optimization terminated successfully.
         Current function value: 0.692837
         Iterations 3
C(fbrace)[T.2.0]       105.1   103.0 *
C(fbrace)[T.3.0]       105.1   104.0  
C(fbrace)[T.4.0]       105.1   106.6 *
mar_p[T.Y]             105.1   105.6  
fhisp                  105.1   103.3 *






    Out[85]:





Logit Regression Results

  Dep. Variable:         boy          No. Observations:     2798315  


  Model:                Logit         Df Residuals:         2798309  


  Method:                MLE          Df Model:                  5   


  Date:           Wed, 18 May 2016    Pseudo R-squ.:       1.968e-05 


  Time:               15:00:03        Log-Likelihood:     -1.9388e+06


  converged:            True          LL-Null:            -1.9388e+06


                                    LLR p-value:         4.935e-15 




                      coef      std err       z       P>|z|  [95.0% Conf. Int.] 


  Intercept             0.0497      0.014      3.433   0.001      0.021     0.078


  C(fbrace)[T.2.0]     -0.0201      0.003     -5.761   0.000     -0.027    -0.013


  C(fbrace)[T.3.0]     -0.0104      0.012     -0.858   0.391     -0.034     0.013


  C(fbrace)[T.4.0]      0.0144      0.005      3.013   0.003      0.005     0.024


  mar_p[T.Y]            0.0045      0.014      0.310   0.757     -0.024     0.033


  fhisp                -0.0177      0.003     -5.694   0.000     -0.024    -0.012

Being married still predicts more boys.



In [86]:

    
formula = ('boy ~ C(fbrace) + fhisp + dmar')
model = smf.logit(formula, data=df)
results = model.fit()
summarize(results)
results.summary()









    



Optimization terminated successfully.
         Current function value: 0.692846
         Iterations 3
C(fbrace)[T.2.0]       104.9   102.7 *
C(fbrace)[T.3.0]       104.9   104.1  
C(fbrace)[T.4.0]       104.9   106.6 *
fhisp                  104.9   103.0 *
dmar                   104.9   105.3  






    Out[86]:





Logit Regression Results

  Dep. Variable:         boy          No. Observations:     3188403  


  Model:                Logit         Df Residuals:         3188397  


  Method:                MLE          Df Model:                  5   


  Date:           Wed, 18 May 2016    Pseudo R-squ.:       1.937e-05 


  Time:               15:00:29        Log-Likelihood:     -2.2091e+06


  converged:            True          LL-Null:            -2.2091e+06


                                    LLR p-value:         5.665e-17 




                      coef      std err       z       P>|z|  [95.0% Conf. Int.] 


  Intercept             0.0478      0.003     13.880   0.000      0.041     0.055


  C(fbrace)[T.2.0]     -0.0209      0.003     -6.174   0.000     -0.028    -0.014


  C(fbrace)[T.3.0]     -0.0079      0.011     -0.728   0.467     -0.029     0.013


  C(fbrace)[T.4.0]      0.0159      0.004      3.589   0.000      0.007     0.025


  fhisp                -0.0177      0.003     -5.947   0.000     -0.024    -0.012


  dmar                  0.0043      0.003      1.667   0.096     -0.001     0.009

The effect of education disappears.



In [87]:

    
formula = ('boy ~ C(fbrace) + fhisp + lowed')
model = smf.logit(formula, data=df)
results = model.fit()
summarize(results)
results.summary()









    



Optimization terminated successfully.
         Current function value: 0.692836
         Iterations 3
C(fbrace)[T.2.0]       105.6   103.6 *
C(fbrace)[T.3.0]       105.6   104.6  
C(fbrace)[T.4.0]       105.6   107.1 *
fhisp                  105.6   103.9 *
lowed                  105.6   105.0  






    Out[87]:





Logit Regression Results

  Dep. Variable:         boy          No. Observations:     2777435  


  Model:                Logit         Df Residuals:         2777429  


  Method:                MLE          Df Model:                  5   


  Date:           Wed, 18 May 2016    Pseudo R-squ.:       1.992e-05 


  Time:               15:00:55        Log-Likelihood:     -1.9243e+06


  converged:            True          LL-Null:            -1.9243e+06


                                    LLR p-value:         4.189e-15 




                      coef      std err       z       P>|z|  [95.0% Conf. Int.] 


  Intercept             0.0546      0.002     34.777   0.000      0.052     0.058


  C(fbrace)[T.2.0]     -0.0198      0.004     -5.634   0.000     -0.027    -0.013


  C(fbrace)[T.3.0]     -0.0100      0.012     -0.823   0.410     -0.034     0.014


  C(fbrace)[T.4.0]      0.0141      0.005      2.925   0.003      0.005     0.024


  fhisp                -0.0163      0.003     -4.999   0.000     -0.023    -0.010


  lowed                -0.0055      0.004     -1.471   0.141     -0.013     0.002

The effect of birth order disappears.



In [88]:

    
formula = ('boy ~ C(fbrace) + fhisp + highbo')
model = smf.logit(formula, data=df)
results = model.fit()
summarize(results)
results.summary()









    



Optimization terminated successfully.
         Current function value: 0.692847
         Iterations 3
C(fbrace)[T.2.0]       105.5   103.5 *
C(fbrace)[T.3.0]       105.5   104.7  
C(fbrace)[T.4.0]       105.5   107.1 *
fhisp                  105.5   103.8 *
highbo                 105.5   104.8  






    Out[88]:





Logit Regression Results

  Dep. Variable:         boy          No. Observations:     3175026  


  Model:                Logit         Df Residuals:         3175020  


  Method:                MLE          Df Model:                  5   


  Date:           Wed, 18 May 2016    Pseudo R-squ.:       1.881e-05 


  Time:               15:01:20        Log-Likelihood:     -2.1998e+06


  converged:            True          LL-Null:            -2.1998e+06


                                    LLR p-value:         2.209e-16 




                      coef      std err       z       P>|z|  [95.0% Conf. Int.] 


  Intercept             0.0531      0.001     36.240   0.000      0.050     0.056


  C(fbrace)[T.2.0]     -0.0192      0.003     -5.879   0.000     -0.026    -0.013


  C(fbrace)[T.3.0]     -0.0074      0.011     -0.683   0.495     -0.029     0.014


  C(fbrace)[T.4.0]      0.0154      0.004      3.457   0.001      0.007     0.024


  fhisp                -0.0163      0.003     -5.586   0.000     -0.022    -0.011


  highbo               -0.0062      0.005     -1.127   0.260     -0.017     0.005

WIC is no longer associated with more girls.



In [89]:

    
formula = ('boy ~ C(fbrace) + fhisp + wic')
model = smf.logit(formula, data=df)
results = model.fit()
summarize(results)
results.summary()









    



Optimization terminated successfully.
         Current function value: 0.692838
         Iterations 3
C(fbrace)[T.2.0]       105.5   103.4 *
C(fbrace)[T.3.0]       105.5   104.7  
C(fbrace)[T.4.0]       105.5   107.1 *
wic[T.Y]               105.5   105.6  
fhisp                  105.5   103.6 *






    Out[89]:





Logit Regression Results

  Dep. Variable:         boy          No. Observations:     2735525  


  Model:                Logit         Df Residuals:         2735519  


  Method:                MLE          Df Model:                  5   


  Date:           Wed, 18 May 2016    Pseudo R-squ.:       2.029e-05 


  Time:               15:02:07        Log-Likelihood:     -1.8953e+06


  converged:            True          LL-Null:            -1.8953e+06


                                    LLR p-value:         3.710e-15 




                      coef      std err       z       P>|z|  [95.0% Conf. Int.] 


  Intercept             0.0539      0.002     31.172   0.000      0.050     0.057


  C(fbrace)[T.2.0]     -0.0209      0.004     -5.723   0.000     -0.028    -0.014


  C(fbrace)[T.3.0]     -0.0078      0.012     -0.636   0.525     -0.032     0.016


  C(fbrace)[T.4.0]      0.0148      0.005      3.044   0.002      0.005     0.024


  wic[T.Y]              0.0007      0.003      0.264   0.792     -0.004     0.006


  fhisp                -0.0181      0.003     -5.484   0.000     -0.025    -0.012

The effect of obesity disappears.



In [90]:

    
formula = ('boy ~ C(fbrace) + fhisp + obese')
model = smf.logit(formula, data=df)
results = model.fit()
summarize(results)
results.summary()









    



Optimization terminated successfully.
         Current function value: 0.692838
         Iterations 3
C(fbrace)[T.2.0]       105.7   103.5 *
C(fbrace)[T.3.0]       105.7   104.2  
C(fbrace)[T.4.0]       105.7   107.2 *
fhisp                  105.7   103.9 *
obese                  105.7   105.1  






    Out[90]:





Logit Regression Results

  Dep. Variable:         boy          No. Observations:     2686167  


  Model:                Logit         Df Residuals:         2686161  


  Method:                MLE          Df Model:                  5   


  Date:           Wed, 18 May 2016    Pseudo R-squ.:       2.202e-05 


  Time:               15:02:31        Log-Likelihood:     -1.8611e+06


  converged:            True          LL-Null:            -1.8611e+06


                                    LLR p-value:         3.274e-16 




                      coef      std err       z       P>|z|  [95.0% Conf. Int.] 


  Intercept             0.0552      0.002     32.697   0.000      0.052     0.059


  C(fbrace)[T.2.0]     -0.0210      0.004     -5.842   0.000     -0.028    -0.014


  C(fbrace)[T.3.0]     -0.0137      0.012     -1.109   0.267     -0.038     0.011


  C(fbrace)[T.4.0]      0.0145      0.005      2.949   0.003      0.005     0.024


  fhisp                -0.0174      0.003     -5.490   0.000     -0.024    -0.011


  obese                -0.0052      0.003     -1.770   0.077     -0.011     0.001

The effect of payment method is diminished, but self-payment is still associated with more boys.



In [91]:

    
formula = ('boy ~ C(fbrace) + fhisp + C(pay_rec)')
model = smf.logit(formula, data=df)
results = model.fit()
summarize(results)
results.summary()









    



Optimization terminated successfully.
         Current function value: 0.692835
         Iterations 3
C(fbrace)[T.2.0]       105.9   103.6 *
C(fbrace)[T.3.0]       105.9   104.7  
C(fbrace)[T.4.0]       105.9   107.4 *
C(pay_rec)[T.2.0]      105.9   105.3  
C(pay_rec)[T.3.0]      105.9   107.0  
C(pay_rec)[T.4.0]      105.9   105.7  
fhisp                  105.9   103.8 *






    Out[91]:





Logit Regression Results

  Dep. Variable:         boy          No. Observations:     2763347  


  Model:                Logit         Df Residuals:         2763339  


  Method:                MLE          Df Model:                  7   


  Date:           Wed, 18 May 2016    Pseudo R-squ.:       2.100e-05 


  Time:               15:03:17        Log-Likelihood:     -1.9145e+06


  converged:            True          LL-Null:            -1.9146e+06


                                    LLR p-value:         1.132e-14 




                       coef      std err       z       P>|z|  [95.0% Conf. Int.] 


  Intercept              0.0571      0.002     22.914   0.000      0.052     0.062


  C(fbrace)[T.2.0]      -0.0214      0.004     -5.920   0.000     -0.028    -0.014


  C(fbrace)[T.3.0]      -0.0113      0.012     -0.915   0.360     -0.035     0.013


  C(fbrace)[T.4.0]       0.0142      0.005      2.955   0.003      0.005     0.024


  C(pay_rec)[T.2.0]     -0.0050      0.003     -1.839   0.066     -0.010     0.000


  C(pay_rec)[T.3.0]      0.0103      0.007      1.580   0.114     -0.002     0.023


  C(pay_rec)[T.4.0]     -0.0016      0.006     -0.274   0.784     -0.013     0.010


  fhisp                 -0.0193      0.003     -5.917   0.000     -0.026    -0.013

But the effect of prenatal visits is still a strong predictor of more girls.



In [92]:

    
formula = ('boy ~ C(fbrace) + fhisp + previs')
model = smf.logit(formula, data=df)
results = model.fit()
summarize(results)
results.summary()









    



Optimization terminated successfully.
         Current function value: 0.692809
         Iterations 3
C(fbrace)[T.2.0]       105.5   103.0 *
C(fbrace)[T.3.0]       105.5   104.1  
C(fbrace)[T.4.0]       105.5   107.0 *
fhisp                  105.5   103.4 *
previs                 105.5   104.4 *






    Out[92]:





Logit Regression Results

  Dep. Variable:         boy          No. Observations:     3097584  


  Model:                Logit         Df Residuals:         3097578  


  Method:                MLE          Df Model:                  5   


  Date:           Wed, 18 May 2016    Pseudo R-squ.:       7.830e-05 


  Time:               15:03:43        Log-Likelihood:     -2.1460e+06


  converged:            True          LL-Null:            -2.1462e+06


                                    LLR p-value:         1.719e-70 




                      coef      std err       z       P>|z|  [95.0% Conf. Int.] 


  Intercept             0.0532      0.001     36.168   0.000      0.050     0.056


  C(fbrace)[T.2.0]     -0.0237      0.003     -7.129   0.000     -0.030    -0.017


  C(fbrace)[T.3.0]     -0.0129      0.011     -1.170   0.242     -0.035     0.009


  C(fbrace)[T.4.0]      0.0141      0.005      3.112   0.002      0.005     0.023


  fhisp                -0.0193      0.003     -6.533   0.000     -0.025    -0.014


  previs               -0.0103      0.001    -16.043   0.000     -0.012    -0.009

And the effect is even stronger if we add a boolean to capture the nonlinearity at 0 visits.



In [93]:

    
formula = ('boy ~ C(fbrace) + fhisp + previs + no_previs')
model = smf.logit(formula, data=df)
results = model.fit()
summarize(results)
results.summary()









    



Optimization terminated successfully.
         Current function value: 0.692805
         Iterations 3
C(fbrace)[T.2.0]       105.5   103.1 *
C(fbrace)[T.3.0]       105.5   104.1  
C(fbrace)[T.4.0]       105.5   107.0 *
fhisp                  105.5   103.5 *
previs                 105.5   104.3 *
no_previs              105.5   99.6 *






    Out[93]:





Logit Regression Results

  Dep. Variable:         boy          No. Observations:     3097584  


  Model:                Logit         Df Residuals:         3097577  


  Method:                MLE          Df Model:                  6   


  Date:           Wed, 18 May 2016    Pseudo R-squ.:       8.320e-05 


  Time:               15:04:09        Log-Likelihood:     -2.1460e+06


  converged:            True          LL-Null:            -2.1462e+06


                                    LLR p-value:         4.542e-74 




                      coef      std err       z       P>|z|  [95.0% Conf. Int.] 


  Intercept             0.0536      0.001     36.382   0.000      0.051     0.057


  C(fbrace)[T.2.0]     -0.0235      0.003     -7.087   0.000     -0.030    -0.017


  C(fbrace)[T.3.0]     -0.0131      0.011     -1.188   0.235     -0.035     0.009


  C(fbrace)[T.4.0]      0.0139      0.005      3.070   0.002      0.005     0.023


  fhisp                -0.0191      0.003     -6.468   0.000     -0.025    -0.013


  previs               -0.0113      0.001    -16.666   0.000     -0.013    -0.010


  no_previs            -0.0573      0.012     -4.587   0.000     -0.082    -0.033

More controls

Now if we control for father's race and Hispanic origin as well as number of prenatal visits, the effect of marriage disappears.



In [94]:

    
formula = ('boy ~ C(fbrace) + fhisp + previs + dmar')
model = smf.logit(formula, data=df)
results = model.fit()
summarize(results)
results.summary()









    



Optimization terminated successfully.
         Current function value: 0.692808
         Iterations 3
C(fbrace)[T.2.0]       105.2   102.6 *
C(fbrace)[T.3.0]       105.2   103.8  
C(fbrace)[T.4.0]       105.2   106.7 *
fhisp                  105.2   103.1 *
previs                 105.2   104.1 *
dmar                   105.2   105.4  






    Out[94]:





Logit Regression Results

  Dep. Variable:         boy          No. Observations:     3097584  


  Model:                Logit         Df Residuals:         3097577  


  Method:                MLE          Df Model:                  6   


  Date:           Wed, 18 May 2016    Pseudo R-squ.:       7.846e-05 


  Time:               15:04:35        Log-Likelihood:     -2.1460e+06


  converged:            True          LL-Null:            -2.1462e+06


                                    LLR p-value:         1.058e-69 




                      coef      std err       z       P>|z|  [95.0% Conf. Int.] 


  Intercept             0.0506      0.004     14.449   0.000      0.044     0.057


  C(fbrace)[T.2.0]     -0.0245      0.003     -7.072   0.000     -0.031    -0.018


  C(fbrace)[T.3.0]     -0.0136      0.011     -1.227   0.220     -0.035     0.008


  C(fbrace)[T.4.0]      0.0142      0.005      3.151   0.002      0.005     0.023


  fhisp                -0.0198      0.003     -6.561   0.000     -0.026    -0.014


  previs               -0.0103      0.001    -15.969   0.000     -0.012    -0.009


  dmar                  0.0022      0.003      0.828   0.408     -0.003     0.007

The effect of payment method disappears.



In [95]:

    
formula = ('boy ~ C(fbrace) + fhisp + previs + C(pay_rec)')
model = smf.logit(formula, data=df)
results = model.fit()
summarize(results)
results.summary()









    



Optimization terminated successfully.
         Current function value: 0.692799
         Iterations 3
C(fbrace)[T.2.0]       105.7   103.1 *
C(fbrace)[T.3.0]       105.7   104.0  
C(fbrace)[T.4.0]       105.7   107.0 *
C(pay_rec)[T.2.0]      105.7   105.6  
C(pay_rec)[T.3.0]      105.7   105.7  
C(pay_rec)[T.4.0]      105.7   105.3  
fhisp                  105.7   103.6 *
previs                 105.7   104.6 *






    Out[95]:





Logit Regression Results

  Dep. Variable:         boy          No. Observations:     2679860  


  Model:                Logit         Df Residuals:         2679851  


  Method:                MLE          Df Model:                  8   


  Date:           Wed, 18 May 2016    Pseudo R-squ.:       7.905e-05 


  Time:               15:05:21        Log-Likelihood:     -1.8566e+06


  converged:            True          LL-Null:            -1.8568e+06


                                    LLR p-value:         9.714e-59 




                       coef      std err       z       P>|z|  [95.0% Conf. Int.] 


  Intercept              0.0553      0.003     21.819   0.000      0.050     0.060


  C(fbrace)[T.2.0]      -0.0248      0.004     -6.723   0.000     -0.032    -0.018


  C(fbrace)[T.3.0]      -0.0166      0.012     -1.326   0.185     -0.041     0.008


  C(fbrace)[T.4.0]       0.0128      0.005      2.610   0.009      0.003     0.022


  C(pay_rec)[T.2.0]     -0.0012      0.003     -0.436   0.663     -0.007     0.004


  C(pay_rec)[T.3.0]   3.729e-05      0.007      0.006   0.996     -0.013     0.013


  C(pay_rec)[T.4.0]     -0.0035      0.006     -0.589   0.556     -0.015     0.008


  fhisp                 -0.0203      0.003     -6.114   0.000     -0.027    -0.014


  previs                -0.0103      0.001    -14.715   0.000     -0.012    -0.009

Here's a version with the addition of a boolean for no prenatal visits.



In [96]:

    
formula = ('boy ~ C(fbrace) + fhisp + previs + no_previs')
model = smf.logit(formula, data=df)
results = model.fit()
summarize(results)
results.summary()









    



Optimization terminated successfully.
         Current function value: 0.692805
         Iterations 3
C(fbrace)[T.2.0]       105.5   103.1 *
C(fbrace)[T.3.0]       105.5   104.1  
C(fbrace)[T.4.0]       105.5   107.0 *
fhisp                  105.5   103.5 *
previs                 105.5   104.3 *
no_previs              105.5   99.6 *






    Out[96]:





Logit Regression Results

  Dep. Variable:         boy          No. Observations:     3097584  


  Model:                Logit         Df Residuals:         3097577  


  Method:                MLE          Df Model:                  6   


  Date:           Wed, 18 May 2016    Pseudo R-squ.:       8.320e-05 


  Time:               15:05:46        Log-Likelihood:     -2.1460e+06


  converged:            True          LL-Null:            -2.1462e+06


                                    LLR p-value:         4.542e-74 




                      coef      std err       z       P>|z|  [95.0% Conf. Int.] 


  Intercept             0.0536      0.001     36.382   0.000      0.051     0.057


  C(fbrace)[T.2.0]     -0.0235      0.003     -7.087   0.000     -0.030    -0.017


  C(fbrace)[T.3.0]     -0.0131      0.011     -1.188   0.235     -0.035     0.009


  C(fbrace)[T.4.0]      0.0139      0.005      3.070   0.002      0.005     0.023


  fhisp                -0.0191      0.003     -6.468   0.000     -0.025    -0.013


  previs               -0.0113      0.001    -16.666   0.000     -0.013    -0.010


  no_previs            -0.0573      0.012     -4.587   0.000     -0.082    -0.033

Now, surprisingly, the mother's age has a small effect.



In [97]:

    
formula = ('boy ~ C(fbrace) + fhisp + previs + no_previs + mager9')
model = smf.logit(formula, data=df)
results = model.fit()
summarize(results)
results.summary()









    



Optimization terminated successfully.
         Current function value: 0.692805
         Iterations 3
C(fbrace)[T.2.0]       106.2   103.7 *
C(fbrace)[T.3.0]       106.2   104.8  
C(fbrace)[T.4.0]       106.2   107.8 *
fhisp                  106.2   104.2 *
previs                 106.2   105.0 *
no_previs              106.2   100.3 *
mager9                 106.2   106.1  






    Out[97]:





Logit Regression Results

  Dep. Variable:         boy          No. Observations:     3097584  


  Model:                Logit         Df Residuals:         3097576  


  Method:                MLE          Df Model:                  7   


  Date:           Wed, 18 May 2016    Pseudo R-squ.:       8.378e-05 


  Time:               15:06:13        Log-Likelihood:     -2.1460e+06


  converged:            True          LL-Null:            -2.1462e+06


                                    LLR p-value:         1.081e-73 




                      coef      std err       z       P>|z|  [95.0% Conf. Int.] 


  Intercept             0.0603      0.004     13.417   0.000      0.051     0.069


  C(fbrace)[T.2.0]     -0.0241      0.003     -7.209   0.000     -0.031    -0.018


  C(fbrace)[T.3.0]     -0.0139      0.011     -1.255   0.209     -0.036     0.008


  C(fbrace)[T.4.0]      0.0144      0.005      3.176   0.001      0.006     0.023


  fhisp                -0.0196      0.003     -6.592   0.000     -0.025    -0.014


  previs               -0.0113      0.001    -16.525   0.000     -0.013    -0.010


  no_previs            -0.0571      0.012     -4.578   0.000     -0.082    -0.033


  mager9               -0.0015      0.001     -1.573   0.116     -0.003     0.000

So does the father's age. But both age effects are small and borderline significant.



In [98]:

    
formula = ('boy ~ C(fbrace) + fhisp + previs + no_previs + fagerrec11')
model = smf.logit(formula, data=df)
results = model.fit()
summarize(results)
results.summary()









    



Optimization terminated successfully.
         Current function value: 0.692804
         Iterations 3
C(fbrace)[T.2.0]       106.4   103.8 *
C(fbrace)[T.3.0]       106.4   105.0  
C(fbrace)[T.4.0]       106.4   107.9 *
fhisp                  106.4   104.3 *
previs                 106.4   105.2 *
no_previs              106.4   100.4 *
fagerrec11             106.4   106.2 *






    Out[98]:





Logit Regression Results

  Dep. Variable:         boy          No. Observations:     3088740  


  Model:                Logit         Df Residuals:         3088732  


  Method:                MLE          Df Model:                  7   


  Date:           Wed, 18 May 2016    Pseudo R-squ.:       8.510e-05 


  Time:               15:06:39        Log-Likelihood:     -2.1399e+06


  converged:            True          LL-Null:            -2.1401e+06


                                    LLR p-value:         1.099e-74 




                      coef      std err       z       P>|z|  [95.0% Conf. Int.] 


  Intercept             0.0620      0.004     14.546   0.000      0.054     0.070


  C(fbrace)[T.2.0]     -0.0243      0.003     -7.284   0.000     -0.031    -0.018


  C(fbrace)[T.3.0]     -0.0137      0.011     -1.236   0.217     -0.035     0.008


  C(fbrace)[T.4.0]      0.0143      0.005      3.143   0.002      0.005     0.023


  fhisp                -0.0197      0.003     -6.622   0.000     -0.026    -0.014


  previs               -0.0113      0.001    -16.639   0.000     -0.013    -0.010


  no_previs            -0.0581      0.013     -4.637   0.000     -0.083    -0.034


  fagerrec11           -0.0017      0.001     -2.082   0.037     -0.003    -0.000

What's up with prenatal visits?

The predictive power of prenatal visits is still surprising to me. To make sure we're controlled for race, I'll select cases where both parents are white:



In [99]:

    
white = df[(df.mbrace==1) & (df.fbrace==1)]
len(white)









    Out[99]:





2381977

And compute sex ratios for each level of previs



In [100]:

    
var = 'previs'
white[[var, 'boy']].groupby(var).aggregate(series_to_ratio)

The effect holds up. People with fewer than average prenatal visits are substantially more likely to have boys.



In [101]:

    
formula = ('boy ~ previs + no_previs')
model = smf.logit(formula, data=white)
results = model.fit()
summarize(results)
results.summary()









    



Optimization terminated successfully.
         Current function value: 0.692804
         Iterations 3
previs                 105.1   103.8 *
no_previs              105.1   98.9 *






    Out[101]:





Logit Regression Results

  Dep. Variable:         boy          No. Observations:     2320227  


  Model:                Logit         Df Residuals:         2320224  


  Method:                MLE          Df Model:                  2   


  Date:           Wed, 18 May 2016    Pseudo R-squ.:       6.584e-05 


  Time:               15:06:43        Log-Likelihood:     -1.6075e+06


  converged:            True          LL-Null:            -1.6076e+06


                                    LLR p-value:         1.073e-46 




               coef      std err       z       P>|z|  [95.0% Conf. Int.] 


  Intercept      0.0493      0.001     37.359   0.000      0.047     0.052


  previs        -0.0116      0.001    -14.535   0.000     -0.013    -0.010


  no_previs     -0.0608      0.015     -3.966   0.000     -0.091    -0.031



In [102]:

    
inter = results.params['Intercept']
slope = results.params['previs']
inter, slope









    Out[102]:





(0.04929183382635937, -0.011584489975776435)



In [103]:

    
previs = np.arange(-5, 5)
logodds = inter + slope * previs
odds = np.exp(logodds)
odds * 100









    Out[103]:





array([ 111.31727637,  110.03516315,  108.76781686,  107.51506742,
        106.2767467 ,  105.05268853,  103.84272863,  102.64670462,
        101.46445599,  100.29582409])



In [104]:

    
formula = ('boy ~ dmar')
model = smf.logit(formula, data=white)
results = model.fit()
summarize(results)
results.summary()









    



Optimization terminated successfully.
         Current function value: 0.692845
         Iterations 3
dmar                   105.2   105.1  






    Out[104]:





Logit Regression Results

  Dep. Variable:         boy          No. Observations:     2381977  


  Model:                Logit         Df Residuals:         2381975  


  Method:                MLE          Df Model:                  1   


  Date:           Wed, 18 May 2016    Pseudo R-squ.:       3.675e-08 


  Time:               15:06:46        Log-Likelihood:     -1.6503e+06


  converged:            True          LL-Null:            -1.6503e+06


                                    LLR p-value:          0.7276   




               coef      std err       z       P>|z|  [95.0% Conf. Int.] 


  Intercept      0.0505      0.004     12.847   0.000      0.043     0.058


  dmar          -0.0010      0.003     -0.348   0.728     -0.007     0.005



In [105]:

    
formula = ('boy ~ lowed')
model = smf.logit(formula, data=white)
results = model.fit()
summarize(results)
results.summary()









    



Optimization terminated successfully.
         Current function value: 0.692830
         Iterations 3
lowed                  105.3   103.9 *






    Out[105]:





Logit Regression Results

  Dep. Variable:         boy          No. Observations:     2089901  


  Model:                Logit         Df Residuals:         2089899  


  Method:                MLE          Df Model:                  1   


  Date:           Wed, 18 May 2016    Pseudo R-squ.:       4.146e-06 


  Time:               15:06:48        Log-Likelihood:     -1.4479e+06


  converged:            True          LL-Null:            -1.4480e+06


                                    LLR p-value:         0.0005303 




               coef      std err       z       P>|z|  [95.0% Conf. Int.] 


  Intercept      0.0520      0.001     35.035   0.000      0.049     0.055


  lowed         -0.0142      0.004     -3.465   0.001     -0.022    -0.006



In [106]:

    
formula = ('boy ~ highbo')
model = smf.logit(formula, data=white)
results = model.fit()
summarize(results)
results.summary()









    



Optimization terminated successfully.
         Current function value: 0.692845
         Iterations 3
highbo                 105.1   104.1  






    Out[106]:





Logit Regression Results

  Dep. Variable:         boy          No. Observations:     2373894  


  Model:                Logit         Df Residuals:         2373892  


  Method:                MLE          Df Model:                  1   


  Date:           Wed, 18 May 2016    Pseudo R-squ.:       6.498e-07 


  Time:               15:06:50        Log-Likelihood:     -1.6447e+06


  converged:            True          LL-Null:            -1.6447e+06


                                    LLR p-value:          0.1437   




               coef      std err       z       P>|z|  [95.0% Conf. Int.] 


  Intercept      0.0496      0.001     37.359   0.000      0.047     0.052


  highbo        -0.0095      0.006     -1.462   0.144     -0.022     0.003



In [107]:

    
formula = ('boy ~ wic')
model = smf.logit(formula, data=white)
results = model.fit()
summarize(results)
results.summary()









    



Optimization terminated successfully.
         Current function value: 0.692836
         Iterations 3
wic[T.Y]               105.3   104.8  






    Out[107]:





Logit Regression Results

  Dep. Variable:         boy          No. Observations:     2059437  


  Model:                Logit         Df Residuals:         2059435  


  Method:                MLE          Df Model:                  1   


  Date:           Wed, 18 May 2016    Pseudo R-squ.:       1.267e-06 


  Time:               15:07:06        Log-Likelihood:     -1.4269e+06


  converged:            True          LL-Null:            -1.4269e+06


                                    LLR p-value:          0.05720  




               coef      std err       z       P>|z|  [95.0% Conf. Int.] 


  Intercept      0.0519      0.002     29.448   0.000      0.048     0.055


  wic[T.Y]      -0.0055      0.003     -1.902   0.057     -0.011     0.000



In [108]:

    
formula = ('boy ~ obese')
model = smf.logit(formula, data=white)
results = model.fit()
summarize(results)
results.summary()









    



Optimization terminated successfully.
         Current function value: 0.692834
         Iterations 3
obese                  105.2   104.8  






    Out[108]:





Logit Regression Results

  Dep. Variable:         boy          No. Observations:     2029161  


  Model:                Logit         Df Residuals:         2029159  


  Method:                MLE          Df Model:                  1   


  Date:           Wed, 18 May 2016    Pseudo R-squ.:       4.153e-07 


  Time:               15:07:08        Log-Likelihood:     -1.4059e+06


  converged:            True          LL-Null:            -1.4059e+06


                                    LLR p-value:          0.2798   




               coef      std err       z       P>|z|  [95.0% Conf. Int.] 


  Intercept      0.0509      0.002     31.979   0.000      0.048     0.054


  obese         -0.0037      0.003     -1.081   0.280     -0.010     0.003



In [109]:

    
formula = ('boy ~ C(pay_rec)')
model = smf.logit(formula, data=white)
results = model.fit()
summarize(results)
results.summary()









    



Optimization terminated successfully.
         Current function value: 0.692834
         Iterations 3
C(pay_rec)[T.2.0]      105.0   105.2  
C(pay_rec)[T.3.0]      105.0   105.8  
C(pay_rec)[T.4.0]      105.0   104.8  






    Out[109]:





Logit Regression Results

  Dep. Variable:         boy          No. Observations:     2077652  


  Model:                Logit         Df Residuals:         2077648  


  Method:                MLE          Df Model:                  3   


  Date:           Wed, 18 May 2016    Pseudo R-squ.:       5.425e-07 


  Time:               15:07:23        Log-Likelihood:     -1.4395e+06


  converged:            True          LL-Null:            -1.4395e+06


                                    LLR p-value:          0.6681   




                       coef      std err       z       P>|z|  [95.0% Conf. Int.] 


  Intercept              0.0486      0.002     20.446   0.000      0.044     0.053


  C(pay_rec)[T.2.0]      0.0021      0.003      0.684   0.494     -0.004     0.008


  C(pay_rec)[T.3.0]      0.0076      0.007      1.036   0.300     -0.007     0.022


  C(pay_rec)[T.4.0]     -0.0020      0.007     -0.296   0.767     -0.015     0.011



In [110]:

    
formula = ('boy ~ mager9')
model = smf.logit(formula, data=white)
results = model.fit()
summarize(results)
results.summary()









    



Optimization terminated successfully.
         Current function value: 0.692845
         Iterations 3
mager9                 105.8   105.6  






    Out[110]:





Logit Regression Results

  Dep. Variable:         boy          No. Observations:     2381977  


  Model:                Logit         Df Residuals:         2381975  


  Method:                MLE          Df Model:                  1   


  Date:           Wed, 18 May 2016    Pseudo R-squ.:       6.201e-07 


  Time:               15:07:27        Log-Likelihood:     -1.6503e+06


  converged:            True          LL-Null:            -1.6503e+06


                                    LLR p-value:          0.1525   




               coef      std err       z       P>|z|  [95.0% Conf. Int.] 


  Intercept      0.0559      0.005     11.397   0.000      0.046     0.066


  mager9        -0.0016      0.001     -1.431   0.153     -0.004     0.001



In [111]:

    
formula = ('boy ~ youngm + oldm')
model = smf.logit(formula, data=white)
results = model.fit()
summarize(results)
results.summary()









    



Optimization terminated successfully.
         Current function value: 0.692844
         Iterations 3
youngm[T.True]         105.0   106.0  
oldm[T.True]           105.0   104.9  






    Out[111]:





Logit Regression Results

  Dep. Variable:         boy          No. Observations:     2381977  


  Model:                Logit         Df Residuals:         2381974  


  Method:                MLE          Df Model:                  2   


  Date:           Wed, 18 May 2016    Pseudo R-squ.:       9.503e-07 


  Time:               15:07:30        Log-Likelihood:     -1.6503e+06


  converged:            True          LL-Null:            -1.6503e+06


                                    LLR p-value:          0.2084   




                    coef      std err       z       P>|z|  [95.0% Conf. Int.] 


  Intercept           0.0486      0.001     35.884   0.000      0.046     0.051


  youngm[T.True]      0.0101      0.006      1.766   0.077     -0.001     0.021


  oldm[T.True]       -0.0004      0.008     -0.055   0.956     -0.015     0.014



In [112]:

    
formula = ('boy ~ youngf + oldf')
model = smf.logit(formula, data=white)
results = model.fit()
summarize(results)
results.summary()









    



Optimization terminated successfully.
         Current function value: 0.692843
         Iterations 3
youngf                 105.1   105.6  
oldf                   105.1   104.0  






    Out[112]:





Logit Regression Results

  Dep. Variable:         boy          No. Observations:     2376438  


  Model:                Logit         Df Residuals:         2376435  


  Method:                MLE          Df Model:                  2   


  Date:           Wed, 18 May 2016    Pseudo R-squ.:       7.327e-07 


  Time:               15:07:34        Log-Likelihood:     -1.6465e+06


  converged:            True          LL-Null:            -1.6465e+06


                                    LLR p-value:          0.2993   




               coef      std err       z       P>|z|  [95.0% Conf. Int.] 


  Intercept      0.0495      0.001     37.030   0.000      0.047     0.052


  youngf         0.0053      0.008      0.652   0.514     -0.011     0.021


  oldf          -0.0107      0.008     -1.390   0.164     -0.026     0.004



In [ ]:



In [ ]:

	year	mager9	restatus	mbrace	mar_p	dmar	meduc	fagerrec11	fbrace	feduc	lbo_rec	previs_rec	wic	height	bmi_r	pay_rec	sex
0	2012	6	1	1	NaN	1	NaN	5	1	NaN	2	6	NaN	NaN	NaN	NaN	M
1	2012	3	1	3	NaN	2	NaN	4	3	NaN	2	5	NaN	NaN	NaN	NaN	F
2	2012	2	1	2	NaN	2	NaN	3	2	NaN	1	7	NaN	NaN	NaN	NaN	M
3	2012	3	1	1	NaN	1	NaN	3	1	NaN	9	7	NaN	NaN	NaN	NaN	M
4	2012	4	1	4	NaN	1	NaN	4	1	NaN	3	7	NaN	NaN	NaN	NaN	F

Dep. Variable:	boy	No. Observations:	3960796
Model:	Logit	Df Residuals:	3960794
Method:	MLE	Df Model:	1
Date:	Wed, 18 May 2016	Pseudo R-squ.:	5.778e-08
Time:	14:54:16	Log-Likelihood:	-2.7444e+06
converged:	True	LL-Null:	-2.7444e+06
		LLR p-value:	0.5733

	coef	std err	z	P>\|z\|	[95.0% Conf. Int.]
Intercept	0.0475	0.004	13.358	0.000	0.041 0.055
mager9	-0.0005	0.001	-0.563	0.573	-0.002 0.001

Dep. Variable:	boy	No. Observations:	3929904
Model:	Logit	Df Residuals:	3929902
Method:	MLE	Df Model:	1
Date:	Wed, 18 May 2016	Pseudo R-squ.:	5.225e-06
Time:	14:55:20	Log-Likelihood:	-2.7230e+06
converged:	True	LL-Null:	-2.7230e+06
		LLR p-value:	9.580e-08

Dep. Variable:	boy	No. Observations:	3488521
Model:	Logit	Df Residuals:	3488519
Method:	MLE	Df Model:	1
Date:	Wed, 18 May 2016	Pseudo R-squ.:	4.062e-06
Time:	14:55:45	Log-Likelihood:	-2.4171e+06
converged:	True	LL-Null:	-2.4171e+06
		LLR p-value:	9.370e-06

Dep. Variable:	boy	No. Observations:	3452032
Model:	Logit	Df Residuals:	3452030
Method:	MLE	Df Model:	1
Date:	Wed, 18 May 2016	Pseudo R-squ.:	5.742e-06
Time:	14:56:15	Log-Likelihood:	-2.3918e+06
converged:	True	LL-Null:	-2.3918e+06
		LLR p-value:	1.599e-07

Dep. Variable:	boy	No. Observations:	3452131
Model:	Logit	Df Residuals:	3452129
Method:	MLE	Df Model:	1
Date:	Wed, 18 May 2016	Pseudo R-squ.:	5.250e-07
Time:	14:56:23	Log-Likelihood:	-2.3919e+06
converged:	True	LL-Null:	-2.3919e+06
		LLR p-value:	0.1130

Dep. Variable:	boy	No. Observations:	3207586
Model:	Logit	Df Residuals:	3207582
Method:	MLE	Df Model:	3
Date:	Wed, 18 May 2016	Pseudo R-squ.:	1.138e-05
Time:	14:56:53	Log-Likelihood:	-2.2224e+06
converged:	True	LL-Null:	-2.2224e+06
		LLR p-value:	6.021e-11

Dep. Variable:	boy	No. Observations:	3400466
Model:	Logit	Df Residuals:	3400464
Method:	MLE	Df Model:	1
Date:	Wed, 18 May 2016	Pseudo R-squ.:	8.006e-06
Time:	14:56:57	Log-Likelihood:	-2.3561e+06
converged:	True	LL-Null:	-2.3561e+06
		LLR p-value:	8.137e-10

Dep. Variable:	boy	No. Observations:	2960402
Model:	Logit	Df Residuals:	2960400
Method:	MLE	Df Model:	1
Date:	Wed, 18 May 2016	Pseudo R-squ.:	3.476e-06
Time:	14:57:00	Log-Likelihood:	-2.0511e+06
converged:	True	LL-Null:	-2.0511e+06
		LLR p-value:	0.0001591

Dep. Variable:	boy	No. Observations:	3940854
Model:	Logit	Df Residuals:	3940852
Method:	MLE	Df Model:	1
Date:	Wed, 18 May 2016	Pseudo R-squ.:	4.164e-06
Time:	14:57:05	Log-Likelihood:	-2.7306e+06
converged:	True	LL-Null:	-2.7306e+06
		LLR p-value:	1.855e-06

Dep. Variable:	boy	No. Observations:	3838678
Model:	Logit	Df Residuals:	3838676
Method:	MLE	Df Model:	1
Date:	Wed, 18 May 2016	Pseudo R-squ.:	4.565e-05
Time:	14:57:15	Log-Likelihood:	-2.6597e+06
converged:	True	LL-Null:	-2.6598e+06
		LLR p-value:	9.364e-55

Dep. Variable:	boy	No. Observations:	3411631
Model:	Logit	Df Residuals:	3411629
Method:	MLE	Df Model:	1
Date:	Wed, 18 May 2016	Pseudo R-squ.:	3.607e-06
Time:	14:57:47	Log-Likelihood:	-2.3638e+06
converged:	True	LL-Null:	-2.3639e+06
		LLR p-value:	3.635e-05

Dep. Variable:	boy	No. Observations:	3428336
Model:	Logit	Df Residuals:	3428334
Method:	MLE	Df Model:	1
Date:	Wed, 18 May 2016	Pseudo R-squ.:	1.043e-06
Time:	14:57:51	Log-Likelihood:	-2.3754e+06
converged:	True	LL-Null:	-2.3754e+06
		LLR p-value:	0.02598

Dep. Variable:	boy	No. Observations:	3343730
Model:	Logit	Df Residuals:	3343728
Method:	MLE	Df Model:	1
Date:	Wed, 18 May 2016	Pseudo R-squ.:	1.109e-06
Time:	14:57:59	Log-Likelihood:	-2.3168e+06
converged:	True	LL-Null:	-2.3168e+06
		LLR p-value:	0.02338

Dep. Variable:	boy	No. Observations:	3447794
Model:	Logit	Df Residuals:	3447790
Method:	MLE	Df Model:	3
Date:	Wed, 18 May 2016	Pseudo R-squ.:	2.074e-06
Time:	14:58:29	Log-Likelihood:	-2.3889e+06
converged:	True	LL-Null:	-2.3889e+06
		LLR p-value:	0.01934

Dep. Variable:	boy	No. Observations:	3184121
Model:	Logit	Df Residuals:	3184112
Method:	MLE	Df Model:	8
Date:	Wed, 18 May 2016	Pseudo R-squ.:	1.935e-05
Time:	14:59:16	Log-Likelihood:	-2.2061e+06
converged:	True	LL-Null:	-2.2061e+06
		LLR p-value:	3.988e-15

Dep. Variable:	boy	No. Observations:	2798315
Model:	Logit	Df Residuals:	2798309
Method:	MLE	Df Model:	5
Date:	Wed, 18 May 2016	Pseudo R-squ.:	1.968e-05
Time:	15:00:03	Log-Likelihood:	-1.9388e+06
converged:	True	LL-Null:	-1.9388e+06
		LLR p-value:	4.935e-15

Dep. Variable:	boy	No. Observations:	3188403
Model:	Logit	Df Residuals:	3188397
Method:	MLE	Df Model:	5
Date:	Wed, 18 May 2016	Pseudo R-squ.:	1.937e-05
Time:	15:00:29	Log-Likelihood:	-2.2091e+06
converged:	True	LL-Null:	-2.2091e+06
		LLR p-value:	5.665e-17

Dep. Variable:	boy	No. Observations:	2777435
Model:	Logit	Df Residuals:	2777429
Method:	MLE	Df Model:	5
Date:	Wed, 18 May 2016	Pseudo R-squ.:	1.992e-05
Time:	15:00:55	Log-Likelihood:	-1.9243e+06
converged:	True	LL-Null:	-1.9243e+06
		LLR p-value:	4.189e-15

Dep. Variable:	boy	No. Observations:	3175026
Model:	Logit	Df Residuals:	3175020
Method:	MLE	Df Model:	5
Date:	Wed, 18 May 2016	Pseudo R-squ.:	1.881e-05
Time:	15:01:20	Log-Likelihood:	-2.1998e+06
converged:	True	LL-Null:	-2.1998e+06
		LLR p-value:	2.209e-16

Dep. Variable:	boy	No. Observations:	2735525
Model:	Logit	Df Residuals:	2735519
Method:	MLE	Df Model:	5
Date:	Wed, 18 May 2016	Pseudo R-squ.:	2.029e-05
Time:	15:02:07	Log-Likelihood:	-1.8953e+06
converged:	True	LL-Null:	-1.8953e+06
		LLR p-value:	3.710e-15

Dep. Variable:	boy	No. Observations:	2686167
Model:	Logit	Df Residuals:	2686161
Method:	MLE	Df Model:	5
Date:	Wed, 18 May 2016	Pseudo R-squ.:	2.202e-05
Time:	15:02:31	Log-Likelihood:	-1.8611e+06
converged:	True	LL-Null:	-1.8611e+06
		LLR p-value:	3.274e-16

Dep. Variable:	boy	No. Observations:	2763347
Model:	Logit	Df Residuals:	2763339
Method:	MLE	Df Model:	7
Date:	Wed, 18 May 2016	Pseudo R-squ.:	2.100e-05
Time:	15:03:17	Log-Likelihood:	-1.9145e+06
converged:	True	LL-Null:	-1.9146e+06
		LLR p-value:	1.132e-14

Dep. Variable:	boy	No. Observations:	3097584
Model:	Logit	Df Residuals:	3097578
Method:	MLE	Df Model:	5
Date:	Wed, 18 May 2016	Pseudo R-squ.:	7.830e-05
Time:	15:03:43	Log-Likelihood:	-2.1460e+06
converged:	True	LL-Null:	-2.1462e+06
		LLR p-value:	1.719e-70

Dep. Variable:	boy	No. Observations:	2679860
Model:	Logit	Df Residuals:	2679851
Method:	MLE	Df Model:	8
Date:	Wed, 18 May 2016	Pseudo R-squ.:	7.905e-05
Time:	15:05:21	Log-Likelihood:	-1.8566e+06
converged:	True	LL-Null:	-1.8568e+06
		LLR p-value:	9.714e-59

Dep. Variable:	boy	No. Observations:	3088740
Model:	Logit	Df Residuals:	3088732
Method:	MLE	Df Model:	7
Date:	Wed, 18 May 2016	Pseudo R-squ.:	8.510e-05
Time:	15:06:39	Log-Likelihood:	-2.1399e+06
converged:	True	LL-Null:	-2.1401e+06
		LLR p-value:	1.099e-74

Dep. Variable:	boy	No. Observations:	2320227
Model:	Logit	Df Residuals:	2320224
Method:	MLE	Df Model:	2
Date:	Wed, 18 May 2016	Pseudo R-squ.:	6.584e-05
Time:	15:06:43	Log-Likelihood:	-1.6075e+06
converged:	True	LL-Null:	-1.6076e+06
		LLR p-value:	1.073e-46

Dep. Variable:	boy	No. Observations:	2381977
Model:	Logit	Df Residuals:	2381975
Method:	MLE	Df Model:	1
Date:	Wed, 18 May 2016	Pseudo R-squ.:	3.675e-08
Time:	15:06:46	Log-Likelihood:	-1.6503e+06
converged:	True	LL-Null:	-1.6503e+06
		LLR p-value:	0.7276

Dep. Variable:	boy	No. Observations:	2089901
Model:	Logit	Df Residuals:	2089899
Method:	MLE	Df Model:	1
Date:	Wed, 18 May 2016	Pseudo R-squ.:	4.146e-06
Time:	15:06:48	Log-Likelihood:	-1.4479e+06
converged:	True	LL-Null:	-1.4480e+06
		LLR p-value:	0.0005303

Dep. Variable:	boy	No. Observations:	2373894
Model:	Logit	Df Residuals:	2373892
Method:	MLE	Df Model:	1
Date:	Wed, 18 May 2016	Pseudo R-squ.:	6.498e-07
Time:	15:06:50	Log-Likelihood:	-1.6447e+06
converged:	True	LL-Null:	-1.6447e+06
		LLR p-value:	0.1437