Does Trivers-Willard apply to people?

This notebook contains a "one-day paper", my attempt to pose a research question, answer it, and publish the results in one work day.

MIT License: https://opensource.org/licenses/MIT



In [1]:

    
from __future__ import print_function, division

import thinkstats2
import thinkplot

import pandas as pd
import numpy as np

import statsmodels.formula.api as smf

%matplotlib inline

Trivers-Willard

According to Wikipedia, the Trivers-Willard hypothesis:

"...suggests that female mammals are able to adjust offspring sex ratio in response to their maternal condition. For example, it may predict greater parental investment in males by parents in 'good conditions' and greater investment in females by parents in 'poor conditions' (relative to parents in good condition)."

For humans, the hypothesis suggests that people with relatively high social status might be more likely to have boys. Some studies have shown evidence for this hypothesis, but based on my very casual survey, it is not persuasive.

To test whether the T-W hypothesis holds up in humans, I downloaded birth data for the nearly 4 million babies born in the U.S. in 2014.

I selected variables that seemed likely to be related to social status and used logistic regression to identify variables associated with sex ratio.

Summary of results

Running regression with one variable at a time, many of the variables have a statistically significant effect on sex ratio, with the sign of the effect generally in the direction predicted by T-W.
However, many of the variables are also correlated with race. If we control for either the mother's race or the father's race, or both, most other variables have no additional predictive power.
Contrary to other reports, the age of the parents seems to have no predictive power.
Strangely, the variable that shows the strongest and most consistent relationship with sex ratio is the number of prenatal visits. Although it seems obvious that prenatal visits are a proxy for quality of health care and general socioeconomic status, the sign of the effect is opposite what T-W predicts; that is, more prenatal visits is a strong predictor of lower sex ratio (more girls).

Following convention, I report sex ratio in terms of boys per 100 girls. The overall sex ratio at birth is about 105; that is, 105 boys are born for every 100 girls.

Data cleaning

Here's how I loaded the data:



In [2]:

    
names = ['year', 'mager9', 'mnativ', 'restatus', 'mbrace', 'mhisp_r',
        'mar_p', 'dmar', 'meduc', 'fagerrec11', 'fbrace', 'fhisp_r', 'feduc', 
        'lbo_rec', 'previs_rec', 'wic', 'height', 'bmi_r', 'pay_rec', 'sex']
colspecs = [(9, 12),
            (79, 79),
            (84, 84),
            (104, 104),
            (110, 110),
            (115, 115),
            (119, 119),
            (120, 120),
            (124, 124),
            (149, 150),
            (156, 156),
            (160, 160),
            (163, 163),
            (179, 179),
            (242, 243),
            (251, 251),
            (280, 281),
            (287, 287),
            (436, 436),
            (475, 475),
           ]

colspecs = [(start-1, end) for start, end in colspecs]



In [3]:

    
df = None



In [4]:

    
filename = 'Nat2014PublicUS.c20150514.r20151022.txt.gz'
#df = pd.read_fwf(filename, compression='gzip', header=None, names=names, colspecs=colspecs)
#df.head()



In [5]:

    
# store the dataframe for faster loading

#store = pd.HDFStore('store.h5')
#store['births2014'] = df
#store.close()



In [6]:

    
# load the dataframe

store = pd.HDFStore('store.h5')
df = store['births2014']
store.close()



In [7]:

    
def series_to_ratio(series):
    """Takes a boolean series and computes sex ratio.
    """
    boys = np.mean(series)
    return np.round(100 * boys / (1-boys)).astype(int)

I have to recode sex as 0 or 1 to make logit happy.



In [8]:

    
df['boy'] = (df.sex=='M').astype(int)
df.boy.value_counts().sort_index()









    Out[8]:





0    1952273
1    2045902
Name: boy, dtype: int64

All births are from 2014.



In [9]:

    
df.year.value_counts().sort_index()









    Out[9]:





2014    3998175
Name: year, dtype: int64

Mother's age:



In [10]:

    
df.mager9.value_counts().sort_index()









    Out[10]:





1       2777
2     249581
3     884246
4    1148469
5    1084064
6     510214
7     110318
8       7750
9        756
Name: mager9, dtype: int64



In [11]:

    
var = 'mager9'
df[[var, 'boy']].groupby(var).aggregate(series_to_ratio)



In [12]:

    
df.mager9.isnull().mean()









    Out[12]:





0.0



In [13]:

    
df['youngm'] = df.mager9<=2
df['oldm'] = df.mager9>=7
df.youngm.mean(), df.oldm.mean()









    Out[13]:





(0.06311829772333627, 0.029719559549044251)

Mother's nativity (1 = born in the U.S.)



In [14]:

    
df.mnativ.replace([3], np.nan, inplace=True)
df.mnativ.value_counts().sort_index()









    Out[14]:





1    3106689
2     881662
Name: mnativ, dtype: int64



In [15]:

    
var = 'mnativ'
df[[var, 'boy']].groupby(var).aggregate(series_to_ratio)

Residence status (1=resident)



In [16]:

    
df.restatus.value_counts().sort_index()









    Out[16]:





1    2873404
2    1025766
3      88906
4      10099
Name: restatus, dtype: int64



In [17]:

    
var = 'restatus'
df[[var, 'boy']].groupby(var).aggregate(series_to_ratio)

Mother's race (1=White, 2=Black, 3=American Indian or Alaskan Native, 4=Asian or Pacific Islander)



In [18]:

    
df.mbrace.value_counts().sort_index()









    Out[18]:





1    3029013
2     641089
3      44962
4     283111
Name: mbrace, dtype: int64



In [19]:

    
var = 'mbrace'
df[[var, 'boy']].groupby(var).aggregate(series_to_ratio)

Mother's Hispanic origin (0=Non-Hispanic)



In [20]:

    
df.mhisp_r.replace([9], np.nan, inplace=True)
df.mhisp_r.value_counts().sort_index()









    Out[20]:





0    3045419
1     553738
2      69894
3      20165
4     136785
5     141497
Name: mhisp_r, dtype: int64



In [21]:

    
def copy_null(df, oldvar, newvar):
    df.loc[df[oldvar].isnull(), newvar] = np.nan



In [22]:

    
df['mhisp'] = df.mhisp_r > 0
copy_null(df, 'mhisp_r', 'mhisp')
df.mhisp.isnull().mean(), df.mhisp.mean()









    Out[22]:





(0.0076727506925034546, 0.23240818268843488)



In [23]:

    
var = 'mhisp'
df[[var, 'boy']].groupby(var).aggregate(series_to_ratio)

Marital status (1=Married)



In [24]:

    
df.dmar.value_counts().sort_index()









    Out[24]:





1    2390630
2    1607545
Name: dmar, dtype: int64



In [25]:

    
var = 'dmar'
df[[var, 'boy']].groupby(var).aggregate(series_to_ratio)

Paternity acknowledged, if unmarried (Y=yes, N=no, X=not applicable, U=unknown).

I recode X (not applicable because married) as Y (paternity acknowledged).



In [26]:

    
df.mar_p.replace(['U'], np.nan, inplace=True)
df.mar_p.replace(['X'], 'Y', inplace=True)
df.mar_p.value_counts().sort_index()









    Out[26]:





N     462627
Y    3386542
Name: mar_p, dtype: int64



In [27]:

    
var = 'mar_p'
df[[var, 'boy']].groupby(var).aggregate(series_to_ratio)

Mother's education level



In [28]:

    
df.meduc.replace([9], np.nan, inplace=True)
df.meduc.value_counts().sort_index()









    Out[28]:





1    138589
2    437081
3    957265
4    815688
5    308384
6    732661
7    326800
8     94057
Name: meduc, dtype: int64



In [29]:

    
var = 'meduc'
df[[var, 'boy']].groupby(var).aggregate(series_to_ratio)



In [30]:

    
df['lowed'] = df.meduc <= 2
copy_null(df, 'meduc', 'lowed')
df.lowed.isnull().mean(), df.lowed.mean()









    Out[30]:





(0.046933913598079122, 0.15107367095085322)

Father's age, in 10 ranges



In [31]:

    
df.fagerrec11.replace([11], np.nan, inplace=True)
df.fagerrec11.value_counts().sort_index()









    Out[31]:





1         277
2       84852
3      498779
4      869280
5     1025631
6      631685
7      262169
8       87432
9       28465
10      12490
Name: fagerrec11, dtype: int64



In [32]:

    
var = 'fagerrec11'
df[[var, 'boy']].groupby(var).aggregate(series_to_ratio)



In [33]:

    
df['youngf'] = df.fagerrec11<=2
copy_null(df, 'fagerrec11', 'youngf')
df.youngf.isnull().mean(), df.youngf.mean()









    Out[33]:





(0.12433547806186572, 0.024315207394332003)



In [34]:

    
df['oldf'] = df.fagerrec11>=8
copy_null(df, 'fagerrec11', 'oldf')
df.oldf.isnull().mean(), df.oldf.mean()









    Out[34]:





(0.12433547806186572, 0.036670893957829916)

Father's race



In [35]:

    
df.fbrace.replace([9], np.nan, inplace=True)
df.fbrace.value_counts().sort_index()









    Out[35]:





1    2497901
2     482433
3      35408
4     238394
Name: fbrace, dtype: int64



In [36]:

    
var = 'fbrace'
df[[var, 'boy']].groupby(var).aggregate(series_to_ratio)

Father's Hispanic origin (0=non-hispanic, other values indicate country of origin)



In [37]:

    
df.fhisp_r.replace([9], np.nan, inplace=True)
df.fhisp_r.value_counts().sort_index()









    Out[37]:





0    2649007
1     493497
2      59137
3      19128
4     108111
5     124172
Name: fhisp_r, dtype: int64



In [38]:

    
df['fhisp'] = df.fhisp_r > 0
copy_null(df, 'fhisp_r', 'fhisp')
df.fhisp.isnull().mean(), df.fhisp.mean()









    Out[38]:





(0.13634295647389122, 0.23285053338322156)



In [39]:

    
var = 'fhisp'
df[[var, 'boy']].groupby(var).aggregate(series_to_ratio)

Father's education level



In [40]:

    
df.feduc.replace([9], np.nan, inplace=True)
df.feduc.value_counts().sort_index()









    Out[40]:





1    141654
2    342061
3    951980
4    643118
5    232622
6    616187
7    242022
8    109482
Name: feduc, dtype: int64



In [41]:

    
var = 'feduc'
df[[var, 'boy']].groupby(var).aggregate(series_to_ratio)

Live birth order.



In [42]:

    
df.lbo_rec.replace([9], np.nan, inplace=True)
df.lbo_rec.value_counts().sort_index()









    Out[42]:





1    1555006
2    1270496
3     669016
4     284435
5     110708
6      46093
7      20786
8      21610
Name: lbo_rec, dtype: int64



In [43]:

    
var = 'lbo_rec'
df[[var, 'boy']].groupby(var).aggregate(series_to_ratio)



In [44]:

    
df['highbo'] = df.lbo_rec >= 5
copy_null(df, 'lbo_rec', 'highbo')
df.highbo.isnull().mean(), df.highbo.mean()









    Out[44]:





(0.0050085351441595226, 0.050072772519889897)

Number of prenatal visits, in 11 ranges



In [45]:

    
df.previs_rec.replace([12], np.nan, inplace=True)
df.previs_rec.value_counts().sort_index()









    Out[45]:





1      59670
2      44923
3      98141
4     201032
5     366887
6     826908
7     998330
8     684997
9     379305
10     99067
11    128805
Name: previs_rec, dtype: int64



In [46]:

    
df.previs_rec.mean()
df['previs'] = df.previs_rec - 7



In [47]:

    
var = 'previs'
df[[var, 'boy']].groupby(var).aggregate(series_to_ratio)



In [48]:

    
df['no_previs'] = df.previs_rec <= 1
copy_null(df, 'previs_rec', 'no_previs')
df.no_previs.isnull().mean(), df.no_previs.mean()









    Out[48]:





(0.027540065154726845, 0.015346965650008423)

Whether the mother is eligible for food stamps



In [49]:

    
df.wic.replace(['U'], np.nan, inplace=True)
df.wic.value_counts().sort_index()









    Out[49]:





N    2124143
Y    1634978
Name: wic, dtype: int64



In [50]:

    
var = 'wic'
df[[var, 'boy']].groupby(var).aggregate(series_to_ratio)

Mother's height in inches



In [51]:

    
df.height.replace([99], np.nan, inplace=True)
df.height.value_counts().sort_index()









    Out[51]:





30        28
31         1
34         2
36        14
37         7
38         7
39         7
40         6
41        10
42        13
43         3
44         8
45        11
46        14
47        22
48       857
49       544
50       357
51       422
52       493
53      1503
54      1414
55      2762
56      6678
57     18359
58     21019
59     81588
60    209490
61    269142
62    474306
63    485840
64    559249
65    453503
66    429253
67    334485
68    189690
69    127789
70     62364
71     33428
72     15323
73      5200
74      2538
75      1019
76       590
77       593
78       941
Name: height, dtype: int64



In [52]:

    
df['mshort'] = df.height<60
copy_null(df, 'height', 'mshort')
df.mshort.isnull().mean(), df.mshort.mean()









    Out[52]:





(0.051844404009329256, 0.0359147662344377)



In [53]:

    
df['mtall'] = df.height>=70
copy_null(df, 'height', 'mtall')
df.mtall.isnull().mean(), df.mtall.mean()









    Out[53]:





(0.051844404009329256, 0.03218134412692316)



In [54]:

    
var = 'mshort'
df[[var, 'boy']].groupby(var).aggregate(series_to_ratio)



In [55]:

    
var = 'mtall'
df[[var, 'boy']].groupby(var).aggregate(series_to_ratio)

Mother's BMI in 6 ranges



In [56]:

    
df.bmi_r.replace([9], np.nan, inplace=True)
df.bmi_r.value_counts().sort_index()









    Out[56]:





1     140142
2    1702519
3     949075
4     506017
5     242957
6     168515
Name: bmi_r, dtype: int64



In [57]:

    
var = 'bmi_r'
df[[var, 'boy']].groupby(var).aggregate(series_to_ratio)



In [58]:

    
df['obese'] = df.bmi_r >= 4
copy_null(df, 'bmi_r', 'obese')
df.obese.isnull().mean(), df.obese.mean()









    Out[58]:





(0.07227047340349034, 0.2473532880857861)

Payment method (1=Medicaid, 2=Private insurance, 3=Self pay, 4=Other)



In [59]:

    
df.pay_rec.replace([9], np.nan, inplace=True)
df.pay_rec.value_counts().sort_index()









    Out[59]:





1    1665161
2    1824151
3     162650
4     167806
Name: pay_rec, dtype: int64



In [60]:

    
var = 'pay_rec'
df[[var, 'boy']].groupby(var).aggregate(series_to_ratio)

Sex of baby



In [61]:

    
df.sex.value_counts().sort_index()









    Out[61]:





F    1952273
M    2045902
Name: sex, dtype: int64

Regression models

Here are some functions I'll use to interpret the results of logistic regression



In [62]:

    
def logodds_to_ratio(logodds):
    """Convert from log odds to probability."""
    odds = np.exp(logodds)
    return 100 * odds

def summarize(results):
    """Summarize parameters in terms of birth ratio."""
    inter_or = results.params['Intercept']
    inter_rat = logodds_to_ratio(inter_or)
    
    for value, lor in results.params.iteritems():
        if value=='Intercept':
            continue
        
        rat = logodds_to_ratio(inter_or + lor)
        code = '*' if results.pvalues[value] < 0.05 else ' '
        
        print('%-20s   %0.1f   %0.1f' % (value, inter_rat, rat), code)

Now I'll run models with each variable, one at a time.

Mother's age seems to have no predictive value:



In [63]:

    
model = smf.logit('boy ~ mager9', data=df)    
results = model.fit()
summarize(results)
results.summary()









    



Optimization terminated successfully.
         Current function value: 0.692873
         Iterations 3
mager9                 105.1   105.0  






    Out[63]:





Logit Regression Results

  Dep. Variable:         boy          No. Observations:     3998175  


  Model:                Logit         Df Residuals:         3998173  


  Method:                MLE          Df Model:                  1   


  Date:           Tue, 17 May 2016    Pseudo R-squ.:       1.129e-07 


  Time:               14:18:28        Log-Likelihood:     -2.7702e+06


  converged:            True          LL-Null:            -2.7702e+06


                                    LLR p-value:          0.4290   




               coef      std err       z       P>|z|  [95.0% Conf. Int.] 


  Intercept      0.0496      0.004     13.550   0.000      0.042     0.057


  mager9        -0.0007      0.001     -0.791   0.429     -0.002     0.001

The estimated ratios for young mothers is higher, and the ratio for older mothers is lower, but neither is statistically significant.



In [64]:

    
model = smf.logit('boy ~ youngm + oldm', data=df)    
results = model.fit()
summarize(results)
results.summary()









    



Optimization terminated successfully.
         Current function value: 0.692873
         Iterations 3
youngm[T.True]         104.8   104.9  
oldm[T.True]           104.8   103.9  






    Out[64]:





Logit Regression Results

  Dep. Variable:         boy          No. Observations:     3998175  


  Model:                Logit         Df Residuals:         3998172  


  Method:                MLE          Df Model:                  2   


  Date:           Tue, 17 May 2016    Pseudo R-squ.:       3.813e-07 


  Time:               14:18:33        Log-Likelihood:     -2.7702e+06


  converged:            True          LL-Null:            -2.7702e+06


                                    LLR p-value:          0.3478   




                    coef      std err       z       P>|z|  [95.0% Conf. Int.] 


  Intercept           0.0470      0.001     44.772   0.000      0.045     0.049


  youngm[T.True]      0.0010      0.004      0.240   0.810     -0.007     0.009


  oldm[T.True]       -0.0084      0.006     -1.421   0.155     -0.020     0.003

Whether the mother was born in the U.S. has no predictive value



In [65]:

    
model = smf.logit('boy ~ C(mnativ)', data=df)    
results = model.fit()
summarize(results)
results.summary()









    



Optimization terminated successfully.
         Current function value: 0.692873
         Iterations 3
C(mnativ)[T.2.0]       104.8   104.9  






    Out[65]:





Logit Regression Results

  Dep. Variable:         boy          No. Observations:     3988351  


  Model:                Logit         Df Residuals:         3988349  


  Method:                MLE          Df Model:                  1   


  Date:           Tue, 17 May 2016    Pseudo R-squ.:       4.566e-08 


  Time:               14:19:00        Log-Likelihood:     -2.7634e+06


  converged:            True          LL-Null:            -2.7634e+06


                                    LLR p-value:          0.6154   




                      coef      std err       z       P>|z|  [95.0% Conf. Int.] 


  Intercept             0.0466      0.001     41.050   0.000      0.044     0.049


  C(mnativ)[T.2.0]      0.0012      0.002      0.502   0.615     -0.004     0.006

Neither does residence status



In [66]:

    
model = smf.logit('boy ~ C(restatus)', data=df)    
results = model.fit()
summarize(results)
results.summary()









    



Optimization terminated successfully.
         Current function value: 0.692872
         Iterations 3
C(restatus)[T.2]       104.8   104.7  
C(restatus)[T.3]       104.8   106.0  
C(restatus)[T.4]       104.8   106.2  






    Out[66]:





Logit Regression Results

  Dep. Variable:         boy          No. Observations:     3998175  


  Model:                Logit         Df Residuals:         3998171  


  Method:                MLE          Df Model:                  3   


  Date:           Tue, 17 May 2016    Pseudo R-squ.:       6.716e-07 


  Time:               14:19:28        Log-Likelihood:     -2.7702e+06


  converged:            True          LL-Null:            -2.7702e+06


                                    LLR p-value:          0.2932   




                      coef      std err       z       P>|z|  [95.0% Conf. Int.] 


  Intercept             0.0468      0.001     39.653   0.000      0.044     0.049


  C(restatus)[T.2]     -0.0010      0.002     -0.418   0.676     -0.005     0.004


  C(restatus)[T.3]      0.0117      0.007      1.718   0.086     -0.002     0.025


  C(restatus)[T.4]      0.0132      0.020      0.663   0.507     -0.026     0.052

Mother's race seems to have predictive value. Relative to whites, black and Native American mothers have more girls; Asians have more boys.



In [67]:

    
model = smf.logit('boy ~ C(mbrace)', data=df)    
results = model.fit()
summarize(results)
results.summary()









    



Optimization terminated successfully.
         Current function value: 0.692863
         Iterations 3
C(mbrace)[T.2]         105.1   102.9 *
C(mbrace)[T.3]         105.1   103.1 *
C(mbrace)[T.4]         105.1   106.3 *






    Out[67]:





Logit Regression Results

  Dep. Variable:         boy          No. Observations:     3998175  


  Model:                Logit         Df Residuals:         3998171  


  Method:                MLE          Df Model:                  3   


  Date:           Tue, 17 May 2016    Pseudo R-squ.:       1.401e-05 


  Time:               14:19:55        Log-Likelihood:     -2.7702e+06


  converged:            True          LL-Null:            -2.7702e+06


                                    LLR p-value:         1.007e-16 




                    coef      std err       z       P>|z|  [95.0% Conf. Int.] 


  Intercept           0.0497      0.001     43.250   0.000      0.047     0.052


  C(mbrace)[T.2]     -0.0214      0.003     -7.770   0.000     -0.027    -0.016


  C(mbrace)[T.3]     -0.0195      0.010     -2.049   0.041     -0.038    -0.001


  C(mbrace)[T.4]      0.0109      0.004      2.777   0.005      0.003     0.019

Hispanic mothers have more girls.



In [68]:

    
model = smf.logit('boy ~ mhisp', data=df)    
results = model.fit()
summarize(results)
results.summary()









    



Optimization terminated successfully.
         Current function value: 0.692874
         Iterations 3
mhisp                  105.0   104.1 *






    Out[68]:





Logit Regression Results

  Dep. Variable:         boy          No. Observations:     3967498  


  Model:                Logit         Df Residuals:         3967496  


  Method:                MLE          Df Model:                  1   


  Date:           Tue, 17 May 2016    Pseudo R-squ.:       1.998e-06 


  Time:               14:19:59        Log-Likelihood:     -2.7490e+06


  converged:            True          LL-Null:            -2.7490e+06


                                    LLR p-value:         0.0009174 




               coef      std err       z       P>|z|  [95.0% Conf. Int.] 


  Intercept      0.0485      0.001     42.263   0.000      0.046     0.051


  mhisp         -0.0079      0.002     -3.315   0.001     -0.013    -0.003

If the mother is married or unmarried but paternity is acknowledged, the sex ratio is higher (more boys)



In [69]:

    
model = smf.logit('boy ~ C(mar_p)', data=df)    
results = model.fit()
summarize(results)
results.summary()









    



Optimization terminated successfully.
         Current function value: 0.692864
         Iterations 3
C(mar_p)[T.Y]          102.8   105.1 *






    Out[69]:





Logit Regression Results

  Dep. Variable:         boy          No. Observations:     3849169  


  Model:                Logit         Df Residuals:         3849167  


  Method:                MLE          Df Model:                  1   


  Date:           Tue, 17 May 2016    Pseudo R-squ.:       9.129e-06 


  Time:               14:20:27        Log-Likelihood:     -2.6670e+06


  converged:            True          LL-Null:            -2.6670e+06


                                    LLR p-value:         2.990e-12 




                   coef      std err       z       P>|z|  [95.0% Conf. Int.] 


  Intercept          0.0278      0.003      9.446   0.000      0.022     0.034


  C(mar_p)[T.Y]      0.0219      0.003      6.978   0.000      0.016     0.028

Being unmarried predicts more girls.



In [70]:

    
model = smf.logit('boy ~ C(dmar)', data=df)    
results = model.fit()
summarize(results)
results.summary()









    



Optimization terminated successfully.
         Current function value: 0.692871
         Iterations 3
C(dmar)[T.2]           105.1   104.3 *






    Out[70]:





Logit Regression Results

  Dep. Variable:         boy          No. Observations:     3998175  


  Model:                Logit         Df Residuals:         3998173  


  Method:                MLE          Df Model:                  1   


  Date:           Tue, 17 May 2016    Pseudo R-squ.:       3.001e-06 


  Time:               14:20:54        Log-Likelihood:     -2.7702e+06


  converged:            True          LL-Null:            -2.7702e+06


                                    LLR p-value:         4.555e-05 




                  coef      std err       z       P>|z|  [95.0% Conf. Int.] 


  Intercept         0.0502      0.001     38.789   0.000      0.048     0.053


  C(dmar)[T.2]     -0.0083      0.002     -4.077   0.000     -0.012    -0.004

Each level of mother's education predicts a small increase in the probability of a boy.



In [71]:

    
model = smf.logit('boy ~ meduc', data=df)    
results = model.fit()
summarize(results)
results.summary()









    



Optimization terminated successfully.
         Current function value: 0.692874
         Iterations 3
meduc                  104.1   104.2 *






    Out[71]:





Logit Regression Results

  Dep. Variable:         boy          No. Observations:     3810525  


  Model:                Logit         Df Residuals:         3810523  


  Method:                MLE          Df Model:                  1   


  Date:           Tue, 17 May 2016    Pseudo R-squ.:       1.416e-06 


  Time:               14:20:59        Log-Likelihood:     -2.6402e+06


  converged:            True          LL-Null:            -2.6402e+06


                                    LLR p-value:         0.006248  




               coef      std err       z       P>|z|  [95.0% Conf. Int.] 


  Intercept      0.0398      0.003     14.711   0.000      0.034     0.045


  meduc          0.0016      0.001      2.734   0.006      0.000     0.003



In [72]:

    
model = smf.logit('boy ~ lowed', data=df)    
results = model.fit()
summarize(results)
results.summary()









    



Optimization terminated successfully.
         Current function value: 0.692874
         Iterations 3
lowed                  104.9   104.1 *






    Out[72]:





Logit Regression Results

  Dep. Variable:         boy          No. Observations:     3810525  


  Model:                Logit         Df Residuals:         3810523  


  Method:                MLE          Df Model:                  1   


  Date:           Tue, 17 May 2016    Pseudo R-squ.:       1.431e-06 


  Time:               14:21:03        Log-Likelihood:     -2.6402e+06


  converged:            True          LL-Null:            -2.6402e+06


                                    LLR p-value:         0.005983  




               coef      std err       z       P>|z|  [95.0% Conf. Int.] 


  Intercept      0.0478      0.001     43.002   0.000      0.046     0.050


  lowed         -0.0079      0.003     -2.749   0.006     -0.013    -0.002

Older fathers are slightly more likely to have girls (but this apparent effect could be due to chance).



In [73]:

    
model = smf.logit('boy ~ fagerrec11', data=df)    
results = model.fit()
summarize(results)
results.summary()









    



Optimization terminated successfully.
         Current function value: 0.692840
         Iterations 3
fagerrec11             105.9   105.7 *






    Out[73]:





Logit Regression Results

  Dep. Variable:         boy          No. Observations:     3501060  


  Model:                Logit         Df Residuals:         3501058  


  Method:                MLE          Df Model:                  1   


  Date:           Tue, 17 May 2016    Pseudo R-squ.:       8.226e-07 


  Time:               14:21:08        Log-Likelihood:     -2.4257e+06


  converged:            True          LL-Null:            -2.4257e+06


                                    LLR p-value:          0.04575  




                coef      std err       z       P>|z|  [95.0% Conf. Int.] 


  Intercept       0.0570      0.004     14.707   0.000      0.049     0.065


  fagerrec11     -0.0015      0.001     -1.998   0.046     -0.003  -2.9e-05



In [74]:

    
model = smf.logit('boy ~ youngf + oldf', data=df)    
results = model.fit()
summarize(results)
results.summary()









    



Optimization terminated successfully.
         Current function value: 0.692840
         Iterations 3
youngf                 105.1   106.3  
oldf                   105.1   105.0  






    Out[74]:





Logit Regression Results

  Dep. Variable:         boy          No. Observations:     3501060  


  Model:                Logit         Df Residuals:         3501057  


  Method:                MLE          Df Model:                  2   


  Date:           Tue, 17 May 2016    Pseudo R-squ.:       5.807e-07 


  Time:               14:21:12        Log-Likelihood:     -2.4257e+06


  converged:            True          LL-Null:            -2.4257e+06


                                    LLR p-value:          0.2445   




               coef      std err       z       P>|z|  [95.0% Conf. Int.] 


  Intercept      0.0493      0.001     44.656   0.000      0.047     0.051


  youngf         0.0116      0.007      1.673   0.094     -0.002     0.025


  oldf          -0.0005      0.006     -0.086   0.932     -0.012     0.011

Predictions based on father's race are similar to those based on mother's race: more girls for black and Native American fathers; more boys for Asian fathers.



In [75]:

    
model = smf.logit('boy ~ C(fbrace)', data=df)    
results = model.fit()
summarize(results)
results.summary()









    



Optimization terminated successfully.
         Current function value: 0.692818
         Iterations 3
C(fbrace)[T.2.0]       105.5   103.1 *
C(fbrace)[T.3.0]       105.5   102.9 *
C(fbrace)[T.4.0]       105.5   106.6 *






    Out[75]:





Logit Regression Results

  Dep. Variable:         boy          No. Observations:     3254136  


  Model:                Logit         Df Residuals:         3254132  


  Method:                MLE          Df Model:                  3   


  Date:           Tue, 17 May 2016    Pseudo R-squ.:       1.504e-05 


  Time:               14:21:38        Log-Likelihood:     -2.2545e+06


  converged:            True          LL-Null:            -2.2546e+06


                                    LLR p-value:         1.256e-14 




                      coef      std err       z       P>|z|  [95.0% Conf. Int.] 


  Intercept             0.0533      0.001     42.144   0.000      0.051     0.056


  C(fbrace)[T.2.0]     -0.0227      0.003     -7.221   0.000     -0.029    -0.017


  C(fbrace)[T.3.0]     -0.0250      0.011     -2.335   0.020     -0.046    -0.004


  C(fbrace)[T.4.0]      0.0106      0.004      2.479   0.013      0.002     0.019

If the father is Hispanic, that predicts more girls.



In [76]:

    
model = smf.logit('boy ~ fhisp', data=df)    
results = model.fit()
summarize(results)
results.summary()









    



Optimization terminated successfully.
         Current function value: 0.692839
         Iterations 3
fhisp                  105.4   104.0 *






    Out[76]:





Logit Regression Results

  Dep. Variable:         boy          No. Observations:     3453052  


  Model:                Logit         Df Residuals:         3453050  


  Method:                MLE          Df Model:                  1   


  Date:           Tue, 17 May 2016    Pseudo R-squ.:       5.800e-06 


  Time:               14:21:42        Log-Likelihood:     -2.3924e+06


  converged:            True          LL-Null:            -2.3924e+06


                                    LLR p-value:         1.378e-07 




               coef      std err       z       P>|z|  [95.0% Conf. Int.] 


  Intercept      0.0525      0.001     42.696   0.000      0.050     0.055


  fhisp         -0.0134      0.003     -5.268   0.000     -0.018    -0.008

Father's education level might predict more boys, but the apparent effect could be due to chance.



In [77]:

    
model = smf.logit('boy ~ feduc', data=df)    
results = model.fit()
summarize(results)
results.summary()









    



Optimization terminated successfully.
         Current function value: 0.692840
         Iterations 3
feduc                  104.6   104.7  






    Out[77]:





Logit Regression Results

  Dep. Variable:         boy          No. Observations:     3279126  


  Model:                Logit         Df Residuals:         3279124  


  Method:                MLE          Df Model:                  1   


  Date:           Tue, 17 May 2016    Pseudo R-squ.:       8.046e-07 


  Time:               14:21:46        Log-Likelihood:     -2.2719e+06


  converged:            True          LL-Null:            -2.2719e+06


                                    LLR p-value:          0.05587  




               coef      std err       z       P>|z|  [95.0% Conf. Int.] 


  Intercept      0.0445      0.003     15.630   0.000      0.039     0.050


  feduc          0.0012      0.001      1.912   0.056  -3.02e-05     0.002

Babies with high birth order are slightly more likely to be girls.



In [78]:

    
model = smf.logit('boy ~ lbo_rec', data=df)    
results = model.fit()
summarize(results)
results.summary()









    



Optimization terminated successfully.
         Current function value: 0.692872
         Iterations 3
lbo_rec                105.3   105.1 *






    Out[78]:





Logit Regression Results

  Dep. Variable:         boy          No. Observations:     3978150  


  Model:                Logit         Df Residuals:         3978148  


  Method:                MLE          Df Model:                  1   


  Date:           Tue, 17 May 2016    Pseudo R-squ.:       1.576e-06 


  Time:               14:21:51        Log-Likelihood:     -2.7563e+06


  converged:            True          LL-Null:            -2.7564e+06


                                    LLR p-value:         0.003206  




               coef      std err       z       P>|z|  [95.0% Conf. Int.] 


  Intercept      0.0518      0.002     26.529   0.000      0.048     0.056


  lbo_rec       -0.0023      0.001     -2.947   0.003     -0.004    -0.001



In [79]:

    
model = smf.logit('boy ~ highbo', data=df)    
results = model.fit()
summarize(results)
results.summary()









    



Optimization terminated successfully.
         Current function value: 0.692872
         Iterations 3
highbo                 104.9   103.4 *






    Out[79]:





Logit Regression Results

  Dep. Variable:         boy          No. Observations:     3978150  


  Model:                Logit         Df Residuals:         3978148  


  Method:                MLE          Df Model:                  1   


  Date:           Tue, 17 May 2016    Pseudo R-squ.:       1.647e-06 


  Time:               14:21:56        Log-Likelihood:     -2.7563e+06


  converged:            True          LL-Null:            -2.7564e+06


                                    LLR p-value:         0.002584  




               coef      std err       z       P>|z|  [95.0% Conf. Int.] 


  Intercept      0.0475      0.001     46.200   0.000      0.046     0.050


  highbo        -0.0139      0.005     -3.013   0.003     -0.023    -0.005

Strangely, prenatal visits are associated with an increased probability of girls.



In [80]:

    
model = smf.logit('boy ~ previs', data=df)    
results = model.fit()
summarize(results)
results.summary()









    



Optimization terminated successfully.
         Current function value: 0.692847
         Iterations 3
previs                 104.6   103.8 *






    Out[80]:





Logit Regression Results

  Dep. Variable:         boy          No. Observations:     3888065  


  Model:                Logit         Df Residuals:         3888063  


  Method:                MLE          Df Model:                  1   


  Date:           Tue, 17 May 2016    Pseudo R-squ.:       3.975e-05 


  Time:               14:22:01        Log-Likelihood:     -2.6938e+06


  converged:            True          LL-Null:            -2.6939e+06


                                    LLR p-value:         1.677e-48 




               coef      std err       z       P>|z|  [95.0% Conf. Int.] 


  Intercept      0.0449      0.001     43.933   0.000      0.043     0.047


  previs        -0.0079      0.001    -14.634   0.000     -0.009    -0.007

The effect seems to be non-linear at zero, so I'm adding a boolean for no prenatal visits.



In [81]:

    
model = smf.logit('boy ~ no_previs + previs', data=df)    
results = model.fit()
summarize(results)
results.summary()









    



Optimization terminated successfully.
         Current function value: 0.692842
         Iterations 3
no_previs              104.6   98.9 *
previs                 104.6   103.7 *






    Out[81]:





Logit Regression Results

  Dep. Variable:         boy          No. Observations:     3888065  


  Model:                Logit         Df Residuals:         3888062  


  Method:                MLE          Df Model:                  2   


  Date:           Tue, 17 May 2016    Pseudo R-squ.:       4.717e-05 


  Time:               14:22:07        Log-Likelihood:     -2.6938e+06


  converged:            True          LL-Null:            -2.6939e+06


                                    LLR p-value:         6.538e-56 




               coef      std err       z       P>|z|  [95.0% Conf. Int.] 


  Intercept      0.0454      0.001     44.310   0.000      0.043     0.047


  no_previs     -0.0564      0.009     -6.322   0.000     -0.074    -0.039


  previs        -0.0093      0.001    -15.938   0.000     -0.010    -0.008

If the mother qualifies for food stamps, she is more likely to have a girl.



In [82]:

    
model = smf.logit('boy ~ wic', data=df)    
results = model.fit()
summarize(results)
results.summary()









    



Optimization terminated successfully.
         Current function value: 0.692869
         Iterations 3
wic[T.Y]               105.2   104.3 *






    Out[82]:





Logit Regression Results

  Dep. Variable:         boy          No. Observations:     3759121  


  Model:                Logit         Df Residuals:         3759119  


  Method:                MLE          Df Model:                  1   


  Date:           Tue, 17 May 2016    Pseudo R-squ.:       3.051e-06 


  Time:               14:22:35        Log-Likelihood:     -2.6046e+06


  converged:            True          LL-Null:            -2.6046e+06


                                    LLR p-value:         6.700e-05 




               coef      std err       z       P>|z|  [95.0% Conf. Int.] 


  Intercept      0.0506      0.001     36.886   0.000      0.048     0.053


  wic[T.Y]      -0.0083      0.002     -3.987   0.000     -0.012    -0.004

Mother's height seems to have no predictive value.



In [83]:

    
model = smf.logit('boy ~ height', data=df)    
results = model.fit()
summarize(results)
results.summary()









    



Optimization terminated successfully.
         Current function value: 0.692873
         Iterations 3
height                 102.4   102.5  






    Out[83]:





Logit Regression Results

  Dep. Variable:         boy          No. Observations:     3790892  


  Model:                Logit         Df Residuals:         3790890  


  Method:                MLE          Df Model:                  1   


  Date:           Tue, 17 May 2016    Pseudo R-squ.:       1.853e-07 


  Time:               14:22:39        Log-Likelihood:     -2.6266e+06


  converged:            True          LL-Null:            -2.6266e+06


                                    LLR p-value:          0.3238   




               coef      std err       z       P>|z|  [95.0% Conf. Int.] 


  Intercept      0.0240      0.023      1.038   0.299     -0.021     0.069


  height         0.0004      0.000      0.987   0.324     -0.000     0.001



In [84]:

    
model = smf.logit('boy ~ mtall + mshort', data=df)    
results = model.fit()
summarize(results)
results.summary()









    



Optimization terminated successfully.
         Current function value: 0.692872
         Iterations 3
mtall                  104.8   104.1  
mshort                 104.8   104.3  






    Out[84]:





Logit Regression Results

  Dep. Variable:         boy          No. Observations:     3790892  


  Model:                Logit         Df Residuals:         3790889  


  Method:                MLE          Df Model:                  2   


  Date:           Tue, 17 May 2016    Pseudo R-squ.:       4.560e-07 


  Time:               14:22:43        Log-Likelihood:     -2.6266e+06


  converged:            True          LL-Null:            -2.6266e+06


                                    LLR p-value:          0.3019   




               coef      std err       z       P>|z|  [95.0% Conf. Int.] 


  Intercept      0.0473      0.001     44.433   0.000      0.045     0.049


  mtall         -0.0071      0.006     -1.212   0.226     -0.018     0.004


  mshort        -0.0056      0.006     -1.005   0.315     -0.016     0.005

Mother's with higher BMI are more likely to have girls.



In [85]:

    
model = smf.logit('boy ~ bmi_r', data=df)    
results = model.fit()
summarize(results)
results.summary()









    



Optimization terminated successfully.
         Current function value: 0.692870
         Iterations 3
bmi_r                  105.7   105.4 *






    Out[85]:





Logit Regression Results

  Dep. Variable:         boy          No. Observations:     3709225  


  Model:                Logit         Df Residuals:         3709223  


  Method:                MLE          Df Model:                  1   


  Date:           Tue, 17 May 2016    Pseudo R-squ.:       2.168e-06 


  Time:               14:22:48        Log-Likelihood:     -2.5700e+06


  converged:            True          LL-Null:            -2.5700e+06


                                    LLR p-value:         0.0008442 




               coef      std err       z       P>|z|  [95.0% Conf. Int.] 


  Intercept      0.0554      0.003     20.336   0.000      0.050     0.061


  bmi_r         -0.0029      0.001     -3.338   0.001     -0.005    -0.001



In [86]:

    
model = smf.logit('boy ~ obese', data=df)    
results = model.fit()
summarize(results)
results.summary()









    



Optimization terminated successfully.
         Current function value: 0.692870
         Iterations 3
obese                  105.0   104.2 *






    Out[86]:





Logit Regression Results

  Dep. Variable:         boy          No. Observations:     3709225  


  Model:                Logit         Df Residuals:         3709223  


  Method:                MLE          Df Model:                  1   


  Date:           Tue, 17 May 2016    Pseudo R-squ.:       2.347e-06 


  Time:               14:22:53        Log-Likelihood:     -2.5700e+06


  converged:            True          LL-Null:            -2.5700e+06


                                    LLR p-value:         0.0005139 




               coef      std err       z       P>|z|  [95.0% Conf. Int.] 


  Intercept      0.0491      0.001     40.976   0.000      0.047     0.051


  obese         -0.0084      0.002     -3.473   0.001     -0.013    -0.004

If payment was made by Medicaid, the baby is more likely to be a girl. Private insurance, self-payment, and other payment method are associated with more boys.



In [87]:

    
model = smf.logit('boy ~ C(pay_rec)', data=df)    
results = model.fit()
summarize(results)
results.summary()









    



Optimization terminated successfully.
         Current function value: 0.692869
         Iterations 3
C(pay_rec)[T.2.0]      104.2   105.1 *
C(pay_rec)[T.3.0]      104.2   106.6 *
C(pay_rec)[T.4.0]      104.2   104.7  






    Out[87]:





Logit Regression Results

  Dep. Variable:         boy          No. Observations:     3819768  


  Model:                Logit         Df Residuals:         3819764  


  Method:                MLE          Df Model:                  3   


  Date:           Tue, 17 May 2016    Pseudo R-squ.:       5.306e-06 


  Time:               14:23:19        Log-Likelihood:     -2.6466e+06


  converged:            True          LL-Null:            -2.6466e+06


                                    LLR p-value:         3.482e-06 




                       coef      std err       z       P>|z|  [95.0% Conf. Int.] 


  Intercept              0.0416      0.002     26.840   0.000      0.039     0.045


  C(pay_rec)[T.2.0]      0.0085      0.002      3.982   0.000      0.004     0.013


  C(pay_rec)[T.3.0]      0.0222      0.005      4.272   0.000      0.012     0.032


  C(pay_rec)[T.4.0]      0.0047      0.005      0.925   0.355     -0.005     0.015

Adding controls

However, none of the previous results should be taken too seriously. We only tested one variable at a time, and many of these apparent effects disappear when we add control variables.

In particular, if we control for father's race and Hispanic origin, the mother's race has no additional predictive value.



In [88]:

    
formula = ('boy ~ C(fbrace) + fhisp + C(mbrace) + mhisp')
model = smf.logit(formula, data=df)
results = model.fit()
summarize(results)
results.summary()









    



Optimization terminated successfully.
         Current function value: 0.692816
         Iterations 3
C(fbrace)[T.2.0]       105.8   103.1 *
C(fbrace)[T.3.0]       105.8   103.5  
C(fbrace)[T.4.0]       105.8   106.9  
C(mbrace)[T.2]         105.8   105.9  
C(mbrace)[T.3]         105.8   104.5  
C(mbrace)[T.4]         105.8   105.6  
fhisp                  105.8   104.2 *
mhisp                  105.8   106.0  






    Out[88]:





Logit Regression Results

  Dep. Variable:         boy          No. Observations:     3231530  


  Model:                Logit         Df Residuals:         3231521  


  Method:                MLE          Df Model:                  8   


  Date:           Tue, 17 May 2016    Pseudo R-squ.:       2.087e-05 


  Time:               14:24:08        Log-Likelihood:     -2.2389e+06


  converged:            True          LL-Null:            -2.2389e+06


                                    LLR p-value:         9.292e-17 




                      coef      std err       z       P>|z|  [95.0% Conf. Int.] 


  Intercept             0.0566      0.001     38.234   0.000      0.054     0.060


  C(fbrace)[T.2.0]     -0.0260      0.006     -4.668   0.000     -0.037    -0.015


  C(fbrace)[T.3.0]     -0.0221      0.012     -1.793   0.073     -0.046     0.002


  C(fbrace)[T.4.0]      0.0097      0.007      1.344   0.179     -0.004     0.024


  C(mbrace)[T.2]        0.0004      0.006      0.075   0.940     -0.011     0.012


  C(mbrace)[T.3]       -0.0130      0.013     -0.994   0.320     -0.039     0.013


  C(mbrace)[T.4]       -0.0026      0.007     -0.375   0.708     -0.016     0.011


  fhisp                -0.0156      0.004     -3.591   0.000     -0.024    -0.007


  mhisp                 0.0018      0.004      0.422   0.673     -0.007     0.010

In fact, once we control for father's race and Hispanic origin, almost every other variable becomes statistically insignificant, including acknowledged paternity.



In [89]:

    
formula = ('boy ~ C(fbrace) + fhisp + mar_p')
model = smf.logit(formula, data=df)
results = model.fit()
summarize(results)
results.summary()









    



Optimization terminated successfully.
         Current function value: 0.692814
         Iterations 3
C(fbrace)[T.2.0]       108.2   105.5 *
C(fbrace)[T.3.0]       108.2   105.2 *
C(fbrace)[T.4.0]       108.2   109.1  
mar_p[T.Y]             108.2   105.8  
fhisp                  108.2   106.7 *






    Out[89]:





Logit Regression Results

  Dep. Variable:         boy          No. Observations:     3112362  


  Model:                Logit         Df Residuals:         3112356  


  Method:                MLE          Df Model:                  5   


  Date:           Tue, 17 May 2016    Pseudo R-squ.:       2.117e-05 


  Time:               14:24:56        Log-Likelihood:     -2.1563e+06


  converged:            True          LL-Null:            -2.1563e+06


                                    LLR p-value:         3.558e-18 




                      coef      std err       z       P>|z|  [95.0% Conf. Int.] 


  Intercept             0.0792      0.015      5.155   0.000      0.049     0.109


  C(fbrace)[T.2.0]     -0.0258      0.003     -7.860   0.000     -0.032    -0.019


  C(fbrace)[T.3.0]     -0.0283      0.011     -2.594   0.009     -0.050    -0.007


  C(fbrace)[T.4.0]      0.0074      0.004      1.662   0.097     -0.001     0.016


  mar_p[T.Y]           -0.0225      0.015     -1.464   0.143     -0.053     0.008


  fhisp                -0.0148      0.003     -4.982   0.000     -0.021    -0.009

Being married still predicts more boys.



In [90]:

    
formula = ('boy ~ C(fbrace) + fhisp + dmar')
model = smf.logit(formula, data=df)
results = model.fit()
summarize(results)
results.summary()









    



Optimization terminated successfully.
         Current function value: 0.692814
         Iterations 3
C(fbrace)[T.2.0]       105.0   102.2 *
C(fbrace)[T.3.0]       105.0   101.9 *
C(fbrace)[T.4.0]       105.0   105.9  
fhisp                  105.0   103.4 *
dmar                   105.0   105.7 *






    Out[90]:





Logit Regression Results

  Dep. Variable:         boy          No. Observations:     3235798  


  Model:                Logit         Df Residuals:         3235792  


  Method:                MLE          Df Model:                  5   


  Date:           Tue, 17 May 2016    Pseudo R-squ.:       2.183e-05 


  Time:               14:25:22        Log-Likelihood:     -2.2418e+06


  converged:            True          LL-Null:            -2.2419e+06


                                    LLR p-value:         1.485e-19 




                      coef      std err       z       P>|z|  [95.0% Conf. Int.] 


  Intercept             0.0492      0.003     14.375   0.000      0.042     0.056


  C(fbrace)[T.2.0]     -0.0278      0.003     -8.324   0.000     -0.034    -0.021


  C(fbrace)[T.3.0]     -0.0301      0.011     -2.778   0.005     -0.051    -0.009


  C(fbrace)[T.4.0]      0.0081      0.004      1.871   0.061     -0.000     0.017


  fhisp                -0.0156      0.003     -5.270   0.000     -0.021    -0.010


  dmar                  0.0062      0.003      2.416   0.016      0.001     0.011

The effect of education disappears.



In [91]:

    
formula = ('boy ~ C(fbrace) + fhisp + lowed')
model = smf.logit(formula, data=df)
results = model.fit()
summarize(results)
results.summary()









    



Optimization terminated successfully.
         Current function value: 0.692816
         Iterations 3
C(fbrace)[T.2.0]       105.8   103.1 *
C(fbrace)[T.3.0]       105.8   102.8 *
C(fbrace)[T.4.0]       105.8   106.5  
fhisp                  105.8   104.2 *
lowed                  105.8   106.0  






    Out[91]:





Logit Regression Results

  Dep. Variable:         boy          No. Observations:     3091385  


  Model:                Logit         Df Residuals:         3091379  


  Method:                MLE          Df Model:                  5   


  Date:           Tue, 17 May 2016    Pseudo R-squ.:       2.076e-05 


  Time:               14:25:47        Log-Likelihood:     -2.1418e+06


  converged:            True          LL-Null:            -2.1418e+06


                                    LLR p-value:         1.130e-17 




                      coef      std err       z       P>|z|  [95.0% Conf. Int.] 


  Intercept             0.0566      0.001     37.993   0.000      0.054     0.060


  C(fbrace)[T.2.0]     -0.0259      0.003     -7.838   0.000     -0.032    -0.019


  C(fbrace)[T.3.0]     -0.0287      0.011     -2.624   0.009     -0.050    -0.007


  C(fbrace)[T.4.0]      0.0067      0.004      1.487   0.137     -0.002     0.015


  fhisp                -0.0152      0.003     -4.927   0.000     -0.021    -0.009


  lowed                 0.0017      0.004      0.462   0.644     -0.006     0.009

The effect of birth order disappears.



In [92]:

    
formula = ('boy ~ C(fbrace) + fhisp + highbo')
model = smf.logit(formula, data=df)
results = model.fit()
summarize(results)
results.summary()









    



Optimization terminated successfully.
         Current function value: 0.692816
         Iterations 3
C(fbrace)[T.2.0]       105.8   103.2 *
C(fbrace)[T.3.0]       105.8   102.9 *
C(fbrace)[T.4.0]       105.8   106.6  
fhisp                  105.8   104.4 *
highbo                 105.8   105.6  






    Out[92]:





Logit Regression Results

  Dep. Variable:         boy          No. Observations:     3221819  


  Model:                Logit         Df Residuals:         3221813  


  Method:                MLE          Df Model:                  5   


  Date:           Tue, 17 May 2016    Pseudo R-squ.:       2.029e-05 


  Time:               14:26:13        Log-Likelihood:     -2.2321e+06


  converged:            True          LL-Null:            -2.2322e+06


                                    LLR p-value:         5.072e-18 




                      coef      std err       z       P>|z|  [95.0% Conf. Int.] 


  Intercept             0.0566      0.001     38.815   0.000      0.054     0.060


  C(fbrace)[T.2.0]     -0.0253      0.003     -7.841   0.000     -0.032    -0.019


  C(fbrace)[T.3.0]     -0.0284      0.011     -2.616   0.009     -0.050    -0.007


  C(fbrace)[T.4.0]      0.0077      0.004      1.758   0.079     -0.001     0.016


  fhisp                -0.0139      0.003     -4.785   0.000     -0.020    -0.008


  highbo               -0.0026      0.005     -0.483   0.629     -0.013     0.008

WIC is no longer associated with more girls.



In [93]:

    
formula = ('boy ~ C(fbrace) + fhisp + wic')
model = smf.logit(formula, data=df)
results = model.fit()
summarize(results)
results.summary()









    



Optimization terminated successfully.
         Current function value: 0.692813
         Iterations 3
C(fbrace)[T.2.0]       105.8   103.0 *
C(fbrace)[T.3.0]       105.8   103.0 *
C(fbrace)[T.4.0]       105.8   106.6  
wic[T.Y]               105.8   106.1  
fhisp                  105.8   104.1 *






    Out[93]:





Logit Regression Results

  Dep. Variable:         boy          No. Observations:     3040527  


  Model:                Logit         Df Residuals:         3040521  


  Method:                MLE          Df Model:                  5   


  Date:           Tue, 17 May 2016    Pseudo R-squ.:       2.175e-05 


  Time:               14:27:01        Log-Likelihood:     -2.1065e+06


  converged:            True          LL-Null:            -2.1066e+06


                                    LLR p-value:         3.031e-18 




                      coef      std err       z       P>|z|  [95.0% Conf. Int.] 


  Intercept             0.0564      0.002     34.772   0.000      0.053     0.060


  C(fbrace)[T.2.0]     -0.0271      0.003     -7.892   0.000     -0.034    -0.020


  C(fbrace)[T.3.0]     -0.0267      0.011     -2.405   0.016     -0.048    -0.005


  C(fbrace)[T.4.0]      0.0076      0.005      1.670   0.095     -0.001     0.016


  wic[T.Y]              0.0025      0.003      0.975   0.330     -0.002     0.007


  fhisp                -0.0161      0.003     -5.153   0.000     -0.022    -0.010

The effect of obesity disappears.



In [94]:

    
formula = ('boy ~ C(fbrace) + fhisp + obese')
model = smf.logit(formula, data=df)
results = model.fit()
summarize(results)
results.summary()









    



Optimization terminated successfully.
         Current function value: 0.692815
         Iterations 3
C(fbrace)[T.2.0]       105.9   103.3 *
C(fbrace)[T.3.0]       105.9   103.1 *
C(fbrace)[T.4.0]       105.9   106.5  
fhisp                  105.9   104.3 *
obese                  105.9   105.7  






    Out[94]:





Logit Regression Results

  Dep. Variable:         boy          No. Observations:     3005073  


  Model:                Logit         Df Residuals:         3005067  


  Method:                MLE          Df Model:                  5   


  Date:           Tue, 17 May 2016    Pseudo R-squ.:       1.947e-05 


  Time:               14:27:26        Log-Likelihood:     -2.0820e+06


  converged:            True          LL-Null:            -2.0820e+06


                                    LLR p-value:         5.013e-16 




                      coef      std err       z       P>|z|  [95.0% Conf. Int.] 


  Intercept             0.0571      0.002     35.622   0.000      0.054     0.060


  C(fbrace)[T.2.0]     -0.0247      0.003     -7.305   0.000     -0.031    -0.018


  C(fbrace)[T.3.0]     -0.0266      0.011     -2.410   0.016     -0.048    -0.005


  C(fbrace)[T.4.0]      0.0056      0.005      1.217   0.224     -0.003     0.015


  fhisp                -0.0151      0.003     -4.996   0.000     -0.021    -0.009


  obese                -0.0014      0.003     -0.524   0.600     -0.007     0.004

The effect of payment method is diminished, but self-payment is still associated with more boys.



In [95]:

    
formula = ('boy ~ C(fbrace) + fhisp + C(pay_rec)')
model = smf.logit(formula, data=df)
results = model.fit()
summarize(results)
results.summary()









    



Optimization terminated successfully.
         Current function value: 0.692812
         Iterations 3
C(fbrace)[T.2.0]       106.1   103.3 *
C(fbrace)[T.3.0]       106.1   103.0 *
C(fbrace)[T.4.0]       106.1   106.7  
C(pay_rec)[T.2.0]      106.1   105.7  
C(pay_rec)[T.3.0]      106.1   108.3 *
C(pay_rec)[T.4.0]      106.1   105.4  
fhisp                  106.1   104.4 *






    Out[95]:





Logit Regression Results

  Dep. Variable:         boy          No. Observations:     3086812  


  Model:                Logit         Df Residuals:         3086804  


  Method:                MLE          Df Model:                  7   


  Date:           Tue, 17 May 2016    Pseudo R-squ.:       2.500e-05 


  Time:               14:28:14        Log-Likelihood:     -2.1386e+06


  converged:            True          LL-Null:            -2.1386e+06


                                    LLR p-value:         3.965e-20 




                       coef      std err       z       P>|z|  [95.0% Conf. Int.] 


  Intercept              0.0593      0.002     25.249   0.000      0.055     0.064


  C(fbrace)[T.2.0]      -0.0271      0.003     -7.980   0.000     -0.034    -0.020


  C(fbrace)[T.3.0]      -0.0297      0.011     -2.696   0.007     -0.051    -0.008


  C(fbrace)[T.4.0]       0.0056      0.004      1.239   0.216     -0.003     0.014


  C(pay_rec)[T.2.0]     -0.0043      0.003     -1.680   0.093     -0.009     0.001


  C(pay_rec)[T.3.0]      0.0203      0.006      3.331   0.001      0.008     0.032


  C(pay_rec)[T.4.0]     -0.0063      0.006     -1.094   0.274     -0.018     0.005


  fhisp                 -0.0167      0.003     -5.378   0.000     -0.023    -0.011

But the effect of prenatal visits is still a strong predictor of more girls.



In [96]:

    
formula = ('boy ~ C(fbrace) + fhisp + previs')
model = smf.logit(formula, data=df)
results = model.fit()
summarize(results)
results.summary()









    



Optimization terminated successfully.
         Current function value: 0.692778
         Iterations 3
C(fbrace)[T.2.0]       105.8   102.8 *
C(fbrace)[T.3.0]       105.8   102.3 *
C(fbrace)[T.4.0]       105.8   106.4  
fhisp                  105.8   104.0 *
previs                 105.8   104.8 *






    Out[96]:





Logit Regression Results

  Dep. Variable:         boy          No. Observations:     3155440  


  Model:                Logit         Df Residuals:         3155434  


  Method:                MLE          Df Model:                  5   


  Date:           Tue, 17 May 2016    Pseudo R-squ.:       7.997e-05 


  Time:               14:28:40        Log-Likelihood:     -2.1860e+06


  converged:            True          LL-Null:            -2.1862e+06


                                    LLR p-value:         2.081e-73 




                      coef      std err       z       P>|z|  [95.0% Conf. Int.] 


  Intercept             0.0567      0.001     38.800   0.000      0.054     0.060


  C(fbrace)[T.2.0]     -0.0295      0.003     -9.008   0.000     -0.036    -0.023


  C(fbrace)[T.3.0]     -0.0341      0.011     -3.114   0.002     -0.056    -0.013


  C(fbrace)[T.4.0]      0.0058      0.004      1.314   0.189     -0.003     0.014


  fhisp                -0.0172      0.003     -5.862   0.000     -0.023    -0.011


  previs               -0.0102      0.001    -16.235   0.000     -0.011    -0.009

And the effect is even stronger if we add a boolean to capture the nonlinearity at 0 visits.



In [97]:

    
formula = ('boy ~ C(fbrace) + fhisp + previs + no_previs')
model = smf.logit(formula, data=df)
results = model.fit()
summarize(results)
results.summary()









    



Optimization terminated successfully.
         Current function value: 0.692776
         Iterations 3
C(fbrace)[T.2.0]       105.9   102.8 *
C(fbrace)[T.3.0]       105.9   102.3 *
C(fbrace)[T.4.0]       105.9   106.5  
fhisp                  105.9   104.1 *
previs                 105.9   104.7 *
no_previs              105.9   101.0 *






    Out[97]:





Logit Regression Results

  Dep. Variable:         boy          No. Observations:     3155440  


  Model:                Logit         Df Residuals:         3155433  


  Method:                MLE          Df Model:                  6   


  Date:           Tue, 17 May 2016    Pseudo R-squ.:       8.351e-05 


  Time:               14:29:06        Log-Likelihood:     -2.1860e+06


  converged:            True          LL-Null:            -2.1862e+06


                                    LLR p-value:         8.674e-76 




                      coef      std err       z       P>|z|  [95.0% Conf. Int.] 


  Intercept             0.0570      0.001     38.973   0.000      0.054     0.060


  C(fbrace)[T.2.0]     -0.0294      0.003     -8.984   0.000     -0.036    -0.023


  C(fbrace)[T.3.0]     -0.0342      0.011     -3.123   0.002     -0.056    -0.013


  C(fbrace)[T.4.0]      0.0056      0.004      1.270   0.204     -0.003     0.014


  fhisp                -0.0171      0.003     -5.817   0.000     -0.023    -0.011


  previs               -0.0111      0.001    -16.625   0.000     -0.012    -0.010


  no_previs            -0.0469      0.012     -3.936   0.000     -0.070    -0.024

More controls

Now if we control for father's race and Hispanic origin as well as number of prenatal visits, the effect of marriage disappears.



In [98]:

    
formula = ('boy ~ C(fbrace) + fhisp + previs + dmar')
model = smf.logit(formula, data=df)
results = model.fit()
summarize(results)
results.summary()









    



Optimization terminated successfully.
         Current function value: 0.692778
         Iterations 3
C(fbrace)[T.2.0]       105.3   102.1 *
C(fbrace)[T.3.0]       105.3   101.7 *
C(fbrace)[T.4.0]       105.3   106.0  
fhisp                  105.3   103.5 *
previs                 105.3   104.3 *
dmar                   105.3   105.7  






    Out[98]:





Logit Regression Results

  Dep. Variable:         boy          No. Observations:     3155440  


  Model:                Logit         Df Residuals:         3155433  


  Method:                MLE          Df Model:                  6   


  Date:           Tue, 17 May 2016    Pseudo R-squ.:       8.045e-05 


  Time:               14:29:32        Log-Likelihood:     -2.1860e+06


  converged:            True          LL-Null:            -2.1862e+06


                                    LLR p-value:         6.525e-73 




                      coef      std err       z       P>|z|  [95.0% Conf. Int.] 


  Intercept             0.0521      0.003     15.015   0.000      0.045     0.059


  C(fbrace)[T.2.0]     -0.0309      0.003     -9.058   0.000     -0.038    -0.024


  C(fbrace)[T.3.0]     -0.0353      0.011     -3.210   0.001     -0.057    -0.014


  C(fbrace)[T.4.0]      0.0062      0.004      1.394   0.163     -0.002     0.015


  fhisp                -0.0181      0.003     -6.033   0.000     -0.024    -0.012


  previs               -0.0102      0.001    -16.122   0.000     -0.011    -0.009


  dmar                  0.0037      0.003      1.446   0.148     -0.001     0.009

The effect of payment method disappears.



In [99]:

    
formula = ('boy ~ C(fbrace) + fhisp + previs + C(pay_rec)')
model = smf.logit(formula, data=df)
results = model.fit()
summarize(results)
results.summary()









    



Optimization terminated successfully.
         Current function value: 0.692777
         Iterations 3
C(fbrace)[T.2.0]       105.8   102.8 *
C(fbrace)[T.3.0]       105.8   102.2 *
C(fbrace)[T.4.0]       105.8   106.3  
C(pay_rec)[T.2.0]      105.8   105.9  
C(pay_rec)[T.3.0]      105.8   106.9  
C(pay_rec)[T.4.0]      105.8   105.0  
fhisp                  105.8   104.0 *
previs                 105.8   104.8 *






    Out[99]:





Logit Regression Results

  Dep. Variable:         boy          No. Observations:     3009712  


  Model:                Logit         Df Residuals:         3009703  


  Method:                MLE          Df Model:                  8   


  Date:           Tue, 17 May 2016    Pseudo R-squ.:       8.163e-05 


  Time:               14:30:20        Log-Likelihood:     -2.0851e+06


  converged:            True          LL-Null:            -2.0852e+06


                                    LLR p-value:         1.004e-68 




                       coef      std err       z       P>|z|  [95.0% Conf. Int.] 


  Intercept              0.0566      0.002     23.765   0.000      0.052     0.061


  C(fbrace)[T.2.0]      -0.0295      0.003     -8.509   0.000     -0.036    -0.023


  C(fbrace)[T.3.0]      -0.0345      0.011     -3.090   0.002     -0.056    -0.013


  C(fbrace)[T.4.0]       0.0046      0.005      1.012   0.312     -0.004     0.014


  C(pay_rec)[T.2.0]      0.0005      0.003      0.174   0.862     -0.005     0.006


  C(pay_rec)[T.3.0]      0.0100      0.006      1.619   0.105     -0.002     0.022


  C(pay_rec)[T.4.0]     -0.0074      0.006     -1.260   0.208     -0.019     0.004


  fhisp                 -0.0178      0.003     -5.687   0.000     -0.024    -0.012


  previs                -0.0101      0.001    -15.540   0.000     -0.011    -0.009

Here's a version with the addition of a boolean for no prenatal visits.



In [100]:

    
formula = ('boy ~ C(fbrace) + fhisp + previs + no_previs')
model = smf.logit(formula, data=df)
results = model.fit()
summarize(results)
results.summary()









    



Optimization terminated successfully.
         Current function value: 0.692776
         Iterations 3
C(fbrace)[T.2.0]       105.9   102.8 *
C(fbrace)[T.3.0]       105.9   102.3 *
C(fbrace)[T.4.0]       105.9   106.5  
fhisp                  105.9   104.1 *
previs                 105.9   104.7 *
no_previs              105.9   101.0 *






    Out[100]:





Logit Regression Results

  Dep. Variable:         boy          No. Observations:     3155440  


  Model:                Logit         Df Residuals:         3155433  


  Method:                MLE          Df Model:                  6   


  Date:           Tue, 17 May 2016    Pseudo R-squ.:       8.351e-05 


  Time:               14:30:47        Log-Likelihood:     -2.1860e+06


  converged:            True          LL-Null:            -2.1862e+06


                                    LLR p-value:         8.674e-76 




                      coef      std err       z       P>|z|  [95.0% Conf. Int.] 


  Intercept             0.0570      0.001     38.973   0.000      0.054     0.060


  C(fbrace)[T.2.0]     -0.0294      0.003     -8.984   0.000     -0.036    -0.023


  C(fbrace)[T.3.0]     -0.0342      0.011     -3.123   0.002     -0.056    -0.013


  C(fbrace)[T.4.0]      0.0056      0.004      1.270   0.204     -0.003     0.014


  fhisp                -0.0171      0.003     -5.817   0.000     -0.023    -0.011


  previs               -0.0111      0.001    -16.625   0.000     -0.012    -0.010


  no_previs            -0.0469      0.012     -3.936   0.000     -0.070    -0.024

Now, surprisingly, the mother's age has a small effect.



In [101]:

    
formula = ('boy ~ C(fbrace) + fhisp + previs + no_previs + mager9')
model = smf.logit(formula, data=df)
results = model.fit()
summarize(results)
results.summary()









    



Optimization terminated successfully.
         Current function value: 0.692775
         Iterations 3
C(fbrace)[T.2.0]       106.8   103.6 *
C(fbrace)[T.3.0]       106.8   103.1 *
C(fbrace)[T.4.0]       106.8   107.4  
fhisp                  106.8   104.9 *
previs                 106.8   105.6 *
no_previs              106.8   101.9 *
mager9                 106.8   106.6 *






    Out[101]:





Logit Regression Results

  Dep. Variable:         boy          No. Observations:     3155440  


  Model:                Logit         Df Residuals:         3155432  


  Method:                MLE          Df Model:                  7   


  Date:           Tue, 17 May 2016    Pseudo R-squ.:       8.440e-05 


  Time:               14:31:14        Log-Likelihood:     -2.1860e+06


  converged:            True          LL-Null:            -2.1862e+06


                                    LLR p-value:         1.043e-75 




                      coef      std err       z       P>|z|  [95.0% Conf. Int.] 


  Intercept             0.0656      0.005     14.344   0.000      0.057     0.075


  C(fbrace)[T.2.0]     -0.0300      0.003     -9.123   0.000     -0.036    -0.024


  C(fbrace)[T.3.0]     -0.0351      0.011     -3.200   0.001     -0.057    -0.014


  C(fbrace)[T.4.0]      0.0062      0.004      1.413   0.158     -0.002     0.015


  fhisp                -0.0176      0.003     -5.974   0.000     -0.023    -0.012


  previs               -0.0110      0.001    -16.456   0.000     -0.012    -0.010


  no_previs            -0.0468      0.012     -3.926   0.000     -0.070    -0.023


  mager9               -0.0019      0.001     -1.970   0.049     -0.004 -9.69e-06

So does the father's age. But both age effects are small and borderline significant.



In [104]:

    
formula = ('boy ~ C(fbrace) + fhisp + previs + no_previs + fagerrec11')
model = smf.logit(formula, data=df)
results = model.fit()
summarize(results)
results.summary()









    



Optimization terminated successfully.
         Current function value: 0.692775
         Iterations 3
C(fbrace)[T.2.0]       106.9   103.7 *
C(fbrace)[T.3.0]       106.9   103.2 *
C(fbrace)[T.4.0]       106.9   107.6  
fhisp                  106.9   105.0 *
previs                 106.9   105.7 *
no_previs              106.9   101.8 *
fagerrec11             106.9   106.7 *






    Out[104]:





Logit Regression Results

  Dep. Variable:         boy          No. Observations:     3148537  


  Model:                Logit         Df Residuals:         3148529  


  Method:                MLE          Df Model:                  7   


  Date:           Tue, 17 May 2016    Pseudo R-squ.:       8.517e-05 


  Time:               14:32:34        Log-Likelihood:     -2.1812e+06


  converged:            True          LL-Null:            -2.1814e+06


                                    LLR p-value:         2.924e-76 




                      coef      std err       z       P>|z|  [95.0% Conf. Int.] 


  Intercept             0.0663      0.004     15.399   0.000      0.058     0.075


  C(fbrace)[T.2.0]     -0.0299      0.003     -9.100   0.000     -0.036    -0.023


  C(fbrace)[T.3.0]     -0.0348      0.011     -3.170   0.002     -0.056    -0.013


  C(fbrace)[T.4.0]      0.0067      0.004      1.518   0.129     -0.002     0.015


  fhisp                -0.0176      0.003     -5.974   0.000     -0.023    -0.012


  previs               -0.0110      0.001    -16.545   0.000     -0.012    -0.010


  no_previs            -0.0483      0.012     -4.039   0.000     -0.072    -0.025


  fagerrec11           -0.0019      0.001     -2.278   0.023     -0.003    -0.000

What's up with prenatal visits?

The predictive power of prenatal visits is still surprising to me. To make sure we're controlled for race, I'll select cases where both parents are white:



In [110]:

    
white = df[(df.mbrace==1) & (df.fbrace==1)]
len(white)









    Out[110]:





2400787

And compute sex ratios for each level of previs



In [111]:

    
var = 'previs'
white[[var, 'boy']].groupby(var).aggregate(series_to_ratio)

The effect holds up. People with fewer than average prenatal visits are substantially more likely to have boys.



In [112]:

    
formula = ('boy ~ previs + no_previs')
model = smf.logit(formula, data=white)
results = model.fit()
summarize(results)
results.summary()









    



Optimization terminated successfully.
         Current function value: 0.692749
         Iterations 3
previs                 105.5   104.3 *
no_previs              105.5   100.4 *






    Out[112]:





Logit Regression Results

  Dep. Variable:         boy          No. Observations:     2346785  


  Model:                Logit         Df Residuals:         2346782  


  Method:                MLE          Df Model:                  2   


  Date:           Tue, 17 May 2016    Pseudo R-squ.:       6.418e-05 


  Time:               14:40:39        Log-Likelihood:     -1.6257e+06


  converged:            True          LL-Null:            -1.6258e+06


                                    LLR p-value:         4.790e-46 




               coef      std err       z       P>|z|  [95.0% Conf. Int.] 


  Intercept      0.0534      0.001     40.728   0.000      0.051     0.056


  previs        -0.0113      0.001    -14.378   0.000     -0.013    -0.010


  no_previs     -0.0490      0.015     -3.352   0.001     -0.078    -0.020



In [113]:

    
inter = results.params['Intercept']
slope = results.params['previs']
inter, slope









    Out[113]:





(0.053449172473506806, -0.011302385985286368)



In [114]:

    
previs = np.arange(-5, 5)
logodds = inter + slope * previs
odds = np.exp(logodds)
odds * 100









    Out[114]:





array([ 111.62346508,  110.36895641,  109.12854687,  107.90207798,
        106.68939307,  105.49033723,  104.30475728,  103.13250177,
        101.97342096,  100.82736677])



In [116]:

    
formula = ('boy ~ dmar')
model = smf.logit(formula, data=white)
results = model.fit()
summarize(results)
results.summary()









    



Optimization terminated successfully.
         Current function value: 0.692788
         Iterations 3
dmar                   105.3   105.5  






    Out[116]:





Logit Regression Results

  Dep. Variable:         boy          No. Observations:     2400787  


  Model:                Logit         Df Residuals:         2400785  


  Method:                MLE          Df Model:                  1   


  Date:           Tue, 17 May 2016    Pseudo R-squ.:       7.406e-08 


  Time:               15:27:21        Log-Likelihood:     -1.6632e+06


  converged:            True          LL-Null:            -1.6632e+06


                                    LLR p-value:          0.6196   




               coef      std err       z       P>|z|  [95.0% Conf. Int.] 


  Intercept      0.0518      0.004     13.234   0.000      0.044     0.059


  dmar           0.0014      0.003      0.496   0.620     -0.004     0.007



In [117]:

    
formula = ('boy ~ lowed')
model = smf.logit(formula, data=white)
results = model.fit()
summarize(results)
results.summary()









    



Optimization terminated successfully.
         Current function value: 0.692788
         Iterations 3
lowed                  105.6   105.0  






    Out[117]:





Logit Regression Results

  Dep. Variable:         boy          No. Observations:     2301234  


  Model:                Logit         Df Residuals:         2301232  


  Method:                MLE          Df Model:                  1   


  Date:           Tue, 17 May 2016    Pseudo R-squ.:       4.759e-07 


  Time:               15:28:01        Log-Likelihood:     -1.5943e+06


  converged:            True          LL-Null:            -1.5943e+06


                                    LLR p-value:          0.2180   




               coef      std err       z       P>|z|  [95.0% Conf. Int.] 


  Intercept      0.0542      0.001     38.603   0.000      0.051     0.057


  lowed         -0.0051      0.004     -1.232   0.218     -0.013     0.003



In [118]:

    
formula = ('boy ~ highbo')
model = smf.logit(formula, data=white)
results = model.fit()
summarize(results)
results.summary()









    



Optimization terminated successfully.
         Current function value: 0.692788
         Iterations 3
highbo                 105.5   105.6  






    Out[118]:





Logit Regression Results

  Dep. Variable:         boy          No. Observations:     2391630  


  Model:                Logit         Df Residuals:         2391628  


  Method:                MLE          Df Model:                  1   


  Date:           Tue, 17 May 2016    Pseudo R-squ.:       4.564e-09 


  Time:               15:28:25        Log-Likelihood:     -1.6569e+06


  converged:            True          LL-Null:            -1.6569e+06


                                    LLR p-value:          0.9021   




               coef      std err       z       P>|z|  [95.0% Conf. Int.] 


  Intercept      0.0535      0.001     40.493   0.000      0.051     0.056


  highbo         0.0008      0.006      0.123   0.902     -0.012     0.013



In [119]:

    
formula = ('boy ~ wic')
model = smf.logit(formula, data=white)
results = model.fit()
summarize(results)
results.summary()









    



Optimization terminated successfully.
         Current function value: 0.692786
         Iterations 3
wic[T.Y]               105.6   105.3  






    Out[119]:





Logit Regression Results

  Dep. Variable:         boy          No. Observations:     2266424  


  Model:                Logit         Df Residuals:         2266422  


  Method:                MLE          Df Model:                  1   


  Date:           Tue, 17 May 2016    Pseudo R-squ.:       3.840e-07 


  Time:               15:28:57        Log-Likelihood:     -1.5701e+06


  converged:            True          LL-Null:            -1.5701e+06


                                    LLR p-value:          0.2721   




               coef      std err       z       P>|z|  [95.0% Conf. Int.] 


  Intercept      0.0548      0.002     33.369   0.000      0.052     0.058


  wic[T.Y]      -0.0031      0.003     -1.098   0.272     -0.009     0.002



In [120]:

    
formula = ('boy ~ obese')
model = smf.logit(formula, data=white)
results = model.fit()
summarize(results)
results.summary()









    



Optimization terminated successfully.
         Current function value: 0.692788
         Iterations 3
obese                  105.6   105.3  






    Out[120]:





Logit Regression Results

  Dep. Variable:         boy          No. Observations:     2244349  


  Model:                Logit         Df Residuals:         2244347  


  Method:                MLE          Df Model:                  1   


  Date:           Tue, 17 May 2016    Pseudo R-squ.:       1.725e-07 


  Time:               15:29:20        Log-Likelihood:     -1.5549e+06


  converged:            True          LL-Null:            -1.5549e+06


                                    LLR p-value:          0.4639   




               coef      std err       z       P>|z|  [95.0% Conf. Int.] 


  Intercept      0.0542      0.002     35.607   0.000      0.051     0.057


  obese         -0.0023      0.003     -0.732   0.464     -0.009     0.004



In [123]:

    
formula = ('boy ~ C(pay_rec)')
model = smf.logit(formula, data=white)
results = model.fit()
summarize(results)
results.summary()









    



Optimization terminated successfully.
         Current function value: 0.692786
         Iterations 3
C(pay_rec)[T.2.0]      105.4   105.5  
C(pay_rec)[T.3.0]      105.4   107.1 *
C(pay_rec)[T.4.0]      105.4   105.3  






    Out[123]:





Logit Regression Results

  Dep. Variable:         boy          No. Observations:     2295681  


  Model:                Logit         Df Residuals:         2295677  


  Method:                MLE          Df Model:                  3   


  Date:           Tue, 17 May 2016    Pseudo R-squ.:       1.666e-06 


  Time:               15:30:06        Log-Likelihood:     -1.5904e+06


  converged:            True          LL-Null:            -1.5904e+06


                                    LLR p-value:          0.1511   




                       coef      std err       z       P>|z|  [95.0% Conf. Int.] 


  Intercept              0.0529      0.002     23.356   0.000      0.048     0.057


  C(pay_rec)[T.2.0]      0.0004      0.003      0.147   0.883     -0.005     0.006


  C(pay_rec)[T.3.0]      0.0159      0.007      2.235   0.025      0.002     0.030


  C(pay_rec)[T.4.0]     -0.0013      0.007     -0.197   0.844     -0.015     0.012



In [124]:

    
formula = ('boy ~ mager9')
model = smf.logit(formula, data=white)
results = model.fit()
summarize(results)
results.summary()









    



Optimization terminated successfully.
         Current function value: 0.692786
         Iterations 3
mager9                 107.0   106.7 *






    Out[124]:





Logit Regression Results

  Dep. Variable:         boy          No. Observations:     2400787  


  Model:                Logit         Df Residuals:         2400785  


  Method:                MLE          Df Model:                  1   


  Date:           Tue, 17 May 2016    Pseudo R-squ.:       2.516e-06 


  Time:               15:30:32        Log-Likelihood:     -1.6632e+06


  converged:            True          LL-Null:            -1.6632e+06


                                    LLR p-value:         0.003813  




               coef      std err       z       P>|z|  [95.0% Conf. Int.] 


  Intercept      0.0677      0.005     13.452   0.000      0.058     0.078


  mager9        -0.0032      0.001     -2.893   0.004     -0.005    -0.001



In [125]:

    
formula = ('boy ~ youngm + oldm')
model = smf.logit(formula, data=white)
results = model.fit()
summarize(results)
results.summary()









    



Optimization terminated successfully.
         Current function value: 0.692787
         Iterations 3
youngm[T.True]         105.6   105.5  
oldm[T.True]           105.6   103.8 *






    Out[125]:





Logit Regression Results

  Dep. Variable:         boy          No. Observations:     2400787  


  Model:                Logit         Df Residuals:         2400784  


  Method:                MLE          Df Model:                  2   


  Date:           Tue, 17 May 2016    Pseudo R-squ.:       1.549e-06 


  Time:               15:31:04        Log-Likelihood:     -1.6632e+06


  converged:            True          LL-Null:            -1.6632e+06


                                    LLR p-value:          0.07608  




                    coef      std err       z       P>|z|  [95.0% Conf. Int.] 


  Intercept           0.0542      0.001     40.370   0.000      0.052     0.057


  youngm[T.True]     -0.0011      0.006     -0.170   0.865     -0.013     0.011


  oldm[T.True]       -0.0173      0.008     -2.268   0.023     -0.032    -0.002



In [126]:

    
formula = ('boy ~ youngf + oldf')
model = smf.logit(formula, data=white)
results = model.fit()
summarize(results)
results.summary()









    



Optimization terminated successfully.
         Current function value: 0.692787
         Iterations 3
youngf                 105.5   106.4  
oldf                   105.5   105.7  






    Out[126]:





Logit Regression Results

  Dep. Variable:         boy          No. Observations:     2396141  


  Model:                Logit         Df Residuals:         2396138  


  Method:                MLE          Df Model:                  2   


  Date:           Tue, 17 May 2016    Pseudo R-squ.:       2.717e-07 


  Time:               15:31:50        Log-Likelihood:     -1.6600e+06


  converged:            True          LL-Null:            -1.6600e+06


                                    LLR p-value:          0.6370   




               coef      std err       z       P>|z|  [95.0% Conf. Int.] 


  Intercept      0.0534      0.001     40.229   0.000      0.051     0.056


  youngf         0.0082      0.009      0.924   0.355     -0.009     0.026


  oldf           0.0018      0.008      0.242   0.809     -0.013     0.017



In [ ]:



In [ ]:

Dep. Variable:	boy	No. Observations:	3998175
Model:	Logit	Df Residuals:	3998173
Method:	MLE	Df Model:	1
Date:	Tue, 17 May 2016	Pseudo R-squ.:	1.129e-07
Time:	14:18:28	Log-Likelihood:	-2.7702e+06
converged:	True	LL-Null:	-2.7702e+06
		LLR p-value:	0.4290

	coef	std err	z	P>\|z\|	[95.0% Conf. Int.]
Intercept	0.0496	0.004	13.550	0.000	0.042 0.057
mager9	-0.0007	0.001	-0.791	0.429	-0.002 0.001

Dep. Variable:	boy	No. Observations:	3988351
Model:	Logit	Df Residuals:	3988349
Method:	MLE	Df Model:	1
Date:	Tue, 17 May 2016	Pseudo R-squ.:	4.566e-08
Time:	14:19:00	Log-Likelihood:	-2.7634e+06
converged:	True	LL-Null:	-2.7634e+06
		LLR p-value:	0.6154

Dep. Variable:	boy	No. Observations:	3967498
Model:	Logit	Df Residuals:	3967496
Method:	MLE	Df Model:	1
Date:	Tue, 17 May 2016	Pseudo R-squ.:	1.998e-06
Time:	14:19:59	Log-Likelihood:	-2.7490e+06
converged:	True	LL-Null:	-2.7490e+06
		LLR p-value:	0.0009174

Dep. Variable:	boy	No. Observations:	3849169
Model:	Logit	Df Residuals:	3849167
Method:	MLE	Df Model:	1
Date:	Tue, 17 May 2016	Pseudo R-squ.:	9.129e-06
Time:	14:20:27	Log-Likelihood:	-2.6670e+06
converged:	True	LL-Null:	-2.6670e+06
		LLR p-value:	2.990e-12

Dep. Variable:	boy	No. Observations:	3810525
Model:	Logit	Df Residuals:	3810523
Method:	MLE	Df Model:	1
Date:	Tue, 17 May 2016	Pseudo R-squ.:	1.416e-06
Time:	14:20:59	Log-Likelihood:	-2.6402e+06
converged:	True	LL-Null:	-2.6402e+06
		LLR p-value:	0.006248

Dep. Variable:	boy	No. Observations:	3501060
Model:	Logit	Df Residuals:	3501058
Method:	MLE	Df Model:	1
Date:	Tue, 17 May 2016	Pseudo R-squ.:	8.226e-07
Time:	14:21:08	Log-Likelihood:	-2.4257e+06
converged:	True	LL-Null:	-2.4257e+06
		LLR p-value:	0.04575

Dep. Variable:	boy	No. Observations:	3254136
Model:	Logit	Df Residuals:	3254132
Method:	MLE	Df Model:	3
Date:	Tue, 17 May 2016	Pseudo R-squ.:	1.504e-05
Time:	14:21:38	Log-Likelihood:	-2.2545e+06
converged:	True	LL-Null:	-2.2546e+06
		LLR p-value:	1.256e-14

Dep. Variable:	boy	No. Observations:	3453052
Model:	Logit	Df Residuals:	3453050
Method:	MLE	Df Model:	1
Date:	Tue, 17 May 2016	Pseudo R-squ.:	5.800e-06
Time:	14:21:42	Log-Likelihood:	-2.3924e+06
converged:	True	LL-Null:	-2.3924e+06
		LLR p-value:	1.378e-07

Dep. Variable:	boy	No. Observations:	3279126
Model:	Logit	Df Residuals:	3279124
Method:	MLE	Df Model:	1
Date:	Tue, 17 May 2016	Pseudo R-squ.:	8.046e-07
Time:	14:21:46	Log-Likelihood:	-2.2719e+06
converged:	True	LL-Null:	-2.2719e+06
		LLR p-value:	0.05587

Dep. Variable:	boy	No. Observations:	3978150
Model:	Logit	Df Residuals:	3978148
Method:	MLE	Df Model:	1
Date:	Tue, 17 May 2016	Pseudo R-squ.:	1.576e-06
Time:	14:21:51	Log-Likelihood:	-2.7563e+06
converged:	True	LL-Null:	-2.7564e+06
		LLR p-value:	0.003206

Dep. Variable:	boy	No. Observations:	3888065
Model:	Logit	Df Residuals:	3888063
Method:	MLE	Df Model:	1
Date:	Tue, 17 May 2016	Pseudo R-squ.:	3.975e-05
Time:	14:22:01	Log-Likelihood:	-2.6938e+06
converged:	True	LL-Null:	-2.6939e+06
		LLR p-value:	1.677e-48

Dep. Variable:	boy	No. Observations:	3759121
Model:	Logit	Df Residuals:	3759119
Method:	MLE	Df Model:	1
Date:	Tue, 17 May 2016	Pseudo R-squ.:	3.051e-06
Time:	14:22:35	Log-Likelihood:	-2.6046e+06
converged:	True	LL-Null:	-2.6046e+06
		LLR p-value:	6.700e-05

Dep. Variable:	boy	No. Observations:	3790892
Model:	Logit	Df Residuals:	3790890
Method:	MLE	Df Model:	1
Date:	Tue, 17 May 2016	Pseudo R-squ.:	1.853e-07
Time:	14:22:39	Log-Likelihood:	-2.6266e+06
converged:	True	LL-Null:	-2.6266e+06
		LLR p-value:	0.3238

Dep. Variable:	boy	No. Observations:	3709225
Model:	Logit	Df Residuals:	3709223
Method:	MLE	Df Model:	1
Date:	Tue, 17 May 2016	Pseudo R-squ.:	2.168e-06
Time:	14:22:48	Log-Likelihood:	-2.5700e+06
converged:	True	LL-Null:	-2.5700e+06
		LLR p-value:	0.0008442

Dep. Variable:	boy	No. Observations:	3819768
Model:	Logit	Df Residuals:	3819764
Method:	MLE	Df Model:	3
Date:	Tue, 17 May 2016	Pseudo R-squ.:	5.306e-06
Time:	14:23:19	Log-Likelihood:	-2.6466e+06
converged:	True	LL-Null:	-2.6466e+06
		LLR p-value:	3.482e-06

Dep. Variable:	boy	No. Observations:	3231530
Model:	Logit	Df Residuals:	3231521
Method:	MLE	Df Model:	8
Date:	Tue, 17 May 2016	Pseudo R-squ.:	2.087e-05
Time:	14:24:08	Log-Likelihood:	-2.2389e+06
converged:	True	LL-Null:	-2.2389e+06
		LLR p-value:	9.292e-17

Dep. Variable:	boy	No. Observations:	3112362
Model:	Logit	Df Residuals:	3112356
Method:	MLE	Df Model:	5
Date:	Tue, 17 May 2016	Pseudo R-squ.:	2.117e-05
Time:	14:24:56	Log-Likelihood:	-2.1563e+06
converged:	True	LL-Null:	-2.1563e+06
		LLR p-value:	3.558e-18

Dep. Variable:	boy	No. Observations:	3235798
Model:	Logit	Df Residuals:	3235792
Method:	MLE	Df Model:	5
Date:	Tue, 17 May 2016	Pseudo R-squ.:	2.183e-05
Time:	14:25:22	Log-Likelihood:	-2.2418e+06
converged:	True	LL-Null:	-2.2419e+06
		LLR p-value:	1.485e-19

Dep. Variable:	boy	No. Observations:	3091385
Model:	Logit	Df Residuals:	3091379
Method:	MLE	Df Model:	5
Date:	Tue, 17 May 2016	Pseudo R-squ.:	2.076e-05
Time:	14:25:47	Log-Likelihood:	-2.1418e+06
converged:	True	LL-Null:	-2.1418e+06
		LLR p-value:	1.130e-17

Dep. Variable:	boy	No. Observations:	3221819
Model:	Logit	Df Residuals:	3221813
Method:	MLE	Df Model:	5
Date:	Tue, 17 May 2016	Pseudo R-squ.:	2.029e-05
Time:	14:26:13	Log-Likelihood:	-2.2321e+06
converged:	True	LL-Null:	-2.2322e+06
		LLR p-value:	5.072e-18

Dep. Variable:	boy	No. Observations:	3040527
Model:	Logit	Df Residuals:	3040521
Method:	MLE	Df Model:	5
Date:	Tue, 17 May 2016	Pseudo R-squ.:	2.175e-05
Time:	14:27:01	Log-Likelihood:	-2.1065e+06
converged:	True	LL-Null:	-2.1066e+06
		LLR p-value:	3.031e-18

Dep. Variable:	boy	No. Observations:	3005073
Model:	Logit	Df Residuals:	3005067
Method:	MLE	Df Model:	5
Date:	Tue, 17 May 2016	Pseudo R-squ.:	1.947e-05
Time:	14:27:26	Log-Likelihood:	-2.0820e+06
converged:	True	LL-Null:	-2.0820e+06
		LLR p-value:	5.013e-16

Dep. Variable:	boy	No. Observations:	3086812
Model:	Logit	Df Residuals:	3086804
Method:	MLE	Df Model:	7
Date:	Tue, 17 May 2016	Pseudo R-squ.:	2.500e-05
Time:	14:28:14	Log-Likelihood:	-2.1386e+06
converged:	True	LL-Null:	-2.1386e+06
		LLR p-value:	3.965e-20

Dep. Variable:	boy	No. Observations:	3155440
Model:	Logit	Df Residuals:	3155434
Method:	MLE	Df Model:	5
Date:	Tue, 17 May 2016	Pseudo R-squ.:	7.997e-05
Time:	14:28:40	Log-Likelihood:	-2.1860e+06
converged:	True	LL-Null:	-2.1862e+06
		LLR p-value:	2.081e-73

Dep. Variable:	boy	No. Observations:	3009712
Model:	Logit	Df Residuals:	3009703
Method:	MLE	Df Model:	8
Date:	Tue, 17 May 2016	Pseudo R-squ.:	8.163e-05
Time:	14:30:20	Log-Likelihood:	-2.0851e+06
converged:	True	LL-Null:	-2.0852e+06
		LLR p-value:	1.004e-68

Dep. Variable:	boy	No. Observations:	3148537
Model:	Logit	Df Residuals:	3148529
Method:	MLE	Df Model:	7
Date:	Tue, 17 May 2016	Pseudo R-squ.:	8.517e-05
Time:	14:32:34	Log-Likelihood:	-2.1812e+06
converged:	True	LL-Null:	-2.1814e+06
		LLR p-value:	2.924e-76

Dep. Variable:	boy	No. Observations:	2346785
Model:	Logit	Df Residuals:	2346782
Method:	MLE	Df Model:	2
Date:	Tue, 17 May 2016	Pseudo R-squ.:	6.418e-05
Time:	14:40:39	Log-Likelihood:	-1.6257e+06
converged:	True	LL-Null:	-1.6258e+06
		LLR p-value:	4.790e-46

Dep. Variable:	boy	No. Observations:	2400787
Model:	Logit	Df Residuals:	2400785
Method:	MLE	Df Model:	1
Date:	Tue, 17 May 2016	Pseudo R-squ.:	7.406e-08
Time:	15:27:21	Log-Likelihood:	-1.6632e+06
converged:	True	LL-Null:	-1.6632e+06
		LLR p-value:	0.6196

Dep. Variable:	boy	No. Observations:	2301234
Model:	Logit	Df Residuals:	2301232
Method:	MLE	Df Model:	1
Date:	Tue, 17 May 2016	Pseudo R-squ.:	4.759e-07
Time:	15:28:01	Log-Likelihood:	-1.5943e+06
converged:	True	LL-Null:	-1.5943e+06
		LLR p-value:	0.2180