Internet use and religion in Europe, part two

This notebook presents a second-pass analysis of the association between Internet use and religion in Europe, using data from the European Social Survey (http://www.europeansocialsurvey.org).

MIT License: http://opensource.org/licenses/MIT



In [1]:

    
from __future__ import print_function, division

import numpy as np
import pandas as pd

import thinkstats2
import thinkplot

import statsmodels.formula.api as smf

from iso_country_codes import COUNTRY

%matplotlib inline

The following function selects the columns I need.



In [2]:

    
def read_cycle(filename):
    """Reads a file containing ESS data and selects columns.
    
    filename: string
    
    returns: DataFrame
    """ 
    df = pd.read_stata(filename, convert_categoricals=False)

    if 'hinctnta' not in df.columns:
        df['hinctnta'] = df.hinctnt
        
    if 'inwyr' not in df.columns:
        df['inwyr'] = df.inwyye
        
    cols = ['cntry', 'inwyr', 'tvtot', 'tvpol', 'rdtot', 'rdpol', 
            'nwsptot', 'nwsppol', 'netuse', 
            'rlgblg', 'rlgdgr', 'eduyrs', 'hinctnta', 'yrbrn', 
            'eisced', 'pspwght', 'pweight']
    df = df[cols]
    return df

Read data from Cycle 1.



In [3]:

    
df1 = read_cycle('ESS1e06_4.dta')
df1.head()

Read data from Cycle 2.



In [4]:

    
df2 = read_cycle('ESS2e03_4.dta')
df2.head()

Read data from Cycle 3.



In [5]:

    
df3 = read_cycle('ESS3e03_5.dta')
df3.head()

Read data from Cycle 4.



In [6]:

    
df4 = read_cycle('ESS4e04_3.dta')
df4.head()

Read data from Cycle 5.



In [7]:

    
df5 = read_cycle('ESS5e03_2.dta')
df5.head()



In [8]:

    
def clean_cycle(df):
    """Cleans data from one cycle.
    
    df: DataFrame
    """
    df.tvtot.replace([77, 88, 99], np.nan, inplace=True)
    df.rdtot.replace([77, 88, 99], np.nan, inplace=True)
    df.nwsptot.replace([77, 88, 99], np.nan, inplace=True)
    df.netuse.replace([77, 88, 99], np.nan, inplace=True)
    df.tvpol.replace([66, 77, 88, 99], np.nan, inplace=True)
    df.rdpol.replace([66, 77, 88, 99], np.nan, inplace=True)
    df.nwsppol.replace([66, 77, 88, 99], np.nan, inplace=True)
    df.eduyrs.replace([77, 88, 99], np.nan, inplace=True)
    df.rlgblg.replace([7, 8, 9], np.nan, inplace=True)
    df.rlgdgr.replace([77, 88, 99], np.nan, inplace=True)
    df.hinctnta.replace([77, 88, 99], np.nan, inplace=True)
    df.yrbrn.replace([7777, 8888, 9999], np.nan, inplace=True)
    df.inwyr.replace([9999], np.nan, inplace=True)
    
    df['hasrelig'] = (df.rlgblg==1).astype(int)
    df.loc[df.rlgblg.isnull(), 'hasrelig'] = np.nan
    
    df['yrbrn60'] = df.yrbrn - 1960
    df['inwyr07'] = df.inwyr - 2007 + np.random.uniform(-0.5, 0.5, len(df))



In [9]:

    
cycles = [df1, df2, df3, df4, df5]
for cycle in cycles:
    clean_cycle(cycle)



In [10]:

    
def resample(df):
    """Resample data by country.
    
    df: DataFrame
    
    returns: map from country code to DataFrame
    """
    res = {}
    grouped = df.groupby('cntry')
    for name, group in grouped:
        sample = group.sample(len(group), weights=group.pspwght, replace=True)
        sample.index = range(len(group))
        res[name] = sample
    return res

# each cycle_map is a map from country code to DataFrame
cycle_maps = [resample(cycle) for cycle in cycles]
for cycle_map in cycle_maps:
    print(len(cycle_map), 'countries')









    



22 countries
25 countries
23 countries
29 countries
27 countries

In a each cycle, a few questions were not asked in some countries.



In [11]:

    
def check_variables(name, group):
    """Print variables missing from a group.
    
    name: group name (country code)
    group: DataFrame
    """
    varnames = ['cntry', 'tvtot', 'tvpol', 'rdtot', 'rdpol', 
                'nwsptot', 'nwsppol', 'netuse', 'inwyr07', 
                'rlgblg', 'rlgdgr', 'eduyrs', 'hinctnta', 
                'yrbrn', 'pspwght', 'pweight']
    for var in varnames:
        n = len(group[var].dropna())
        if (n < 100):
            print(name, var, len(group[var].dropna()))

for i, cycle_map in enumerate(cycle_maps):
    print('Cycle', i+1)
    for name, group in cycle_map.items():
        check_variables(name, group)









    



Cycle 1
FR netuse 0
FR hinctnta 0
DE netuse 0
HU hinctnta 0
IE hinctnta 0
Cycle 2
FR netuse 0
FI rlgblg 0
EE hinctnta 0
UA hinctnta 0
Cycle 3
HU hinctnta 0
EE hinctnta 0
UA hinctnta 0
Cycle 4
BG hinctnta 0
CY hinctnta 0
SK hinctnta 0
Cycle 5
PT hinctnta 0
EE inwyr07 0

I'll remove the cycle/country groups that are missing netuse or rlgblg.

The ones that are missing income information undermine the ability to control for income, but it's a pretty weak effect, so I don't think it's a problem.

inwtr07 is interview year shifted by 1970, the approximate mean.



In [12]:

    
del cycle_maps[0]['FR']
del cycle_maps[0]['DE']
del cycle_maps[1]['FR']
del cycle_maps[1]['FI']

ee = cycle_maps[4]['EE']
ee.inwyr07 = 3 + np.random.uniform(-0.5, 0.5, len(ee))

Income is reported on a different scale in different cycles, and differs substantially across countries. So I'm replacing it with rank on a per-country basis: that is, each respondent is mapped to their income rank, from 0 to 1, within their country.

I do the same thing with years of education, in part because there are some suspiciously high values. At the very least, mapping to rank reduces the effect of outliers.



In [13]:

    
def replace_var_with_rank(name, df, old, new):
    """Replaces a scale variable with a rank from 0-1.
    
    Creates a new column.
    
    name: country code
    df: DataFrame
    old: old variable name
    new: new variable name
    """
    # jitter the data
    series = df[old] + np.random.uniform(-0.25, 0.25, len(df))
    
    # if there's no data, just put in random values
    if len(series.dropna()) < 10:
        df[new] = np.random.random(len(df))
        return
    
    # map from values to ranks
    cdf = thinkstats2.Cdf(series)
    df[new] = cdf.Probs(series)
    
    # make sure NaN maps to NaN
    df.loc[df[old].isnull(), new] = np.nan
    
    
def replace_with_ranks(cycle_map):
    """Replace variables within countries.
    
    cycle_map: map from country code to DataFrame
    """
    for name, group in cycle_map.items():
        replace_var_with_rank(name, group, 'hinctnta', 'hincrank')
        replace_var_with_rank(name, group, 'eduyrs', 'edurank')
        
for cycle_map in cycle_maps:
    replace_with_ranks(cycle_map)

To fill missing values, I am drawing random samples from the available values, on a per-country basis.

Since I am not modeling relationships between variables, I am losing some information with this replacement policy; the net effect is to understate the actual strength of all relationships.



In [14]:

    
def fill_var(df, old, new):
    """Fills missing values.
    
    Creates a new column
    
    df: DataFrame
    old: old variable name
    new: new variable name
    """
    # find the NaN rows
    null = df[df[old].isnull()]
    
    # sample from the non-NaN rows
    fill = df[old].dropna().sample(len(null), replace=True)
    fill.index = null.index
    
    # replace NaNs with the random sample
    df[new] = df[old].fillna(fill)
    
    
def fill_all_vars(df):
    """Fills missing values in the variables we need.
    
    df: DataFrame
    """
    for old in ['hasrelig', 'rlgdgr', 'yrbrn60', 'edurank', 'hincrank',
           'tvtot', 'rdtot', 'nwsptot', 'netuse', 'inwyr07']:
        new = old + '_f'
        fill_var(df, old, new)



In [15]:

    
def fill_vars_by_country(cycle_map):
    for name, group in cycle_map.items():
        fill_all_vars(group)
        
for cycle_map in cycle_maps:
    fill_vars_by_country(cycle_map)

Concatenate the cycles.



In [16]:

    
def concat_groups(cycle_map):
    """Concat all countries in a cycle.
    
    cycle_map: map from country code to DataFrame
    
    returns: DataFrame
    """
    return pd.concat(cycle_map.values(), ignore_index=True)

dfs = [concat_groups(cycle_map) for cycle_map in cycle_maps]

for df in dfs:
    print(len(df))



In [17]:

    
df = pd.concat(dfs, ignore_index=True)
print(df.shape)
df.head()









    



(233856, 32)






    Out[17]:






  
    
      
      cntry
      inwyr
      tvtot
      tvpol
      rdtot
      rdpol
      nwsptot
      nwsppol
      netuse
      rlgblg
      ...
      hasrelig_f
      rlgdgr_f
      yrbrn60_f
      edurank_f
      hincrank_f
      tvtot_f
      rdtot_f
      nwsptot_f
      netuse_f
      inwyr07_f
    
  
  
    
      0
      BE
      2002
      2
      0
      7
      2
      0
      NaN
      1
      1
      ...
      1
      4
      -8
      0.476496
      0.761204
      2
      7
      0
      1
      -5.265308
    
    
      1
      BE
      2002
      4
      1
      1
      1
      0
      NaN
      0
      2
      ...
      0
      6
      -23
      0.444444
      0.475585
      4
      1
      0
      0
      -5.463721
    
    
      2
      BE
      2002
      5
      2
      2
      1
      0
      NaN
      1
      2
      ...
      0
      2
      -7
      0.307158
      0.434114
      5
      2
      0
      1
      -4.811951
    
    
      3
      BE
      2002
      2
      1
      0
      NaN
      0
      NaN
      0
      2
      ...
      0
      2
      -3
      0.167201
      0.460201
      2
      0
      0
      0
      -4.888268
    
    
      4
      BE
      2002
      6
      2
      5
      0
      5
      1
      1
      1
      ...
      1
      10
      -39
      0.207799
      0.287625
      6
      5
      5
      1
      -5.321503
    
  

5 rows × 32 columns

TV watching time on average weekday



In [18]:

    
df.tvtot.value_counts().sort_index()









    Out[18]:





0     8029
1    12609
2    32932
3    32466
4    38880
5    29248
6    28409
7    50587
Name: tvtot, dtype: int64

Radio listening, total time on average weekday.



In [19]:

    
df.rdtot.value_counts().sort_index()









    Out[19]:





0    57368
1    35800
2    36999
3    18019
4    15176
5    10336
6     9372
7    49540
Name: rdtot, dtype: int64

Newspaper reading, total time on average weekday.



In [20]:

    
df.nwsptot.value_counts().sort_index()









    Out[20]:





0    68275
1    72280
2    62358
3    17305
4     6804
5     2729
6     1324
7     1875
Name: nwsptot, dtype: int64

Personal use of Internet, email, www



In [21]:

    
df.netuse.value_counts().sort_index()









    Out[21]:





0    71528
1    36749
2     4705
3     3751
4     8248
5     9923
6    28414
7    69833
Name: netuse, dtype: int64

Belong to a particular religion or denomination



In [22]:

    
df.rlgblg.value_counts().sort_index()









    Out[22]:





1    148495
2     83497
Name: rlgblg, dtype: int64

How religious



In [23]:

    
df.rlgdgr.value_counts().sort_index()









    Out[23]:





0     29052
1     13327
2     16316
3     18726
4     15487
5     40339
6     22866
7     27172
8     23707
9     10667
10    14027
Name: rlgdgr, dtype: int64

Total household net income, all sources



In [24]:

    
df.hincrank.describe()









    Out[24]:





count    184192.000000
mean          0.500507
std           0.288454
min           0.000014
25%           0.250977
50%           0.500285
75%           0.750301
max           1.000000
Name: hincrank, dtype: float64

Year born



In [25]:

    
df.yrbrn.describe()









    Out[25]:





count    232794.000000
mean       1961.491916
std          18.727516
min        1885.000000
25%        1947.000000
50%        1962.000000
75%        1977.000000
max        1996.000000
Name: yrbrn, dtype: float64



In [26]:

    
cdf = thinkstats2.Cdf(df.yrbrn)
thinkplot.PrePlot(1)
thinkplot.Cdf(cdf)
thinkplot.Config(xlabel='Year born', ylabel='CDF', 
                 title='Disrtribution of year born', legend=False)

Shifted to mean near 0



In [27]:

    
df.yrbrn60.describe()









    Out[27]:





count    232794.000000
mean          1.491916
std          18.727516
min         -75.000000
25%         -13.000000
50%           2.000000
75%          17.000000
max          36.000000
Name: yrbrn60, dtype: float64

Number of years of education, converted to ranks.



In [28]:

    
df.edurank.describe()









    Out[28]:





count    230992.000000
mean          0.500264
std           0.288676
min           0.000332
25%           0.250251
50%           0.500277
75%           0.750251
max           1.000000
Name: edurank, dtype: float64

Country codes



In [29]:

    
df.cntry.value_counts().sort_index()









    Out[29]:





AT     6918
BE     8939
BG     6064
CH     9310
CY     3293
CZ     8790
DE    11568
DK     7684
EE     6960
ES     9729
FI     7969
FR     5787
GB    11117
GR     9759
HR     3133
HU     7806
IE    10472
IL     7283
IS      579
IT     1207
LT     1677
LU     3187
LV     1980
NL     9741
NO     8643
PL     8917
PT    10302
RO     2146
RU     7544
SE     9201
SI     7126
SK     6944
TR     4272
UA     7809
Name: cntry, dtype: int64



In [30]:

    
df.rlgdgr.value_counts().sort_index()









    Out[30]:





0     29052
1     13327
2     16316
3     18726
4     15487
5     40339
6     22866
7     27172
8     23707
9     10667
10    14027
Name: rlgdgr, dtype: int64



In [31]:

    
df.inwyr07.describe()









    Out[31]:





count    233537.000000
mean         -0.219482
std           2.933184
min          -5.499864
25%          -2.722377
50%          -0.247620
75%           2.284778
max           5.499294
Name: inwyr07, dtype: float64

Run the model



In [32]:

    
def run_model(df, formula):
    model = smf.logit(formula, data=df)    
    results = model.fit(disp=False)
    return results

Here's the model with all control variables and all media variables:



In [33]:

    
formula = ('hasrelig ~ yrbrn60 + edurank + hincrank +'
           'tvtot + rdtot + nwsptot + netuse')
res = run_model(df, formula)
res.summary()









    Out[33]:





Logit Regression Results

  Dep. Variable:      hasrelig        No. Observations:     179125   


  Model:                Logit         Df Residuals:         179117   


  Method:                MLE          Df Model:                  7   


  Date:           Mon, 16 Nov 2015    Pseudo R-squ.:        0.02992  


  Time:               14:59:20        Log-Likelihood:     -1.1470e+05


  converged:            True          LL-Null:            -1.1824e+05


                                    LLR p-value:           0.000   




               coef      std err       z       P>|z|  [95.0% Conf. Int.] 


  Intercept      1.0394      0.019     54.565   0.000      1.002     1.077


  yrbrn60       -0.0080      0.000    -24.684   0.000     -0.009    -0.007


  edurank       -0.0096      0.020     -0.490   0.624     -0.048     0.029


  hincrank       0.1361      0.019      7.309   0.000      0.100     0.173


  tvtot         -0.0154      0.003     -6.059   0.000     -0.020    -0.010


  rdtot         -0.0151      0.002     -7.956   0.000     -0.019    -0.011


  nwsptot       -0.0412      0.004    -10.213   0.000     -0.049    -0.033


  netuse        -0.1124      0.002    -56.961   0.000     -0.116    -0.109

Now using the filled variables



In [34]:

    
formula = ('hasrelig_f ~ yrbrn60_f + edurank_f + hincrank_f +'
           'tvtot_f + rdtot_f + nwsptot_f + netuse_f')
res = run_model(df, formula)
res.summary()









    Out[34]:





Logit Regression Results

  Dep. Variable:     hasrelig_f       No. Observations:     233856   


  Model:                Logit         Df Residuals:         233848   


  Method:                MLE          Df Model:                  7   


  Date:           Mon, 16 Nov 2015    Pseudo R-squ.:        0.02832  


  Time:               14:59:21        Log-Likelihood:     -1.4846e+05


  converged:            True          LL-Null:            -1.5278e+05


                                    LLR p-value:           0.000   




                coef      std err       z       P>|z|  [95.0% Conf. Int.] 


  Intercept       1.1082      0.017     66.598   0.000      1.076     1.141


  yrbrn60_f      -0.0079      0.000    -28.692   0.000     -0.008    -0.007


  edurank_f      -0.0123      0.017     -0.725   0.469     -0.045     0.021


  hincrank_f      0.0826      0.016      5.188   0.000      0.051     0.114


  tvtot_f        -0.0142      0.002     -6.417   0.000     -0.019    -0.010


  rdtot_f        -0.0185      0.002    -11.059   0.000     -0.022    -0.015


  nwsptot_f      -0.0425      0.004    -11.886   0.000     -0.050    -0.035


  netuse_f       -0.1052      0.002    -60.862   0.000     -0.109    -0.102

Now adding inwyr07



In [35]:

    
formula = ('hasrelig_f ~ inwyr07_f + yrbrn60_f + edurank_f + hincrank_f +'
           'tvtot_f + rdtot_f + nwsptot_f + netuse_f')
res = run_model(df, formula)
res.summary()









    Out[35]:





Logit Regression Results

  Dep. Variable:     hasrelig_f       No. Observations:     233856   


  Model:                Logit         Df Residuals:         233847   


  Method:                MLE          Df Model:                  8   


  Date:           Mon, 16 Nov 2015    Pseudo R-squ.:        0.03166  


  Time:               14:59:21        Log-Likelihood:     -1.4795e+05


  converged:            True          LL-Null:            -1.5278e+05


                                    LLR p-value:           0.000   




                coef      std err       z       P>|z|  [95.0% Conf. Int.] 


  Intercept       1.1406      0.017     68.130   0.000      1.108     1.173


  inwyr07_f       0.0494      0.002     31.850   0.000      0.046     0.052


  yrbrn60_f      -0.0085      0.000    -30.617   0.000     -0.009    -0.008


  edurank_f       0.0180      0.017      1.058   0.290     -0.015     0.051


  hincrank_f      0.0949      0.016      5.949   0.000      0.064     0.126


  tvtot_f        -0.0201      0.002     -9.026   0.000     -0.024    -0.016


  rdtot_f        -0.0148      0.002     -8.817   0.000     -0.018    -0.012


  nwsptot_f      -0.0352      0.004     -9.780   0.000     -0.042    -0.028


  netuse_f       -0.1151      0.002    -65.319   0.000     -0.119    -0.112



In [36]:

    
def extract_res(res, var='netuse_f'):
    param = res.params[var]
    pvalue = res.pvalues[var]
    stars = '**' if pvalue < 0.01 else '*' if pvalue < 0.05 else ''
    return res.nobs, param, stars

extract_res(res)









    Out[36]:





(233856, -0.11506110306210458, '**')

Group by country:



In [37]:

    
grouped = df.groupby('cntry')

Run a sample country



In [38]:

    
group = grouped.get_group('IS')
run_model(group, formula).summary()









    Out[38]:





Logit Regression Results

  Dep. Variable:     hasrelig_f       No. Observations:       579  


  Model:                Logit         Df Residuals:           570  


  Method:                MLE          Df Model:                 8  


  Date:           Mon, 16 Nov 2015    Pseudo R-squ.:       0.06961 


  Time:               14:59:22        Log-Likelihood:      -372.62 


  converged:            True          LL-Null:             -400.50 


                                    LLR p-value:        3.143e-09




                coef      std err       z       P>|z|  [95.0% Conf. Int.] 


  Intercept      -1.6075      0.753     -2.136   0.033     -3.083    -0.133


  inwyr07_f      -0.0307      0.305     -0.101   0.920     -0.629     0.567


  yrbrn60_f      -0.0372      0.006     -6.020   0.000     -0.049    -0.025


  edurank_f       0.2802      0.384      0.730   0.465     -0.472     1.033


  hincrank_f      0.8706      0.326      2.672   0.008      0.232     1.509


  tvtot_f         0.0625      0.052      1.206   0.228     -0.039     0.164


  rdtot_f        -0.0137      0.036     -0.378   0.706     -0.085     0.057


  nwsptot_f      -0.0036      0.107     -0.034   0.973     -0.214     0.207


  netuse_f        0.1488      0.044      3.352   0.001      0.062     0.236

Run all countries



In [39]:

    
def run_logits(grouped, formula, var):
    for name, group in grouped:
        country = '%14.14s' % COUNTRY[name]
        model = smf.logit(formula, data=group)    
        results = model.fit(disp=False)
        nobs, param, stars = extract_res(results, var=var)
        arrow = '<--' if stars and param > 0 else ''
        print(country, nobs, '%0.3g'%param, stars, arrow, sep='\t')



In [40]:

    
formula = ('hasrelig_f ~ inwyr07_f + yrbrn60_f + edurank_f + hincrank_f +'
           'tvtot_f + rdtot_f + nwsptot_f + netuse_f')

run_logits(grouped, formula, 'netuse_f')









    



       AUSTRIA	6918	-0.0783	**	
       BELGIUM	8939	-0.0484	**	
      BULGARIA	6064	0.0168		
   SWITZERLAND	9310	-0.037	**	
        CYPRUS	3293	-0.0223		
CZECH REPUBLIC	8790	-0.0164		
       GERMANY	11568	-0.0243	**	
       DENMARK	7684	-0.0374	**	
       ESTONIA	6960	-0.0452	**	
         SPAIN	9729	-0.0657	**	
       FINLAND	7969	-0.0472	**	
        FRANCE	5787	0.000362		
UNITED KINGDOM	11117	-0.0297	**	
        GREECE	9759	-0.031		
       CROATIA	3133	-0.0361		
       HUNGARY	7806	-0.0301	**	
       IRELAND	10472	-0.0582	**	
        ISRAEL	7283	-0.0813	**	
       ICELAND	579	0.149	**	<--
         ITALY	1207	-0.0873	**	
     LITHUANIA	1677	-0.0398		
    LUXEMBOURG	3187	-0.0581	**	
        LATVIA	1980	-0.0319		
   NETHERLANDS	9741	-0.0457	**	
        NORWAY	8643	-0.0385	**	
        POLAND	8917	-0.119	**	
      PORTUGAL	10302	-0.102	**	
       ROMANIA	2146	-0.00946		
RUSSIAN FEDERA	7544	0.0301	**	<--
        SWEDEN	9201	-0.0652	**	
      SLOVENIA	7126	-0.0426	**	
      SLOVAKIA	6944	-0.0615	**	
        TURKEY	4272	-0.0939	*	
       UKRAINE	7809	-0.0601	**



In [41]:

    
run_logits(grouped, formula, 'hincrank_f')









    



       AUSTRIA	6918	0.375	**	<--
       BELGIUM	8939	-0.136		
      BULGARIA	6064	0.0274		
   SWITZERLAND	9310	-0.143		
        CYPRUS	3293	0.111		
CZECH REPUBLIC	8790	-0.199	*	
       GERMANY	11568	0.518	**	<--
       DENMARK	7684	0.342	**	<--
       ESTONIA	6960	-0.241	*	
         SPAIN	9729	0.0597		
       FINLAND	7969	0.0164		
        FRANCE	5787	-0.0222		
UNITED KINGDOM	11117	-0.0577		
        GREECE	9759	-0.758	**	
       CROATIA	3133	0.12		
       HUNGARY	7806	-0.119		
       IRELAND	10472	0.285	**	<--
        ISRAEL	7283	-0.1		
       ICELAND	579	0.871	**	<--
         ITALY	1207	0.519	*	<--
     LITHUANIA	1677	0.655	**	<--
    LUXEMBOURG	3187	0.302	*	<--
        LATVIA	1980	0.0253		
   NETHERLANDS	9741	-0.407	**	
        NORWAY	8643	-0.0125		
        POLAND	8917	-0.156		
      PORTUGAL	10302	-0.0845		
       ROMANIA	2146	0.0659		
RUSSIAN FEDERA	7544	-0.0556		
        SWEDEN	9201	-0.112		
      SLOVENIA	7126	-0.313	**	
      SLOVAKIA	6944	0.0315		
        TURKEY	4272	-0.119		
       UKRAINE	7809	0.0702



In [42]:

    
run_logits(grouped, formula, 'edurank_f')









    



       AUSTRIA	6918	-0.682	**	
       BELGIUM	8939	0.142		
      BULGARIA	6064	-0.247		
   SWITZERLAND	9310	-0.665	**	
        CYPRUS	3293	-1.19	*	
CZECH REPUBLIC	8790	-0.38	**	
       GERMANY	11568	-0.616	**	
       DENMARK	7684	-0.241	**	
       ESTONIA	6960	0.0551		
         SPAIN	9729	-0.4	**	
       FINLAND	7969	0.127		
        FRANCE	5787	0.0997		
UNITED KINGDOM	11117	0.446	**	<--
        GREECE	9759	-1.34	**	
       CROATIA	3133	-0.693	**	
       HUNGARY	7806	-0.604	**	
       IRELAND	10472	-0.198	*	
        ISRAEL	7283	-0.0924		
       ICELAND	579	0.28		
         ITALY	1207	0.192		
     LITHUANIA	1677	0.13		
    LUXEMBOURG	3187	-1.01	**	
        LATVIA	1980	0.303		
   NETHERLANDS	9741	-0.412	**	
        NORWAY	8643	-0.106		
        POLAND	8917	-1.16	**	
      PORTUGAL	10302	-0.715	**	
       ROMANIA	2146	-1.43	**	
RUSSIAN FEDERA	7544	-0.0916		
        SWEDEN	9201	0.561	**	<--
      SLOVENIA	7126	-0.413	**	
      SLOVAKIA	6944	-0.298	**	
        TURKEY	4272	-1.69	**	
       UKRAINE	7809	-0.255	**

Run OLS model with rlgdgr: "Regardless of whether you belong to a particular religion, how religious would you say you are?"



In [43]:

    
formula = ('rlgdgr_f ~ inwyr07_f + yrbrn60_f + edurank_f + hincrank_f +'
           'tvtot_f + rdtot_f + nwsptot_f + netuse_f')
model = smf.ols(formula, data=df)    
results = model.fit(disp=False)
results.summary()









    Out[43]:





OLS Regression Results

  Dep. Variable:         rlgdgr_f        R-squared:              0.051  


  Model:                    OLS          Adj. R-squared:         0.051  


  Method:              Least Squares     F-statistic:            1559.  


  Date:              Mon, 16 Nov 2015    Prob (F-statistic):      0.00   


  Time:                  14:59:26        Log-Likelihood:     -5.7994e+05


  No. Observations:       233856         AIC:                 1.160e+06 


  Df Residuals:           233847         BIC:                 1.160e+06 


  Df Model:                    8                                        


  Covariance Type:       nonrobust                                      




                coef      std err       t       P>|t|  [95.0% Conf. Int.] 


  Intercept       6.0342      0.022    270.724   0.000      5.991     6.078


  inwyr07_f       0.0244      0.002     11.646   0.000      0.020     0.029


  yrbrn60_f      -0.0175      0.000    -47.085   0.000     -0.018    -0.017


  edurank_f      -0.2782      0.023    -12.079   0.000     -0.323    -0.233


  hincrank_f     -0.1569      0.022     -7.236   0.000     -0.199    -0.114


  tvtot_f        -0.0763      0.003    -25.339   0.000     -0.082    -0.070


  rdtot_f        -0.0238      0.002    -10.429   0.000     -0.028    -0.019


  nwsptot_f      -0.0668      0.005    -13.715   0.000     -0.076    -0.057


  netuse_f       -0.1367      0.002    -57.186   0.000     -0.141    -0.132




  Omnibus:        31835.411    Durbin-Watson:         1.760


  Prob(Omnibus):    0.000      Jarque-Bera (JB):   8476.913


  Skew:            -0.134      Prob(JB):               0.00


  Kurtosis:         2.107      Cond. No.               85.0



In [44]:

    
def run_ols(grouped, formula, var):
    for name, group in grouped:
        model = smf.ols(formula, data=group)    
        results = model.fit(disp=False)
        nobs, param, stars = extract_res(results, var=var)
        arrow = '<--' if stars and param > 0 else ''
        print(name, len(group), '%0.3g    '%param, stars, arrow, sep='\t')



In [45]:

    
run_ols(grouped, formula, 'netuse_f')









    



AT	6918	-0.0822    	**	
BE	8939	-0.108    	**	
BG	6064	0.00266    		
CH	9310	-0.0905    	**	
CY	3293	-0.0555    	**	
CZ	8790	0.00954    		
DE	11568	-0.0563    	**	
DK	7684	-0.0366    	**	
EE	6960	-0.0176    		
ES	9729	-0.0823    	**	
FI	7969	-0.0187    		
FR	5787	-0.0434    	**	
GB	11117	-0.0301    	**	
GR	9759	-0.0619    	**	
HR	3133	-0.0798    	**	
HU	7806	-0.00543    		
IE	10472	-0.0801    	**	
IL	7283	-0.31    	**	
IS	579	0.0105    		
IT	1207	-0.142    	**	
LT	1677	-0.038    		
LU	3187	-0.145    	**	
LV	1980	-0.0209    		
NL	9741	-0.0827    	**	
NO	8643	-0.065    	**	
PL	8917	-0.0682    	**	
PT	10302	-0.0538    	**	
RO	2146	-0.0302    		
RU	7544	0.0284    	*	<--
SE	9201	-0.0942    	**	
SI	7126	-0.0718    	**	
SK	6944	-0.034    	*	
TR	4272	-0.0322    		
UA	7809	-0.0462    	**



In [46]:

    
run_ols(grouped, formula, 'edurank_f')









    



AT	6918	-0.253    	*	
BE	8939	-0.313    	*	
BG	6064	-0.646    	**	
CH	9310	-0.316    	**	
CY	3293	-0.639    	**	
CZ	8790	-0.438    	**	
DE	11568	-0.624    	**	
DK	7684	-0.374    	**	
EE	6960	0.478    	**	<--
ES	9729	-0.429    	**	
FI	7969	0.175    		
FR	5787	-0.464    	**	
GB	11117	0.428    	**	<--
GR	9759	-1.68    	**	
HR	3133	-1.22    	**	
HU	7806	-0.938    	**	
IE	10472	-0.442    	**	
IL	7283	-0.904    	**	
IS	579	0.0494    		
IT	1207	0.0235    		
LT	1677	-0.00891    		
LU	3187	-1.95    	**	
LV	1980	0.263    		
NL	9741	-0.157    		
NO	8643	0.32    	**	<--
PL	8917	-1.29    	**	
PT	10302	-0.864    	**	
RO	2146	-0.432    	*	
RU	7544	-0.0769    		
SE	9201	0.336    	**	<--
SI	7126	-0.964    	**	
SK	6944	-0.79    	**	
TR	4272	-1.16    	**	
UA	7809	-0.591    	**



In [47]:

    
run_ols(grouped, formula, 'hincrank_f')









    



AT	6918	0.162    		
BE	8939	-0.359    	**	
BG	6064	-0.175    		
CH	9310	-0.29    	**	
CY	3293	-0.37    	**	
CZ	8790	-0.41    	**	
DE	11568	0.372    	**	<--
DK	7684	0.128    		
EE	6960	-0.44    	**	
ES	9729	-0.227    	*	
FI	7969	-0.271    	*	
FR	5787	-0.806    	**	
GB	11117	-0.529    	**	
GR	9759	-0.511    	**	
HR	3133	0.124    		
HU	7806	-0.078    		
IE	10472	-0.0518    		
IL	7283	-0.333    	*	
IS	579	0.0341    		
IT	1207	0.104    		
LT	1677	-0.531    	*	
LU	3187	-0.382    	*	
LV	1980	-0.274    		
NL	9741	-0.853    	**	
NO	8643	-0.192    		
PL	8917	-0.355    	**	
PT	10302	-0.0868    		
RO	2146	-0.178    		
RU	7544	-0.0596    		
SE	9201	-0.249    	*	
SI	7126	-0.876    	**	
SK	6944	0.0426    		
TR	4272	-0.201    		
UA	7809	-0.0463

Let's see what happens if we add quadratic terms for edurank and hincrank:



In [48]:

    
df['edurank_f2'] = df.edurank_f**2
df['hincrank_f2'] = df.hincrank_f**2



In [49]:

    
formula = ('rlgdgr_f ~ inwyr07_f + yrbrn60_f + edurank_f + edurank_f2 + hincrank_f +'
           'tvtot_f + rdtot_f + nwsptot_f + netuse_f')



In [50]:

    
run_ols(grouped, formula, 'edurank_f')









    



AT	6918	-0.669    		
BE	8939	0.186    		
BG	6064	-2.81    	**	
CH	9310	-0.863    	*	
CY	3293	-1.07    		
CZ	8790	-2.53    	**	
DE	11568	-2.79    	**	
DK	7684	-0.748    		
EE	6960	-0.671    		
ES	9729	-1.55    	**	
FI	7969	0.0851    		
FR	5787	-1.85    	**	
GB	11117	-2.16    	**	
GR	9759	-1.93    	**	
HR	3133	-1.85    	**	
HU	7806	-3.03    	**	
IE	10472	-0.237    		
IL	7283	-0.533    		
IS	579	-1.68    		
IT	1207	-2.43    	*	
LT	1677	-1.82    	*	
LU	3187	-2.46    	**	
LV	1980	-1.74    	*	
NL	9741	-0.285    		
NO	8643	-1.05    	*	
PL	8917	-2.44    	**	
PT	10302	-1.2    	**	
RO	2146	-1.35    		
RU	7544	-1.06    	*	
SE	9201	-0.788    		
SI	7126	-2.03    	**	
SK	6944	-2.73    	**	
TR	4272	-0.9    		
UA	7809	-0.66



In [51]:

    
run_ols(grouped, formula, 'edurank_f2')









    



AT	6918	0.412    		
BE	8939	-0.5    		
BG	6064	2.13    	**	<--
CH	9310	0.55    		
CY	3293	0.427    		
CZ	8790	2.09    	**	<--
DE	11568	2.15    	**	<--
DK	7684	0.373    		
EE	6960	1.15    	*	<--
ES	9729	1.09    	**	<--
FI	7969	0.0867    		
FR	5787	1.36    	*	<--
GB	11117	2.56    	**	<--
GR	9759	0.244    		
HR	3133	0.62    		
HU	7806	2.09    	**	<--
IE	10472	-0.204    		
IL	7283	-0.37    		
IS	579	1.66    		
IT	1207	2.36    	*	<--
LT	1677	1.81    	*	<--
LU	3187	0.513    		
LV	1980	2.02    	*	<--
NL	9741	0.126    		
NO	8643	1.34    	**	<--
PL	8917	1.14    	**	<--
PT	10302	0.324    		
RO	2146	0.896    		
RU	7544	0.979    	*	<--
SE	9201	1.11    	**	<--
SI	7126	1.08    	*	<--
SK	6944	1.94    	**	<--
TR	4272	-0.257    		
UA	7809	0.0687



In [52]:

    
formula = ('rlgdgr_f ~ inwyr07_f + yrbrn60_f + edurank_f + edurank_f2 + '
           'hincrank_f + hincrank_f2 + '
           'tvtot_f + rdtot_f + nwsptot_f + netuse_f')



In [53]:

    
run_ols(grouped, formula, 'hincrank_f')









    



AT	6918	1.95    	**	<--
BE	8939	-0.782    		
BG	6064	-0.412    		
CH	9310	0.417    		
CY	3293	-1.34    	**	
CZ	8790	-1.44    	**	
DE	11568	0.149    		
DK	7684	1.5    	**	<--
EE	6960	-1.25    	**	
ES	9729	0.182    		
FI	7969	0.265    		
FR	5787	-1.96    	**	
GB	11117	-0.107    		
GR	9759	-0.63    	*	
HR	3133	1.97    	**	<--
HU	7806	-0.0722    		
IE	10472	0.416    		
IL	7283	-0.525    		
IS	579	-0.719    		
IT	1207	-0.197    		
LT	1677	2.34    	**	<--
LU	3187	-1.03    		
LV	1980	-0.262    		
NL	9741	-0.0917    		
NO	8643	-0.55    		
PL	8917	-0.273    		
PT	10302	0.284    		
RO	2146	-0.35    		
RU	7544	-0.622    		
SE	9201	-0.787    	*	
SI	7126	0.0244    		
SK	6944	0.828    		
TR	4272	-1.16    	*	
UA	7809	0.304



In [54]:

    
run_ols(grouped, formula, 'edurank_f')









    



AT	6918	-0.782    		
BE	8939	0.232    		
BG	6064	-2.79    	**	
CH	9310	-0.901    	*	
CY	3293	-0.992    		
CZ	8790	-2.44    	**	
DE	11568	-2.77    	**	
DK	7684	-0.859    	*	
EE	6960	-0.641    		
ES	9729	-1.59    	**	
FI	7969	0.0251    		
FR	5787	-1.8    	**	
GB	11117	-2.19    	**	
GR	9759	-1.92    	**	
HR	3133	-2.26    	**	
HU	7806	-3.03    	**	
IE	10472	-0.267    		
IL	7283	-0.519    		
IS	579	-1.57    		
IT	1207	-2.39    	*	
LT	1677	-2.11    	*	
LU	3187	-2.42    	**	
LV	1980	-1.74    	*	
NL	9741	-0.375    		
NO	8643	-1.01    	*	
PL	8917	-2.44    	**	
PT	10302	-1.22    	**	
RO	2146	-1.32    		
RU	7544	-1.05    	*	
SE	9201	-0.742    		
SI	7126	-2.16    	**	
SK	6944	-2.76    	**	
TR	4272	-0.774    		
UA	7809	-0.668



In [55]:

    
run_ols(grouped, formula, 'netuse_f')









    



AT	6918	-0.0812    	**	
BE	8939	-0.107    	**	
BG	6064	-0.00442    		
CH	9310	-0.0913    	**	
CY	3293	-0.0597    	**	
CZ	8790	0.00383    		
DE	11568	-0.0567    	**	
DK	7684	-0.0381    	**	
EE	6960	-0.0199    		
ES	9729	-0.0874    	**	
FI	7969	-0.0187    		
FR	5787	-0.0433    	**	
GB	11117	-0.0326    	**	
GR	9759	-0.0635    	**	
HR	3133	-0.0762    	**	
HU	7806	-0.0145    		
IE	10472	-0.0799    	**	
IL	7283	-0.31    	**	
IS	579	0.0128    		
IT	1207	-0.158    	**	
LT	1677	-0.0403    		
LU	3187	-0.146    	**	
LV	1980	-0.0254    		
NL	9741	-0.0843    	**	
NO	8643	-0.0622    	**	
PL	8917	-0.0759    	**	
PT	10302	-0.0556    	**	
RO	2146	-0.0339    		
RU	7544	0.0223    		
SE	9201	-0.0921    	**	
SI	7126	-0.0761    	**	
SK	6944	-0.0405    	**	
TR	4272	-0.0324    		
UA	7809	-0.0466    	**



In [ ]:

	cntry	inwyr	tvtot	tvpol	rdtot	rdpol	nwsptot	nwsppol	netuse	rlgblg	rlgdgr	eduyrs	hinctnta	yrbrn	pspwght	pweight
0	AT	2003	1	1	1	1	1	1	5	1	8	11	77	1949	0.940933	0.271488
1	AT	2003	3	2	4	1	2	2	6	2	5	14	2	1953	0.470466	0.271488
2	AT	2003	7	3	0	66	0	66	0	1	7	9	77	1940	1.392155	0.271488
3	AT	2003	1	1	1	1	2	2	4	1	7	18	9	1959	1.382163	0.271488
4	AT	2003	0	66	1	1	0	66	7	1	10	15	9	1962	1.437766	0.271488

	cntry	inwyr	tvtot	tvpol	rdtot	rdpol	nwsptot	nwsppol	netuse	rlgblg	rlgdgr	eduyrs	hinctnta	yrbrn	pspwght	pweight
0	AT	2005	3	2	0	66	4	2	7	1	7	12	8	1971	0.682185	0.302006
1	AT	2005	7	2	3	1	5	2	0	1	7	8	4	1925	0.565038	0.302006
2	AT	2005	6	2	1	0	1	0	6	2	4	13	6	1977	0.341133	0.302006
3	AT	2005	3	1	2	1	2	1	7	1	5	8	9	1989	0.804050	0.302006
4	AT	2005	2	1	1	1	1	1	4	2	1	11	88	1988	1.125251	0.302006

	cntry	inwyr	tvtot	tvpol	rdtot	rdpol	nwsptot	nwsppol	netuse	rlgblg	rlgdgr	eduyrs	hinctnta	yrbrn	pspwght	pweight
0	AT	2007	5	1	7	2	0	66	6	1	3	11	7	1980	0.949578	0.289116
1	AT	2007	3	1	2	1	2	1	7	1	9	20	5	1974	1.412180	0.289116
2	AT	2007	1	1	7	2	2	1	7	1	6	16	77	1954	0.723276	0.289116
3	AT	2007	4	1	7	2	3	2	6	1	5	12	88	1967	0.625744	0.289116
4	AT	2007	6	2	6	2	3	1	0	1	7	11	88	1971	0.417162	0.289116

	cntry	inwyr	tvtot	tvpol	rdtot	rdpol	nwsptot	nwsppol	netuse	rlgblg	rlgdgr	eduyrs	hinctnta	yrbrn	eisced	pspwght	pweight
0	BE	2008	7	1	0	66	0	66	0	2	1	18	4	1972	6	0.823223	0.503773
1	BE	2008	7	2	7	3	0	66	0	2	0	15	7	1982	6	0.798610	0.503773
2	BE	2009	2	2	3	3	3	3	7	1	6	18	10	1940	7	0.778020	0.503773
3	BE	2008	7	2	2	2	2	2	0	1	6	15	7	1931	6	0.777735	0.503773
4	BE	2009	0	66	3	0	1	1	7	2	0	13	7	1982	4	0.960960	0.503773

	cntry	inwyr	tvtot	tvpol	rdtot	rdpol	nwsptot	nwsppol	netuse	rlgblg	rlgdgr	eduyrs	hinctnta	yrbrn	eisced	pspwght	pweight
0	BE	2010	5	1	1	0	1	1	2	1	5	15	88	1988	4	0.792865	0.528619
1	BE	2010	4	3	2	2	1	2	6	2	7	15	5	1967	6	0.871107	0.528619
2	BE	2010	4	0	0	66	0	66	7	2	0	13	1	1991	3	0.799453	0.528619
3	BE	2010	2	2	7	2	0	66	7	1	5	15	10	1987	6	0.816030	0.528619
4	BE	2010	7	3	7	4	0	66	0	2	5	15	6	1952	6	0.764902	0.528619

	cntry	inwyr	tvtot	tvpol	rdtot	rdpol	nwsptot	nwsppol	netuse	rlgblg	...	hasrelig_f	rlgdgr_f	yrbrn60_f	edurank_f	hincrank_f	tvtot_f	rdtot_f	nwsptot_f	netuse_f	inwyr07_f
0	BE	2002	2	0	7	2	0	NaN	1	1	...	1	4	-8	0.476496	0.761204	2	7	0	1	-5.265308
1	BE	2002	4	1	1	1	0	NaN	0	2	...	0	6	-23	0.444444	0.475585	4	1	0	0	-5.463721
2	BE	2002	5	2	2	1	0	NaN	1	2	...	0	2	-7	0.307158	0.434114	5	2	0	1	-4.811951
3	BE	2002	2	1	0	NaN	0	NaN	0	2	...	0	2	-3	0.167201	0.460201	2	0	0	0	-4.888268
4	BE	2002	6	2	5	0	5	1	1	1	...	1	10	-39	0.207799	0.287625	6	5	5	1	-5.321503

Dep. Variable:	hasrelig	No. Observations:	179125
Model:	Logit	Df Residuals:	179117
Method:	MLE	Df Model:	7
Date:	Mon, 16 Nov 2015	Pseudo R-squ.:	0.02992
Time:	14:59:20	Log-Likelihood:	-1.1470e+05
converged:	True	LL-Null:	-1.1824e+05
		LLR p-value:	0.000

	coef	std err	z	P>\|z\|	[95.0% Conf. Int.]
Intercept	1.0394	0.019	54.565	0.000	1.002 1.077
yrbrn60	-0.0080	0.000	-24.684	0.000	-0.009 -0.007
edurank	-0.0096	0.020	-0.490	0.624	-0.048 0.029
hincrank	0.1361	0.019	7.309	0.000	0.100 0.173
tvtot	-0.0154	0.003	-6.059	0.000	-0.020 -0.010
rdtot	-0.0151	0.002	-7.956	0.000	-0.019 -0.011
nwsptot	-0.0412	0.004	-10.213	0.000	-0.049 -0.033
netuse	-0.1124	0.002	-56.961	0.000	-0.116 -0.109

Dep. Variable:	hasrelig_f	No. Observations:	233856
Model:	Logit	Df Residuals:	233848
Method:	MLE	Df Model:	7
Date:	Mon, 16 Nov 2015	Pseudo R-squ.:	0.02832
Time:	14:59:21	Log-Likelihood:	-1.4846e+05
converged:	True	LL-Null:	-1.5278e+05
		LLR p-value:	0.000

Dep. Variable:	rlgdgr_f	R-squared:	0.051
Model:	OLS	Adj. R-squared:	0.051
Method:	Least Squares	F-statistic:	1559.
Date:	Mon, 16 Nov 2015	Prob (F-statistic):	0.00
Time:	14:59:26	Log-Likelihood:	-5.7994e+05
No. Observations:	233856	AIC:	1.160e+06
Df Residuals:	233847	BIC:	1.160e+06
Df Model:	8
Covariance Type:	nonrobust

	coef	std err	t	P>\|t\|	[95.0% Conf. Int.]
Intercept	6.0342	0.022	270.724	0.000	5.991 6.078
inwyr07_f	0.0244	0.002	11.646	0.000	0.020 0.029
yrbrn60_f	-0.0175	0.000	-47.085	0.000	-0.018 -0.017
edurank_f	-0.2782	0.023	-12.079	0.000	-0.323 -0.233
hincrank_f	-0.1569	0.022	-7.236	0.000	-0.199 -0.114
tvtot_f	-0.0763	0.003	-25.339	0.000	-0.082 -0.070
rdtot_f	-0.0238	0.002	-10.429	0.000	-0.028 -0.019
nwsptot_f	-0.0668	0.005	-13.715	0.000	-0.076 -0.057
netuse_f	-0.1367	0.002	-57.186	0.000	-0.141 -0.132

Omnibus:	31835.411	Durbin-Watson:	1.760
Prob(Omnibus):	0.000	Jarque-Bera (JB):	8476.913
Skew:	-0.134	Prob(JB):	0.00
Kurtosis:	2.107	Cond. No.	85.0