This notebook presents a quick-and-dirty analysis of the association between Internet use and religion in Europe, using data from the European Social Survey (http://www.europeansocialsurvey.org).
Copyright 2015 Allen Downey
MIT License: http://opensource.org/licenses/MIT
In [1]:
from __future__ import print_function, division
import numpy as np
import pandas as pd
import statsmodels.formula.api as smf
%matplotlib inline
The following function selects the columns I need.
In [2]:
def select_cols(df):
cols = ['cntry', 'tvtot', 'tvpol', 'rdtot', 'rdpol', 'nwsptot', 'nwsppol', 'netuse',
'rlgblg', 'rlgdgr', 'eduyrs', 'hinctnta', 'yrbrn', 'eisced', 'pspwght', 'pweight']
df = df[cols]
return df
Read data from Cycle 1.
TODO: investigate the difference between hinctnt and hinctnta; is there a recode that reconciles them?
In [3]:
df1 = pd.read_stata('ESS1e06_4.dta', convert_categoricals=False)
df1['hinctnta'] = df1.hinctnt
df1 = select_cols(df1)
df1.head()
Out[3]:
Read data from Cycle 2.
In [4]:
df2 = pd.read_stata('ESS2e03_4.dta', convert_categoricals=False)
df2['hinctnta'] = df2.hinctnt
df2 = select_cols(df2)
df2.head()
Out[4]:
Read data from Cycle 3.
In [5]:
df3 = pd.read_stata('ESS3e03_5.dta', convert_categoricals=False)
df3['hinctnta'] = df3.hinctnt
df3 = select_cols(df3)
df3.head()
Out[5]:
Read data from Cycle 4.
In [6]:
df4 = pd.read_stata('ESS4e04_3.dta', convert_categoricals=False)
df4 = select_cols(df4)
df4.head()
Out[6]:
Read data from Cycle 5.
In [7]:
df5 = pd.read_stata('ESS5e03_2.dta', convert_categoricals=False)
df5 = select_cols(df5)
df5.head()
Out[7]:
Concatenate the cycles.
TODO: Have to resample each cycle before concatenating.
In [8]:
df = pd.concat([df1, df2, df3, df4, df5], ignore_index=True)
df.head()
Out[8]:
TV watching time on average weekday
In [9]:
df.tvtot.replace([77, 88, 99], np.nan, inplace=True)
df.tvtot.value_counts().sort_index()
Out[9]:
Radio listening, total time on average weekday.
In [10]:
df.rdtot.replace([77, 88, 99], np.nan, inplace=True)
df.rdtot.value_counts().sort_index()
Out[10]:
Newspaper reading, total time on average weekday.
In [11]:
df.nwsptot.replace([77, 88, 99], np.nan, inplace=True)
df.nwsptot.value_counts().sort_index()
Out[11]:
TV watching: news, politics, current affairs
In [12]:
df.tvpol.replace([66, 77, 88, 99], np.nan, inplace=True)
df.tvpol.value_counts().sort_index()
Out[12]:
Radio listening: news, politics, current affairs
In [13]:
df.rdpol.replace([66, 77, 88, 99], np.nan, inplace=True)
df.rdpol.value_counts().sort_index()
Out[13]:
Newspaper reading: politics, current affairs
In [14]:
df.nwsppol.replace([66, 77, 88, 99], np.nan, inplace=True)
df.nwsppol.value_counts().sort_index()
Out[14]:
Personal use of Internet, email, www
In [15]:
df.netuse.replace([77, 88, 99], np.nan, inplace=True)
df.netuse.value_counts().sort_index()
Out[15]:
Belong to a particular religion or denomination
In [16]:
df.rlgblg.replace([7, 8, 9], np.nan, inplace=True)
df.rlgblg.value_counts().sort_index()
Out[16]:
How religious
In [17]:
df.rlgdgr.replace([77, 88, 99], np.nan, inplace=True)
df.rlgdgr.value_counts().sort_index()
Out[17]:
Total household net income, all sources
TODO: It looks like one cycle measured HINCTNT on a 12 point scale. Might need to reconcile
In [18]:
df.hinctnta.replace([77, 88, 99], np.nan, inplace=True)
df.hinctnta.value_counts().sort_index()
Out[18]:
Shift income to mean near 0.
In [19]:
df['hinctnta5'] = df.hinctnta - 5
df.hinctnta5.describe()
Out[19]:
Year born
In [20]:
df.yrbrn.replace([7777, 8888, 9999], np.nan, inplace=True)
df.yrbrn.describe()
Out[20]:
Shifted to mean near 0
In [21]:
df['yrbrn60'] = df.yrbrn - 1960
df.yrbrn60.describe()
Out[21]:
Number of years of education
In [22]:
df.eduyrs.replace([77, 88, 99], np.nan, inplace=True)
df.loc[df.eduyrs > 25, 'eduyrs'] = 25
df.eduyrs.value_counts().sort_index()
Out[22]:
There are a bunch of really high values for eduyrs, need to investigate.
In [23]:
df.eduyrs.describe()
Out[23]:
Shift to mean near 0
In [24]:
df['eduyrs12'] = df.eduyrs - 12
In [25]:
df.eduyrs12.describe()
Out[25]:
Country codes
In [26]:
df.cntry.value_counts().sort_index()
Out[26]:
Make a binary dependent variable
In [27]:
df['hasrelig'] = (df.rlgblg==1).astype(int)
Run the model
In [28]:
def run_model(df, formula):
model = smf.logit(formula, data=df)
results = model.fit(disp=False)
return results
Here's the model with all control variables and all media variables:
In [29]:
formula = ('hasrelig ~ yrbrn60 + eduyrs12 + hinctnta5 +'
'tvtot + tvpol + rdtot + rdpol + nwsptot + nwsppol + netuse')
res = run_model(df, formula)
res.summary()
Out[29]:
Most of the media variables are not statistically significant. If we drop the politial media variables, we get a cleaner model:
In [30]:
formula = ('hasrelig ~ yrbrn60 + eduyrs12 + hinctnta5 +'
'tvtot + rdtot + nwsptot + netuse')
res = run_model(df, formula)
res.summary()
Out[30]:
And if we fill missing values for income, cleaner still.
In [31]:
def fill_var(df, var):
fill = df[var].dropna().sample(len(df), replace=True)
fill.index = df.index
df[var].fillna(fill, inplace=True)
In [32]:
fill_var(df, var='hinctnta5')
In [33]:
formula = ('hasrelig ~ yrbrn60 + eduyrs12 + hinctnta5 +'
'tvtot + rdtot + nwsptot + netuse')
res = run_model(df, formula)
res.summary()
Out[33]:
Now all variables have small p-values. All parameters have the expected signs:
The strength of the Internet effect is stronger than for other media.
These results are consistent in each cycle of the data, and across a few changes I've made in the cleaning process.
However, these results should be considered preliminary:
Nevertheless, I'll run a breakdown by country.
Here's a function to extract the parameter associated with netuse:
In [34]:
def extract_res(res, var='netuse'):
param = res.params[var]
pvalue = res.pvalues[var]
stars = '**' if pvalue < 0.01 else '*' if pvalue < 0.05 else ''
return res.nobs, param, stars
extract_res(res)
Out[34]:
Running a similar model with degree of religiosity.
In [35]:
formula = ('rlgdgr ~ yrbrn60 + eduyrs12 + hinctnta5 +'
'tvtot + rdtot + nwsptot + netuse')
model = smf.ols(formula, data=df)
res = model.fit(disp=False)
res.summary()
Out[35]:
Group by country:
In [36]:
grouped = df.groupby('cntry')
for name, group in grouped:
print(name, len(group))
Run a sample country
In [37]:
gb = grouped.get_group('DK')
run_model(gb, formula).summary()
Run all countries
In [ ]:
for name, group in grouped:
try:
fill_var(group, var='hinctnta5')
res = run_model(group, formula)
nobs, param, stars = extract_res(res)
arrow = '<--' if stars and param > 0 else ''
print(name, len(group), nobs, '%0.3g'%param, stars, arrow, sep='\t')
except:
print(name, len(group), ' ', 'NA', sep='\t')
In more than half of the countries, the association between Internet use and religious affiliation is statistically significant. In all except two, the association is negative.
In many countries we've lost a substantial number of observations due to missing data. Really need to fill that in!
In [ ]:
group = grouped.get_group('FR')
len(group)
In [ ]:
for col in group.columns:
print(col, sum(group[col].isnull()))
In [ ]:
fill_var(group, 'hinctnta5')
In [ ]:
formula
In [ ]:
res = run_model(group, formula)
res.summary()
In [ ]: