For Week 4 assignment I'm testing if gender affects either strength or the dirrection of the relationship between beer consumption and the fact the person had or not major depression diagnosis.
In [1]:
import pandas
import numpy
import scipy.stats
import seaborn
import matplotlib.pyplot as plt
data = pandas.read_csv('nesarc_pds.csv', low_memory=False)
# MAJORDEPLIFE - Diagnosed major depressions in lifetime
# S2AQ5A - Drink beer in last 12 months
# S2AQ5B - How often drank a beer in last year
data['MAJORDEPLIFE'] = pandas.to_numeric(data['TAB12MDX'], errors='coerce')
data['S2AQ5A'] = pandas.to_numeric(data['S2AQ5A'], errors='coerce')
data['S2AQ5B'] = pandas.to_numeric(data['S2AQ5B'], errors='coerce')
data['AGE'] = pandas.to_numeric(data['AGE'], errors='coerce')
Following block prepares data we will use and prints contingency table with number of people we have within each category:
In [2]:
#subset data to young adults age 18 to 25 who have drinked beer in the past 12 months
sub1=data[(data['AGE']>=18) & (data['AGE']<=25) & (data['S2AQ5A']==1)]
sub2 = sub1.copy()
# recode missing values to python missing (NaN)
sub2['S2AQ5B']=sub2['S2AQ5B'].replace(99, numpy.nan)
# contingency table of observed counts
ct1=pandas.crosstab(sub2['MAJORDEPLIFE'], sub2['S2AQ5B'])
print (ct1)
Next block prints the same contingency table but in percentages instead of absolute values:
In [3]:
# column percentages
colsum=ct1.sum(axis=0)
colpct=ct1/colsum
print(colpct)
Now we split test dataset into two subgroups based on Gender moderator
In [4]:
sub3 = sub2[(sub2['SEX'] == 1)] # Males only
sub4 = sub2[(sub2['SEX'] == 2)] # Females only
print 'Contingency tables in absolute and percentage values for MALE subset'
ct2 = pandas.crosstab(sub2['MAJORDEPLIFE'], sub3['S2AQ5B'])
print (ct2)
# column percentages
colsum2 = ct2.sum(axis=0)
colpct2 = ct2/colsum2
print(colpct2)
print 'Contingency tables in absolute and percentage values for FEMALE subset'
ct3 = pandas.crosstab(sub2['MAJORDEPLIFE'], sub4['S2AQ5B'])
print (ct3)
# column percentages
colsum3 = ct3.sum(axis=0)
colpct3 = ct3/colsum3
print(colpct3)
In [5]:
# chi-square tests
cs2 = scipy.stats.chi2_contingency(ct2)
print '[Males dataset] Chi-square value: ', cs2[0]
print '[Males dataset] p value: ', cs2[1]
print '[Males dataset] Expected counts:', cs2[3]
cs3 = scipy.stats.chi2_contingency(ct3)
print '[Females dataset] Chi-square value: ', cs3[0]
print '[Females dataset] p value: ', cs3[1]
print '[Females dataset] Expected counts:', cs3[3]
Both chi-square test values are significant and p value in both tests are much smaller then 0,05.
Relationship between beer consumption and major depression diagnosis is similar for both male and female groups. Still chi-square test value is more significant for Male subgroup. That said, we can say that gender does not moderate relationship between beer consumption and major depression diagnosis.