Data analysis tools

For Week 4 assignment I'm testing if gender affects either strength or the dirrection of the relationship between beer consumption and the fact the person had or not major depression diagnosis.


In [1]:
import pandas
import numpy
import scipy.stats
import seaborn
import matplotlib.pyplot as plt

data = pandas.read_csv('nesarc_pds.csv', low_memory=False)

# MAJORDEPLIFE - Diagnosed major depressions in lifetime
# S2AQ5A - Drink beer in last 12 months
# S2AQ5B - How often drank a beer in last year

data['MAJORDEPLIFE'] = pandas.to_numeric(data['TAB12MDX'], errors='coerce')
data['S2AQ5A'] = pandas.to_numeric(data['S2AQ5A'], errors='coerce')
data['S2AQ5B'] = pandas.to_numeric(data['S2AQ5B'], errors='coerce')
data['AGE'] = pandas.to_numeric(data['AGE'], errors='coerce')

Following block prepares data we will use and prints contingency table with number of people we have within each category:


In [2]:
#subset data to young adults age 18 to 25 who have drinked beer in the past 12 months
sub1=data[(data['AGE']>=18) & (data['AGE']<=25) & (data['S2AQ5A']==1)]
sub2 = sub1.copy()

# recode missing values to python missing (NaN)
sub2['S2AQ5B']=sub2['S2AQ5B'].replace(99, numpy.nan)

# contingency table of observed counts
ct1=pandas.crosstab(sub2['MAJORDEPLIFE'], sub2['S2AQ5B'])
print (ct1)


S2AQ5B        1   2    3    4    5    6    7    8    9    10
MAJORDEPLIFE                                                
0             41  48  157  271  325  355  289  130  228  312
1             34  31   86  128  107  116   66   28   57   49

Next block prints the same contingency table but in percentages instead of absolute values:


In [3]:
# column percentages
colsum=ct1.sum(axis=0)
colpct=ct1/colsum
print(colpct)


S2AQ5B              1         2         3         4         5         6   \
MAJORDEPLIFE                                                               
0             0.546667  0.607595  0.646091  0.679198  0.752315  0.753715   
1             0.453333  0.392405  0.353909  0.320802  0.247685  0.246285   

S2AQ5B              7         8    9         10  
MAJORDEPLIFE                                     
0             0.814085  0.822785  0.8  0.864266  
1             0.185915  0.177215  0.2  0.135734  

Now we split test dataset into two subgroups based on Gender moderator


In [4]:
sub3 = sub2[(sub2['SEX'] == 1)] # Males only
sub4 = sub2[(sub2['SEX'] == 2)] # Females only

print 'Contingency tables in absolute and percentage values for MALE subset'
ct2 = pandas.crosstab(sub2['MAJORDEPLIFE'], sub3['S2AQ5B'])
print (ct2)

# column percentages
colsum2 = ct2.sum(axis=0)
colpct2 = ct2/colsum2
print(colpct2)

print 'Contingency tables in absolute and percentage values for FEMALE subset'
ct3 = pandas.crosstab(sub2['MAJORDEPLIFE'], sub4['S2AQ5B'])
print (ct3)

# column percentages
colsum3 = ct3.sum(axis=0)
colpct3 = ct3/colsum3
print(colpct3)


Contingency tables in absolute and percentage values for MALE subset
S2AQ5B        1   2    3    4    5    6    7   8    9    10
MAJORDEPLIFE                                               
0             36  42  117  188  201  215  165  66  111  148
1             27  24   58   82   68   61   34  13   20   17
S2AQ5B              1         2         3         4         5         6   \
MAJORDEPLIFE                                                               
0             0.571429  0.636364  0.668571  0.696296  0.747212  0.778986   
1             0.428571  0.363636  0.331429  0.303704  0.252788  0.221014   

S2AQ5B              7         8         9        10  
MAJORDEPLIFE                                         
0             0.829146  0.835443  0.847328  0.89697  
1             0.170854  0.164557  0.152672  0.10303  
Contingency tables in absolute and percentage values for FEMALE subset
S2AQ5B        1   2   3   4    5    6    7   8    9    10
MAJORDEPLIFE                                             
0              5   6  40  83  124  140  124  64  117  164
1              7   7  28  46   39   55   32  15   37   32
S2AQ5B              1         2         3         4         5         6   \
MAJORDEPLIFE                                                               
0             0.416667  0.461538  0.588235  0.643411  0.760736  0.717949   
1             0.583333  0.538462  0.411765  0.356589  0.239264  0.282051   

S2AQ5B              7         8        9         10  
MAJORDEPLIFE                                         
0             0.794872  0.810127  0.75974  0.836735  
1             0.205128  0.189873  0.24026  0.163265  

Chi-square tests for both subgroups


In [5]:
# chi-square tests
cs2 = scipy.stats.chi2_contingency(ct2)
print '[Males dataset] Chi-square value: ', cs2[0]
print '[Males dataset] p value: ', cs2[1]
print '[Males dataset] Expected counts:', cs2[3]

cs3 = scipy.stats.chi2_contingency(ct3)
print '[Females dataset] Chi-square value: ', cs3[0]
print '[Females dataset] p value: ', cs3[1]
print '[Females dataset] Expected counts:', cs3[3]


[Males dataset] Chi-square value:  62.9857170406
[Males dataset] p value:  3.55051057886e-10
[Males dataset] Expected counts: [[  47.96633196   50.250443    133.23981099  205.56999409  204.80862374
   210.13821618  151.51269935   60.14825753   99.73951565  125.6261075 ]
 [  15.03366804   15.749557     41.76018901   64.43000591   64.19137626
    65.86178382   47.48730065   18.85174247   31.26048435   39.3738925 ]]
[Females dataset] Chi-square value:  41.6512786359
[Females dataset] p value:  3.80571273777e-06
[Females dataset] Expected counts: [[   8.9304721     9.67467811   50.60600858   96.00257511  121.3055794
   145.12017167  116.09613734   58.79227468  114.60772532  145.86437768]
 [   3.0695279     3.32532189   17.39399142   32.99742489   41.6944206
    49.87982833   39.90386266   20.20772532   39.39227468   50.13562232]]

Both chi-square test values are significant and p value in both tests are much smaller then 0,05.

Relationship between beer consumption and major depression diagnosis is similar for both male and female groups. Still chi-square test value is more significant for Male subgroup. That said, we can say that gender does not moderate relationship between beer consumption and major depression diagnosis.