Data analysis tools

For Week 2 assignment I'm testing:

Null hypothesis - there is no difference in beer consumption between people with/without major depression diagnosis Alternate hypothesis - there is difference in beer consumption between people with/without major depression diagnosis


In [1]:
import pandas
import numpy
import scipy.stats
import seaborn
import matplotlib.pyplot as plt

data = pandas.read_csv('nesarc_pds.csv', low_memory=False)

# MAJORDEPLIFE - Diagnosed major depressions in lifetime
# S2AQ5A - Drink beer in last 12 months
# S2AQ5B - How often drank a beer in last year

data['MAJORDEPLIFE'] = pandas.to_numeric(data['TAB12MDX'], errors='coerce')
data['S2AQ5A'] = pandas.to_numeric(data['S2AQ5A'], errors='coerce')
data['S2AQ5B'] = pandas.to_numeric(data['S2AQ5B'], errors='coerce')
data['AGE'] = pandas.to_numeric(data['AGE'], errors='coerce')

Following block prepares data we will use and prints contingency table with number of people we have within each category:


In [2]:
#subset data to young adults age 18 to 25 who have drinked beer in the past 12 months
sub1=data[(data['AGE']>=18) & (data['AGE']<=25) & (data['S2AQ5A']==1)]
sub2 = sub1.copy()

# recode missing values to python missing (NaN)
sub2['S2AQ5B']=sub2['S2AQ5B'].replace(99, numpy.nan)

# contingency table of observed counts
ct1=pandas.crosstab(sub2['MAJORDEPLIFE'], sub2['S2AQ5B'])
print (ct1)


S2AQ5B        1   2    3    4    5    6    7    8    9    10
MAJORDEPLIFE                                                
0             41  48  157  271  325  355  289  130  228  312
1             34  31   86  128  107  116   66   28   57   49

Next block prints the same contingency table but in percentages instead of absolute values:


In [5]:
# column percentages
colsum=ct1.sum(axis=0)
colpct=ct1/colsum
print(colpct)


S2AQ5B              1         2         3         4         5         6   \
MAJORDEPLIFE                                                               
0             0.546667  0.607595  0.646091  0.679198  0.752315  0.753715   
1             0.453333  0.392405  0.353909  0.320802  0.247685  0.246285   

S2AQ5B              7         8    9         10  
MAJORDEPLIFE                                     
0             0.814085  0.822785  0.8  0.864266  
1             0.185915  0.177215  0.2  0.135734  

Chi-square test


In [10]:
# chi-square test
cs1= scipy.stats.chi2_contingency(ct1)
print 'Chi-square value: ', cs1[0]
print 'p value: ', cs1[1]
print 'Expected counts:', cs1[3]


Chi-square value:  91.7559212919
p value:  7.22908728572e-16
Expected counts: [[  56.57802659   59.59552134  183.31280616  300.99510147  325.88943317
   355.310007    267.8026592   119.19104269  214.99650105  272.32890133]
 [  18.42197341   19.40447866   59.68719384   98.00489853  106.11056683
   115.689993     87.1973408    38.80895731   70.00349895   88.67109867]]

Test value is significant, p value is much smaller than 0.05, so we can reject null hypothesis and accept alternate one.

Bonferroni adjusted p - post-hoc test


In [14]:
bp_adjusted = 0.05 / 45
print 'adjusted p value: ', bp_adjusted


adjusted p value:  0.00111111111111

Chi-square test p value is still smaller than bonferroni adjusted p value. This test indicated we were right to reject null hypothesis.

Individual category pairs post-hoc test


In [21]:
for idx1 in range(1,11):
    for idx2 in range (idx1+1, 11):
        map_filter = {idx1:idx1, idx2:idx2}
        sub2['COMP1v2']= sub2['S2AQ5B'].map(map_filter)

        # contingency table of observed counts
        ct2 = pandas.crosstab(sub2['MAJORDEPLIFE'], sub2['COMP1v2'])
        # chi-square test
        cs2 = scipy.stats.chi2_contingency(ct2)
        print 'Category1:', idx1, 'Category2:',idx2, 'P-value:',cs2[1],'Rejected:', cs2[1]<0.05


Category1: 1 Category2: 2 P-value: 0.547186731823 Rejected: False
Category1: 1 Category2: 3 P-value: 0.156616903798 Rejected: False
Category1: 1 Category2: 4 P-value: 0.0368415781231 Rejected: True
Category1: 1 Category2: 5 P-value: 0.000416443651544 Rejected: True
Category1: 1 Category2: 6 P-value: 0.000328566460663 Rejected: True
Category1: 1 Category2: 7 P-value: 1.36159833394e-06 Rejected: True
Category1: 1 Category2: 8 P-value: 1.72911221141e-05 Rejected: True
Category1: 1 Category2: 9 P-value: 1.41010285511e-05 Rejected: True
Category1: 1 Category2: 10 P-value: 5.18542935195e-10 Rejected: True
Category1: 2 Category2: 3 P-value: 0.628841311934 Rejected: False
Category1: 2 Category2: 4 P-value: 0.269844416553 Rejected: False
Category1: 2 Category2: 5 P-value: 0.0115372823375 Rejected: True
Category1: 2 Category2: 6 P-value: 0.00992353822001 Rejected: True
Category1: 2 Category2: 7 P-value: 0.000125483472413 Rejected: True
Category1: 2 Category2: 8 P-value: 0.000555862101885 Rejected: True
Category1: 2 Category2: 9 P-value: 0.000709597273352 Rejected: True
Category1: 2 Category2: 10 P-value: 2.02940227136e-07 Rejected: True
Category1: 3 Category2: 4 P-value: 0.4372899875 Rejected: False
Category1: 3 Category2: 5 P-value: 0.00446964612246 Rejected: True
Category1: 3 Category2: 6 P-value: 0.0033076578016 Rejected: True
Category1: 3 Category2: 7 P-value: 5.66391841695e-06 Rejected: True
Category1: 3 Category2: 8 P-value: 0.000199486220784 Rejected: True
Category1: 3 Category2: 9 P-value: 0.000109607355374 Rejected: True
Category1: 3 Category2: 10 P-value: 5.23681988891e-10 Rejected: True
Category1: 4 Category2: 5 P-value: 0.0237498460441 Rejected: True
Category1: 4 Category2: 6 P-value: 0.0181636872504 Rejected: True
Category1: 4 Category2: 7 P-value: 3.38679931987e-05 Rejected: True
Category1: 4 Category2: 8 P-value: 0.00097638863584 Rejected: True
Category1: 4 Category2: 9 P-value: 0.000628032752038 Rejected: True
Category1: 4 Category2: 10 P-value: 2.81824803258e-09 Rejected: True
Category1: 5 Category2: 6 P-value: 0.977276237751 Rejected: False
Category1: 5 Category2: 7 P-value: 0.0459712143641 Rejected: True
Category1: 5 Category2: 8 P-value: 0.0903191359526 Rejected: False
Category1: 5 Category2: 9 P-value: 0.162449381314 Rejected: False
Category1: 5 Category2: 10 P-value: 0.000113533666137 Rejected: True
Category1: 6 Category2: 7 P-value: 0.0468670925646 Rejected: True
Category1: 6 Category2: 8 P-value: 0.0932079333443 Rejected: False
Category1: 6 Category2: 9 P-value: 0.167946370294 Rejected: False
Category1: 6 Category2: 10 P-value: 0.000106273018234 Rejected: True
Category1: 7 Category2: 8 P-value: 0.911174832533 Rejected: False
Category1: 7 Category2: 9 P-value: 0.727455646198 Rejected: False
Category1: 7 Category2: 10 P-value: 0.0842294361331 Rejected: False
Category1: 8 Category2: 9 P-value: 0.647361007788 Rejected: False
Category1: 8 Category2: 10 P-value: 0.276066950601 Rejected: False
Category1: 9 Category2: 10 P-value: 0.0372587697866 Rejected: True

Here we picked all pair combinations and got the pairs for which null hypothesis can be rejected, and also ones for which we cannot reject it.