For Week 2 assignment I'm testing:
Null hypothesis - there is no difference in beer consumption between people with/without major depression diagnosis Alternate hypothesis - there is difference in beer consumption between people with/without major depression diagnosis
In [1]:
import pandas
import numpy
import scipy.stats
import seaborn
import matplotlib.pyplot as plt
data = pandas.read_csv('nesarc_pds.csv', low_memory=False)
# MAJORDEPLIFE - Diagnosed major depressions in lifetime
# S2AQ5A - Drink beer in last 12 months
# S2AQ5B - How often drank a beer in last year
data['MAJORDEPLIFE'] = pandas.to_numeric(data['TAB12MDX'], errors='coerce')
data['S2AQ5A'] = pandas.to_numeric(data['S2AQ5A'], errors='coerce')
data['S2AQ5B'] = pandas.to_numeric(data['S2AQ5B'], errors='coerce')
data['AGE'] = pandas.to_numeric(data['AGE'], errors='coerce')
Following block prepares data we will use and prints contingency table with number of people we have within each category:
In [2]:
#subset data to young adults age 18 to 25 who have drinked beer in the past 12 months
sub1=data[(data['AGE']>=18) & (data['AGE']<=25) & (data['S2AQ5A']==1)]
sub2 = sub1.copy()
# recode missing values to python missing (NaN)
sub2['S2AQ5B']=sub2['S2AQ5B'].replace(99, numpy.nan)
# contingency table of observed counts
ct1=pandas.crosstab(sub2['MAJORDEPLIFE'], sub2['S2AQ5B'])
print (ct1)
Next block prints the same contingency table but in percentages instead of absolute values:
In [5]:
# column percentages
colsum=ct1.sum(axis=0)
colpct=ct1/colsum
print(colpct)
In [10]:
# chi-square test
cs1= scipy.stats.chi2_contingency(ct1)
print 'Chi-square value: ', cs1[0]
print 'p value: ', cs1[1]
print 'Expected counts:', cs1[3]
Test value is significant, p value is much smaller than 0.05, so we can reject null hypothesis and accept alternate one.
In [14]:
bp_adjusted = 0.05 / 45
print 'adjusted p value: ', bp_adjusted
Chi-square test p value is still smaller than bonferroni adjusted p value. This test indicated we were right to reject null hypothesis.
In [21]:
for idx1 in range(1,11):
for idx2 in range (idx1+1, 11):
map_filter = {idx1:idx1, idx2:idx2}
sub2['COMP1v2']= sub2['S2AQ5B'].map(map_filter)
# contingency table of observed counts
ct2 = pandas.crosstab(sub2['MAJORDEPLIFE'], sub2['COMP1v2'])
# chi-square test
cs2 = scipy.stats.chi2_contingency(ct2)
print 'Category1:', idx1, 'Category2:',idx2, 'P-value:',cs2[1],'Rejected:', cs2[1]<0.05
Here we picked all pair combinations and got the pairs for which null hypothesis can be rejected, and also ones for which we cannot reject it.