For Week 1 assignment I'm checking correlation between race and amphetamines using the NESARC data
N0 hypothesis - there is no difference in alcohol usage between races.
Na hypothesis - there is difference in alcohol usage between races
In [10]:
import numpy
import pandas
import statsmodels.formula.api as smf
import statsmodels.stats.multicomp as multi
data = pandas.read_csv('nesarc_pds.csv', low_memory=False)
# S2AQ8A - HOW OFTEN DRANK ANY ALCOHOL IN LAST 12 MONTHS (99 - Unknown)
# S2AQ8B - NUMBER OF DRINKS OF ANY ALCOHOL USUALLY CONSUMED ON DAYS WHEN DRANK ALCOHOL IN LAST 12 MONTHS (99 - Unknown)
# S2AQ3 - DRANK AT LEAST 1 ALCOHOLIC DRINK IN LAST 12 MONTHS
#setting variables you will be working with to numeric
data['S2AQ8A'] = data['S2AQ8A'].convert_objects(convert_numeric=True)
data['S2AQ8B'] = data['S2AQ8B'].convert_objects(convert_numeric=True)
data['S2AQ3'] = data['S2AQ3'].convert_objects(convert_numeric=True)
#subset data to young adults age 18 to 27 who have drank alcohol in the past 12 months
subset=data[(data['AGE']>=19) & (data['AGE']<=34) & (data['S2AQ3']==1)]
subset['S2AQ8A']=subset['S2AQ8A'].replace(99, numpy.nan)
subset['S3BD4Q2DR']=subset['S3BD4Q2DR'].replace(99, numpy.nan)
alcohol_usage_map = {
1: 365,
2: 330,
3: 182,
4: 104,
5: 52,
6: 30,
7: 12,
8: 9,
9: 5,
10: 2,
}
subset['ALCO_FREQMO'] = subset['S2AQ8A'].map(alcohol_usage_map)
#converting new variable ALCO_FREQMO to numeric
subset['ALCO_FREQMO'] = subset['ALCO_FREQMO'].convert_objects(convert_numeric=True)
subset['ALCO_NUM_EST'] = subset['ALCO_FREQMO'] * subset['S2AQ8B']
ct1 = subset.groupby('ALCO_NUM_EST').size()
subset_race = subset[['ALCO_NUM_EST', 'ETHRACE2A']].dropna()
Then OLS regression test is run
In [6]:
# using ols function for calculating the F-statistic and associated p value
model1 = smf.ols(formula='ALCO_NUM_EST ~ C(ETHRACE2A)', data=subset_race)
results1 = model1.fit()
print (results1.summary())
And as Prob (F-statistics) is less than 0.05, I can discard null hypothesis.
Following block gives means and std deviations:
In [7]:
print ('means for ALCO_NUM_EST by race')
m2= subset_race.groupby('ETHRACE2A').mean()
print (m2)
print ('standard dev for ALCO_NUM_EST by race')
sd2 = subset_race.groupby('ETHRACE2A').std()
print (sd2)
In [11]:
mc1 = multi.MultiComparison(subset_race['ALCO_NUM_EST'], subset_race['ETHRACE2A'])
res1 = mc1.tukeyhsd()
print(res1.summary())
We see that Tukey HSD didn’t give us verification that we can reject null hypothesis for any combination of two races, which is probably result of p vlaue being near the limit (4.13%)
In [ ]: