Data analysis tools

For Week 1 assignment I'm checking correlation between race and amphetamines using the NESARC data

N0 hypothesis - there is no difference in alcohol usage between races.

Na hypothesis - there is difference in alcohol usage between races


In [10]:
import numpy
import pandas
import statsmodels.formula.api as smf
import statsmodels.stats.multicomp as multi 

data = pandas.read_csv('nesarc_pds.csv', low_memory=False)

# S2AQ8A - HOW OFTEN DRANK ANY ALCOHOL IN LAST 12 MONTHS (99 - Unknown)
# S2AQ8B - NUMBER OF DRINKS OF ANY ALCOHOL USUALLY CONSUMED ON DAYS WHEN DRANK ALCOHOL IN LAST 12 MONTHS (99 - Unknown)
# S2AQ3 - DRANK AT LEAST 1 ALCOHOLIC DRINK IN LAST 12 MONTHS

#setting variables you will be working with to numeric
data['S2AQ8A'] = data['S2AQ8A'].convert_objects(convert_numeric=True)
data['S2AQ8B'] = data['S2AQ8B'].convert_objects(convert_numeric=True)
data['S2AQ3'] = data['S2AQ3'].convert_objects(convert_numeric=True)

#subset data to young adults age 18 to 27 who have drank alcohol in the past 12 months
subset=data[(data['AGE']>=19) & (data['AGE']<=34) & (data['S2AQ3']==1)]

subset['S2AQ8A']=subset['S2AQ8A'].replace(99, numpy.nan)
subset['S3BD4Q2DR']=subset['S3BD4Q2DR'].replace(99, numpy.nan)

alcohol_usage_map = {
    1: 365,
    2: 330,
    3: 182,
    4: 104,
    5: 52,
    6: 30,
    7: 12,
    8: 9,
    9: 5,
    10: 2,
}

subset['ALCO_FREQMO'] = subset['S2AQ8A'].map(alcohol_usage_map)

#converting new variable ALCO_FREQMO to numeric
subset['ALCO_FREQMO'] = subset['ALCO_FREQMO'].convert_objects(convert_numeric=True)
subset['ALCO_NUM_EST'] = subset['ALCO_FREQMO'] * subset['S2AQ8B']


ct1 = subset.groupby('ALCO_NUM_EST').size()
subset_race = subset[['ALCO_NUM_EST', 'ETHRACE2A']].dropna()


/usr/local/lib/python2.7/site-packages/ipykernel/__main__.py:13: FutureWarning: convert_objects is deprecated.  Use the data-type specific converters pd.to_datetime, pd.to_timedelta and pd.to_numeric.
/usr/local/lib/python2.7/site-packages/ipykernel/__main__.py:14: FutureWarning: convert_objects is deprecated.  Use the data-type specific converters pd.to_datetime, pd.to_timedelta and pd.to_numeric.
/usr/local/lib/python2.7/site-packages/ipykernel/__main__.py:15: FutureWarning: convert_objects is deprecated.  Use the data-type specific converters pd.to_datetime, pd.to_timedelta and pd.to_numeric.
/usr/local/lib/python2.7/site-packages/ipykernel/__main__.py:20: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
/usr/local/lib/python2.7/site-packages/ipykernel/__main__.py:21: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
/usr/local/lib/python2.7/site-packages/ipykernel/__main__.py:36: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
/usr/local/lib/python2.7/site-packages/ipykernel/__main__.py:39: FutureWarning: convert_objects is deprecated.  Use the data-type specific converters pd.to_datetime, pd.to_timedelta and pd.to_numeric.
/usr/local/lib/python2.7/site-packages/ipykernel/__main__.py:39: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
/usr/local/lib/python2.7/site-packages/ipykernel/__main__.py:40: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy

Then OLS regression test is run


In [6]:
# using ols function for calculating the F-statistic and associated p value
model1 = smf.ols(formula='ALCO_NUM_EST ~ C(ETHRACE2A)', data=subset_race)
results1 = model1.fit()

print (results1.summary())


                            OLS Regression Results                            
==============================================================================
Dep. Variable:           ALCO_NUM_EST   R-squared:                       0.001
Model:                            OLS   Adj. R-squared:                  0.001
Method:                 Least Squares   F-statistic:                     2.489
Date:                Sat, 27 Feb 2016   Prob (F-statistic):             0.0413
Time:                        22:01:31   Log-Likelihood:                -75816.
No. Observations:                8797   AIC:                         1.516e+05
Df Residuals:                    8792   BIC:                         1.517e+05
Df Model:                           4                                         
Covariance Type:            nonrobust                                         
=====================================================================================
                        coef    std err          t      P>|t|      [95.0% Conf. Int.]
-------------------------------------------------------------------------------------
Intercept           354.6143     19.439     18.243      0.000       316.510   392.719
C(ETHRACE2A)[T.2]   -20.5087     39.805     -0.515      0.606       -98.536    57.519
C(ETHRACE2A)[T.3]    86.8109    120.399      0.721      0.471      -149.199   322.821
C(ETHRACE2A)[T.4]  -149.7788     78.497     -1.908      0.056      -303.651     4.093
C(ETHRACE2A)[T.5]   -88.4542     34.929     -2.532      0.011      -156.923   -19.985
==============================================================================
Omnibus:                    18608.088   Durbin-Watson:                   1.958
Prob(Omnibus):                  0.000   Jarque-Bera (JB):         68158886.865
Skew:                          18.547   Prob(JB):                         0.00
Kurtosis:                     432.622   Cond. No.                         8.86
==============================================================================

Warnings:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.

And as Prob (F-statistics) is less than 0.05, I can discard null hypothesis.

Following block gives means and std deviations:


In [7]:
print ('means for ALCO_NUM_EST by race')
m2= subset_race.groupby('ETHRACE2A').mean()
print (m2)

print ('standard dev for ALCO_NUM_EST by race')
sd2 = subset_race.groupby('ETHRACE2A').std()
print (sd2)


means for ALCO_NUM_EST by race
           ALCO_NUM_EST
ETHRACE2A              
1            354.614331
2            334.105653
3            441.425197
4            204.835484
5            266.160169
standard dev for ALCO_NUM_EST by race
           ALCO_NUM_EST
ETHRACE2A              
1           1419.794119
2           1575.251108
3           1054.538239
4            520.285056
5           1037.883732

Tukey's HSD post hoc test


In [11]:
mc1 = multi.MultiComparison(subset_race['ALCO_NUM_EST'], subset_race['ETHRACE2A'])
res1 = mc1.tukeyhsd()
print(res1.summary())


Multiple Comparison of Means - Tukey HSD,FWER=0.05
=================================================
group1 group2  meandiff   lower    upper   reject
-------------------------------------------------
  1      2     -20.5087 -129.1102 88.0928  False 
  1      3     86.8109  -241.6761 415.2978 False 
  1      4    -149.7788 -363.9426  64.385  False 
  1      5     -88.4542  -183.752  6.8437  False 
  2      3     107.3195 -230.4266 445.0657 False 
  2      4    -129.2702 -357.3818 98.8414  False 
  2      5     -67.9455 -191.4382 55.5472  False 
  3      4    -236.5897 -621.4849 148.3054 False 
  3      5     -175.265 -508.9712 158.4412 False 
  4      5     61.3247  -160.7616 283.4109 False 
-------------------------------------------------

We see that Tukey HSD didn’t give us verification that we can reject null hypothesis for any combination of two races, which is probably result of p vlaue being near the limit (4.13%)


In [ ]: