BDA_Fall17: Data for Final Project

Sean M. Shiverick, IU-Bloomington

2015 National Survey on Drug Abuse and Health (NSDUH)

Data Cleaning and Preparation

Step 1. Download data from URL, unzip files, write data to csv

  • get_data() function retrieves datafiles from URL, unzips files, extracts data
  • Reads NSDUH-2015-DS0001-data-excel.tsv file, converts to dataFrame object
  • Print data frame shape, and exports dataframe to CSV file as nsduh15-dataset.csv

In [1]:
import requests, zipfile, io
import pandas as pd

URL = 'http://samhda.s3-us-gov-west-1.amazonaws.com/s3fs-public/field-uploads-protected/studies/NSDUH-2015/NSDUH-2015-datasets/NSDUH-2015-DS0001/NSDUH-2015-DS0001-bundles-with-study-info/NSDUH-2015-DS0001-bndl-data-tsv.zip'

def get_data():
    r = requests.get('http://samhda.s3-us-gov-west-1.amazonaws.com/s3fs-public/field-uploads-protected/studies/NSDUH-2015/NSDUH-2015-datasets/NSDUH-2015-DS0001/NSDUH-2015-DS0001-bundles-with-study-info/NSDUH-2015-DS0001-bndl-data-tsv.zip')
    z = zipfile.ZipFile(io.BytesIO(r.content))
    z.extractall()

    file = pd.read_table('~/NSDUH-2015-DS0001-bndl-data-tsv/NSDUH-2015-DS0001-data/NSDUH-2015-DS0001-data-excel.tsv', low_memory=False)
    data = pd.DataFrame(file)
    print(data.shape)
    
    data.to_csv('nsduh15-dataset.csv', sep=',', encoding='utf-8')
    
get_data()


(57146, 2666)

Step 2. Use Pandas to Subset dataset as data frame

  • Import python modules
  • load data file and save as DataFrame object
  • Subset dataframe by column

In [2]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
%matplotlib inline

In [3]:
file = pd.read_csv('nsduh15-dataset.csv', low_memory=False)
data = pd.DataFrame(file)

In [4]:
data.shape


Out[4]:
(57146, 2667)

In [5]:
df = pd.DataFrame(data, columns=['QUESTID2', 'CATAG6', 'IRSEX','IRMARITSTAT',
        'EDUHIGHCAT', 'IRWRKSTAT18', 'COUTYP2', 'HEALTH2','STDANYYR1',
        'HEPBCEVER1','HIVAIDSEV1','CANCEREVR1','INHOSPYR','AMDELT',
        'AMDEYR','ADDPR2WK1','ADWRDST1','DSTWORST1','IMPGOUTM1',
        'IMPSOCM1','IMPRESPM1','SUICTHNK1','SUICPLAN1','SUICTRY1',
        'PNRNMLIF','PNRNM30D','PNRWYGAMT','PNRNMFLAG','PNRNMYR',
        'PNRNMMON','OXYCNNMYR','DEPNDPYPNR','ABUSEPYPNR','PNRRSHIGH',
        'HYDCPDAPYU','OXYCPDAPYU','OXCNANYYR2','TRAMPDAPYU','MORPPDAPYU',
        'FENTPDAPYU','BUPRPDAPYU','OXYMPDAPYU','DEMEPDAPYU','HYDMPDAPYU',
        'HERFLAG','HERYR','HERMON','ABODHER', 'MTDNPDAPYU',
        'IRHERFY','TRBENZAPYU','ALPRPDAPYU','LORAPDAPYU','CLONPDAPYU',
        'DIAZPDAPYU','SVBENZAPYU','TRIAPDAPYU','TEMAPDAPYU','BARBITAPYU',
        'SEDOTANYR2','COCFLAG','COCYR','COCMON','CRKFLAG',
        'CRKYR','AMMEPDAPYU','METHAMFLAG','METHAMYR','METHAMMON',
        'HALLUCFLAG','LSDFLAG','ECSTMOFLAG','DAMTFXFLAG','KETMINFLAG',
        'TXYRRESOV1','TXYROUTPT1','TXYRMHCOP1','TXYREMRGN1','TXCURRENT1',
        'TXLTYPNRL1','TXYRNOSPIL','AUOPTYR1','MHLMNT3','MHLTHER3',
        'MHLDOC3','MHLCLNC3','MHLDTMT3','AUINPYR1','AUALTYR1'])
df.shape


Out[5]:
(57146, 89)

In [6]:
df.head()


Out[6]:
QUESTID2 CATAG6 IRSEX IRMARITSTAT EDUHIGHCAT IRWRKSTAT18 COUTYP2 HEALTH2 STDANYYR1 HEPBCEVER1 ... TXLTYPNRL1 TXYRNOSPIL AUOPTYR1 MHLMNT3 MHLTHER3 MHLDOC3 MHLCLNC3 MHLDTMT3 AUINPYR1 AUALTYR1
0 25095143 1 1 4 5 99 3 2.0 NaN NaN ... NaN 0 NaN NaN NaN NaN NaN NaN NaN NaN
1 13005143 4 1 3 2 1 2 1.0 NaN NaN ... NaN 1 NaN 0.0 0.0 0.0 0.0 0.0 NaN NaN
2 67415143 3 2 4 4 4 3 2.0 NaN NaN ... NaN 0 NaN 1.0 0.0 0.0 0.0 0.0 NaN NaN
3 70925143 1 2 99 5 99 2 3.0 NaN NaN ... NaN 0 NaN NaN NaN NaN NaN NaN NaN NaN
4 75235143 2 2 1 3 4 3 3.0 NaN NaN ... NaN 0 NaN 0.0 0.0 0.0 1.0 0.0 NaN NaN

5 rows × 89 columns


In [7]:
df.tail()


Out[7]:
QUESTID2 CATAG6 IRSEX IRMARITSTAT EDUHIGHCAT IRWRKSTAT18 COUTYP2 HEALTH2 STDANYYR1 HEPBCEVER1 ... TXLTYPNRL1 TXYRNOSPIL AUOPTYR1 MHLMNT3 MHLTHER3 MHLDOC3 MHLCLNC3 MHLDTMT3 AUINPYR1 AUALTYR1
57141 57863730 5 1 4 4 1 2 2.0 NaN NaN ... NaN 0 NaN 0.0 0.0 0.0 0.0 0.0 NaN NaN
57142 97294730 2 1 4 2 1 1 3.0 NaN NaN ... NaN 0 NaN 0.0 0.0 0.0 0.0 0.0 NaN NaN
57143 31894730 2 2 4 3 2 2 2.0 NaN NaN ... NaN 0 NaN 0.0 0.0 0.0 0.0 0.0 NaN NaN
57144 98524730 1 1 4 5 99 2 3.0 NaN NaN ... NaN 0 NaN NaN NaN NaN NaN NaN NaN NaN
57145 80134730 1 2 4 5 99 2 3.0 NaN NaN ... NaN 0 NaN NaN NaN NaN NaN NaN NaN NaN

5 rows × 89 columns

Step 3. Remove Missing Values and Recode Values

  • Recode null and NaN missing values
  • Replace values for Bad Data, Don't know, Refused, Blank, Skip with NaN
  • Replace NaN with 0

In [8]:
df.replace([83, 85, 91, 93, 94, 97, 98, 99, 991, 993], np.nan, inplace=True)
df.fillna(0, inplace=True)
df.head()


Out[8]:
QUESTID2 CATAG6 IRSEX IRMARITSTAT EDUHIGHCAT IRWRKSTAT18 COUTYP2 HEALTH2 STDANYYR1 HEPBCEVER1 ... TXLTYPNRL1 TXYRNOSPIL AUOPTYR1 MHLMNT3 MHLTHER3 MHLDOC3 MHLCLNC3 MHLDTMT3 AUINPYR1 AUALTYR1
0 25095143 1 1 4.0 5 0.0 3 2.0 0.0 0.0 ... 0.0 0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
1 13005143 4 1 3.0 2 1.0 2 1.0 0.0 0.0 ... 0.0 1 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
2 67415143 3 2 4.0 4 4.0 3 2.0 0.0 0.0 ... 0.0 0 0.0 1.0 0.0 0.0 0.0 0.0 0.0 0.0
3 70925143 1 2 0.0 5 0.0 2 3.0 0.0 0.0 ... 0.0 0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
4 75235143 2 2 1.0 3 4.0 3 3.0 0.0 0.0 ... 0.0 0 0.0 0.0 0.0 0.0 1.0 0.0 0.0 0.0

5 rows × 89 columns

3.2 Recode values for selected features:

Order matters here, because some variables were recoded into new variables

  • Recode 2=0: ['STDANYYR1','HEPBCEVER1', 'HIVAIDSEV1', 'CANCEREVR1', 'INHOSPYR ', 'AMDELT','AMDEYR','ADDPR2WK1','DSTWORST1', 'IMPGOUTM1', 'IMPSOCM1','IMPRESPM1','SUICTHNK1','SUICPLAN1','SUICTRY1', 'PNRNMLIF','PNRNM30D','PNRWYGAMT','PNRRSHIGH' 'TXYRRESOV1','TXYROUTPT1','TXYRMHCOP1','TXYREMRGN1', 'TXCURRENT1', 'TXLTYPNRL1','AUOPTYR1','AUINPYR1','AUALTYR1']
  • Recode ['PNRRSHIGH', 'TXLTYPNRL1','TXYREMRGN1', 'AUOPTYR1','AUALTYR1']: 3=1
  • Recode ['TXYRRESOV1', 'TXYROUTPT1','TXYRMHCOP1']: 5=1
  • Recode TXLTYPNRL: 6=0
  • Recode IRSEX to male=0, female=1
  • Recode IRMARITSTAT: 1=4, 2=3, 3=2, 4=1
  • Recode EDUHIGHCAT: 5=0
  • Recode IRWRKSTAT18: 1=2, 2=1, 3=0, 4=0
  • Recode COUTYP2: 1=3, 3=1
  • Recode ADWRDST1: 1=0, 2=1, 3=2, 4=3

In [9]:
df['STDANYYR1'].replace(2,0,inplace=True)
df['HEPBCEVER1'].replace(2,0,inplace=True)
df['HIVAIDSEV1'].replace(2,0,inplace=True)
df['CANCEREVR1'].replace(2,0,inplace=True)
df['INHOSPYR'].replace(2,0,inplace=True)
df['AMDELT'].replace(2,0,inplace=True)
df['AMDEYR'].replace(2,0,inplace=True)
df['ADDPR2WK1'].replace(2,0,inplace=True)
df['DSTWORST1'].replace(2,0,inplace=True)
df['IMPGOUTM1'].replace(2,0,inplace=True)

df['IMPSOCM1'].replace(2,0,inplace=True)
df['IMPRESPM1'].replace(2,0,inplace=True)
df['SUICTHNK1'].replace(2,0,inplace=True)
df['SUICPLAN1'].replace(2,0,inplace=True)
df['SUICTRY1'].replace(2,0,inplace=True)
df['PNRNMLIF'].replace(2,0,inplace=True)
df['PNRNM30D'].replace(2,0,inplace=True)
df['PNRWYGAMT'].replace(2,0,inplace=True)
df['PNRRSHIGH'].replace(2,0,inplace=True)

df['TXYRRESOV1'].replace(2,0,inplace=True)
df['TXYROUTPT1'].replace(2,0,inplace=True)
df['TXYRMHCOP1'].replace(2,0,inplace=True)
df['TXYREMRGN1'].replace(2,0,inplace=True)
df['TXCURRENT1'].replace(2,0,inplace=True)
df['TXLTYPNRL1'].replace(2,0,inplace=True)
df['AUOPTYR1'].replace(2,0,inplace=True)
df['AUINPYR1'].replace(2,0,inplace=True)
df['AUALTYR1'].replace(2,0,inplace=True)


df.head()


Out[9]:
QUESTID2 CATAG6 IRSEX IRMARITSTAT EDUHIGHCAT IRWRKSTAT18 COUTYP2 HEALTH2 STDANYYR1 HEPBCEVER1 ... TXLTYPNRL1 TXYRNOSPIL AUOPTYR1 MHLMNT3 MHLTHER3 MHLDOC3 MHLCLNC3 MHLDTMT3 AUINPYR1 AUALTYR1
0 25095143 1 1 4.0 5 0.0 3 2.0 0.0 0.0 ... 0.0 0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
1 13005143 4 1 3.0 2 1.0 2 1.0 0.0 0.0 ... 0.0 1 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
2 67415143 3 2 4.0 4 4.0 3 2.0 0.0 0.0 ... 0.0 0 0.0 1.0 0.0 0.0 0.0 0.0 0.0 0.0
3 70925143 1 2 0.0 5 0.0 2 3.0 0.0 0.0 ... 0.0 0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
4 75235143 2 2 1.0 3 4.0 3 3.0 0.0 0.0 ... 0.0 0 0.0 0.0 0.0 0.0 1.0 0.0 0.0 0.0

5 rows × 89 columns


In [10]:
df['PNRRSHIGH'].replace(3,1,inplace=True)
df['TXLTYPNRL1'].replace(3,1,inplace=True)
df['TXYREMRGN1'].replace(3,1,inplace=True)
df['AUOPTYR1'].replace(3,1,inplace=True)
df['AUALTYR1'].replace(3,1,inplace=True)

df.head()


Out[10]:
QUESTID2 CATAG6 IRSEX IRMARITSTAT EDUHIGHCAT IRWRKSTAT18 COUTYP2 HEALTH2 STDANYYR1 HEPBCEVER1 ... TXLTYPNRL1 TXYRNOSPIL AUOPTYR1 MHLMNT3 MHLTHER3 MHLDOC3 MHLCLNC3 MHLDTMT3 AUINPYR1 AUALTYR1
0 25095143 1 1 4.0 5 0.0 3 2.0 0.0 0.0 ... 0.0 0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
1 13005143 4 1 3.0 2 1.0 2 1.0 0.0 0.0 ... 0.0 1 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
2 67415143 3 2 4.0 4 4.0 3 2.0 0.0 0.0 ... 0.0 0 0.0 1.0 0.0 0.0 0.0 0.0 0.0 0.0
3 70925143 1 2 0.0 5 0.0 2 3.0 0.0 0.0 ... 0.0 0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
4 75235143 2 2 1.0 3 4.0 3 3.0 0.0 0.0 ... 0.0 0 0.0 0.0 0.0 0.0 1.0 0.0 0.0 0.0

5 rows × 89 columns


In [11]:
df['SEX'] = df['IRSEX'].replace([1,2], [0,1])
df['MARRIED'] = df['IRMARITSTAT'].replace([1,2,3,4], [4,3,2,1])
df['EDUCAT'] = df['EDUHIGHCAT'].replace([1,2,3,4,5], [2,3,4,5,1])
df['EMPLOY18'] = df['IRWRKSTAT18'].replace([1,2,3,4], [2,1,0,0])
df['CTYMETRO'] = df['COUTYP2'].replace([1,2,3],[3,2,1])

df['EMODSWKS'] = df['ADWRDST1'].replace([1,2,3,4], [0,1,2,3])
df['TXLTPNRL'] = df['TXLTYPNRL1'].replace(6,0)

df['TXYRRESOV'] = df['TXYRRESOV1'].replace(5,1)
df['TXYROUTPT'] = df['TXYROUTPT1'].replace(5,1)
df['TXYRMHCOP'] = df['TXYRMHCOP1'].replace(5,1)

df.head()


Out[11]:
QUESTID2 CATAG6 IRSEX IRMARITSTAT EDUHIGHCAT IRWRKSTAT18 COUTYP2 HEALTH2 STDANYYR1 HEPBCEVER1 ... SEX MARRIED EDUCAT EMPLOY18 CTYMETRO EMODSWKS TXLTPNRL TXYRRESOV TXYROUTPT TXYRMHCOP
0 25095143 1 1 4.0 5 0.0 3 2.0 0.0 0.0 ... 0 1.0 1 0.0 1 0.0 0.0 0.0 0.0 0.0
1 13005143 4 1 3.0 2 1.0 2 1.0 0.0 0.0 ... 0 2.0 3 2.0 2 0.0 0.0 0.0 0.0 0.0
2 67415143 3 2 4.0 4 4.0 3 2.0 0.0 0.0 ... 1 1.0 5 0.0 1 0.0 0.0 0.0 0.0 0.0
3 70925143 1 2 0.0 5 0.0 2 3.0 0.0 0.0 ... 1 0.0 1 0.0 2 0.0 0.0 0.0 0.0 0.0
4 75235143 2 2 1.0 3 4.0 3 3.0 0.0 0.0 ... 1 4.0 4 0.0 1 0.0 0.0 0.0 0.0 0.0

5 rows × 99 columns


In [12]:
df.shape


Out[12]:
(57146, 99)

In [13]:
df.columns


Out[13]:
Index(['QUESTID2', 'CATAG6', 'IRSEX', 'IRMARITSTAT', 'EDUHIGHCAT',
       'IRWRKSTAT18', 'COUTYP2', 'HEALTH2', 'STDANYYR1', 'HEPBCEVER1',
       'HIVAIDSEV1', 'CANCEREVR1', 'INHOSPYR', 'AMDELT', 'AMDEYR', 'ADDPR2WK1',
       'ADWRDST1', 'DSTWORST1', 'IMPGOUTM1', 'IMPSOCM1', 'IMPRESPM1',
       'SUICTHNK1', 'SUICPLAN1', 'SUICTRY1', 'PNRNMLIF', 'PNRNM30D',
       'PNRWYGAMT', 'PNRNMFLAG', 'PNRNMYR', 'PNRNMMON', 'OXYCNNMYR',
       'DEPNDPYPNR', 'ABUSEPYPNR', 'PNRRSHIGH', 'HYDCPDAPYU', 'OXYCPDAPYU',
       'OXCNANYYR2', 'TRAMPDAPYU', 'MORPPDAPYU', 'FENTPDAPYU', 'BUPRPDAPYU',
       'OXYMPDAPYU', 'DEMEPDAPYU', 'HYDMPDAPYU', 'HERFLAG', 'HERYR', 'HERMON',
       'ABODHER', 'MTDNPDAPYU', 'IRHERFY', 'TRBENZAPYU', 'ALPRPDAPYU',
       'LORAPDAPYU', 'CLONPDAPYU', 'DIAZPDAPYU', 'SVBENZAPYU', 'TRIAPDAPYU',
       'TEMAPDAPYU', 'BARBITAPYU', 'SEDOTANYR2', 'COCFLAG', 'COCYR', 'COCMON',
       'CRKFLAG', 'CRKYR', 'AMMEPDAPYU', 'METHAMFLAG', 'METHAMYR', 'METHAMMON',
       'HALLUCFLAG', 'LSDFLAG', 'ECSTMOFLAG', 'DAMTFXFLAG', 'KETMINFLAG',
       'TXYRRESOV1', 'TXYROUTPT1', 'TXYRMHCOP1', 'TXYREMRGN1', 'TXCURRENT1',
       'TXLTYPNRL1', 'TXYRNOSPIL', 'AUOPTYR1', 'MHLMNT3', 'MHLTHER3',
       'MHLDOC3', 'MHLCLNC3', 'MHLDTMT3', 'AUINPYR1', 'AUALTYR1', 'SEX',
       'MARRIED', 'EDUCAT', 'EMPLOY18', 'CTYMETRO', 'EMODSWKS', 'TXLTPNRL',
       'TXYRRESOV', 'TXYROUTPT', 'TXYRMHCOP'],
      dtype='object')

Step 4. Rename Select Features for Description


In [14]:
df =  df.rename(columns={'QUESTID2':'QID','CATAG6':'AGECAT',
     'STDANYYR1':'STDPYR','HEPBCEVER1':'HEPEVR','CANCEREVR1':'CANCEVR','INHOSPYR':'HOSPYR', 
     'AMDELT':'DEPMELT','AMDEYR':'DEPMEYR','ADDPR2WK1':'DEPMWKS','DSTWORST1':'DEPWMOS',
     'IMPGOUTM1':'EMOPGOUT','IMPSOCM1':'EMOPSOC','IMPRESPM1':'EMOPWRK',
     'SUICTHNK1':'SUICTHT','SUICPLAN1':'SUICPLN','SUICTRY1':'SUICATT',
     'PNRNMLIF':'PRLUNDR','PNRNM30D':'PRLUNDR30','PNRWYGAMT':'PRLGRTYR',
     'PNRNMFLAG':'PRLMISEVR','PNRNMYR':'PRLMISYR','PNRNMMON':'PRLMISMO',
     'OXYCNNMYR':'PRLOXYMSYR','DEPNDPYPNR':'PRLDEPYR','ABUSEPYPNR':'PRLABSRY',     
     'PNRRSHIGH':'PRLHIGH','HYDCPDAPYU':'HYDRCDYR','OXYCPDAPYU':'OXYCDPRYR', 
     'OXCNANYYR2':'OXYCTNYR','TRAMPDAPYU':'TRMADLYR','MORPPDAPYU':'MORPHPRYR',
     'FENTPDAPYU':'FENTNYLYR','BUPRPDAPYU':'BUPRNRPHN','OXYMPDAPYU':'OXYMORPHN',
     'DEMEPDAPYU':'DEMEROL','HYDMPDAPYU':'HYDRMRPHN','HERFLAG':'HEROINEVR',
     'HERYR':'HEROINYR', 'HERMON':'HEROINMO','ABODHER':'HEROINAB',
     'MTDNPDAPYU':'METHADONE','IRHERFY':'HEROINFQY',
     'TRBENZAPYU':'TRQBENZODZ','ALPRPDAPYU':'TRQALPRZM','LORAPDAPYU':'TRQLRZPM',
     'CLONPDAPYU':'TRQCLNZPM','DIAZPDAPYU':'TRQDIAZPM','SVBENZAPYU':'SDBENZDPN',
     'TRIAPDAPYU':'SDTRZLM','TEMAPDAPYU':'SDTMZPM','BARBITAPYU':'SDBARBTS', 
     'SEDOTANYR2':'SDOTHYR','COCFLAG':'COCNEVR','COCYR':'COCNYR','COCMON':'COCNMO',
     'CRKFLAG':'CRACKEVR','CRKYR':'CRACKYR','AMMEPDAPYU':'AMPHTMNYR', 
     'METHAMFLAG':'METHEVR','METHAMYR':'METHYR','METHAMMON':'METHMO',
     'HALLUCFLAG':'HLCNEVR','LSDFLAG':'LSDEVR','ECSTMOFLAG':'MDMAEVR',
     'DAMTFXFLAG':'DMTEVR','KETMINFLAG':'KETMNEVR', 
     'TXYRRESOV':'TRTRHBOVN','TXYROUTPT':'TRTRHBOUT','TXYRMHCOP':'TRTMHCTR',
     'TXYREMRGN1':'TRTERYR','TXCURRENT1':'TRTCURRCV','TXLTPNRL':'TRTCURPRL',
     'TXYRNOSPIL':'TRTGAPYR','AUOPTYR1':'MHTRTOYR','MHLMNT3':'MHTRTCLYR',
     'MHLTHER3':'MHTRTTHPY','MHLDOC3':'MHTRTDRYR', 'MHLCLNC3':'MHTRTMDOUT',
     'MHLDTMT3':'MHTRTHPPGM','AUINPYR1':'MHTRTHSPON','AUALTYR1':'MHTRTALT'})
     
df.shape


Out[14]:
(57146, 99)

Step 5. Revised Data Frame with updated features


In [15]:
df1 = df[['QID','AGECAT','SEX', 'MARRIED', 'EDUCAT', 
     'EMPLOY18','CTYMETRO','HEALTH2','STDPYR','HEPEVR','CANCEVR','HOSPYR', 
     'DEPMELT','DEPMEYR','DEPMWKS','DEPWMOS','EMODSWKS','EMOPGOUT',
     'EMOPSOC','EMOPWRK','SUICTHT','SUICPLN','SUICATT',
     'PRLUNDR','PRLUNDR30','PRLGRTYR','PRLMISEVR','PRLMISYR',
     'PRLMISMO','PRLOXYMSYR','PRLDEPYR','PRLABSRY','PRLHIGH',
     'HYDRCDYR','OXYCDPRYR','OXYCTNYR','TRMADLYR','MORPHPRYR',
     'FENTNYLYR','BUPRNRPHN','OXYMORPHN','DEMEROL','HYDRMRPHN',
     'HEROINEVR','HEROINYR','HEROINMO','HEROINAB','METHADONE','HEROINFQY',
     'TRQBENZODZ','TRQALPRZM','TRQLRZPM','TRQCLNZPM','TRQDIAZPM',
     'SDBENZDPN','SDTRZLM','SDTMZPM','SDBARBTS','SDOTHYR',
     'COCNEVR','COCNYR','COCNMO','CRACKEVR','CRACKYR',
     'AMPHTMNYR','METHEVR','METHYR','METHMO',
     'HLCNEVR','LSDEVR','MDMAEVR','DMTEVR','KETMNEVR', 
     'TRTRHBOVN','TRTRHBOUT','TRTMHCTR','TRTERYR','TRTCURRCV',
     'TRTCURPRL','TRTGAPYR','MHTRTOYR','MHTRTCLYR','MHTRTTHPY',
     'MHTRTDRYR','MHTRTMDOUT','MHTRTHPPGM','MHTRTHSPON','MHTRTALT']]
df1.shape


Out[15]:
(57146, 88)

In [16]:
df1.head()


Out[16]:
QID AGECAT SEX MARRIED EDUCAT EMPLOY18 CTYMETRO HEALTH2 STDPYR HEPEVR ... TRTCURPRL TRTGAPYR MHTRTOYR MHTRTCLYR MHTRTTHPY MHTRTDRYR MHTRTMDOUT MHTRTHPPGM MHTRTHSPON MHTRTALT
0 25095143 1 0 1.0 1 0.0 1 2.0 0.0 0.0 ... 0.0 0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
1 13005143 4 0 2.0 3 2.0 2 1.0 0.0 0.0 ... 0.0 1 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
2 67415143 3 1 1.0 5 0.0 1 2.0 0.0 0.0 ... 0.0 0 0.0 1.0 0.0 0.0 0.0 0.0 0.0 0.0
3 70925143 1 1 0.0 1 0.0 2 3.0 0.0 0.0 ... 0.0 0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
4 75235143 2 1 4.0 4 0.0 1 3.0 0.0 0.0 ... 0.0 0 0.0 0.0 0.0 0.0 1.0 0.0 0.0 0.0

5 rows × 88 columns

Step 6. Export data file to CSV


In [17]:
df1.to_csv('nsduh-2015.csv', sep=',', encoding='utf-8')

Step 7. Sum selected columns to create aggregate variables

Several ways to create new variables based on sum of related columns:

  1. Simple way to add columns in new variables: df['C'] = df['A'] + df['B']
  2. Use sum function to sum columns: df['C'] = df[['A', 'B']].sum(axis=1)
  3. Use lambda function across rows, using axis=1 for columns: df['C'] = df.apply(lambda row: row['A']+row['B'], axis=1)

In [18]:
df1['HEALTH'] = df1['HEALTH2']+df1['STDPYR']+df1['HEPEVR']+df1['CANCEVR']+df1['HOSPYR']


//anaconda/lib/python3.6/site-packages/ipykernel_launcher.py:1: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  """Entry point for launching an IPython kernel.

In [20]:
df1['MENTHLTH'] =  df1[['DEPMELT', 'DEPMEYR', 'DEPMWKS', 'DEPWMOS', 'EMODSWKS', 
                        'EMOPGOUT','EMOPSOC', 'EMOPWRK','SUICTHT', 'SUICPLN']].sum(axis=1)

df1['PRLMISAB'] =  df1[['PRLUNDR', 'PRLUNDR30', 'PRLGRTYR', 'PRLMISEVR', 'PRLMISYR', 
                          'PRLMISMO', 'PRLOXYMSYR','PRLDEPYR', 'PRLABSRY','PRLHIGH']].sum(axis=1)

df1['PRLANY'] = df1[['HYDRCDYR', 'OXYCDPRYR', 'OXYCTNYR', 'TRMADLYR', 'MORPHPRYR', 
                        'FENTNYLYR','BUPRNRPHN', 'OXYMORPHN','DEMEROL', 'HYDRMRPHN']].sum(axis=1)

df1['HEROINUSE'] =  df1[['HEROINEVR', 'HEROINYR', 'HEROINMO', 'HEROINAB', 'METHADONE']].sum(axis=1)

df1['TRQLZRS'] =  df1[['TRQBENZODZ', 'TRQALPRZM', 'TRQLRZPM', 'TRQCLNZPM', 'TRQDIAZPM']].sum(axis=1)

df1['SEDATVS'] =  df1[['SDBENZDPN','SDTRZLM', 'SDTMZPM','SDBARBTS', 'SDOTHYR', 'SDOTHYR']].sum(axis=1)

df1['COCAINE'] =  df1[['COCNEVR', 'COCNYR', 'COCNMO', 'CRACKEVR', 'CRACKYR']].sum(axis=1)

df1['AMPHETMN'] =  df1[['AMPHTMNYR','METHEVR', 'METHYR','METHMO']].sum(axis=1)

df1['HALUCNG'] =  df1[['HLCNEVR', 'LSDEVR','MDMAEVR', 'DMTEVR', 'KETMNEVR']].sum(axis=1)

df1['TRTMENT'] =  df1[['TRTRHBOVN', 'TRTRHBOUT', 'TRTMHCTR','TRTERYR',
                         'TRTCURRCV', 'TRTCURPRL', 'TRTGAPYR']].sum(axis=1)
        
df1['MHTRTMT'] =  df1[['MHTRTOYR','MHTRTCLYR', 'MHTRTTHPY', 'MHTRTDRYR', 
                          'MHTRTMDOUT', 'MHTRTHPPGM','MHTRTHSPON', 'MHTRTALT']].sum(axis=1)

In [21]:
df1.shape


Out[21]:
(57146, 100)

In [22]:
df1.keys()


Out[22]:
Index(['QID', 'AGECAT', 'SEX', 'MARRIED', 'EDUCAT', 'EMPLOY18', 'CTYMETRO',
       'HEALTH2', 'STDPYR', 'HEPEVR', 'CANCEVR', 'HOSPYR', 'DEPMELT',
       'DEPMEYR', 'DEPMWKS', 'DEPWMOS', 'EMODSWKS', 'EMOPGOUT', 'EMOPSOC',
       'EMOPWRK', 'SUICTHT', 'SUICPLN', 'SUICATT', 'PRLUNDR', 'PRLUNDR30',
       'PRLGRTYR', 'PRLMISEVR', 'PRLMISYR', 'PRLMISMO', 'PRLOXYMSYR',
       'PRLDEPYR', 'PRLABSRY', 'PRLHIGH', 'HYDRCDYR', 'OXYCDPRYR', 'OXYCTNYR',
       'TRMADLYR', 'MORPHPRYR', 'FENTNYLYR', 'BUPRNRPHN', 'OXYMORPHN',
       'DEMEROL', 'HYDRMRPHN', 'HEROINEVR', 'HEROINYR', 'HEROINMO', 'HEROINAB',
       'METHADONE', 'HEROINFQY', 'TRQBENZODZ', 'TRQALPRZM', 'TRQLRZPM',
       'TRQCLNZPM', 'TRQDIAZPM', 'SDBENZDPN', 'SDTRZLM', 'SDTMZPM', 'SDBARBTS',
       'SDOTHYR', 'COCNEVR', 'COCNYR', 'COCNMO', 'CRACKEVR', 'CRACKYR',
       'AMPHTMNYR', 'METHEVR', 'METHYR', 'METHMO', 'HLCNEVR', 'LSDEVR',
       'MDMAEVR', 'DMTEVR', 'KETMNEVR', 'TRTRHBOVN', 'TRTRHBOUT', 'TRTMHCTR',
       'TRTERYR', 'TRTCURRCV', 'TRTCURPRL', 'TRTGAPYR', 'MHTRTOYR',
       'MHTRTCLYR', 'MHTRTTHPY', 'MHTRTDRYR', 'MHTRTMDOUT', 'MHTRTHPPGM',
       'MHTRTHSPON', 'MHTRTALT', 'HEALTH', 'MENTHLTH', 'PRLMISAB', 'PRLANY',
       'HEROINUSE', 'TRQLZRS', 'SEDATVS', 'COCAINE', 'AMPHETMN', 'HALUCNG',
       'TRTMENT', 'MHTRTMT'],
      dtype='object')

7.1 Recode health variable: higher score == better health


In [23]:
df1['HEALTH'].replace([0,1,2,3,4,5], [5,4,3,2,1,0])


Out[23]:
0        3.0
1        4.0
2        3.0
3        2.0
4        1.0
5        2.0
6        3.0
7        4.0
8        3.0
9        1.0
10       1.0
11       4.0
12       3.0
13       3.0
14       3.0
15       3.0
16       4.0
17       2.0
18       1.0
19       4.0
20       1.0
21       2.0
22       2.0
23       4.0
24       2.0
25       3.0
26       3.0
27       3.0
28       1.0
29       4.0
        ... 
57116    1.0
57117    3.0
57118    3.0
57119    2.0
57120    4.0
57121    3.0
57122    2.0
57123    4.0
57124    3.0
57125    2.0
57126    3.0
57127    2.0
57128    2.0
57129    4.0
57130    3.0
57131    2.0
57132    1.0
57133    3.0
57134    2.0
57135    3.0
57136    4.0
57137    3.0
57138    0.0
57139    4.0
57140    2.0
57141    3.0
57142    2.0
57143    3.0
57144    2.0
57145    2.0
Name: HEALTH, Length: 57146, dtype: float64

Step 8. Save Data Subset as Data Frame


In [24]:
df2 = pd.DataFrame(df1, columns=['QID', 'AGECAT', 'SEX', 'MARRIED', 
    'EDUCAT', 'EMPLOY18', 'CTYMETRO','HEALTH', 'MENTHLTH','SUICATT',
    'PRLMISEVR','PRLMISAB','PRLANY','HEROINEVR','HEROINUSE','HEROINFQY',    
    'TRQLZRS', 'SEDATVS', 'COCAINE', 'AMPHETMN','TRTMENT','MHTRTMT'  
    ])
df2.shape


Out[24]:
(57146, 22)

In [25]:
df2.keys()


Out[25]:
Index(['QID', 'AGECAT', 'SEX', 'MARRIED', 'EDUCAT', 'EMPLOY18', 'CTYMETRO',
       'HEALTH', 'MENTHLTH', 'SUICATT', 'PRLMISEVR', 'PRLMISAB', 'PRLANY',
       'HEROINEVR', 'HEROINUSE', 'HEROINFQY', 'TRQLZRS', 'SEDATVS', 'COCAINE',
       'AMPHETMN', 'TRTMENT', 'MHTRTMT'],
      dtype='object')

Step 9. Export data frame to CSV file


In [26]:
df2.to_csv('project-data.csv', sep=',', encoding='utf-8')