In [1]:
import os
from dotenv import load_dotenv, find_dotenv

# find .env automagically by walking up directories until it's found
dotenv_path = find_dotenv()

# load up the entries as environment variables
load_dotenv(dotenv_path)


Out[1]:
True

First some environment variables

We now use the files that are stored in the RAW directory.

If we decide to change the data format by changing names, adding features, created summary data frames etc., we will save those files in the INTERIM directory.


In [2]:
PROJECT_DIR = os.path.dirname(dotenv_path)
RAW_DATA_DIR = PROJECT_DIR + os.environ.get("RAW_DATA_DIR")
INTERIM_DATA_DIR = PROJECT_DIR + os.environ.get("INTERIM_DATA_DIR")
files=os.environ.get("FILES").split()

print("Project directory is  : {0}".format(PROJECT_DIR))
print("Raw data directory is : {0}".format(RAW_DATA_DIR))
print("Interim directory is  : {0}".format(INTERIM_DATA_DIR))


Project directory is  : /home/gsentveld/lunch_and_learn
Raw data directory is : /home/gsentveld/lunch_and_learn/data/raw
Interim directory is  : /home/gsentveld/lunch_and_learn/data/interim

Importing pandas and matplotlib.pyplot


In [3]:
# The following jupyter notebook magic makes the plots appear in the notebook. 
# If you run in batch mode, you have to save your plots as images.
%matplotlib inline

# matplotlib.pyplot is traditionally imported as plt
import matplotlib.pyplot as plt

# numpy is imported as np
import numpy as np

# Pandas is traditionaly imported as pd.
import pandas as pd
from pylab import rcParams

# some display options to size the figures. feel free to experiment
pd.set_option('display.max_columns', 25)
rcParams['figure.figsize'] = (17, 7)

Reading a file in Pandas

Reading a CSV file is really easy in Pandas. There are several formats that Pandas can deal with.

Format Type Data Description Reader Writer
text CSV read_csv to_csv
text JSON read_json to_json
text HTML read_html to_html
text Local clipboard read_clipboard to_clipboard
binary MS Excel read_excel to_excel
binary HDF5 Format read_hdf to_hdf
binary Feather Format read_feather to_feather
binary Msgpack read_msgpack to_msgpack
binary Stata read_stata to_stata
binary SAS read_sas
binary Python Pickle Format read_pickle to_pickle
SQL SQL read_sql to_sql
SQL Google Big Query read_gbq to_gbq

Psychological well-being among US adults with arthritis and the unmet need for mental health care

https://www.ncbi.nlm.nih.gov/pmc/articles/PMC5436776/pdf/oarrr-9-101.pdf

This article suggests a relationship between arthritis and serious psychological distress (SPD).

First we will look at the article to recreate the data set from the NIHS data we got in session 2.

We will use pd.read_csv().

As you will see, the Jupyter notebook prints out a very nice rendition of the DataFrame object that is the result


In [4]:
family=pd.read_csv(RAW_DATA_DIR+'/familyxx.csv')
samadult=pd.read_csv(RAW_DATA_DIR+'/samadult.csv')

In [5]:
# Start with a data frame to collect all the data in
df =  pd.DataFrame()

Mental health conditions

Individuals were determined to have SPD using the validated Kessler 6 (K6) scale.31,32 K6 scores are derived from responses to six questions asking how often in the past 30 days the individual felt “nervous”, “restless”, “hopeless”, “worthless”, “everything feels like an effort”, and “so sad that nothing cheers them up”, with responses ranging from 0 (none of the time) to 4 (all of the time). The responses for these six variables are summed to obtain the K6 score (maximum possible score of 24), and individuals with a score of ≥13 are considered to have SPD.

Corresponding columns:

ASINERV, ASIRSTLS, ASIHOPLS, ASIWTHLS, ASIEFFRT, ASISAD


In [6]:
# Calculate Kessler 6
# How often did you feel:
# nervous, restless, hopeless, worthless, everything is an effort, so sad nothing mattered.
# ASINERV,  ASIRSTLS, ASIHOPLS, ASIWTHLS, ASIEFFRT, ASISAD

kessler_6_questions=['ASINERV',  'ASIRSTLS', 'ASIHOPLS', 'ASIWTHLS', 'ASIEFFRT', 'ASISAD']

# 1 ALL of the time
# 2 MOST of the time
# 3 SOME of the time
# 4 A LITTLE of the time
# 5 NONE of the time
# 7 Refused
# 8 Not ascertained
# 9 Don't know 

# These have to be encoded as:
# 7, 8, 9 -> NaN
# 5 -> 0
# 4 -> 1
# 3 -> 2
# 2 -> 3
# 1 -> 4

In [7]:
kessler_6_map = { 1:4, 2:3, 3:2, 4:1, 5:0}
kessler_6=pd.DataFrame()

for col in kessler_6_questions:
    kessler_6[col]=[ kessler_6_map.get(x, None) for x in samadult[col]]
    
df['SPD']= kessler_6.sum(axis=1)>=13

df['SPD'] = np.where(df['SPD'], 'Yes', 'No')
del kessler_6
df.head(5)


Out[7]:
SPD
0 No
1 No
2 No
3 No
4 No

Arthritis indicator itself is very simple


In [8]:
# Arthritis Status
arth_map= {1:'Yes', 2:'No'}

df['ARTH1']=[ arth_map.get(x, None) for x in samadult['ARTH1']]

Chronic condition count

From the article:

We created a chronic condition count based on the following eight nonarthritis chronic conditions: cancer (except nonmelanoma skin); heart condition (including coronary heart disease, angina, myocardial infarction, or any other heart condition); diabetes; hepatitis or liver condition; hypertension (on at least two different visits); respiratory conditions (current asthma, emphysema, or chronic bronchitis); stroke; and weak or failing kidneys, defined similar to the recommendations of Goodman et al.

From the NIHS file:

  • CANEV, - CNKIND22: cancer (except nonmelanoma skin)
  • CHDEV: heart condition (including coronary heart disease, angina, myocardial infarction, or any other heart condition)
  • DIBEV: diabetes
  • AHEP, LIVEV: hepatitis or liver condition
  • HYPDIFV: hypertension (on at least two different visits)
  • AASMEV, EPHEV, CBRCHYR: respiratory conditions (current asthma, emphysema, or chronic bronchitis)
  • STREV, ALCHRC8: stroke
  • KIDWKYR: and weak or failing kidneys

In [9]:
# the following variables are used for the chronic condition count
straight_chronic_condition_questions = ['CHDEV','DIBEV','HYPDIFV', 'KIDWKYR']
cancer_nonmelanoma_skin= ['CANEV','CNKIND22'] # CANEV minus CNKIND22
hep_liver=['AHEP','LIVEV']
respiratory=['AASMEV','EPHEV', 'CBRCHYR']
stroke=['STREV','ALCHRC8']

# Create a temporary dataframe and collect the straight forward conditions
chronic_ind=pd.DataFrame()

# this could be a bit too liberal with the Unknown and Refused to answer values
for col in straight_chronic_condition_questions:
    chronic_ind[col]=samadult[col]==1

In [10]:
# Assume CANCER is false. Set to True for those diagnosed, and reset a few that were CNKIND22
chronic_ind['CANCER']=False
chronic_ind.loc[samadult['CANEV']==1,'CANCER'] = True
# override a few that have nonmelanoma skin
chronic_ind.loc[samadult['CNKIND22']==1, 'CANCER'] = False

In [11]:
# Assume Hepatitis or Liver condition is false and then set to True if either is reported
chronic_ind['HEPLIVER']=False
chronic_ind.loc[(samadult['AHEP']==1) | (samadult['LIVEV']==1), 'HEPLIVER'] = True

# Assume Respiratory condition is False and set to True if either of the three is reported
chronic_ind['RESPIRATORY']=False
chronic_ind.loc[(samadult['AASMEV']==1) | (samadult['EPHEV']==1) | (samadult['CBRCHYR']==1), 'RESPIRATORY'] = True

# Assume Stroke condition is false and then set to True if either flag is reported
chronic_ind['STROKE']=False
chronic_ind.loc[(samadult['STREV']==1) | (samadult['ALCHRC8']==1), 'STROKE'] = True

chronic_ind.head()


Out[11]:
CHDEV DIBEV HYPDIFV KIDWKYR CANCER HEPLIVER RESPIRATORY STROKE
0 False False False False False False False False
1 False False True False False False False False
2 True True True False False False True False
3 False False False False False False False False
4 False False True False False False True False

Now count the TRUE values over this dataframe.

Keep the values for 0, 1 and 2 and call everything else >=3


In [12]:
# Now count the chronic conditions and assign to df
chronic_ind['CHRONIC_CT']=np.array(np.sum(chronic_ind, axis=1))

chron_map = {0:'0',1:'1', 2:'2'}
df['CHRONIC_CT']=[chron_map.get(x, '>=3') for x in chronic_ind['CHRONIC_CT']] 
del chronic_ind

df.head(10)


Out[12]:
SPD ARTH1 CHRONIC_CT
0 No No 0
1 No No 1
2 No Yes >=3
3 No No 0
4 No Yes 2
5 No No 0
6 No No 0
7 No No 0
8 No No 0
9 Yes No >=3

In [13]:
# General Health Status, does not exist as in study Very Good/Excellent, Good, Poor/Fair. 
# Only and indicator if it was worse, same, better
# we will use it as a proxy.
status_map={1:"Very Good", 2:"Poor", 3: "Good"}
df['GENERAL_HEALTH_STATUS']=[status_map.get(x, None) for x in samadult['AHSTATYR']]

Another Pandas manipulation trick

Here we have numerical range that we want to transform into 3 different categories.

We can do a loop, but Pandas allows for a more Pythonic way to do this


In [14]:
# BMI
bmi=pd.DataFrame()
bmi['BMI']=samadult['BMI']

bmi.loc[bmi['BMI'] < 2500, 'BMI_C'] = '<25'
bmi.loc[(bmi['BMI'] >= 2500)&(bmi['BMI'] < 3000), 'BMI_C'] = '25<30'
bmi.loc[(bmi['BMI'] >= 3000)&(bmi['BMI'] < 9999), 'BMI_C'] = '>30'

df['BMI_C']=bmi['BMI_C']

del bmi

Physical Activity Level

At least 150 Moderate or 75 Vigorous minutes per week. Questions are answered with per day, per week, per month, per year and recoded to units per week. But the file also has it recoded to units per week. Those units are either minutes or hours.

So we have to do some math to figure out if we get more than 150 moderate equivalent minutes.

Another interesting way to manipulate data, this time using 'apply' and a user defined function.


In [15]:
def determine_activity(x):
    minutes = 0
    if x['VIGLNGTP']==1:
        minutes = minutes + x['VIGLNGNO']*2
    elif x['VIGLNGTP']==2:
        minutes = minutes + x['VIGLNGNO']*120

    if x['MODLNGTP']==1:
        minutes = minutes + x['MODLNGNO']
    elif x['MODLNGTP']==2:
        minutes = minutes + x['MODLNGNO']*60

    return 'Meets' if minutes >= 150 else 'Does not meet'

physical_activity=pd.DataFrame()
physical_activity=samadult[['VIGLNGNO','VIGLNGTP', 'MODLNGNO', 'MODLNGTP']].copy()

physical_activity['ACTIVITY']=physical_activity.apply(determine_activity, axis=1)

df['ACTIVITY']=physical_activity['ACTIVITY']
del physical_activity

In [16]:
df.head(20)


Out[16]:
SPD ARTH1 CHRONIC_CT GENERAL_HEALTH_STATUS BMI_C ACTIVITY
0 No No 0 Good <25 Does not meet
1 No No 1 Good <25 Does not meet
2 No Yes >=3 Very Good >30 Does not meet
3 No No 0 Good <25 Does not meet
4 No Yes 2 Good >30 Does not meet
5 No No 0 Good <25 Does not meet
6 No No 0 Good 25<30 Does not meet
7 No No 0 Good 25<30 Does not meet
8 No No 0 Very Good <25 Does not meet
9 Yes No >=3 Poor 25<30 Does not meet
10 Yes No 1 Good 25<30 Does not meet
11 No Yes >=3 Good >30 Does not meet
12 No No 1 Good 25<30 Does not meet
13 No No 0 Good <25 Does not meet
14 Yes Yes >=3 Poor >30 Does not meet
15 No No 0 Good <25 Does not meet
16 No Yes 1 Poor >30 Does not meet
17 No No 1 Good 25<30 Does not meet
18 No No 0 Good 25<30 Meets
19 No Yes 0 Very Good <25 Does not meet

Similar activities for Age, Sex, and Race

Here we do similar activities for Age, Sex and Race and we start to see that the coding of the data is slightly different than is suggested in the article for some fields. This is interesting as it will schew the categories probabilities.


In [17]:
# Age
age=pd.DataFrame()
age['AGE_P']=samadult['AGE_P']

age.loc[age['AGE_P'] < 45, 'AGE_C'] = '18-44'
age.loc[(age['AGE_P'] >= 45)&(age['AGE_P'] < 65), 'AGE_C'] = '45-64'
age.loc[age['AGE_P'] >= 65, 'AGE_C'] = '65-'

df['AGE_C']=age['AGE_C']
del age

In [18]:
# Sex
df['SEX']=[ 'Male' if x == 1 else 'Female' for x in samadult['SEX']]

In [19]:
# Race. Not exactly a match with the study. Not sure why.
# RACERPI2

race_map= {1: 'White', 2: 'Black/African American', 3:'AIAN', 4: 'Asian',5: 'not releasable',6: 'Multiple'}


df['RACE']=[ race_map.get(x, None) for x in samadult['RACERPI2']]

Some fields are not found or hard to reconstruct

Education Level and Employment status are not encoded as expected. Education level can't be found at all and employment status is a mix between workstatus and why did you not work last week. Which is an odd way to determine if someone is retired, or a student.


In [20]:
# Educational level:
# Less than high school
# High school diploma
# Some college or Associates degree
# College or greater
# Can't find it in data?

In [21]:
# Employment status: complex between workstatus and why not worked last week, logic is not described
# Maybe at least get "Out of Work", "Retired", "Other"?

Do the same for the other fields


In [22]:
# marital status
# R_MARITL

# 0 Under 14 years   -> will combine that with Never Married
# 1 Married - spouse in household         \
# 2 Married - spouse not in household      > -- will combine these
# 3 Married - spouse in household unknown /
# 4 Widowed 
# 5 Divorced   \ will combine these
# 6 Separated  /
# 7 Never married
# 8 Living with partner
# 9 Unknown marital status -> will combine with 7
marital_map = { 0: "Never Married"
              , 1: "Married"
              , 2: "Married"
              , 3: "Married"
              , 4: "Widowed"
              , 5: "Divorced/Separated"
              , 6: "Divorced/Separated"
              , 7: "Never Married"
              , 8: "Living with Partner"
              , 9: "Never Married"}
df['MARITAL_STATUS']=[ marital_map.get(x, "Never Married") for x in samadult['R_MARITL']]

In [23]:
# Functional limitation score
fl_columns=['FLWALK','FLCLIMB','FLSTAND','FLSIT','FLSTOOP','FLREACH','FLGRASP','FLCARRY','FLPUSH']

fl_cols=samadult[fl_columns].copy()

for col in fl_columns:
    fl_cols.loc[fl_cols[col]>=6] = 0

fl_cols['FL_AVG']=fl_cols.mean(axis=1)

fl_cols.loc[fl_cols['FL_AVG'] == 0,'FUNC_LIMIT'] = 'None'
fl_cols.loc[(fl_cols['FL_AVG'] > 0)&(fl_cols['FL_AVG'] <=1),'FUNC_LIMIT'] = 'Low'
fl_cols.loc[(fl_cols['FL_AVG'] > 1)&(fl_cols['FL_AVG'] <=2),'FUNC_LIMIT'] = 'Medium'
fl_cols.loc[fl_cols['FL_AVG'] > 2,'FUNC_LIMIT'] = 'High'

df['FUNC_LIMIT']=fl_cols['FUNC_LIMIT']

del fl_cols

In [24]:
# Social participation restriction
# We defined social participation restriction as
# difficulty or inability to shop, go to events, or participate in
# social activities without special equipment, per previously
# published analyses.
# FLSHOP and FLSOCL
restr_map={1:"Yes", 2:"Yes", 3: "Yes", 4: "Yes"}

social_cols=pd.DataFrame()
social_cols['FLSHOP']=[restr_map.get(x, 'No') for x in samadult['FLSHOP']]
social_cols['FLSOCL']=[restr_map.get(x, 'No') for x in samadult['FLSOCL']]

social_cols.loc[(social_cols['FLSHOP']=='Yes')|(social_cols['FLSOCL']=='Yes'), 'SOC_RESTR']='Yes'
social_cols.loc[(social_cols['FLSHOP']=='No')&(social_cols['FLSOCL']=='No'), 'SOC_RESTR']='No'

df['SOC_RESTR']=social_cols['SOC_RESTR']

In [25]:
#Could not afford mental health care, past 12 months
# AHCAFYR2
# No  = 2
# Yes = 1
df['NOT_AFFORD']=[ 'Yes' if x == 1 else 'No' for x in samadult['AHCAFYR2']]

#Seen a mental health professional, past 12 months
# AHCSYR1
#No = 2
#Yes = 1
df['SEEN_MENTAL_DR']=[ 'Yes' if x == 1 else 'No' for x in samadult['AHCSYR1']]

What do we have so far.


In [26]:
df.head(34)


Out[26]:
SPD ARTH1 CHRONIC_CT GENERAL_HEALTH_STATUS BMI_C ACTIVITY AGE_C SEX RACE MARITAL_STATUS FUNC_LIMIT SOC_RESTR NOT_AFFORD SEEN_MENTAL_DR
0 No No 0 Good <25 Does not meet 18-44 Male White Never Married None No No No
1 No No 1 Good <25 Does not meet 45-64 Female White Never Married None No No No
2 No Yes >=3 Very Good >30 Does not meet 45-64 Male Asian Never Married None No No No
3 No No 0 Good <25 Does not meet 45-64 Male Black/African American Divorced/Separated None No No No
4 No Yes 2 Good >30 Does not meet 65- Female Black/African American Widowed High Yes No No
5 No No 0 Good <25 Does not meet 18-44 Female White Married None No No No
6 No No 0 Good 25<30 Does not meet 45-64 Female Black/African American Married None No No No
7 No No 0 Good 25<30 Does not meet 18-44 Male White Married None No No Yes
8 No No 0 Very Good <25 Does not meet 45-64 Female White Never Married None No No No
9 Yes No >=3 Poor 25<30 Does not meet 45-64 Female White Divorced/Separated High Yes Yes No
10 Yes No 1 Good 25<30 Does not meet 18-44 Female White Never Married Low Yes No Yes
11 No Yes >=3 Good >30 Does not meet 65- Male Black/African American Married Medium No No No
12 No No 1 Good 25<30 Does not meet 18-44 Male White Married None No No No
13 No No 0 Good <25 Does not meet 45-64 Male White Never Married None No No No
14 Yes Yes >=3 Poor >30 Does not meet 45-64 Female White Married High Yes No No
15 No No 0 Good <25 Does not meet 18-44 Female White Married None No No No
16 No Yes 1 Poor >30 Does not meet 65- Male White Married Medium No No No
17 No No 1 Good 25<30 Does not meet 65- Male Multiple Married None No No No
18 No No 0 Good 25<30 Meets 18-44 Male White Married None No No No
19 No Yes 0 Very Good <25 Does not meet 45-64 Male Asian Divorced/Separated None No No No
20 No Yes 2 Poor >30 Does not meet 65- Male White Divorced/Separated None No No No
21 No No 0 Good <25 Does not meet 45-64 Male White Never Married None Yes No No
22 No Yes 2 Poor 25<30 Does not meet 45-64 Male White Never Married Low No No Yes
23 No No 1 Good 25<30 Does not meet 45-64 Male White Married Low No No No
24 No Yes 2 Very Good >30 Does not meet 65- Female Black/African American Widowed None No No No
25 No No 1 Good <25 Does not meet 45-64 Male Black/African American Never Married Low Yes Yes No
26 No No 0 Good >30 Does not meet 18-44 Male White Married None No No No
27 No Yes >=3 Poor 25<30 Does not meet 65- Female White Married Medium Yes No No
28 No Yes 0 Good >30 Does not meet 45-64 Female White Married None No No No
29 No No 0 Good <25 Does not meet 18-44 Female White Married None No No No
30 No No 1 Poor <25 Does not meet 65- Male White Never Married None No No No
31 No No 0 Good <25 Does not meet 18-44 Male White Widowed None No No No
32 No No 0 Good 25<30 Does not meet 18-44 Male White Married None No No No
33 No No 2 Good >30 Does not meet 18-44 Female Black/African American Never Married High No No Yes

Now get the Insurance and Poverty Ratio fields from the Family file.


In [27]:
#From Familyxx get poverty ratio

fam_df=pd.DataFrame()

ratio_map={
1: '<1' # Under 0.50
,2: '<1' # 0.50 - 0.74
,3: '<1' # 0.75 - 0.99
,4: '1 to <2' # 1.00 - 1.24
,5: '1 to <2' # 1.25 - 1.49
,6: '1 to <2' # 1.50 - 1.74
,7: '1 to <2' # 1.75 - 1.99
,8: '>=2' # 2.00 - 2.49
,9: '>=2' # 2.50 - 2.99
,10: '>=2' # 3.00 - 3.49
,11: '>=2' # 3.50 - 3.99
,12: '>=2' # 4.00 - 4.49
,13: '>=2' # 4.50 - 4.99
,14: '>=2' # 5.00 and over
,15: '<1' # Less than 1.00 (no further detail)
,16: '1 to <2' # 1.00 - 1.99 (no further detail)
,17: '>=2' # 2.00 and over (no further detail)
,96:  '1 to <2' # Undefinable
,99:  '1 to <2' # Unknown 
}

fam_df['POV_RATIO']=[ratio_map.get(x, None) for x in family['RAT_CAT4']]

In [28]:
# Just going to go for Yes and No and any unknown/refused as No
# Health insurance
#Any private 
#Public only 
# Not covered 
# FHICOVYN
fam_df['INSURANCE']=['Yes' if x == 1 else 'No' for x in family['FHICOVYN']]

This is how you join two datasets in Pandas.

To join two data sets in Pandas, you can merge based on key fields.

In the NIHS datasets the key field that links a person to the family is the Houshold Key and the Family Key.


In [29]:
df['HHX']=samadult['HHX']
df['FMX']=samadult['FMX']
fam_df['HHX']=family['HHX']
fam_df['FMX']=family['FMX']

And you then do a merge, with the columns indicated in the on= parameter and, very important, specify it is a left join, so that you don't loose any people, if you can't find the family.


In [30]:
joined_df=pd.merge(df, fam_df, on=['HHX','FMX'],how='left', sort=False)
joined_df.drop(['HHX','FMX'], axis=1,inplace=True )
joined_df.head()


Out[30]:
SPD ARTH1 CHRONIC_CT GENERAL_HEALTH_STATUS BMI_C ACTIVITY AGE_C SEX RACE MARITAL_STATUS FUNC_LIMIT SOC_RESTR NOT_AFFORD SEEN_MENTAL_DR POV_RATIO INSURANCE
0 No No 0 Good <25 Does not meet 18-44 Male White Never Married None No No No 1 to <2 Yes
1 No No 1 Good <25 Does not meet 45-64 Female White Never Married None No No No >=2 Yes
2 No Yes >=3 Very Good >30 Does not meet 45-64 Male Asian Never Married None No No No >=2 Yes
3 No No 0 Good <25 Does not meet 45-64 Male Black/African American Divorced/Separated None No No No >=2 Yes
4 No Yes 2 Good >30 Does not meet 65- Female Black/African American Widowed High Yes No No >=2 Yes

Save the result in the INTERIM data directory


In [31]:
df=joined_df
df.to_csv(INTERIM_DATA_DIR+'/arthritis_study.csv')