Notebook Author: Melissa Burn\ Georgetown University School of Continuing Studies, Certificate in Data Science, Cohort 11 (Spring 2018)
Data Source:
Specific dataset used: Data from the Johnson (2005) JRP study and documentation for those files. File ipip20993.dat contains 20,993 cases of item responses to the IPIP-NEO-300 in ASCII format. The file also contains facet and domain scale scores and two measures of intra-individual reliability described in the publication. Variables are listed at the top of the file. ipip20993.doc is a Word.doc description of the dataset
Note that, prior to reading into this Notebook, I opened the ASCII file in Excel, took the top 3K some instances and discarded the rest. I deleted 300+ columns I didn't need, added an ID column, and adopted the IMMODERA and EXCITE columns as stand-ins for "Impulsiveness" and "Sensation Seeking". The columns will be renamed below.
In [10]:
import numpy as np
import pandas as pd
from numpy import random
from random import randint
pd.options.mode.chained_assignment = None # get rid of this pesky warning; default='warn'
This Notebook moves through the following steps to ingest, sort, and wrangle the dataset so it fits into the Drug Use Predictor model:
Grab the dataset from the data subdirectory
In [25]:
data = pd.read_excel('data/Johnson_ipip3K_partial.xlsx')
data.head()
Out[25]:
In [26]:
# There's an order of magnitude difference in the scale of the numbers and df needs normalizing
import sklearn
from sklearn import preprocessing
In [37]:
# I have learned that preprocessing strips the column headings, so create a working array
X = np.array(data)
X = X.astype(np.float64)
# Scale the data in the range of the UCI dataset
X = preprocessing.minmax_scale(X, feature_range=(-3,3))
# Make a df again and restore the headings
df = pd.DataFrame(X, columns = data.columns)
print(df.describe())
In [38]:
# Aaack! How do I avoid scaling the index? I couldn't find the answer through much googling
# Below is the features list I need. So, I'll have to invent data for the missing columns
# Note, this isn't the same order as in the UCI database but that shouldn't matter
FEATURES = [
"ID", # May not be used to identify respondents
"Age", # 18-24, 25-34, 35-44, 45-54, 55-64, 65+
"Gender", # Female, Male
"NS", # Neuroticism Score
"ES", # Extroversion Score
"OS", # Openness to experience Score
"AS", # Agreeableness Score
"CS", # Conscientiousness Score
"Imp", # Impulsivity, Lickert scale with -3 = least impulsive, +3 = most impulsive
"SS", # Sensation seeking, part of the Impulsiveness assessment, -3 < score > +3
"Cntry", # Country: AUS, CAN, NZ, Other, IRE, UK, USA
"Educ", # Left before age 16, left @ 16, @ 17, @ 18, some college, prof cert, univ degree, masters, doctorate
"Ethn", # Ethnicity: Asian, Black, Mixed Bla/As, Mixed Whi/As, Mixed Whi/Bla, Other
"Alcohol", # Class of alcohol consumption
"Caffeine", # Class of caffeine consumption
"Choco", # Class of chocolate consumption
"Nicotine", # Class of nicotine consumption
]
print("{} instances with {} features\n".format(*df.shape))
In [39]:
# Rename the two columns I'm adopting to match the Drug Use Predictor format, and correct upper/lower of others
df.rename(columns={'IMMODERA': 'Imp', 'EXCITE': 'SS', 'AGE':'Age', 'GENDER':'Gender'}, inplace=True)
# Take a look at the data again
print(df.describe())
In [40]:
# I'll make all these people Americans for Cntry = 3
df['Cntry'] = 3
# Perhaps because I'm using .loc, it needs me to establish the other feature columns in advance
df['Educ'] = 0
df['Ethn'] = 0
df['Alcohol'] = 0
df['Caffeine'] = 0
df['Choco'] = 0
df['Nicotine'] = 0
# Now I need to generate data for the Educ, Ethn, Alcohol, Caffeine, Choco, and Nicotine features
# HOWEVER, it will help to ensure they're the same scale as the other data in the df
for i in df.index.values:
df.loc[[i],['Educ']] = np.random.normal(-3, 3)
df.loc[[i],['Ethn']] = np.random.normal(-3, 3)
df.loc[[i],['Alcohol']] = np.random.normal(-3, 3)
df.loc[[i],['Caffeine']] = np.random.normal(-3, 3)
df.loc[[i],['Choco']] = np.random.normal(-3, 3)
df.loc[[i],['Nicotine']] = np.random.normal(-3, 3)
print(df.describe())
In [41]:
# Now, save this df in a file that can be read by the Drug Use Predictor
df.to_csv('data/Johnny_data_out.csv', index=False)
In [ ]: