Reading in and wrangling a completely different dataset to use for the Drug Use Predictor

Notebook Author: Melissa Burn\ Georgetown University School of Continuing Studies, Certificate in Data Science, Cohort 11 (Spring 2018)

Data Source:

  • Johnson's IPIP-NEO data repository
  • Contributors: John Anthony Johnson
  • Date created: 2015-09-22 04:21 PM | Last Updated: 2015-11-04 06:25 PM
  • Description: This project makes available information about International Personality Item Pool (IPIP) versions of the NEO Personality Inventory.
  • URL: https://osf.io/sxeq5/

Specific dataset used: Data from the Johnson (2005) JRP study and documentation for those files. File ipip20993.dat contains 20,993 cases of item responses to the IPIP-NEO-300 in ASCII format. The file also contains facet and domain scale scores and two measures of intra-individual reliability described in the publication. Variables are listed at the top of the file. ipip20993.doc is a Word.doc description of the dataset

Note that, prior to reading into this Notebook, I opened the ASCII file in Excel, took the top 3K some instances and discarded the rest. I deleted 300+ columns I didn't need, added an ID column, and adopted the IMMODERA and EXCITE columns as stand-ins for "Impulsiveness" and "Sensation Seeking". The columns will be renamed below.


In [10]:
import numpy as np
import pandas as pd
from numpy import random
from random import randint

pd.options.mode.chained_assignment = None  # get rid of this pesky warning; default='warn'

Project Workflow

This Notebook moves through the following steps to ingest, sort, and wrangle the dataset so it fits into the Drug Use Predictor model:

  1. Ingest the required xlsx data into a dataframe
  2. Wrangle the data to provide the right format and column structure, keeping the age, gender and personality test scores
  3. Use a random number generator to any features needed for the Drug Use Predictor that don't exist in the Johnson dataset

Data Ingestion

Grab the dataset from the data subdirectory


In [25]:
data = pd.read_excel('data/Johnson_ipip3K_partial.xlsx') 
data.head()


Out[25]:
ID AGE GENDER NS ES OS AS CS IMMODERA EXCITE
0 1 37 2 122 158 251 204 266 33 15
1 2 31 2 130 162 256 241 218 13 28
2 3 24 2 127 161 229 221 244 26 16
3 4 23 2 158 241 225 244 168 39 40
4 5 33 2 186 189 232 227 230 30 24

In [26]:
# There's an order of magnitude difference in the scale of the numbers and df needs normalizing
import sklearn
from sklearn import preprocessing

In [37]:
# I have learned that preprocessing strips the column headings, so create a working array
X = np.array(data)
X = X.astype(np.float64)

# Scale the data in the range of the UCI dataset
X = preprocessing.minmax_scale(X, feature_range=(-3,3)) 

# Make a df again and restore the headings
df = pd.DataFrame(X, columns = data.columns)
print(df.describe())


                 ID          AGE       GENDER           NS           ES  \
count  3.167000e+03  3167.000000  3167.000000  3167.000000  3167.000000   
mean   1.158250e-16    -1.754436     0.707610    -0.070051     0.396142   
std    1.732871e+00     0.800656     2.915814     1.024715     0.946197   
min   -3.000000e+00    -3.000000    -3.000000    -3.000000    -3.000000   
25%   -1.500000e+00    -2.400000    -3.000000    -0.773756    -0.270142   
50%    0.000000e+00    -2.025000     3.000000    -0.149321     0.440758   
75%    1.500000e+00    -1.350000     3.000000     0.610860     1.094787   
max    3.000000e+00     3.000000     3.000000     3.000000     3.000000   

                OS           AS           CS     IMMODERA       EXCITE  
count  3167.000000  3167.000000  3167.000000  3167.000000  3167.000000  
mean      0.497913     0.904180     0.388287     0.317141     0.318330  
std       0.964612     0.884711     0.965160     1.088756     1.215007  
min      -3.000000    -3.000000    -3.000000    -3.000000    -3.000000  
25%      -0.127273     0.350254    -0.264249    -0.384615    -0.600000  
50%       0.527273     0.959391     0.419689     0.230769     0.300000  
75%       1.181818     1.507614     1.072539     1.153846     1.200000  
max       3.000000     3.000000     3.000000     3.000000     3.000000  

In [38]:
# Aaack! How do I avoid scaling the index? I couldn't find the answer through much googling

# Below is the features list I need. So, I'll have to invent data for the missing columns
# Note, this isn't the same order as in the UCI database but that shouldn't matter

FEATURES  = [
    "ID", # May not be used to identify respondents
    "Age", # 18-24, 25-34, 35-44, 45-54, 55-64, 65+
    "Gender", # Female, Male
    "NS",  # Neuroticism Score
    "ES",  # Extroversion Score
    "OS",  # Openness to experience Score
    "AS",  # Agreeableness Score
    "CS",  # Conscientiousness Score
    "Imp", # Impulsivity, Lickert scale with -3 = least impulsive, +3 = most impulsive
    "SS",  # Sensation seeking, part of the Impulsiveness assessment, -3 < score > +3
    "Cntry", # Country: AUS, CAN, NZ, Other, IRE, UK, USA
    "Educ", # Left before age 16, left @ 16, @ 17, @ 18, some college, prof cert, univ degree, masters, doctorate
    "Ethn", # Ethnicity: Asian, Black, Mixed Bla/As, Mixed Whi/As, Mixed Whi/Bla, Other
    "Alcohol", # Class of alcohol consumption
    "Caffeine", # Class of caffeine consumption
    "Choco", # Class of chocolate consumption
    "Nicotine", # Class of nicotine consumption
]

print("{} instances with {} features\n".format(*df.shape))


3167 instances with 10 features


In [39]:
# Rename the two columns I'm adopting to match the Drug Use Predictor format, and correct upper/lower of others
df.rename(columns={'IMMODERA': 'Imp', 'EXCITE': 'SS', 'AGE':'Age', 'GENDER':'Gender'}, inplace=True)

# Take a look at the data again
print(df.describe())


                 ID          Age       Gender           NS           ES  \
count  3.167000e+03  3167.000000  3167.000000  3167.000000  3167.000000   
mean   1.158250e-16    -1.754436     0.707610    -0.070051     0.396142   
std    1.732871e+00     0.800656     2.915814     1.024715     0.946197   
min   -3.000000e+00    -3.000000    -3.000000    -3.000000    -3.000000   
25%   -1.500000e+00    -2.400000    -3.000000    -0.773756    -0.270142   
50%    0.000000e+00    -2.025000     3.000000    -0.149321     0.440758   
75%    1.500000e+00    -1.350000     3.000000     0.610860     1.094787   
max    3.000000e+00     3.000000     3.000000     3.000000     3.000000   

                OS           AS           CS          Imp           SS  
count  3167.000000  3167.000000  3167.000000  3167.000000  3167.000000  
mean      0.497913     0.904180     0.388287     0.317141     0.318330  
std       0.964612     0.884711     0.965160     1.088756     1.215007  
min      -3.000000    -3.000000    -3.000000    -3.000000    -3.000000  
25%      -0.127273     0.350254    -0.264249    -0.384615    -0.600000  
50%       0.527273     0.959391     0.419689     0.230769     0.300000  
75%       1.181818     1.507614     1.072539     1.153846     1.200000  
max       3.000000     3.000000     3.000000     3.000000     3.000000  

In [40]:
# I'll make all these people Americans for Cntry = 3
df['Cntry'] = 3

# Perhaps because I'm using .loc, it needs me to establish the other feature columns in advance
df['Educ'] = 0
df['Ethn'] = 0
df['Alcohol'] = 0
df['Caffeine'] = 0
df['Choco'] = 0
df['Nicotine'] = 0

# Now I need to generate data for the Educ, Ethn, Alcohol, Caffeine, Choco, and Nicotine features
# HOWEVER, it will help to ensure they're the same scale as the other data in the df
for i in df.index.values:
    df.loc[[i],['Educ']] = np.random.normal(-3, 3)
    df.loc[[i],['Ethn']] = np.random.normal(-3, 3)
    df.loc[[i],['Alcohol']] = np.random.normal(-3, 3)
    df.loc[[i],['Caffeine']] = np.random.normal(-3, 3)
    df.loc[[i],['Choco']] = np.random.normal(-3, 3)
    df.loc[[i],['Nicotine']] = np.random.normal(-3, 3)

print(df.describe())


                 ID          Age       Gender           NS           ES  \
count  3.167000e+03  3167.000000  3167.000000  3167.000000  3167.000000   
mean   1.158250e-16    -1.754436     0.707610    -0.070051     0.396142   
std    1.732871e+00     0.800656     2.915814     1.024715     0.946197   
min   -3.000000e+00    -3.000000    -3.000000    -3.000000    -3.000000   
25%   -1.500000e+00    -2.400000    -3.000000    -0.773756    -0.270142   
50%    0.000000e+00    -2.025000     3.000000    -0.149321     0.440758   
75%    1.500000e+00    -1.350000     3.000000     0.610860     1.094787   
max    3.000000e+00     3.000000     3.000000     3.000000     3.000000   

                OS           AS           CS          Imp           SS  \
count  3167.000000  3167.000000  3167.000000  3167.000000  3167.000000   
mean      0.497913     0.904180     0.388287     0.317141     0.318330   
std       0.964612     0.884711     0.965160     1.088756     1.215007   
min      -3.000000    -3.000000    -3.000000    -3.000000    -3.000000   
25%      -0.127273     0.350254    -0.264249    -0.384615    -0.600000   
50%       0.527273     0.959391     0.419689     0.230769     0.300000   
75%       1.181818     1.507614     1.072539     1.153846     1.200000   
max       3.000000     3.000000     3.000000     3.000000     3.000000   

        Cntry         Educ         Ethn      Alcohol     Caffeine  \
count  3167.0  3167.000000  3167.000000  3167.000000  3167.000000   
mean      3.0    -3.053803    -2.924421    -3.033622    -2.971994   
std       0.0     2.983656     2.990705     2.931054     2.983613   
min       3.0   -12.396163   -13.860629   -12.442075   -11.750012   
25%       3.0    -5.097133    -4.971756    -5.036436    -5.084885   
50%       3.0    -3.098763    -2.904049    -3.066110    -2.887685   
75%       3.0    -1.010245    -0.928653    -1.061128    -0.970281   
max       3.0     6.315914     8.225119     6.344689     6.682071   

             Choco     Nicotine  
count  3167.000000  3167.000000  
mean     -2.990313    -2.941755  
std       3.004880     3.022518  
min     -14.382956   -13.050001  
25%      -4.981528    -4.993154  
50%      -3.072716    -2.957549  
75%      -0.972593    -0.845246  
max       8.767262     6.877888  

In [41]:
# Now, save this df in a file that can be read by the Drug Use Predictor
df.to_csv('data/Johnny_data_out.csv', index=False)

In [ ]: