Author: Melissa Burn\ Georgetown University School of Continuing Studies, Certificate in Data Science, Cohort 11 (Spring 2018)
Please see the associated README and LICENSE files in this directory for important information about this project
In [1]:
%matplotlib inline
import os
import pickle
import requests
import numpy as np
import pandas as pd
import matplotlib
import seaborn as sns
import matplotlib.pyplot as plt
from pandas.plotting import scatter_matrix
import sklearn
from sklearn import preprocessing, tree
from sklearn.preprocessing import scale
from sklearn.metrics import confusion_matrix, classification_report
from scipy import cluster
from sklearn.naive_bayes import GaussianNB
from sklearn import metrics
pd.options.mode.chained_assignment = None # get rid of this pesky warning; default='warn'
This Notebook moves through the following steps to feed, run, test, and validate the model:
Grab the dataset from the UCI Machine Learning Repository and look it over.
In [2]:
# Location of the drug consumption data in the UCI machine learning Repo
URL = "https://archive.ics.uci.edu/ml/machine-learning-databases/00373/drug_consumption.data"
# Function to fetch the data from the UCI Repo and store it in a .txt file in the data subdir
def fetch_data(fname='data/drug_dataset.txt'):
"""
Method to retrieve the ML Repository dataset.
"""
response = requests.get(URL)
outpath = os.path.abspath(fname)
with open(outpath, 'wb') as f:
f.write(response.content)
return outpath
DATA = fetch_data()
The research data were collected in 2016 by Fehrman (UK), who used the NEO 5-factor inventory to assess personality traits that might be associated with substance abuse. She also assessed respondents using the Barratt Impusliveness Scale.
FROM THE REFERENCE ARTICLE:
"The five factors are: N, E, O, A, and C with 12 items per domain. The five traits can be summarized as:
Participants were questioned concerning their use of 18 legal and illegal drugs (alcohol, amphetamines, amyl nitrite, benzodiazepines, cannabis, chocolate, cocaine, caffeine, crack, ecstasy, heroin, ketamine, legal highs, LSD, methadone, magic mushrooms, nicotine and volatile substance abuse (VSA)) and one fictitious drug (Semeron) which was introduced to identify over-claimers."
While the factors above may be distinct for the purpose of psychological assessment, they might be highly correlated for machine learning purposes and, so, not all give useful information. I'll need to examine them to see if the features should be combined, dropped, or supplemented.
In [3]:
# Create the features list using information from the UCI Data Set Description
FEATURES = [
"ID", # May not be used to identify respondents
"Age", # 18-24, 25-34, 35-44, 45-54, 55-64, 65+
"Gender", # Female, Male
"Educ", # Left before age 16, left @ 16, @ 17, @ 18, some college, prof cert, univ degree, masters, doctorate
"Cntry", # Country: AUS, CAN, NZ, Other, IRE, UK, USA
"Ethn", # Ethnicity: Asian, Black, Mixed Bla/As, Mixed Whi/As, Mixed Whi/Bla, Other
"NS", # Neuroticism Score
"ES", # Extroversion Score
"OS", # Openness to experience Score
"AS", # Agreeableness Score
"CS", # Conscientiousness Score
"Imp", # Impulsivity, Lickert scale with -3 = least impulsive, +3 = most impulsive
"SS", # Sensation seeking, part of the Impulsiveness assessment, -3 < score > +3
"Alcohol", # Class of alcohol consumption
"Amphet", # Class of amphetamine consumption
"Amyl", # Class of amyl nitrite consumption
"Benzos", # Class of benzodiazepine consumption
"Caffeine", # Class of caffeine consumption
"Cannabis", # Class of cannabis consumption
"Choco", # Class of chocolate consumption
"Coke", # Class of cocaine consumption
"Crack", # Class of crack cocaine consumption
"Ecstasy", # Class of ecstasy consumption
"Heroin", # Class of heroin consumption
"Ketamine", # Class of ketamine consumption
"LegalH", # Class of legal highs consumption
"LSD", # Class of LSD consumption
"Meth", # Class of methamphetamine consumption
"Shrooms", # Class of mushrooms consumption
"Nicotine", # Class of nicotine consumption
"Semer",# Class of fictitious drug Semeron consumption
"VSA" # Class of volatile substance abuse consumption
]
USE_CLASSES = [
"CL0", # Never used this substance
"CL1", # Used over a decade ago
"CL2", # Used in last decade
"CL3", # Used in last year
"CL4", # Used in last month
"CL5", # Used in last week
"CL6" # Used in last day
]
# Read the data into a DataFrame -- the Respondent characteristics are all integer or real
df = pd.read_csv(DATA, sep=',', header=None, names=FEATURES)
# Determine the shape of the data
print("{} instances with {} features\n".format(*df.shape))
# I'd like to look at this data a little more
df.head()
Out[3]:
In [4]:
df.index.values
Out[4]:
In [5]:
# Describe the dataset numeric features (won't include the drug use classes, which are still coded as text)
print(df.describe())
Based on the fact that all numeric data was coded by the UK researchers to be in the range -3 to 3, I don't think the dataset needs to be standardized or normalized.
However, because I want to create a predictive model, and the most interesting thing (to me) to predict is drug usage, I need to create a target variable reflecting that. The simplest case is to use the demographic and personality data to make a guess, but I think peoples' use of socially-acceptable substances like nicotine and alcohol might also be useful for the prediction, so I'll keep those too in the feature set.
The desired endstate for wrangling then is to keep all the potentially relevant features plus one target that reflects any illegal drug usage in the training dataset.
Tasks being completed in this section include:
Note several versions of the dataframe:
In [6]:
# Preserve the original df but create a copy to work on
dfcopy = df
# Replace all the non-numeric features (14 - 32) with numbers, using a range consistent with other variables
dfcopy.replace(["CL0","CL1","CL2","CL3","CL4","CL5","CL6"],[-3.0,-2.0,-1.0,0.0,1.0,2.0,3.0],inplace=True)
dfcopy.head()
Out[6]:
In [7]:
# I need a copy of the dataframe that has the first 13 demographic and personality features, plus
# the 4 responses indicating use of legal drugs (alcohol, caffeine, chocolate, and nicotine)
# NOTE: I'm treating "LegalHigh" as an illegal for assessing likelihood of illegal use
feature_names = ['Age','Gender','Educ','Cntry','Ethn','NS','ES','OS','AS','CS','Imp',\
'SS',"Alcohol","Caffeine","Choco","Nicotine"]
part_df = pd.DataFrame.copy(dfcopy[feature_names])
print(part_df.head())
In [8]:
# Checking the data characteristics of part_df
print(part_df.describe())
In [9]:
# Try a scatter matrix
scatter_matrix(part_df, alpha=0.2, figsize=(13, 13), diagonal='kde')
plt.show()
In [10]:
# I can't see any "screaming" patterns that jump out, but a good distribution is OK
In [11]:
# Create a radial plot focusing on relationships to one feature
from pandas.plotting import radviz
plt.figure(figsize=(13,13))
pd.plotting.radviz(part_df, "Nicotine")
plt.show()
In [12]:
# The dots are pretty well mixed up. Not much hope of segregating them, based on this view. Keep going.
In [13]:
# Try a different scatter plot
part_df.plot(kind='scatter', x='Age', y='Nicotine', s=part_df['NS']*100,figsize=[20,10])
Out[13]:
In [14]:
# Interesting that the size of the blobs (NS, negativity, anxiety) is high for all who smoke a lot.
# Also seem to see a reduction in both smoking and blob density with age as though young people
# smoke more and are more NS = neurotic, negative, anxious, fearful. Wow!
# Try a different plot to see if increased age is associated with any change in cigarette smoking
sns.regplot(x='Age', y='NS', data=part_df, x_estimator=np.mean)
Out[14]:
In [15]:
# So, older people are a little less "neurotic", though scatter increases with age. This is completely
# counter to popular perception of younger people being more optimistic until age hardens them. Wow!
# Is there anything that increases with age?
sns.regplot(x='Age', y='CS', data=part_df, x_estimator=np.mean)
Out[15]:
In [16]:
# So, conscientiousness increases with age, which is reassuring, and the scatter seems a little narrower
# Is there anything at all that varies LESS with age, or doesn't change?
sns.regplot(x='Age', y='ES', data=part_df, x_estimator=np.mean)
Out[16]:
Hmmm. Apparently extroversion remains stable until it seems to drop for the oldest age group. Get off my lawn!!
Add "WillUse" column to part_df
Set drug use assessment as follows:
Note: I'm treating LegalHighs as illegal since, as far as I know, they're not accepted as normal recreation in any community (as the other 4 are).
In [17]:
# Create a list of the illegal drug names. These are the responses I want to compress into one factor that
# gets triggered if any of these are used within the past decade
drug_names = ["Amphet","Amyl","Benzos","Cannabis","Coke","Crack","Ecstasy","Heroin","Ketamine",
"LegalH","LSD","Meth","Shrooms","Semer","VSA"]
# To part_df, add a column for indicating illegal drug use (the target), set to zero ("not likely to use")
part_df["WillUse"] = int(0)
In [18]:
# Cycle through the instances and illegal drug name features to switch part_df['WillUse'] to 1
# if any illegal drugs were used in the past decade
for i in dfcopy.index.values:
count = -1
while count < 14:
count += 1
# print('i is ', i, 'count is ', count)
# check for illegal drug use within past decade or less
if dfcopy.loc[i, drug_names[count]] > -2.0:
# set the prediction to 1 = likely to use illegal drugs
part_df.loc[[i],['WillUse']] = int(1)
# print ("For i of ", i," target is switched to ", part_df.loc[[i],['WillUse']])
break
In [19]:
print(part_df.describe()) # Check to see what this stuff looks like
In [20]:
# Judging from the mean of WillUse, quite a few of the respondents in the sample used in the past decade
# Basic data info
part_df.info()
In [21]:
# Maybe a pairplot will help see patterns better?
vis1 = sns.pairplot(data=part_df)
In [22]:
# Correlation heatmap
sns.heatmap(part_df.corr())
Out[22]:
In [23]:
# Interesting - shows greatest correlation between WillUse and Nicotine, SS, Imp and OS-openness to experience
# Create a radial plot focusing on relationships to WillUse
from pandas.plotting import radviz
plt.figure(figsize=(13,13))
pd.plotting.radviz(part_df, "WillUse")
plt.show()
In [49]:
# Create arrays to put into the models
X = np.array(part_df[feature_names])
y = np.array(part_df['WillUse'])
In [50]:
# Split the X and y arrays into two equal parts, one to train and one to test
X_train, X_test = X[:942, ...], X[943:, ...]
y_train = y[:942]
y_test = y[943:]
In [51]:
# Run the Decisiontree classifier on the train data and test the accuracy using the test data.
dtc = tree.DecisionTreeClassifier()
dtc.fit(X_train, y_train)
y_pred = dtc.predict(X_test)
# Check the accuracy
accuracy = dtc.score(X_test, y_test)
print("Random Forest Accuracy = ",accuracy)
# Print classification report
target_names = ['Unlikely to Use', 'Will Use']
clr = classification_report(y_test, y_pred, target_names=target_names)
print(clr)
In [52]:
# Function code copied from jhboyle's "1984_Congressional_Voting_Classification THANKS!
def plot_confusion_matrix(cm, title='Confusion Matrix', cmap=plt.cm.Blues):
plt.imshow(cm, interpolation='nearest', cmap=cmap)
plt.title(title, size = 15)
plt.colorbar()
tick_marks = np.arange(len(target_names))
plt.xticks(tick_marks, target_names, rotation=0, size = 12)
plt.yticks(tick_marks, target_names, rotation=90, size = 12)
plt.tight_layout()
plt.ylabel('True Label', size = 15)
plt.xlabel('Predicted Label', size = 15)
plt.savefig('plot_confusion_matrix')
In [53]:
# Plot the confusion matrix
cm = confusion_matrix(y_test, y_pred)
np.set_printoptions(precision=2)
print('Confusion matrix for un-scaled Random Forest')
print(cm)
plt.figure()
plot_confusion_matrix(cm)
In [54]:
# Following jhboyle's example ...
# Normalize the confusion matrix by row (i.e by the number of samples in each class)
cm_normalized = cm.astype('float') / cm.sum(axis=1)[:, np.newaxis]
print('Normalized Confusion Matrix for Random Forest')
print(cm_normalized)
plt.figure()
plot_confusion_matrix(cm_normalized, title='Normalized Confusion Matrix')
plt.savefig('plot_norm_confusion_matrix')
plt.show()
In [55]:
# Would it improve the results if I scale X?
Xp = scale(X)
# Split the X array into two, one to train and one to test
Xp_train, Xp_test = Xp[:942, ...], Xp[943:, ...]
y_train = y[:942]
y_test = y[943:]
# Run the Decisiontree classifier on the train data and test the accuracy on the test data.
dtcX = tree.DecisionTreeClassifier()
dtcX.fit(Xp_train, y_train)
y_pred = dtcX.predict(Xp_test)
# Check the accuracy
accuracy = dtc.score(Xp_test, y_test)
print("Scaled Random Forest Accuracy = ",accuracy)
# Print classification report
clr = classification_report(y_test, y_pred, target_names=target_names)
print(clr)
In [57]:
# Plot the confusion matrix
cm = confusion_matrix(y_test, y_pred)
np.set_printoptions(precision=2)
cm_normalized = cm.astype('float') / cm.sum(axis=1)[:, np.newaxis]
print('Normalized Confusion Matrix for Scaled Random Forest')
print(cm_normalized)
plt.figure()
plot_confusion_matrix(cm_normalized, title='Normalized Confusion Matrix')
plt.savefig('plot_norm_confusion_matrix')
plt.show()
In [59]:
# To try a Naive Bayes model, need to verify the train and test arrays are the same length
print(X_train.shape, y_train.shape)
# Run a Gaussian Naive Bayes model
gnb = GaussianNB()
y_pred = gnb.fit(X_train, y_train).predict(X_test)
# Print classification report
clr = classification_report(y_test, y_pred, target_names=target_names)
print(clr)
In [61]:
# Plot the normalized confusion matrix
cm = confusion_matrix(y_test, y_pred)
np.set_printoptions(precision=2)
cm_normalized = cm.astype('float') / cm.sum(axis=1)[:, np.newaxis]
print('Normalized Confusion Matrix for Naive Bayes')
print(cm_normalized)
plt.figure()
plot_confusion_matrix(cm_normalized, title='Normalized Confusion Matrix for Naive Bayes')
plt.savefig('plot_norm_confusion_matrix')
plt.show()
In [64]:
# I'll try a K Nearest Neighbor. For this, I'll go back to the scaled training dataset
from sklearn.neighbors import KNeighborsClassifier
knn = KNeighborsClassifier(n_neighbors=5)
knn.fit(Xp_train, y_train)
y_pred = knn.predict(Xp_test)
# Print classification report
clr = classification_report(y_test, y_pred, target_names=target_names)
print(clr)
In [65]:
# Plot the normalized confusion matrix
cm = confusion_matrix(y_test, y_pred)
np.set_printoptions(precision=2)
cm_normalized = cm.astype('float') / cm.sum(axis=1)[:, np.newaxis]
print('Normalized Confusion Matrix for K-Nearest Neighbor')
print(cm_normalized)
plt.figure()
plot_confusion_matrix(cm_normalized, title='Normalized Confusion Matrix for K-Nearest Neighbor')
plt.savefig('plot_norm_confusion_matrix')
plt.show()
In [66]:
# How about a Multi-layer Perceptron neural network model?
from sklearn.neural_network import MLPClassifier
mlpnn = MLPClassifier(solver='sgd', alpha=1e-5, random_state=1)
mlpnn.fit(X_train, y_train)
MLPClassifier(activation='relu', alpha=1e-05, batch_size='auto',
beta_1=0.9, beta_2=0.999, early_stopping=False,
epsilon=1e-08, hidden_layer_sizes=(100,), learning_rate='constant',
learning_rate_init=0.001, max_iter=200, momentum=0.9,
nesterovs_momentum=True, power_t=0.5, random_state=1, shuffle=True,
solver='sgd', tol=0.0001, validation_fraction=0.1, verbose=False,
warm_start=False)
Out[66]:
In [67]:
# Run the fitted model to get a prediction for the test data
y_pred = mlpnn.predict(X_test)
# Print classification report
clr = classification_report(y_test, y_pred, target_names=target_names)
print(clr)
In [68]:
# Plot the normalized confusion matrix
cm = confusion_matrix(y_test, y_pred)
np.set_printoptions(precision=2)
cm_normalized = cm.astype('float') / cm.sum(axis=1)[:, np.newaxis]
print('Normalized Confusion Matrix')
print(cm_normalized)
plt.figure()
plot_confusion_matrix(cm_normalized, title='Normalized Confusion Matrix')
plt.savefig('plot_norm_confusion_matrix')
plt.show()
MLP results varied depending on the solver but not other parameters. I started with 'lbfgs', since that's what my example code used. That didn't work very well (f1-score 0.82), so on the advice of stackoverflow, I tried 'sgd', giving f1-score 0.86. I tried the 'adam' solver but I got a warning that it wasn't converging. I gave it four hidden layers but that didn't help. Going back to the 'sgd solver, I also adjusted the number of hidden layers a few times, going up to four, but f1 stayed the same. I doubled the number of iterations to 400, which also had no effect on f1, so I set 'sgd' back to one hidden layer and 200 iterations.
This dataset was built around data taken from John Anthony Johnson's IPIP-NEO data repository, using results from his 2005 JRP study. The Notebook used to create the dataset is called "Reading & Wrangling Diff Dataset for Drug Use Predictor" and added random integers to his actual data to provide all needed features
In [82]:
new_df = pd.read_csv('data/Johnny_data_out.csv')
new_df.head()
Out[82]:
In [83]:
# I'll try the Naive Bayes since the MLP had trouble converging under certain conditions
# and throwing in a bunch of random data won't help
# Create array to put into the model
X_new = np.array(new_df[feature_names])
# Run the Gaussian Naive Bayes model, fitting it again with the UCI data
gnb = GaussianNB()
gnb.fit(X_train, y_train)
y_pred = gnb.predict(X_new)
In [84]:
from scipy import stats
stats.describe(y_pred)
Out[84]:
In [86]:
# That's not very helpful. I like df.describe better
dfy = pd.DataFrame(y_pred)
print(dfy.describe())
Hmmmm. With such a low mean, it suggests the people in the new sample are much less likely to use illegal drugs than the people in the UCI sample. Is that realistic? Such a big difference?
In [ ]: