Following is the Python program I wrote to fulfill the second assignment of the Machine Learning for Data Analysis online course.
I decided to use Jupyter Notebook as it is a pretty way to write code and present results.
For this assignment, I took the same research question as for the previous assignement. I decided to use the NESARC database with the following question : Are people from white ethnicity more likely to have ever used cannabis?
The explanatory variables will be:
The data will be managed to get cannabis usage recoded from 0 (never used cannabis) and 1 (used cannabis). The non-answering recordings (reported as 9) will be discarded.
The response variable having 2 categories, categories grouping is not needed.
In [1]:
# Magic command to insert the graph directly in the notebook
%matplotlib inline
# Load a useful Python libraries for handling data
import pandas as pd
import numpy as np
import statsmodels.formula.api as smf
import seaborn as sns
import matplotlib.pyplot as plt
from IPython.display import Markdown, display, Image
In [3]:
from sklearn.cross_validation import train_test_split
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import classification_report
import sklearn.metrics
from sklearn import datasets
from sklearn.ensemble import ExtraTreesClassifier
In [4]:
nesarc = pd.read_csv('nesarc_pds.csv')
In [45]:
boolean_flip_ignore = {1 : 1, 2 : 0, 9 : np.nan}
boolean_flip_ignore_str = {'1' : 1, '2' : 0, '9' : np.nan, ' ' : 0}
boolean_flip = {1 : 1, 2 : 0}
# Group the family income in two groups splitting it on the second quartile class
def income_group(value):
if value < 5:
return 0
elif value < 9:
return 1
elif value < 13:
return 2
elif value < 17:
return 3
else:
return 4
subnesarc = (nesarc[['AGE', 'SEX', 'S1Q7D', 'S3BQ1A5', 'S1Q11A', 'S1Q11B', 'S3AQ1A',
'S1Q1C', 'S1Q1D1', 'S1Q1D2', 'S1Q1D3', 'S1Q1D4', 'S1Q1D5',
'S1Q2A', 'S1Q2C1', 'S1Q2C2', 'S1Q2C3', 'S1Q2C4', 'S1Q2C5',
'S1Q2D', 'S1Q6A',
'S1Q7A1', 'S1Q7A2', 'S1Q7A6', 'S1Q7A7']]
.assign(sex=lambda x: pd.to_numeric(x['SEX'].map(boolean_flip)),
native=lambda x: pd.to_numeric(x['S1Q1D1'].map(boolean_flip)),
asian=lambda x: pd.to_numeric(x['S1Q1D2'].map(boolean_flip)),
black=lambda x: pd.to_numeric(x['S1Q1D3'].map(boolean_flip)),
pacific=lambda x: pd.to_numeric(x['S1Q1D4'].map(boolean_flip)),
white=lambda x: pd.to_numeric(x['S1Q1D5'].map(boolean_flip)),
used_canabis=lambda x: (pd.to_numeric(x['S3BQ1A5'], errors='coerce')
.map(boolean_flip_ignore)),
family_income=lambda x: pd.to_numeric(x['S1Q11B'].map(income_group)),
smoked_100cigarettes=lambda x: (pd.to_numeric(x['S3AQ1A'], errors='coerce')
.map(boolean_flip_ignore)),
bioparent=lambda x: (pd.to_numeric(x['S1Q2A'], errors='coerce')
.map(boolean_flip_ignore)),
adoptiveparent=lambda x: (pd.to_numeric(x['S1Q2C1'].map(boolean_flip_ignore_str),
errors='coerce')),
byrelative=lambda x: (pd.to_numeric(x['S1Q2C2'].map(boolean_flip_ignore_str),
errors='coerce')),
byfoster=lambda x: (pd.to_numeric(x['S1Q2C3'].map(boolean_flip_ignore_str),
errors='coerce')),
ininstitution=lambda x: (pd.to_numeric(x['S1Q2C4'].map(boolean_flip_ignore_str),
errors='coerce')),
otherraised=lambda x: (pd.to_numeric(x['S1Q2C5'].map(boolean_flip_ignore_str),
errors='coerce')),
divorced=lambda x: (pd.to_numeric(x['S1Q2D'].map(boolean_flip_ignore_str),
errors='coerce')),
grade=lambda x: (pd.to_numeric(x['S1Q6A'], errors='coerce')),
fulltime=lambda x: pd.to_numeric(x['S1Q7A1'].map(boolean_flip)),
parttime=lambda x: pd.to_numeric(x['S1Q7A2'].map(boolean_flip)),
unemployedsearch=lambda x: pd.to_numeric(x['S1Q7A6'].map(boolean_flip)),
unemployed=lambda x: pd.to_numeric(x['S1Q7A7'].map(boolean_flip))
)
.dropna())[['sex', 'AGE', 'native', 'asian', 'black', 'pacific', 'white',
'used_canabis', 'family_income', 'smoked_100cigarettes',
'bioparent', 'adoptiveparent', 'byrelative', 'byfoster', 'ininstitution',
'otherraised', 'divorced', 'grade', 'fulltime', 'parttime',
'unemployedsearch', 'unemployed']]
In [46]:
subnesarc.describe()
Out[46]:
In [47]:
# Selection of explanatory variables
features = ['sex', 'AGE', 'native', 'asian', 'black', 'pacific', 'white', 'family_income',
'smoked_100cigarettes',
'bioparent', 'adoptiveparent', 'byrelative', 'byfoster', 'ininstitution',
'otherraised', 'divorced', 'grade', 'fulltime', 'parttime',
'unemployedsearch', 'unemployed']
predictors = subnesarc[features]
# Reponse variable
targets = subnesarc['used_canabis']
# Split the data in test and training test
pred_train, pred_test, tar_train, tar_test = train_test_split(predictors, targets, test_size=.4)
The following commands run the Random Forest test on 60% of the sample.
In [48]:
from sklearn.ensemble import RandomForestClassifier
classifier = RandomForestClassifier(n_estimators=25)
classifier = classifier.fit(pred_train, tar_train)
predictions = classifier.predict(pred_test)
In [49]:
cm = sklearn.metrics.confusion_matrix(tar_test, predictions)
nice_cm = pd.DataFrame({'Never used cannabis' : cm[:, 0], 'Used cannabis' : cm[:, 1]},
index=('Never used cannabis', 'Used cannabis'))
nice_cm.index.name = 'True/predicted value'
nice_cm
Out[49]:
The confusion matrix above does not mean much in the case of Random Forest. But still it shows that 13169 records were predicted accurately when 2245 were false negative and 1512 false positive.
The accuracy of the trees are good as shown below.
In [50]:
display(Markdown("Accuracy score of the model: {:.3g}".format(sklearn.metrics.accuracy_score(tar_test, predictions))))
In [62]:
# fit an Extra Trees model to the data
model = ExtraTreesClassifier()
model.fit(pred_train,tar_train)
# display the relative importance of each attribute
cm = sns.light_palette("yellow", as_cmap=True)
(pd.Series(model.feature_importances_, index=features, name='importance')
.to_frame()
.style.background_gradient(cmap=cm))
Out[62]:
Here above is the most interesting results. It shows the relative importance of the explanatory variables. The main message here is that I forgot to consider the grade completed variable when running the classification tree last week... And the second one is the preponderant role of age in the trees.
In [63]:
trees=range(25)
accuracy=np.zeros(25)
for idx in range(len(trees)):
classifier=RandomForestClassifier(n_estimators=idx + 1)
classifier=classifier.fit(pred_train,tar_train)
predictions=classifier.predict(pred_test)
accuracy[idx]=sklearn.metrics.accuracy_score(tar_test, predictions)
plt.plot(trees, accuracy)
plt.xlabel("Number of trees")
plt.ylabel("Accuracy score");
This week shows that the 4 most important variables to tell if someone has ever used cannabis are, in that order :
From the evolution of the accuracy score with the number of trees, it looks like a single decision tree is really good already (the accuracy score is even the higher in this case). But to start to see a stabilisation of the accuracy, 10 or more trees should be used.