This is it. The last course of the specialization on Data Analysis. Following is the Python program I wrote to fulfill the first assignment of the Machine Learning for Data Analysis online course.
I decided to use Jupyter Notebook as it is a pretty way to write code and present results.
For this assignment, I took the same research question as for the previous assignement. I decided to use the NESARC database with the following question : Are people from white ethnicity more likely to have ever used cannabis?
The explanatory variables will be:
The data will be managed to get cannabis usage recoded from 0 (never used cannabis) and 1 (used cannabis). The non-answering recordings (reported as 9) will be discarded.
The response variable having 2 categories, categories grouping is not needed.
In [9]:
# Magic command to insert the graph directly in the notebook
%matplotlib inline
# Load a useful Python libraries for handling data
import pandas as pd
import numpy as np
import statsmodels.formula.api as smf
import seaborn as sns
import matplotlib.pyplot as plt
from IPython.display import Markdown, display, Image
In [10]:
from sklearn.cross_validation import train_test_split
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import classification_report
import sklearn.metrics
In [11]:
nesarc = pd.read_csv('nesarc_pds.csv')
In [12]:
canabis_usage = {1 : 1, 2 : 0, 9 : 9}
sex_shift = {1 : 1, 2 : 0}
white_race = {1 : 1, 2 : 0}
# Group the family income in two groups splitting it on the second quartile class
def income_group(value):
if value < 5:
return 0
elif value < 9:
return 1
elif value < 13:
return 2
elif value < 17:
return 3
else:
return 4
subnesarc = (nesarc[['AGE', 'SEX', 'S1Q1D5', 'S1Q7D', 'S3BQ1A5', 'S1Q11A', 'S1Q11B', 'S3AQ1A']]
.assign(sex=lambda x: pd.to_numeric(x['SEX'].map(sex_shift)),
white_ethnicity=lambda x: pd.to_numeric(x['S1Q1D5'].map(white_race)),
used_canabis=lambda x: (pd.to_numeric(x['S3BQ1A5'], errors='coerce')
.map(canabis_usage)
.replace(9, np.nan)),
family_income=lambda x: (x['S1Q11B'].map(income_group)),
smoked_100cigarettes=lambda x: (pd.to_numeric(x['S3AQ1A'], errors='coerce')
.map(canabis_usage)
.replace(9, np.nan))
)
.dropna())
In [13]:
subnesarc.describe()
Out[13]:
In [14]:
features = ['AGE', 'sex', 'S1Q1D5', 'family_income', 'smoked_100cigarettes']
predictors = subnesarc[features]
targets = subnesarc['used_canabis']
# Split the data in test and training test
pred_train, pred_test, tar_train, tar_test = train_test_split(predictors, targets, test_size=.4)
Let's build the decision tree
In [15]:
classifier = DecisionTreeClassifier()
classifier = classifier.fit(pred_train, tar_train)
predictions = classifier.predict(pred_test)
Now that the decision tree is created. We can try it on the test sample.
In [16]:
cm = sklearn.metrics.confusion_matrix(tar_test, predictions)
nice_cm = pd.DataFrame({'Never used cannabis' : cm[:, 0], 'Used cannabis' : cm[:, 1]},
index=('Never used cannabis', 'Used cannabis'))
nice_cm.index.name = 'True/predicted value'
nice_cm
Out[16]:
In [17]:
display(Markdown("Accuracy score of the model: {:.3g}".format(sklearn.metrics.accuracy_score(tar_test, predictions))))
In [10]:
from sklearn import tree
from io import StringIO
out = StringIO()
tree.export_graphviz(classifier, out_file=out,
filled=True, rotate=True,
feature_names=features)
import pydotplus
graph=pydotplus.graph_from_dot_data(out.getvalue())
In [11]:
Image(graph.create_png())
Out[11]: