Machine Learning for Data Analysis

Assignment: Running a Classification Tree

This is it. The last course of the specialization on Data Analysis. Following is the Python program I wrote to fulfill the first assignment of the Machine Learning for Data Analysis online course.

I decided to use Jupyter Notebook as it is a pretty way to write code and present results.

Research question for this assignment

For this assignment, I took the same research question as for the previous assignement. I decided to use the NESARC database with the following question : Are people from white ethnicity more likely to have ever used cannabis?

The explanatory variables will be:

  • Age (Quantitative variable)
  • Sex (0 = Female, 1 = Male)
  • Family income (grouped in 5 categories)
  • Ever smoked 100+ cigarettes (0 = No, 1 = Yes)
  • White Ethnicity (2 = No, 1 = Yes)

Data management

The data will be managed to get cannabis usage recoded from 0 (never used cannabis) and 1 (used cannabis). The non-answering recordings (reported as 9) will be discarded.

The response variable having 2 categories, categories grouping is not needed.


In [9]:
# Magic command to insert the graph directly in the notebook
%matplotlib inline
# Load a useful Python libraries for handling data
import pandas as pd
import numpy as np
import statsmodels.formula.api as smf
import seaborn as sns
import matplotlib.pyplot as plt
from IPython.display import Markdown, display, Image

In [10]:
from sklearn.cross_validation import train_test_split
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import classification_report
import sklearn.metrics

In [11]:
nesarc = pd.read_csv('nesarc_pds.csv')


C:\Anaconda3\lib\site-packages\IPython\core\interactiveshell.py:2723: DtypeWarning: Columns (76) have mixed types. Specify dtype option on import or set low_memory=False.
  interactivity=interactivity, compiler=compiler, result=result)

In [12]:
canabis_usage = {1 : 1, 2 : 0, 9 : 9}
sex_shift = {1 : 1, 2 : 0}
white_race = {1 : 1, 2 : 0}

# Group the family income in two groups splitting it on the second quartile class
def income_group(value):
    if value < 5:
        return 0
    elif value < 9:
        return 1
    elif value < 13:
        return 2
    elif value < 17:
        return 3
    else:
        return 4

subnesarc = (nesarc[['AGE', 'SEX', 'S1Q1D5', 'S1Q7D', 'S3BQ1A5', 'S1Q11A', 'S1Q11B', 'S3AQ1A']]
             .assign(sex=lambda x: pd.to_numeric(x['SEX'].map(sex_shift)),
                     white_ethnicity=lambda x: pd.to_numeric(x['S1Q1D5'].map(white_race)),
                     used_canabis=lambda x: (pd.to_numeric(x['S3BQ1A5'], errors='coerce')
                                                .map(canabis_usage)
                                                .replace(9, np.nan)),
                     family_income=lambda x: (x['S1Q11B'].map(income_group)),
                     smoked_100cigarettes=lambda x: (pd.to_numeric(x['S3AQ1A'], errors='coerce')
                                                .map(canabis_usage)
                                                .replace(9, np.nan))
                    )
             .dropna())

In [13]:
subnesarc.describe()


Out[13]:
AGE SEX S1Q1D5 S3BQ1A5 S1Q11A S1Q11B S3AQ1A family_income sex smoked_100cigarettes used_canabis white_ethnicity
count 42467.000000 42467.000000 42467.000000 42467.000000 4.246700e+04 42467.000000 42467.000000 42467.000000 42467.000000 42467.000000 42467.000000 42467.000000
mean 46.411355 1.570985 1.238444 1.807639 4.565991e+04 9.427273 1.578072 1.714696 0.429015 0.421928 0.192361 0.761556
std 18.192126 0.494941 0.426137 0.394160 5.783819e+04 4.843027 0.493873 1.195342 0.494941 0.493873 0.394160 0.426137
min 18.000000 1.000000 1.000000 1.000000 2.400000e+01 1.000000 1.000000 0.000000 0.000000 0.000000 0.000000 0.000000
25% 32.000000 1.000000 1.000000 2.000000 1.700000e+04 6.000000 1.000000 1.000000 0.000000 0.000000 0.000000 1.000000
50% 44.000000 2.000000 1.000000 2.000000 3.300000e+04 9.000000 2.000000 2.000000 0.000000 0.000000 0.000000 1.000000
75% 59.000000 2.000000 1.000000 2.000000 6.000000e+04 13.000000 2.000000 3.000000 1.000000 1.000000 0.000000 1.000000
max 98.000000 2.000000 2.000000 2.000000 3.000000e+06 21.000000 2.000000 4.000000 1.000000 1.000000 1.000000 1.000000

Modeling and prediction


In [14]:
features = ['AGE', 'sex', 'S1Q1D5', 'family_income', 'smoked_100cigarettes']
predictors = subnesarc[features]

targets = subnesarc['used_canabis']

# Split the data in test and training test
pred_train, pred_test, tar_train, tar_test = train_test_split(predictors, targets, test_size=.4)

Let's build the decision tree


In [15]:
classifier = DecisionTreeClassifier()
classifier = classifier.fit(pred_train, tar_train)

predictions = classifier.predict(pred_test)

Now that the decision tree is created. We can try it on the test sample.


In [16]:
cm = sklearn.metrics.confusion_matrix(tar_test, predictions)
nice_cm = pd.DataFrame({'Never used cannabis' : cm[:, 0], 'Used cannabis' : cm[:, 1]},
                       index=('Never used cannabis', 'Used cannabis'))
nice_cm.index.name = 'True/predicted value'
nice_cm


Out[16]:
Never used cannabis Used cannabis
True/predicted value
Never used cannabis 13103 651
Used cannabis 2638 595

In [17]:
display(Markdown("Accuracy score of the model: {:.3g}".format(sklearn.metrics.accuracy_score(tar_test, predictions))))


Accuracy score of the model: 0.806

Displaying the decision tree


In [10]:
from sklearn import tree
from io import StringIO
out = StringIO()
tree.export_graphviz(classifier, out_file=out,
                     filled=True, rotate=True,
                     feature_names=features)
import pydotplus

graph=pydotplus.graph_from_dot_data(out.getvalue())

In [11]:
Image(graph.create_png())


dot: graph is too large for cairo-renderer bitmaps. Scaling by 0.428192 to fit

Out[11]:

Summary

The Decision Tree algorithm was applied to the NESARC database to influence of the following explanatory variables on the 'ever used cannabis' indicator:

  • Age (Quantitative variable)
  • Sex (0 = Female, 1 = Male)
  • Family income (grouped in 5 categories)
  • Ever smoked 100+ cigarettes (0 = No, 1 = Yes)
  • White Ethnicity (2 = No, 1 = Yes)

The decision tree is build using the genie index as splitting criteria.

The confusion matrix shows that the tested model predicts correctly on the test sample 12 933 people as never used cannabis and 627 as ever used cannabis. But 2660 people having used cannabis in reality were miss evaluated as well as 767 people never having used cannabis in reality.

The accuracy score is 80%. But this high results is due to the big proportion of people having never used cannabis. Indeed the prediction for people having ever used cannabis is very poor with more people being predicted as never used cannabis.

The decision tree is huge with lots of nodes as expected with the Python approach. I used the option filled=True to fill the nodes with a color indicating majority class.

The first node is a split on the AGE criteria <= 55.5. For people being younger (13313 having used cannabis and 4577 having never used it) than the threshold, the second criteria is whether or not then ever smoked 100+ cigarettes. Those having smoked 100+ cigarettes are more likely to have ever used cannabis. Then comes the third level for that groups splitting according to the sex. This time the tree implies an higher chance for male to have ever used cannabis. And then those men are split according the family income. This split allowing the fourth category. The richer one are less likely to have ever tried cannabis.

I can go one. But I will stop here as the precised description does not really bring much.

The main conclusion from the decision tree is a split of behavior depending on the age. Then smoker and non-smoker group is a second important criteria. And after that comes sex.