Machine Learning for Data Analysis

Assignment: Running a Classification Tree

This is it. The last course of the specialization on Data Analysis. Following is the Python program I wrote to fulfill the first assignment of the Machine Learning for Data Analysis online course.

I decided to use Jupyter Notebook as it is a pretty way to write code and present results.

Research question for this assignment

For this assignment, I took the same research question as for the previous assignement. I decided to use the NESARC database with the following question : Are people from white ethnicity more likely to have ever used cannabis?

The explanatory variables will be:

Age (Quantitative variable)
Sex (0 = Female, 1 = Male)
Family income (grouped in 5 categories)
Ever smoked 100+ cigarettes (0 = No, 1 = Yes)
White Ethnicity (2 = No, 1 = Yes)

Data management

The data will be managed to get cannabis usage recoded from 0 (never used cannabis) and 1 (used cannabis). The non-answering recordings (reported as 9) will be discarded.

The response variable having 2 categories, categories grouping is not needed.



In [9]:

    
# Magic command to insert the graph directly in the notebook
%matplotlib inline
# Load a useful Python libraries for handling data
import pandas as pd
import numpy as np
import statsmodels.formula.api as smf
import seaborn as sns
import matplotlib.pyplot as plt
from IPython.display import Markdown, display, Image



In [10]:

    
from sklearn.cross_validation import train_test_split
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import classification_report
import sklearn.metrics



In [11]:

    
nesarc = pd.read_csv('nesarc_pds.csv')









    



C:\Anaconda3\lib\site-packages\IPython\core\interactiveshell.py:2723: DtypeWarning: Columns (76) have mixed types. Specify dtype option on import or set low_memory=False.
  interactivity=interactivity, compiler=compiler, result=result)



In [12]:

    
canabis_usage = {1 : 1, 2 : 0, 9 : 9}
sex_shift = {1 : 1, 2 : 0}
white_race = {1 : 1, 2 : 0}

# Group the family income in two groups splitting it on the second quartile class
def income_group(value):
    if value < 5:
        return 0
    elif value < 9:
        return 1
    elif value < 13:
        return 2
    elif value < 17:
        return 3
    else:
        return 4

subnesarc = (nesarc[['AGE', 'SEX', 'S1Q1D5', 'S1Q7D', 'S3BQ1A5', 'S1Q11A', 'S1Q11B', 'S3AQ1A']]
             .assign(sex=lambda x: pd.to_numeric(x['SEX'].map(sex_shift)),
                     white_ethnicity=lambda x: pd.to_numeric(x['S1Q1D5'].map(white_race)),
                     used_canabis=lambda x: (pd.to_numeric(x['S3BQ1A5'], errors='coerce')
                                                .map(canabis_usage)
                                                .replace(9, np.nan)),
                     family_income=lambda x: (x['S1Q11B'].map(income_group)),
                     smoked_100cigarettes=lambda x: (pd.to_numeric(x['S3AQ1A'], errors='coerce')
                                                .map(canabis_usage)
                                                .replace(9, np.nan))
                    )
             .dropna())



In [13]:

    
subnesarc.describe()









    Out[13]:






  
    
      
      AGE
      SEX
      S1Q1D5
      S3BQ1A5
      S1Q11A
      S1Q11B
      S3AQ1A
      family_income
      sex
      smoked_100cigarettes
      used_canabis
      white_ethnicity
    
  
  
    
      count
      42467.000000
      42467.000000
      42467.000000
      42467.000000
      4.246700e+04
      42467.000000
      42467.000000
      42467.000000
      42467.000000
      42467.000000
      42467.000000
      42467.000000
    
    
      mean
      46.411355
      1.570985
      1.238444
      1.807639
      4.565991e+04
      9.427273
      1.578072
      1.714696
      0.429015
      0.421928
      0.192361
      0.761556
    
    
      std
      18.192126
      0.494941
      0.426137
      0.394160
      5.783819e+04
      4.843027
      0.493873
      1.195342
      0.494941
      0.493873
      0.394160
      0.426137
    
    
      min
      18.000000
      1.000000
      1.000000
      1.000000
      2.400000e+01
      1.000000
      1.000000
      0.000000
      0.000000
      0.000000
      0.000000
      0.000000
    
    
      25%
      32.000000
      1.000000
      1.000000
      2.000000
      1.700000e+04
      6.000000
      1.000000
      1.000000
      0.000000
      0.000000
      0.000000
      1.000000
    
    
      50%
      44.000000
      2.000000
      1.000000
      2.000000
      3.300000e+04
      9.000000
      2.000000
      2.000000
      0.000000
      0.000000
      0.000000
      1.000000
    
    
      75%
      59.000000
      2.000000
      1.000000
      2.000000
      6.000000e+04
      13.000000
      2.000000
      3.000000
      1.000000
      1.000000
      0.000000
      1.000000
    
    
      max
      98.000000
      2.000000
      2.000000
      2.000000
      3.000000e+06
      21.000000
      2.000000
      4.000000
      1.000000
      1.000000
      1.000000
      1.000000

Modeling and prediction



In [14]:

    
features = ['AGE', 'sex', 'S1Q1D5', 'family_income', 'smoked_100cigarettes']
predictors = subnesarc[features]

targets = subnesarc['used_canabis']

# Split the data in test and training test
pred_train, pred_test, tar_train, tar_test = train_test_split(predictors, targets, test_size=.4)

Let's build the decision tree



In [15]:

    
classifier = DecisionTreeClassifier()
classifier = classifier.fit(pred_train, tar_train)

predictions = classifier.predict(pred_test)

Now that the decision tree is created. We can try it on the test sample.



In [16]:

    
cm = sklearn.metrics.confusion_matrix(tar_test, predictions)
nice_cm = pd.DataFrame({'Never used cannabis' : cm[:, 0], 'Used cannabis' : cm[:, 1]},
                       index=('Never used cannabis', 'Used cannabis'))
nice_cm.index.name = 'True/predicted value'
nice_cm









    Out[16]:






  
    
      
      Never used cannabis
      Used cannabis
    
    
      True/predicted value
      
      
    
  
  
    
      Never used cannabis
      13103
      651
    
    
      Used cannabis
      2638
      595



In [17]:

    
display(Markdown("Accuracy score of the model: {:.3g}".format(sklearn.metrics.accuracy_score(tar_test, predictions))))









    




Accuracy score of the model: 0.806

Displaying the decision tree



In [10]:

    
from sklearn import tree
from io import StringIO
out = StringIO()
tree.export_graphviz(classifier, out_file=out,
                     filled=True, rotate=True,
                     feature_names=features)
import pydotplus

graph=pydotplus.graph_from_dot_data(out.getvalue())



In [11]:

    
Image(graph.create_png())









    



dot: graph is too large for cairo-renderer bitmaps. Scaling by 0.428192 to fit







    Out[11]:

Summary

The Decision Tree algorithm was applied to the NESARC database to influence of the following explanatory variables on the 'ever used cannabis' indicator:

Age (Quantitative variable)
Sex (0 = Female, 1 = Male)
Family income (grouped in 5 categories)
Ever smoked 100+ cigarettes (0 = No, 1 = Yes)
White Ethnicity (2 = No, 1 = Yes)

The decision tree is build using the genie index as splitting criteria.

The confusion matrix shows that the tested model predicts correctly on the test sample 12 933 people as never used cannabis and 627 as ever used cannabis. But 2660 people having used cannabis in reality were miss evaluated as well as 767 people never having used cannabis in reality.

The accuracy score is 80%. But this high results is due to the big proportion of people having never used cannabis. Indeed the prediction for people having ever used cannabis is very poor with more people being predicted as never used cannabis.

The decision tree is huge with lots of nodes as expected with the Python approach. I used the option filled=True to fill the nodes with a color indicating majority class.

The first node is a split on the AGE criteria <= 55.5. For people being younger (13313 having used cannabis and 4577 having never used it) than the threshold, the second criteria is whether or not then ever smoked 100+ cigarettes. Those having smoked 100+ cigarettes are more likely to have ever used cannabis. Then comes the third level for that groups splitting according to the sex. This time the tree implies an higher chance for male to have ever used cannabis. And then those men are split according the family income. This split allowing the fourth category. The richer one are less likely to have ever tried cannabis.

I can go one. But I will stop here as the precised description does not really bring much.

The main conclusion from the decision tree is a split of behavior depending on the age. Then smoker and non-smoker group is a second important criteria. And after that comes sex.

	AGE	SEX	S1Q1D5	S3BQ1A5	S1Q11A	S1Q11B	S3AQ1A	family_income	sex	smoked_100cigarettes	used_canabis	white_ethnicity
count	42467.000000	42467.000000	42467.000000	42467.000000	4.246700e+04	42467.000000	42467.000000	42467.000000	42467.000000	42467.000000	42467.000000	42467.000000
mean	46.411355	1.570985	1.238444	1.807639	4.565991e+04	9.427273	1.578072	1.714696	0.429015	0.421928	0.192361	0.761556
std	18.192126	0.494941	0.426137	0.394160	5.783819e+04	4.843027	0.493873	1.195342	0.494941	0.493873	0.394160	0.426137
min	18.000000	1.000000	1.000000	1.000000	2.400000e+01	1.000000	1.000000	0.000000	0.000000	0.000000	0.000000	0.000000
25%	32.000000	1.000000	1.000000	2.000000	1.700000e+04	6.000000	1.000000	1.000000	0.000000	0.000000	0.000000	1.000000
50%	44.000000	2.000000	1.000000	2.000000	3.300000e+04	9.000000	2.000000	2.000000	0.000000	0.000000	0.000000	1.000000
75%	59.000000	2.000000	1.000000	2.000000	6.000000e+04	13.000000	2.000000	3.000000	1.000000	1.000000	0.000000	1.000000
max	98.000000	2.000000	2.000000	2.000000	3.000000e+06	21.000000	2.000000	4.000000	1.000000	1.000000	1.000000	1.000000

	Never used cannabis	Used cannabis
True/predicted value
Never used cannabis	13103	651
Used cannabis	2638	595