Lab 4- Decision Trees

This assignment uses 2012 data obtained from the Federal Election Commission on contributions to candidates from committees. The data dictionary is available at http://www.fec.gov/finance/disclosure/metadata/DataDictionaryContributionstoCandidates.shtml. The file we've given you has been subset to 10,000 randomly sampled rows, wth some columns removed



In [48]:

    
from __future__ import division, print_function
from collections import Counter, defaultdict
from itertools import combinations 
import pandas as pd
import numpy as np
import itertools



In [49]:

    
import sklearn
from sklearn.tree import DecisionTreeClassifier
from sklearn.feature_extraction import DictVectorizer #to turn categorial variables into numeric arrays
from sklearn import preprocessing #to transform the feature labels
from sklearn.feature_extraction import DictVectorizer



In [50]:

    
df = pd.read_csv('lab4_candidate_contributions.csv')

#convert zip code and transaction date from floats to strings (since we wnat to treat them as categorical)
df.ZIP_CODE = df.ZIP_CODE.astype('int').astype('str')
df.TRANSACTION_DT = df.TRANSACTION_DT.astype('int').astype('str')

df.head()









    Out[50]:






  
    
      
      CMTE_ID
      AMNDT_IND
      RPT_TP
      ENTITY_TP
      NAME
      CITY
      STATE
      ZIP_CODE
      TRANSACTION_DT
      TRANSACTION_AMT
      CAND_ID
    
  
  
    
      0
       C90011156
       N
       Q3
       IND
       ROULAND, PRESTON
       CLEVELAND
       OH
       44135
        9182012
       26
        Obama
    
    
      1
       C90011156
       N
       Q3
       IND
         DANFORD, KAYLA
          OXFORD
       OH
       45056
        9242012
       17
        Obama
    
    
      2
       C90011156
       N
       YE
       IND
         CLARK, BARBARA
       WHITEHALL
       OH
       43213
       10032012
       18
       Romney
    
    
      3
       C90011156
       N
       YE
       ORG
        JENNIFER JANNON
         DORMONT
       PA
       15216
       10262012
       11
        Obama
    
    
      4
       C90011156
       N
       YE
       IND
          ARNDI, PHILIP
          BOSTON
       MA
        2115
       11062012
       57
        Obama



In [ ]:

Calculating Gini Index

Question 1: How many rows are there in the dataset for Obama? For Romney?



In [51]:

    
obama = 0
romney = 0
for i in range(df.CAND_ID.size):
    if df.CAND_ID.get_value(i) == "Obama":
        obama += 1
    else:
        romney += 1
print("Obama: %d, Romney: %d"%(obama, romney))









    



Obama: 5761, Romney: 4239

Question 2: What is the Gini Index of the this dataset, using Romney and Obama as the target classes?



In [52]:

    
def gini(D):
    obama = sum(D.CAND_ID == "Obama")
    romney = sum(D.CAND_ID == "Romney")
    total = obama+romney
    return 1-((obama/total)**2+(romney/total)**2)
    
print("gini index: %f"%gini(df))









    



gini index: 0.488418

Best Split of a Numeric Feature



In [53]:

    
sortd = df.sort(columns="TRANSACTION_AMT")
mingini = 1
mini = 0
ob1 = 0
ob2 = sum(df.CAND_ID == "Obama")
ro1 = 0
ro2 = sum(df.CAND_ID == "Romney")
total = ob2+ro2
for i in range(df.CAND_ID.size-1):
    if sortd.CAND_ID.get_value(i) == "Obama":
        ob1 += 1
        ob2 -= 1
    else:
        ro1 += 1
        ro2 -= 1
    low = df.TRANSACTION_AMT.get_value(i)
    high = df.TRANSACTION_AMT.get_value(i+1)
    if low != high:
        tot1 = ob1+ro1
        tot2 = ob2+ro2
        gini1 = 1-((ob1/tot1)**2+(ro1/tot1)**2)
        gini2 = 1-((ob2/tot2)**2+(ro2/tot2)**2)
        ginit = gini1*((ob1+ro1)/total) + gini2*((ob2+ro2)/total)
        if ginit < mingini:
            mini = i
            mingini = ginit
            minob1 = ob1
            minob2 = ob2
            minro1 = ro1
            minro2 = ro2

print("split after: %d, gini score: %f, gini reduced by: %f, Obama below: %d, Obama above: %d, Romney below: %d, Romney above: %d"%(mini, mingini, (gini(df)-mingini), minob1, minob2, minro1, minro2))









    



split after: 8040, gini score: 0.488082, gini reduced by: 0.000336, Obama below: 4581, Obama above: 1180, Romney below: 3460, Romney above: 779

Question 3: What is the best split point of the TRANSACTION_AMT feature.



In [54]:

    
mini









    Out[54]:





8040

Question 4: What is the Gini Index of this best split?



In [55]:

    
mingini









    Out[55]:





0.48808188000763558

Question 5: How much does this partitioning reduce the Gini Index over that of the overall dataset?



In [56]:

    
(gini(df)-mingini)









    Out[56]:





0.00033569999236454651

Question 6: How many Romney rows are below your best split point? Obama rows?



In [57]:

    
print("Romney Rows: ", minro1)
print("Obama Rows:", minob1)









    



Romney Rows:  3460
Obama Rows: 4581

Question 7: How many Romney rows are above your best split point? Obama rows?

Recall that, to calculate the best split of this numeric field, you'll need to order your data by TRANSACTION AMT, then consider the midpoint between each pair of consecutive transaction amounts as a potential split point, then calculate the Gini Index for that partitioning. You'll want to keep track of the best split point and its Gini Index (remember that you are trying to minimize the Gini Index).

There are a lot of ways to do this. Some are very fast, others very slow. One tip to make this run quickly is, as you consecutively step through the data and calculate the Gini Index of each possible split point, keep a running total of the number of rows for each candidate that are located above and below the split point.

Some Python tips:

Counter(), from the collections module, is a special dictionary for counting values of a key
zip() lets you concatenate lists into a list of tuples (for example, if we have a list of the candidates and a list of transaction amounts, zip(candidate_list, transaction_amount) would give us a list of (candidate, transaction amount) pairs



In [58]:

    
sortd = df.sort(columns="TRANSACTION_AMT")
mingini = 1
mini = 0
ob1 = 0
ob2 = sum(df.CAND_ID == "Obama")
ro1 = 0
ro2 = sum(df.CAND_ID == "Romney")
total = ob2+ro2
for i in range(df.CAND_ID.size-1):
    if sortd.CAND_ID.get_value(i) == "Obama":
        ob1 += 1
        ob2 -= 1
    else:
        ro1 += 1
        ro2 -= 1
    low = df.TRANSACTION_AMT.get_value(i)
    high = df.TRANSACTION_AMT.get_value(i+1)
    if low != high:
        tot1 = ob1+ro1
        tot2 = ob2+ro2
        gini1 = 1-((ob1/tot1)**2+(ro1/tot1)**2)
        gini2 = 1-((ob2/tot2)**2+(ro2/tot2)**2)
        ginit = gini1*((ob1+ro1)/total) + gini2*((ob2+ro2)/total)
        if ginit < mingini:
            mini = i
            mingini = ginit
            minob1 = ob1
            minob2 = ob2
            minro1 = ro1
            minro2 = ro2

print("split after: %d, gini score: %f, gini reduced by: %f, Obama below: %d, Obama above: %d, Romney below: %d, Romney above: %d"%(mini, mingini, (gini(df)-mingini), minob1, minob2, minro1, minro2))









    



split after: 8040, gini score: 0.488082, gini reduced by: 0.000336, Obama below: 4581, Obama above: 1180, Romney below: 3460, Romney above: 779

Best Split of a Categorial Variable



In [59]:

    
import functools
# question 8
entity_vals = pd.unique(df["ENTITY_TP"])
combinations = functools.reduce(lambda x,y: x+y, [list(itertools.combinations(entity_vals, r)) for r in range(1, len(entity_vals)//2 + 1)])

# question 9
mingini = 1
mincomb = None
for comb in combinations:
    indices = df["ENTITY_TP"].isin(comb)
    split1 = df.loc[indices,:]
    split2 = df.loc[~indices,:]
    cur_gini = (len(split1) * gini(split1) + len(split2) * gini(split2)) / len(df)
    if cur_gini < mingini:
        mingini = cur_gini
        mincomb = comb
        min_split1 = split1
        min_split2 = split2
    
    #print "%d, %d"%(len(split1), len(split2))

Question 8: How many possible splits are there of the ENTITY_TP feature?



In [60]:

    
len(combinations)
# (2**7 - 2) / 2 (because we optimize by throwing out half)









    Out[60]:





63

Question 9: Which split of ENTITY_TP best splits the Obama and Romney rows, as measured by the Gini Index?



In [61]:

    
# question 9
mincomb









    Out[61]:





('CCM', 'COM', 'CAN')

Question 10: What is the Gini Index of this best split?



In [62]:

    
# question 10
mingini









    Out[62]:





0.48314520835547992

Question 11: How much does this partitioning reduce the Gini Index over that of the overall data set?



In [63]:

    
# question 11
gini(df) - mingini









    Out[63]:





0.0052723716445202129

Question 12: How many Romney rows and Obama rows are in your first partition? How many Romney rows and Obama rows are in your second partition?



In [64]:

    
# question 12
print("Romney: %s, Obama: %s"%(sum(min_split1.CAND_ID == "Romney"), sum(min_split1.CAND_ID == "Obama")))
print("Romney: %s, Obama: %s"%(sum(min_split2.CAND_ID == "Romney"), sum(min_split2.CAND_ID == "Obama")))









    



Romney: 147, Obama: 37
Romney: 4092, Obama: 5724

In this exercise, you will be partitioning the original dataset (as opposed to further partitioning the transaction amount partitions from the previous set of questions).

Python tip: the combinations function of the itertools module allows you to enumerate combinations of a list

Training a decision tree

Question 13: Using all of the features in the original dataframe read in at the top of this notebook, train a decision tree classifier that has a depth of three (including the root node and leaf nodes). What is the accuracy of this classifier on the training data?



In [65]:

    
from random import sample
def trainingSample(n, size):
    rows = sample(range(n), size)
    return rows

def separateRows(training_rows, data):
    """ Return (training set, prediction set) with n% of rows in training set"""
    training = data.ix[training_rows]
    prediction = data.drop(training_rows)
    return (training, prediction)



In [66]:

    
from datetime import datetime
from sklearn import preprocessing
from sklearn.feature_extraction import DictVectorizer

classifier = DecisionTreeClassifier(criterion='gini', splitter='best', max_depth=3, min_samples_split=2, min_samples_leaf=1, max_features=None, random_state=None, min_density=None, compute_importances=None, max_leaf_nodes=None)
df_new = df.copy()
df_new.TRANSACTION_DT = df_new.TRANSACTION_DT.apply(lambda x: x if len(x) == 8 else "0" + x)
df_new.TRANSACTION_DT = df_new.TRANSACTION_DT.apply(lambda x: datetime.strptime(x, "%m%d%Y").toordinal())

CAND_ID = df_new.CAND_ID
X = df_new.drop("CAND_ID", axis = 1)

vec = DictVectorizer()
X = pd.DataFrame(vec.fit_transform(X.to_dict("records")).toarray())
X.columns = vec.get_feature_names()

train_size = 0.75
training_rows = trainingSample(X.shape[0], int(train_size * X.shape[0]))
train_rows, pred_rows = separateRows(training_rows, X)
train_Y, pred_Y = separateRows(training_rows, CAND_ID)


train_Y = train_Y == 'Obama'
train_Y = train_Y.astype(int)
pred_Y = pred_Y == 'Obama'
pred_Y = pred_Y.astype(int)



In [67]:

    
clf = classifier.fit(train_rows, train_Y)



In [68]:

    
classifier.score(train_rows, train_Y)









    Out[68]:





0.57906666666666662



In [69]:

    
classifier.score(pred_rows, pred_Y)









    Out[69]:





0.56720000000000004

Question 14: Export your decision tree to graphviz. Please submit a png file of this graphic to bcourses. In your write-up, write down the interpretation of the rule at each node (for example, 'Root node: rows from state AL go the the left, rows from all other states go to the right. Left child of root node: ... etc



In [70]:

    
from sklearn.externals.six import StringIO
with open("obama_romney.dot", 'w') as f:
    f = sklearn.tree.export_graphviz(clf, out_file=f)



In [71]:

    
print(train_rows.columns[2768])
print(train_rows.columns[7])
print(train_rows.columns[745])
print(train_rows.columns[706])
print(train_rows.columns[810])
print(train_rows.columns[2745])









    



TRANSACTION_DT
CITY=AKRON
CMTE_ID=C90011156
CMTE_ID=C00521013
ENTITY_TP=IND
STATE=NC

The topmost split depends on the date of the contribution. The next highest are whether a contribution was from Akron, and whether CMTE_ID equals C90011156 (which we think has to do with who the contributor is). The final row splits on whether CMTE_ID equals C00521013, whether the donor is an individual, and whether the state is North Carolina.

Question 15: For each of your leaf nodes, specify the percentage of Obama rows in that node (out of the total number of rows at that node).



In [72]:

    
print(3419 / (3079 + 3419))
print(135 / (0 + 135))
print(75 / (56 + 75))
print(68/ (1 + 68))
print(364 / (17 + 364))
print(9 / (7 + 9))









    



0.526161895968
1.0
0.572519083969
0.985507246377
0.955380577428
0.5625

See this notebook for the basics of training a decision tree in scikit-learn and exporting the outputs to view in graphviz: http://nbviewer.ipython.org/gist/tebarkley/b68c04d9b31e64ce6023

Scikit-learn classifiers require class labels and features to be in numeric arrays. As such, you will need to turn your categorical features into numeric arrays using DictVectorizer. This is a helpful notebook for understanding how to do this: http://nbviewer.ipython.org/gist/sarguido/7423289. You can turn a pandas dataframe of features into a dictionary of the form needed by DictVectorizer by using df.to_dict('records'). Make sure you remove the class label first (in this case, CAND_ID). If you use the class label as a feature, your classifier will have a training accuracy of 100%! The example notebook link also shows how to turn your class labels into a numeric array using sklearn.preprocessing.LabelEncoder().

We already did this for you at the top of the notebook, but before you convert your features into numeric arrays, you should always make sure they are of the correct type (ie zip code should be a string, not a float, because it is a categorical variable).

Question 14 asks you to interpret the rules at each decision tree node using the graphviz output. The graphviz output looks cryptic (ie it might tell you that X[1014] < 0.5 is the best split for a particular node. To figure out what feature that corresponds to, use the .get_feature_names() function of your DictVectorizer object. If that returns something like 'CITY=PHOENIX', then you know that the left child of the node contains rows not in Phoenix ('CITY=PHOENIX' ==0) and the right child of the node contains rows in Phoenix ('CITY=PHOENIX' == 1).



In [162]:

	CMTE_ID	AMNDT_IND	RPT_TP	ENTITY_TP	NAME	CITY	STATE	ZIP_CODE	TRANSACTION_DT	TRANSACTION_AMT	CAND_ID
0	C90011156	N	Q3	IND	ROULAND, PRESTON	CLEVELAND	OH	44135	9182012	26	Obama
1	C90011156	N	Q3	IND	DANFORD, KAYLA	OXFORD	OH	45056	9242012	17	Obama
2	C90011156	N	YE	IND	CLARK, BARBARA	WHITEHALL	OH	43213	10032012	18	Romney
3	C90011156	N	YE	ORG	JENNIFER JANNON	DORMONT	PA	15216	10262012	11	Obama
4	C90011156	N	YE	IND	ARNDI, PHILIP	BOSTON	MA	2115	11062012	57	Obama