Week 5

There are two main kinds of decision trees depending on the type of output (numeric vs. categorical). What are they?

Classification trees and regression trees.

Explain in your own words: Why is entropy useful when deciding where to split the data?

The entropy is a great way for creating a desition tree, as you can use the entropy to calculate how well the data is splitted for differnt features. Therefore the lower the entropy is the better the predition is.

Why are trees prone to overfitting?

Overfitting is when the desition three is fittet to well to the training dataset. And when you test with other data, the predictions are wrong.

Explain (in your own words) how random forests help prevent overfitting.

Instead of creating a single 'perfect' tree, you randomly create a lot of 'weak' trees. That you then combine to give create an ensemble learning technique. Were you look at the result of all the weak trees, and then let them 'vote' on how the data should be classified.

Use the category of the crimes to build a decision tree that predicts the corresponding district. You can implement the ID3 tree in the DSFS book, or use the DecisionTreeClassifier class in scikit-learn. For training, you can use 90% of the data and test the tree prediction on the remaining 10%.

What is the fraction of correct predictions?



In [1]:

    
from sklearn import tree, preprocessing
from sklearn.feature_extraction import DictVectorizer
import numpy as np
from collections import OrderedDict
import pandas as pd
import itertools
from __future__ import division
from sklearn.externals.six import StringIO
import os
import pydot
from IPython.display import Image



In [2]:

    
data_path = 'SFPD_Incidents_-_from_1_January_2003.csv'

data = pd.read_csv(data_path)



In [3]:

    
def encode_target(df, target_column):
    df_mod = df.copy()
    targets = df_mod[target_column].unique()
    map_to_int = {name: n for n, name in enumerate(targets)}
    df_mod[target_column+"_encoded"] = df_mod[target_column].replace(map_to_int)

    return (df_mod, targets)

data, districts = encode_target(data, 'PdDistrict')
data, categories = encode_target(data, 'Category')
data, days = encode_target(data, 'DayOfWeek')

# sneak peek of data
data.head()









    Out[3]:






  
    
      
      IncidntNum
      Category
      Descript
      DayOfWeek
      Date
      Time
      PdDistrict
      Resolution
      Address
      X
      Y
      Location
      PdId
      PdDistrict_encoded
      Category_encoded
      DayOfWeek_encoded
    
  
  
    
      0
      160095193
      OTHER OFFENSES
      DRIVERS LICENSE, SUSPENDED OR REVOKED
      Monday
      02/01/2016
      23:51
      TENDERLOIN
      ARREST, BOOKED
      HYDE ST / GROVE ST
      -122.414744
      37.778719
      (37.778719262789, -122.414743835382)
      16009519365016
      0
      0
      0
    
    
      1
      160095171
      OTHER OFFENSES
      DRIVERS LICENSE, SUSPENDED OR REVOKED
      Monday
      02/01/2016
      23:44
      SOUTHERN
      ARREST, BOOKED
      13TH ST / BRYANT ST
      -122.410931
      37.769411
      (37.7694111951212, -122.410931084001)
      16009517165016
      1
      0
      0
    
    
      2
      160095262
      OTHER OFFENSES
      POSSESSION OF BURGLARY TOOLS
      Monday
      02/01/2016
      23:43
      CENTRAL
      ARREST, BOOKED
      600 Block of LEAVENWORTH ST
      -122.414971
      37.786987
      (37.7869870915274, -122.414971182724)
      16009526227130
      2
      0
      0
    
    
      3
      160095262
      STOLEN PROPERTY
      STOLEN PROPERTY, POSSESSION WITH KNOWLEDGE, RE...
      Monday
      02/01/2016
      23:43
      CENTRAL
      ARREST, BOOKED
      600 Block of LEAVENWORTH ST
      -122.414971
      37.786987
      (37.7869870915274, -122.414971182724)
      16009526211012
      2
      1
      0
    
    
      4
      160095262
      BURGLARY
      BURGLARY, VEHICLE (ARREST MADE)
      Monday
      02/01/2016
      23:43
      CENTRAL
      ARREST, BOOKED
      600 Block of LEAVENWORTH ST
      -122.414971
      37.786987
      (37.7869870915274, -122.414971182724)
      16009526205014
      2
      2
      0



In [4]:

    
training_data = data.head(int(data.Category.count() * 0.9))
test_data = data.tail(int(data.Category.count() * 0.1))



In [5]:

    
def train_tree( prediction, features, dataset):
    clf = tree.DecisionTreeClassifier()
    print "TRAINING WITH %d SAMPLES" % len(dataset) 
    X = np.array(dataset[features])
    Y = np.array(list(itertools.chain(*dataset[[prediction]].values)))
    return clf.fit(X, Y)

def test_tree(clf, test_data, features):
    return clf.predict(test_data[features])

def convert_encoded_district_to_str(preditions):
    return map(lambda p: districts[p], preditions)

def test_prediction(clf, test_data, features):
    corrects = 0
    predictions = test_tree(clf, test_data[features], features)
    for i in range(0, len(predictions)):
        if predictions[i] == test_data.iloc[i].PdDistrict_encoded:
            corrects += 1
    print "FOUND %d CORRECT PREDICTIONS" % corrects
    return corrects / len(predictions)



In [6]:

    
# The featues we create our model from
features = ['Category_encoded']

# We train, we are predicting the district
clf = train_tree('PdDistrict_encoded', features, training_data)

# test prediction accuracy 
print "Prediction accuracy %f" % test_prediction(clf, test_data, features)









    



TRAINING WITH 1685385 SAMPLES
FOUND 35959 CORRECT PREDICTIONS
Prediction accuracy 0.192022



In [7]:

    
for dis in districts[:1]:
    clf = train_tree('PdDistrict_encoded', features, training_data[training_data.PdDistrict == dis])
    print "Prediction accuracy %f, trained for %s\n" % (test_prediction(clf, test_data, features), dis)









    



TRAINING WITH 154794 SAMPLES
FOUND 16097 CORRECT PREDICTIONS
Prediction accuracy 0.085958, trained for TENDERLOIN



In [8]:

    
# We can see that the prediction can only guess SOUTHN
len(test_data[test_data.PdDistrict == 'SOUTHERN'])









    Out[8]:





31724



In [12]:

    
# The featues we create our model from
features = ['Category_encoded','DayOfWeek_encoded']

# We train, we are predicting the district
clf = train_tree('PdDistrict_encoded', features, training_data)

# test prediction accuracy 
print "Prediction accuracy %f" % test_prediction(clf, test_data, features)









    



TRAINING WITH 1685385 SAMPLES
FOUND 35784 CORRECT PREDICTIONS
Prediction accuracy 0.191087



In [16]:

    
with open("tree.dot", 'w') as f:
    f = tree.export_graphviz(clf, out_file=f)



In [18]:









    



---------------------------------------------------------------------------
error                                     Traceback (most recent call last)
<ipython-input-18-ffee0829ae90> in <module>()
      2 tree.export_graphviz(clf, out_file=dot_data)
      3 graph = pydot.graph_from_dot_data(dot_data.getvalue())
----> 4 graph.write_pdf("iris.pdf")

C:\Users\Casper\Anaconda2\lib\site-packages\pydot.pyc in <lambda>(path, f, prog)
   1600 
   1601         for frmt in self.formats+['raw']:
-> 1602             self.__setattr__(
   1603                 'write_'+frmt,
   1604                 lambda path, f=frmt, prog=self.prog : self.write(path, format=f, prog=prog))

C:\Users\Casper\Anaconda2\lib\site-packages\pydot.pyc in write(self, path, prog, format)
   1694         dot_fd = file(path, "w+b")
   1695         if format == 'raw':
-> 1696             dot_fd.write(self.to_string())
   1697         else:
   1698             dot_fd.write(self.create(prog, format))

C:\Users\Casper\Anaconda2\lib\site-packages\pydot.pyc in create(self, prog, format)
   1722         if prog is None:
   1723             prog = self.prog
-> 1724 
   1725         if self.progs is None:
   1726             self.progs = find_graphviz()

C:\Users\Casper\Anaconda2\lib\site-packages\pydot.pyc in find_graphviz()
    407             #
    408             hkey = win32api.RegOpenKeyEx( win32con.HKEY_LOCAL_MACHINE,
--> 409                 "SOFTWARE\ATT\Graphviz", 0, win32con.KEY_QUERY_VALUE )
    410 
    411             path = win32api.RegQueryValueEx( hkey, "InstallPath" )[0]

error: (2, 'RegOpenKeyEx', 'The system cannot find the file specified.')



In [ ]:

    
for dis in districts:
    clf2 = train_tree('PdDistrict_encoded', features, training_data[training_data.PdDistrict == dis])
    print "Prediction accuracy %f, trained for %s\n" % (test_prediction(clf2, test_data, features), dis)



In [ ]:

	IncidntNum	Category	Descript	DayOfWeek	Date	Time	PdDistrict	Resolution	Address	X	Y	Location	PdId	PdDistrict_encoded	Category_encoded
0	160095193	OTHER OFFENSES	DRIVERS LICENSE, SUSPENDED OR REVOKED	Monday	02/01/2016	23:51	TENDERLOIN	ARREST, BOOKED	HYDE ST / GROVE ST	-122.414744	37.778719	(37.778719262789, -122.414743835382)	16009519365016	0	0
1	160095171	OTHER OFFENSES	DRIVERS LICENSE, SUSPENDED OR REVOKED	Monday	02/01/2016	23:44	SOUTHERN	ARREST, BOOKED	13TH ST / BRYANT ST	-122.410931	37.769411	(37.7694111951212, -122.410931084001)	16009517165016	1	0
2	160095262	OTHER OFFENSES	POSSESSION OF BURGLARY TOOLS	Monday	02/01/2016	23:43	CENTRAL	ARREST, BOOKED	600 Block of LEAVENWORTH ST	-122.414971	37.786987	(37.7869870915274, -122.414971182724)	16009526227130	2	0
3	160095262	STOLEN PROPERTY	STOLEN PROPERTY, POSSESSION WITH KNOWLEDGE, RE...	Monday	02/01/2016	23:43	CENTRAL	ARREST, BOOKED	600 Block of LEAVENWORTH ST	-122.414971	37.786987	(37.7869870915274, -122.414971182724)	16009526211012	2	1
4	160095262	BURGLARY	BURGLARY, VEHICLE (ARREST MADE)	Monday	02/01/2016	23:43	CENTRAL	ARREST, BOOKED	600 Block of LEAVENWORTH ST	-122.414971	37.786987	(37.7869870915274, -122.414971182724)	16009526205014	2	2