Week 5

There are two main kinds of decision trees depending on the type of output (numeric vs. categorical). What are they?

Classification trees and regression trees.

Explain in your own words: Why is entropy useful when deciding where to split the data?

The entropy is a great way for creating a desition tree, as you can use the entropy to calculate how well the data is splitted for differnt features. Therefore the lower the entropy is the better the predition is.

Why are trees prone to overfitting?

Overfitting is when the desition three is fittet to well to the training dataset. And when you test with other data, the predictions are wrong.

Explain (in your own words) how random forests help prevent overfitting.

Instead of creating a single 'perfect' tree, you randomly create a lot of 'weak' trees. That you then combine to give create an ensemble learning technique. Were you look at the result of all the weak trees, and then let them 'vote' on how the data should be classified.

Use the category of the crimes to build a decision tree that predicts the corresponding district. You can implement the ID3 tree in the DSFS book, or use the DecisionTreeClassifier class in scikit-learn. For training, you can use 90% of the data and test the tree prediction on the remaining 10%.

What is the fraction of correct predictions?


In [1]:
from sklearn import tree, preprocessing
from sklearn.feature_extraction import DictVectorizer
import numpy as np
from collections import OrderedDict
import pandas as pd
import itertools
from __future__ import division
from sklearn.externals.six import StringIO
import os
import pydot
from IPython.display import Image

In [2]:
data_path = 'SFPD_Incidents_-_from_1_January_2003.csv'

data = pd.read_csv(data_path)

In [3]:
def encode_target(df, target_column):
    df_mod = df.copy()
    targets = df_mod[target_column].unique()
    map_to_int = {name: n for n, name in enumerate(targets)}
    df_mod[target_column+"_encoded"] = df_mod[target_column].replace(map_to_int)

    return (df_mod, targets)

data, districts = encode_target(data, 'PdDistrict')
data, categories = encode_target(data, 'Category')
data, days = encode_target(data, 'DayOfWeek')

# sneak peek of data
data.head()


Out[3]:
IncidntNum Category Descript DayOfWeek Date Time PdDistrict Resolution Address X Y Location PdId PdDistrict_encoded Category_encoded DayOfWeek_encoded
0 160095193 OTHER OFFENSES DRIVERS LICENSE, SUSPENDED OR REVOKED Monday 02/01/2016 23:51 TENDERLOIN ARREST, BOOKED HYDE ST / GROVE ST -122.414744 37.778719 (37.778719262789, -122.414743835382) 16009519365016 0 0 0
1 160095171 OTHER OFFENSES DRIVERS LICENSE, SUSPENDED OR REVOKED Monday 02/01/2016 23:44 SOUTHERN ARREST, BOOKED 13TH ST / BRYANT ST -122.410931 37.769411 (37.7694111951212, -122.410931084001) 16009517165016 1 0 0
2 160095262 OTHER OFFENSES POSSESSION OF BURGLARY TOOLS Monday 02/01/2016 23:43 CENTRAL ARREST, BOOKED 600 Block of LEAVENWORTH ST -122.414971 37.786987 (37.7869870915274, -122.414971182724) 16009526227130 2 0 0
3 160095262 STOLEN PROPERTY STOLEN PROPERTY, POSSESSION WITH KNOWLEDGE, RE... Monday 02/01/2016 23:43 CENTRAL ARREST, BOOKED 600 Block of LEAVENWORTH ST -122.414971 37.786987 (37.7869870915274, -122.414971182724) 16009526211012 2 1 0
4 160095262 BURGLARY BURGLARY, VEHICLE (ARREST MADE) Monday 02/01/2016 23:43 CENTRAL ARREST, BOOKED 600 Block of LEAVENWORTH ST -122.414971 37.786987 (37.7869870915274, -122.414971182724) 16009526205014 2 2 0

In [4]:
training_data = data.head(int(data.Category.count() * 0.9))
test_data = data.tail(int(data.Category.count() * 0.1))

In [5]:
def train_tree( prediction, features, dataset):
    clf = tree.DecisionTreeClassifier()
    print "TRAINING WITH %d SAMPLES" % len(dataset) 
    X = np.array(dataset[features])
    Y = np.array(list(itertools.chain(*dataset[[prediction]].values)))
    return clf.fit(X, Y)

def test_tree(clf, test_data, features):
    return clf.predict(test_data[features])

def convert_encoded_district_to_str(preditions):
    return map(lambda p: districts[p], preditions)

def test_prediction(clf, test_data, features):
    corrects = 0
    predictions = test_tree(clf, test_data[features], features)
    for i in range(0, len(predictions)):
        if predictions[i] == test_data.iloc[i].PdDistrict_encoded:
            corrects += 1
    print "FOUND %d CORRECT PREDICTIONS" % corrects
    return corrects / len(predictions)

In [6]:
# The featues we create our model from
features = ['Category_encoded']

# We train, we are predicting the district
clf = train_tree('PdDistrict_encoded', features, training_data)

# test prediction accuracy 
print "Prediction accuracy %f" % test_prediction(clf, test_data, features)


TRAINING WITH 1685385 SAMPLES
FOUND 35959 CORRECT PREDICTIONS
Prediction accuracy 0.192022

In [7]:
for dis in districts[:1]:
    clf = train_tree('PdDistrict_encoded', features, training_data[training_data.PdDistrict == dis])
    print "Prediction accuracy %f, trained for %s\n" % (test_prediction(clf, test_data, features), dis)


TRAINING WITH 154794 SAMPLES
FOUND 16097 CORRECT PREDICTIONS
Prediction accuracy 0.085958, trained for TENDERLOIN


In [8]:
# We can see that the prediction can only guess SOUTHN
len(test_data[test_data.PdDistrict == 'SOUTHERN'])


Out[8]:
31724

In [12]:
# The featues we create our model from
features = ['Category_encoded','DayOfWeek_encoded']

# We train, we are predicting the district
clf = train_tree('PdDistrict_encoded', features, training_data)

# test prediction accuracy 
print "Prediction accuracy %f" % test_prediction(clf, test_data, features)


TRAINING WITH 1685385 SAMPLES
FOUND 35784 CORRECT PREDICTIONS
Prediction accuracy 0.191087

In [16]:
with open("tree.dot", 'w') as f:
    f = tree.export_graphviz(clf, out_file=f)


In [18]:



---------------------------------------------------------------------------
error                                     Traceback (most recent call last)
<ipython-input-18-ffee0829ae90> in <module>()
      2 tree.export_graphviz(clf, out_file=dot_data)
      3 graph = pydot.graph_from_dot_data(dot_data.getvalue())
----> 4 graph.write_pdf("iris.pdf")

C:\Users\Casper\Anaconda2\lib\site-packages\pydot.pyc in <lambda>(path, f, prog)
   1600 
   1601         for frmt in self.formats+['raw']:
-> 1602             self.__setattr__(
   1603                 'write_'+frmt,
   1604                 lambda path, f=frmt, prog=self.prog : self.write(path, format=f, prog=prog))

C:\Users\Casper\Anaconda2\lib\site-packages\pydot.pyc in write(self, path, prog, format)
   1694         dot_fd = file(path, "w+b")
   1695         if format == 'raw':
-> 1696             dot_fd.write(self.to_string())
   1697         else:
   1698             dot_fd.write(self.create(prog, format))

C:\Users\Casper\Anaconda2\lib\site-packages\pydot.pyc in create(self, prog, format)
   1722         if prog is None:
   1723             prog = self.prog
-> 1724 
   1725         if self.progs is None:
   1726             self.progs = find_graphviz()

C:\Users\Casper\Anaconda2\lib\site-packages\pydot.pyc in find_graphviz()
    407             #
    408             hkey = win32api.RegOpenKeyEx( win32con.HKEY_LOCAL_MACHINE,
--> 409                 "SOFTWARE\ATT\Graphviz", 0, win32con.KEY_QUERY_VALUE )
    410 
    411             path = win32api.RegQueryValueEx( hkey, "InstallPath" )[0]

error: (2, 'RegOpenKeyEx', 'The system cannot find the file specified.')

In [ ]:
for dis in districts:
    clf2 = train_tree('PdDistrict_encoded', features, training_data[training_data.PdDistrict == dis])
    print "Prediction accuracy %f, trained for %s\n" % (test_prediction(clf2, test_data, features), dis)

In [ ]: