Decision trees

Gives you a flow char to make a decision that depend on several variables. It give some sample data and it will result in a classification.

They are susceptible to overfiting. For that, we use Random forest technique; we can construct several alternative decision trees and let them vom on the final classification. For example, we could use the Random Forest technique, which is baggin (bootstrap aggregating) to vote on the best decission tree. This means that many models are built by training on randomly-drawn subsets of the data.



In [64]:

    
%pylab inline









    



Populating the interactive namespace from numpy and matplotlib






    



WARNING: pylab import has clobbered these variables: ['clf']
`%matplotlib` prevents importing * from pylab and numpy

Load CSV file



In [65]:

    
import pandas as pd



In [70]:

    
df = pd.read_csv('../data/SliceQuality.csv', header=0)
df.head(5) # return the first 5 rows









    Out[70]:






  
    
      
      Age
      Slicing
      Slicing temp
      Storage
      Storage temp
      Storage time
      Recovery
      Recovery temp
      Recovery time
      Recording
    
  
  
    
      0
      21
      Suc120
      0
      Suc120
      37
      45
      ACSF
      21
      15
      1

Let's parse everything in numbers



In [42]:

    
myEducation = {'BS':0, 'MS':1, 'PhD':2}
df['Level of Education'] = df['Level of Education'].map(myEducation)
df.head(5)









    Out[42]:






  
    
      
      Years Experience
      Employed?
      Previous employers
      Level of Education
      Top-tier school
      Interned
      Hired
    
  
  
    
      0
      10
      Y
      4
      0
      N
      N
      Y
    
    
      1
      0
      N
      0
      0
      Y
      Y
      Y
    
    
      2
      7
      N
      6
      0
      N
      N
      N
    
    
      3
      2
      Y
      1
      1
      Y
      N
      Y
    
    
      4
      20
      N
      2
      2
      Y
      N
      N



In [44]:

    
myOutPut = {'Y':1, 'N':0}
for key in ['Employed?', 'Top-tier school', 'Interned', 'Hired']:
    df[key] = df[key].map(myOutPut)
df.head(5)









    Out[44]:






  
    
      
      Years Experience
      Employed?
      Previous employers
      Level of Education
      Top-tier school
      Interned
      Hired
    
  
  
    
      0
      10
      1
      4
      0
      0
      0
      1
    
    
      1
      0
      0
      0
      0
      1
      1
      1
    
    
      2
      7
      0
      6
      0
      0
      0
      0
    
    
      3
      2
      1
      1
      1
      1
      0
      1
    
    
      4
      20
      0
      2
      2
      1
      0
      0



In [49]:

    
# separates features and target columns
features = list(df.columns[:6])
features









    Out[49]:





['Years Experience',
 'Employed?',
 'Previous employers',
 'Level of Education',
 'Top-tier school',
 'Interned']



In [51]:

    
df['Years Experience']









    Out[51]:





0     10
1      0
2      7
3      2
4     20
5      0
6      5
7      3
8     15
9      0
10     1
11     4
12     0
Name: Years Experience, dtype: int64



In [52]:

    
X = df[features]
Y = df['Hired']

Decission Tree classifier



In [53]:

    
from sklearn import tree
clf = tree.DecisionTreeClassifier()



In [54]:

    
clf = clf.fit(X,Y) # so easy!

Display decission tree



In [57]:

    
from IPython.display import Image
from sklearn.externals.six import StringIO
import pydot



In [61]:

    
dot_data = StringIO()
tree.export_graphviz(clf, out_file = dot_data, feature_names = features)
graph = pydot.graph_from_dot_data(dot_data.getvalue())
Image(graph.create_png())









    Out[61]:

Ensemble learning using random forest

We use a random forest of 10 decission trees to predict



In [62]:

    
from sklearn.ensemble import RandomForestClassifier



In [63]:

    
clf = RandomForestClassifier(n_estimators=10)
clf = clf.fit(X,Y)



In [ ]:

    
# predict