They are susceptible to overfiting. For that, we use Random forest technique; we can construct several alternative decision trees and let them vom on the final classification. For example, we could use the Random Forest technique, which is baggin (bootstrap aggregating) to vote on the best decission tree. This means that many models are built by training on randomly-drawn subsets of the data.
In [64]:
%pylab inline
In [65]:
import pandas as pd
In [70]:
df = pd.read_csv('../data/SliceQuality.csv', header=0)
df.head(5) # return the first 5 rows
Out[70]:
Let's parse everything in numbers
In [42]:
myEducation = {'BS':0, 'MS':1, 'PhD':2}
df['Level of Education'] = df['Level of Education'].map(myEducation)
df.head(5)
Out[42]:
In [44]:
myOutPut = {'Y':1, 'N':0}
for key in ['Employed?', 'Top-tier school', 'Interned', 'Hired']:
df[key] = df[key].map(myOutPut)
df.head(5)
Out[44]:
In [49]:
# separates features and target columns
features = list(df.columns[:6])
features
Out[49]:
In [51]:
df['Years Experience']
Out[51]:
In [52]:
X = df[features]
Y = df['Hired']
In [53]:
from sklearn import tree
clf = tree.DecisionTreeClassifier()
In [54]:
clf = clf.fit(X,Y) # so easy!
In [57]:
from IPython.display import Image
from sklearn.externals.six import StringIO
import pydot
In [61]:
dot_data = StringIO()
tree.export_graphviz(clf, out_file = dot_data, feature_names = features)
graph = pydot.graph_from_dot_data(dot_data.getvalue())
Image(graph.create_png())
Out[61]:
In [62]:
from sklearn.ensemble import RandomForestClassifier
In [63]:
clf = RandomForestClassifier(n_estimators=10)
clf = clf.fit(X,Y)
In [ ]:
# predict