Nonlinear Classification and Regression with Decision Trees

Decision trees

Decision trees are commonly learned by recursively splitting the set of training instances into subsets based on the instances' values for the explanatory variables.

In classification tasks, the leaf nodes of the decision tree represent classes. In regression tasks, the values of the response variable for the instances contained in a leaf node may be averaged to produce the estimate for the response variable. After the decision tree has been constructed, making a prediction for a test instance requires only following the edges until a leaf node is reached.

Let's create a decision tree using an algorithm called Iterative Dichotomiser 3 (ID3). Invented by Ross Quinlan, ID3 was one of the first algorithms used to train decision trees.

But how to choose the first variable on which we have to divide the data so that we can have smaller tree.

Measured in bits, entropy quantifies the amount of uncertainty in a variable. Entropy is given by the following equation, where n is the number of outcomes and ( ) i P x is the probability of the outcome i. Common values for b are 2, e, and 10. Because the log of a number less than one will be negative, the entire sum is negated to return a positive value.

entropy $$ H(X) = -\sum_{i=1}^{n} P(x_i)log_b P(x_i) $$

Information gain

Selecting the test that produces the subsets with the lowest average entropy can produce a suboptimal tree. we will measure the reduction in entropy using a metric called information gain.
Calculated with the following equation, information gain is the difference between the entropy of the parent node, H (T ), and the weighted average of the children nodes' entropies.

For creating Decision Tree, Algo ID3 is the one mostly used. C4.5 is a modified version of ID3 that can be used with continuous explanatory variables and can accommodate missing values for features. C4.5 also can prune trees.
Pruning reduces the size of a tree by replacing branches that classify few instances with leaf nodes. Used by scikit-learn's implementation of decision trees, CART is another learning algorithm that supports pruning.

Gini impurity

Gini impurity measures the proportions of classes in a set. Gini impurity is given by the following equation, where j is the number of classes, t is the subset of instances for the node, and P(i|t) is the probability of selecting an element of class i from the node's subset:

$$ Gini (t) = 1 - \sum_{i=1}^{j} P(i|t)^2 $$

Intuitively, Gini impurity is zero when all of the elements of the set are the same class, as the probability of selecting an element of that class is equal to one. Like entropy, Gini impurity is greatest when each class has an equal probability of being selected. The maximum value of Gini impurity depends on the number of possible classes, and it is given by the following equation:

$$ Gini_{max} = 1 - \frac{1}{n} $$

In [22]:
# import
import pandas as pd

from sklearn.tree import DecisionTreeClassifier
from sklearn.grid_search import GridSearchCV
from sklearn.metrics import classification_report
from sklearn.model_selection import train_test_split
from sklearn.pipeline import Pipeline
from sklearn.ensemble import RandomForestClassifier

In [3]:
df = pd.read_csv("data/ad.data", header=None)

explanatory_variable_columns = set(df.columns.values)
response_variable_column = df[len(df.columns.values)-1]
# The last column describes the targets
explanatory_variable_columns.remove(len(df.columns.values)-1)
y = [1 if e == 'ad.' else 0 for e in response_variable_column]
X = df[list(explanatory_variable_columns)]


C:\Anaconda3\lib\site-packages\IPython\core\interactiveshell.py:2723: DtypeWarning: Columns (3) have mixed types. Specify dtype option on import or set low_memory=False.
  interactivity=interactivity, compiler=compiler, result=result)

In [20]:
#X.replace(to_replace=' *\?', value=-1, regex=True, inplace=True)
X.replace(['?'], [-1])


Out[20]:
0 1 2 3 4 5 6 7 8 9 ... 1548 1549 1550 1551 1552 1553 1554 1555 1556 1557
0 125 125 1.0 1 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 0
1 57 468 8.2105 1 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 0
2 33 230 6.9696 1 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 0
3 60 468 7.8 1 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 0
4 60 468 7.8 1 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 0
5 60 468 7.8 1 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 0
6 59 460 7.7966 1 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 0
7 60 234 3.9 1 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 0
8 60 468 7.8 1 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 0
9 60 468 7.8 1 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 0
10 -1 -1 -1 1 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 0
11 90 52 0.5777 1 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 0
12 90 60 0.6666 1 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 0
13 90 60 0.6666 1 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 0
14 33 230 6.9696 1 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 0
15 60 468 7.8 1 0 0 0 0 0 0 ... 0 0 0 0 0 0 1 1 0 0
16 60 468 7.8 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 1 1 0 0
17 125 125 1.0 1 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 0
18 60 468 7.8 1 0 0 0 0 0 0 ... 0 0 0 0 0 0 1 1 1 0
19 30 585 19.5 1 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 0
20 90 60 0.6666 1 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 0
21 90 60 0.6666 1 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 0
22 90 60 0.6666 1 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 0
23 90 60 0.6666 1 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 0
24 -1 -1 -1 1 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 0
25 90 52 0.5777 1 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 0
26 90 60 0.6666 1 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 0
27 60 468 7.8 1 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 0
28 60 234 3.9 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 0
29 60 234 3.9 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 0
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
3249 -1 -1 -1 1 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 0
3250 -1 -1 -1 1 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 0
3251 16 16 1.0 1 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 0
3252 24 75 3.125 1 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 0
3253 -1 -1 -1 1 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 0
3254 25 100 4.0 1 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 0
3255 -1 -1 -1 1 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 0
3256 55 175 3.1818 1 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 0
3257 -1 -1 -1 1 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 0
3258 -1 -1 -1 1 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 0
3259 10 600 60.0 1 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 0
3260 11 64 5.8181 1 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 0
3261 -1 -1 -1 1 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 0
3262 150 200 1.3333 1 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 0
3263 16 16 1.0 1 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 0
3264 134 184 1.3731 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 0
3265 23 26 1.1304 1 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 0
3266 40 130 3.25 1 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 0
3267 158 192 1.2151 1 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 0
3268 25 100 4.0 1 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 0
3269 -1 -1 -1 1 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 0
3270 -1 -1 -1 1 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 0
3271 -1 -1 -1 1 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 0
3272 106 110 1.0377 1 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 0
3273 30 30 1.0 0 0 0 0 0 0 0 ... 1 0 0 0 0 0 0 0 0 0
3274 170 94 0.5529 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 0
3275 101 140 1.3861 1 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 0
3276 23 120 5.2173 1 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 0
3277 -1 -1 -1 1 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 0
3278 40 40 1.0 1 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 0

3279 rows × 1558 columns


In [15]:
X_train, X_test, y_train, y_test = train_test_split(X, y)

In [16]:
pipeline = Pipeline([
('clf', DecisionTreeClassifier(criterion='entropy'))
])

parameters = {
'clf__max_depth': (150, 155, 160),
'clf__min_samples_split': (1, 2, 3),
'clf__min_samples_leaf': (1, 2, 3)
}

In [17]:
grid_search = GridSearchCV(pipeline, parameters, n_jobs=-1, verbose=1, scoring='f1')

In [21]:
#grid_search.fit(X_train, y_train)

In [ ]:
print( 'Best score: %0.3f' % grid_search.best_score_)
print( 'Best parameters set:')
best_parameters = grid_search.best_estimator_.get_params()
for param_name in sorted(parameters.keys()):
    print( '\t%s: %r' % (param_name, best_parameters[param_name]))
        
predictions = grid_search.predict(X_test)

print ('Accuracy:', accuracy_score(y_test, predictions))
print ('Confusion Matrix:', confusion_matrix(y_test, predictions))
print ('Classification Report:', classification_report(y_test, predictions))

Tree ensembles (RandomForestClassifier)

Ensemble learning methods combine a set of models to produce an estimator that has better predictive performance than its individual components. A random forest is a collection of decision trees that have been trained on randomly selected subsets of the training instances and explanatory variables. Random forests usually make predictions by returning the mode or mean of the predictions of their constituent trees.

Random forests are less prone to overfitting than decision trees because no single tree can learn from all of the instances and explanatory variables; no single tree can memorize all of the noise in the representation


In [23]:
pipeline = Pipeline([
('clf', RandomForestClassifier(criterion='entropy'))
])

parameters = {
'clf__n_estimators': (5, 10, 20, 50),
'clf__max_depth': (50, 150, 250),
'clf__min_samples_split': (1, 2, 3),
'clf__min_samples_leaf': (1, 2, 3)
}

In [26]:
grid_search = GridSearchCV(pipeline, parameters, n_jobs=-1, verbose=1, scoring='f1')
#grid_search.fit(X_train, y_train)

The advantages and disadvantages of decision trees

Decision trees are easy to use. Unlike many learning algorithms, decision trees do not require the data to have zero mean and unit variance. While decision trees can tolerate missing values for explanatory variables, scikit-learn's current implementation cannot. Decision trees can even learn to ignore explanatory variables that are not relevant to the task.

Small decision trees can be easy to interpret and visualize with the export_graphviz function from scikit-learn's tree module. The branches of a decision tree are conjunctions of logical predicates, and they are easily visualized as flowcharts. Decision trees support multioutput tasks, and a single decision tree can be used for multiclass classification without employing a strategy like one-versus-all.

decision trees are eager learners. Eager learners must build an input-independent model from the training data before they can be used to estimate the values of test instances, but can predict relatively quickly once the model has been built. In contrast, lazy learners such as the k-nearest neighbors algorithm defer all generalization until they must make a prediction. Lazy learners do not spend time training, but often predict slowly compared to eager learners.

Decision trees are more prone to overfitting than many of the models, Pruning is a common strategy that removes some of the tallest nodes and leaves of a decision tree but it is not currently implemented in scikit-learn. However, similar effects can be achieved by setting a maximum depth for the tree or by creating child nodes only when the number of training instances they will contain exceeds a threshold.

Some of Algo are : ID3, C4.5, J4.5, RandomeForest


In [ ]: