Decision trees are commonly learned by recursively splitting the set of training instances into subsets based on the instances' values for the explanatory variables.
In classification tasks, the leaf nodes of the decision tree represent classes. In regression tasks, the values of the response variable for the instances contained in a leaf node may be averaged to produce the estimate for the response variable. After the decision tree has been constructed, making a prediction for a test instance requires only following the edges until a leaf node is reached.
Let's create a decision tree using an algorithm called Iterative Dichotomiser 3 (ID3). Invented by Ross Quinlan, ID3 was one of the first algorithms used to train decision trees.
But how to choose the first variable on which we have to divide the data so that we can have smaller tree.
Measured in bits, entropy quantifies the amount of uncertainty in a variable. Entropy is given by the following equation, where n is the number of outcomes and ( ) i P x is the probability of the outcome i. Common values for b are 2, e, and 10. Because the log of a number less than one will be negative, the entire sum is negated to return a positive value.
entropy $$ H(X) = -\sum_{i=1}^{n} P(x_i)log_b P(x_i) $$
Selecting the test that produces the subsets with the lowest average entropy can produce a suboptimal tree.
we will measure the reduction in entropy using a metric called information gain.
Calculated with the following equation, information gain is the difference between the entropy of the parent
node, H (T ), and the weighted average of the children nodes' entropies.
For creating Decision Tree, Algo ID3 is the one mostly used. C4.5 is a modified version of ID3
that can be used with continuous explanatory variables and can accommodate
missing values for features. C4.5 also can prune trees.
Pruning reduces the size of a tree by replacing branches that classify few instances with leaf nodes. Used by
scikit-learn's implementation of decision trees, CART is another learning algorithm
that supports pruning.
Gini impurity measures the proportions of classes in a set. Gini impurity is given by the following equation, where j is the number of classes, t is the subset of instances for the node, and P(i|t) is the probability of selecting an element of class i from the node's subset:
$$ Gini (t) = 1 - \sum_{i=1}^{j} P(i|t)^2 $$Intuitively, Gini impurity is zero when all of the elements of the set are the same class, as the probability of selecting an element of that class is equal to one. Like entropy, Gini impurity is greatest when each class has an equal probability of being selected. The maximum value of Gini impurity depends on the number of possible classes, and it is given by the following equation:
$$ Gini_{max} = 1 - \frac{1}{n} $$
In [22]:
# import
import pandas as pd
from sklearn.tree import DecisionTreeClassifier
from sklearn.grid_search import GridSearchCV
from sklearn.metrics import classification_report
from sklearn.model_selection import train_test_split
from sklearn.pipeline import Pipeline
from sklearn.ensemble import RandomForestClassifier
In [3]:
df = pd.read_csv("data/ad.data", header=None)
explanatory_variable_columns = set(df.columns.values)
response_variable_column = df[len(df.columns.values)-1]
# The last column describes the targets
explanatory_variable_columns.remove(len(df.columns.values)-1)
y = [1 if e == 'ad.' else 0 for e in response_variable_column]
X = df[list(explanatory_variable_columns)]
In [20]:
#X.replace(to_replace=' *\?', value=-1, regex=True, inplace=True)
X.replace(['?'], [-1])
Out[20]:
In [15]:
X_train, X_test, y_train, y_test = train_test_split(X, y)
In [16]:
pipeline = Pipeline([
('clf', DecisionTreeClassifier(criterion='entropy'))
])
parameters = {
'clf__max_depth': (150, 155, 160),
'clf__min_samples_split': (1, 2, 3),
'clf__min_samples_leaf': (1, 2, 3)
}
In [17]:
grid_search = GridSearchCV(pipeline, parameters, n_jobs=-1, verbose=1, scoring='f1')
In [21]:
#grid_search.fit(X_train, y_train)
In [ ]:
print( 'Best score: %0.3f' % grid_search.best_score_)
print( 'Best parameters set:')
best_parameters = grid_search.best_estimator_.get_params()
for param_name in sorted(parameters.keys()):
print( '\t%s: %r' % (param_name, best_parameters[param_name]))
predictions = grid_search.predict(X_test)
print ('Accuracy:', accuracy_score(y_test, predictions))
print ('Confusion Matrix:', confusion_matrix(y_test, predictions))
print ('Classification Report:', classification_report(y_test, predictions))
Ensemble learning methods combine a set of models to produce an estimator that has better predictive performance than its individual components. A random forest is a collection of decision trees that have been trained on randomly selected subsets of the training instances and explanatory variables. Random forests usually make predictions by returning the mode or mean of the predictions of their constituent trees.
Random forests are less prone to overfitting than decision trees because no single tree can learn from all of the instances and explanatory variables; no single tree can memorize all of the noise in the representation
In [23]:
pipeline = Pipeline([
('clf', RandomForestClassifier(criterion='entropy'))
])
parameters = {
'clf__n_estimators': (5, 10, 20, 50),
'clf__max_depth': (50, 150, 250),
'clf__min_samples_split': (1, 2, 3),
'clf__min_samples_leaf': (1, 2, 3)
}
In [26]:
grid_search = GridSearchCV(pipeline, parameters, n_jobs=-1, verbose=1, scoring='f1')
#grid_search.fit(X_train, y_train)
Decision trees are easy to use. Unlike many learning algorithms, decision trees do not require the data to have zero mean and unit variance. While decision trees can tolerate missing values for explanatory variables, scikit-learn's current implementation cannot. Decision trees can even learn to ignore explanatory variables that are not relevant to the task.
Small decision trees can be easy to interpret and visualize with the export_graphviz function from scikit-learn's tree module. The branches of a decision tree are conjunctions of logical predicates, and they are easily visualized as flowcharts. Decision trees support multioutput tasks, and a single decision tree can be used for multiclass classification without employing a strategy like one-versus-all.
decision trees are eager learners. Eager learners must build an input-independent model from the training data before they can be used to estimate the values of test instances, but can predict relatively quickly once the model has been built. In contrast, lazy learners such as the k-nearest neighbors algorithm defer all generalization until they must make a prediction. Lazy learners do not spend time training, but often predict slowly compared to eager learners.
Decision trees are more prone to overfitting than many of the models, Pruning is a common strategy that removes some of the tallest nodes and leaves of a decision tree but it is not currently implemented in scikit-learn. However, similar effects can be achieved by setting a maximum depth for the tree or by creating child nodes only when the number of training instances they will contain exceeds a threshold.
Some of Algo are : ID3, C4.5, J4.5, RandomeForest
In [ ]: