Homepage: https://github.com/tien-le/kaggle-titanic
unbelivable ... to achieve 1.000. How did they do this?
Just curious, how did they cheat the score? ANS: maybe, we have the information existing in https://www.encyclopedia-titanica.org/titanic-victims/
The sinking of the RMS Titanic is one of the most infamous shipwrecks in history. On April 15, 1912, during her maiden voyage, the Titanic sank after colliding with an iceberg, killing 1502 out of 2224 passengers and crew. This sensational tragedy shocked the international community and led to better safety regulations for ships.
One of the reasons that the shipwreck led to such loss of life was that there were not enough lifeboats for the passengers and crew. Although there was some element of luck involved in surviving the sinking, some groups of people were more likely to survive than others, such as women, children, and the upper-class.
In this challenge, we ask you to complete the analysis of what sorts of people were likely to survive. In particular, we ask you to apply the tools of machine learning to predict which passengers survived the tragedy.
https://www.kaggle.com/c/titanic
https://triangleinequality.wordpress.com/2013/09/08/basic-feature-engineering-with-the-titanic-data/
https://www.kaggle.com/mrisdal/exploring-survival-on-the-titanic
In [1]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
%matplotlib inline
import seaborn as sns
import random
In [2]:
#Training Corpus
trn_corpus_after_preprocessing = pd.read_csv("output/trn_corpus_after_preprocessing.csv")
#Testing Corpus
tst_corpus_after_preprocessing = pd.read_csv("output/tst_corpus_after_preprocessing.csv")
In [3]:
#tst_corpus_after_preprocessing[tst_corpus_after_preprocessing["Fare"].isnull()]
In [4]:
trn_corpus_after_preprocessing.info()
print("-"*36)
tst_corpus_after_preprocessing.info()
One definition: "Machine learning is the semi-automated extraction of knowledge from data"
Unsupervised learning: Extracting structure from data
High-level steps of supervised learning:
First, train a machine learning model using labeled data
Then, make predictions on new data for which the label is unknown
The primary goal of supervised learning is to build a model that "generalizes": It accurately predicts the future rather than the past!
In [ ]:
In [5]:
trn_corpus_after_preprocessing.columns
Out[5]:
In [6]:
list_of_non_preditor_variables = ['Survived','PassengerId']
In [7]:
#Method 1
#x_train = trn_corpus_after_preprocessing.ix[:, trn_corpus_after_preprocessing.columns != 'Survived']
#y_train = trn_corpus_after_preprocessing.ix[:,"Survived"]
#Method 2
x_train = trn_corpus_after_preprocessing[trn_corpus_after_preprocessing.columns.difference(list_of_non_preditor_variables)].copy()
y_train = trn_corpus_after_preprocessing['Survived'].copy()
#y_train = trn_corpus_after_preprocessing.iloc[:,-1]
#y_train = trn_corpus_after_preprocessing[trn_corpus_after_preprocessing.columns[-1]]
#x_train
In [8]:
#y_train
In [9]:
x_train.columns
Out[9]:
In [10]:
# check the types of the features and response
#print(type(x_train))
#print(type(x_test))
In [11]:
#Method 1
#x_test = tst_corpus_after_preprocessing.ix[:, trn_corpus_after_preprocessing.columns != 'Survived']
#y_test = tst_corpus_after_preprocessing.ix[:,"Survived"]
#Method 2
x_test = tst_corpus_after_preprocessing[tst_corpus_after_preprocessing.columns.difference(list_of_non_preditor_variables)].copy()
y_test = tst_corpus_after_preprocessing['Survived'].copy()
#y_test = tst_corpus_after_preprocessing.iloc[:,-1]
#y_test = tst_corpus_after_preprocessing[tst_corpus_after_preprocessing.columns[-1]]
In [12]:
#x_test
In [13]:
#y_test
In [14]:
# display the first 5 rows
x_train.head()
Out[14]:
In [15]:
# display the last 5 rows
x_train.tail()
Out[15]:
In [16]:
# check the shape of the DataFrame (rows, columns)
x_train.shape
Out[16]:
What are the features?
What is the response?
What else do we know?
Note that if the response variable is continuous, this is a regression problem.
In [17]:
print(x_train.shape)
display(x_train.head())
display(x_train.describe())
In [ ]:
In [ ]:
In [ ]:
In [ ]:
In [18]:
from sklearn import tree
clf = tree.DecisionTreeClassifier()
clf = clf.fit(x_train, y_train)
In [19]:
#Once trained, we can export the tree in Graphviz format using the export_graphviz exporter.
#Below is an example export of a tree trained on the entire iris dataset:
with open("output/titanic.dot", 'w') as f:
f = tree.export_graphviz(clf, out_file=f)
#Then we can use Graphviz’s dot tool to create a PDF file (or any other supported file type):
#dot -Tpdf titanic.dot -o titanic.pdf.
import os
os.unlink('output/titanic.dot')
#Alternatively, if we have Python module pydotplus installed, we can generate a PDF file
#(or any other supported file type) directly in Python:
import pydotplus
dot_data = tree.export_graphviz(clf, out_file=None)
graph = pydotplus.graph_from_dot_data(dot_data)
graph.write_pdf("output/titanic.pdf")
Out[19]:
In [20]:
#The export_graphviz exporter also supports a variety of aesthetic options,
#including coloring nodes by their class (or value for regression)
#and using explicit variable and class names if desired.
#IPython notebooks can also render these plots inline using the Image() function:
"""from IPython.display import Image
dot_data = tree.export_graphviz(clf, out_file=None,
feature_names= list(x_train.columns[1:]), #iris.feature_names,
class_names= ["Survived"], #iris.target_names,
filled=True, rounded=True,
special_characters=True)
graph = pydotplus.graph_from_dot_data(dot_data)
Image(graph.create_png())"""
Out[20]:
In [21]:
print("accuracy score: ", clf.score(x_test,y_test))
Classification accuracy: percentage of correct predictions
In [22]:
#After being fitted, the model can then be used to predict the class of samples:
y_pred_class = clf.predict(x_test);
#Alternatively, the probability of each class can be predicted,
#which is the fraction of training samples of the same class in a leaf:
clf.predict_proba(x_test);
In [23]:
# calculate accuracy
from sklearn import metrics
print(metrics.accuracy_score(y_test, y_pred_class))
Null accuracy: accuracy that could be achieved by always predicting the most frequent class
In [24]:
# examine the class distribution of the testing set (using a Pandas Series method)
y_test.value_counts()
Out[24]:
In [25]:
# calculate the percentage of ones
y_test.mean()
Out[25]:
In [26]:
# calculate the percentage of zeros
1 - y_test.mean()
Out[26]:
In [27]:
# calculate null accuracy (for binary classification problems coded as 0/1)
max(y_test.mean(), 1 - y_test.mean())
Out[27]:
In [28]:
# calculate null accuracy (for multi-class classification problems)
y_test.value_counts().head(1) / len(y_test)
Out[28]:
Comparing the true and predicted response values
In [29]:
# print the first 25 true and predicted responses
from __future__ import print_function
print('True:', y_test.values[0:25])
print('Pred:', y_pred_class[0:25])
Conclusion: ???
In [30]:
# IMPORTANT: first argument is true values, second argument is predicted values
print(metrics.confusion_matrix(y_test, y_pred_class))
Basic terminology
In [31]:
# save confusion matrix and slice into four pieces
confusion = metrics.confusion_matrix(y_test, y_pred_class)
TP = confusion[1, 1]
TN = confusion[0, 0]
FP = confusion[0, 1]
FN = confusion[1, 0]
In [32]:
print(TP, TN, FP, FN)
Classification Accuracy: Overall, how often is the classifier correct?
In [33]:
print((TP + TN) / float(TP + TN + FP + FN))
print(metrics.accuracy_score(y_test, y_pred_class))
Classification Error: Overall, how often is the classifier incorrect?
In [34]:
print((FP + FN) / float(TP + TN + FP + FN))
print(1 - metrics.accuracy_score(y_test, y_pred_class))
Specificity: When the actual value is negative, how often is the prediction correct?
In [35]:
print(TN / float(TN + FP))
False Positive Rate: When the actual value is negative, how often is the prediction incorrect?
In [36]:
print(FP / float(TN + FP))
Precision: When a positive value is predicted, how often is the prediction correct?
In [37]:
print(TP / float(TP + FP))
print(metrics.precision_score(y_test, y_pred_class))
In [38]:
print("Presicion: ", metrics.precision_score(y_test, y_pred_class))
print("Recall: ", metrics.recall_score(y_test, y_pred_class))
print("F1 score: ", metrics.f1_score(y_test, y_pred_class))
Many other metrics can be computed: F1 score, Matthews correlation coefficient, etc.
Conclusion:
Which metrics should you focus on?
Similar to SVC with parameter kernel=’linear’, but implemented in terms of liblinear rather than libsvm, so it has more flexibility in the choice of penalties and loss functions and should scale better to large numbers of samples.
Ref: http://scikit-learn.org/stable/modules/generated/sklearn.svm.LinearSVC.html#sklearn.svm.LinearSVC
In [39]:
from sklearn import svm
model = svm.LinearSVC()
# fit a model to the data
model.fit(x_train, y_train)
Out[39]:
In [40]:
acc_score = model.score(x_test, y_test)
print("Accuracy score: ", acc_score)
In [41]:
y_pred_class = model.predict(x_test)
In [42]:
from sklearn import metrics
In [43]:
confusion_matrix = metrics.confusion_matrix(y_test, y_pred_class)
print(confusion_matrix)
In [44]:
# summarize the fit of the model
print(metrics.classification_report(y_test, y_pred_class))
print(metrics.confusion_matrix(y_test, y_pred_class))
http://scikit-learn.org/stable/auto_examples/classification/plot_classifier_comparison.html
A comparison of a several classifiers in scikit-learn on synthetic datasets. The point of this example is to illustrate the nature of decision boundaries of different classifiers. This should be taken with a grain of salt, as the intuition conveyed by these examples does not necessarily carry over to real datasets.
Particularly in high-dimensional spaces, data can more easily be separated linearly and the simplicity of classifiers such as naive Bayes and linear SVMs might lead to better generalization than is achieved by other classifiers.
The plots show training points in solid colors and testing points semi-transparent. The lower right shows the classification accuracy on the test set.
In [45]:
from sklearn.neighbors import KNeighborsClassifier
from sklearn.svm import SVC
from sklearn.gaussian_process import GaussianProcessClassifier
from sklearn.gaussian_process.kernels import RBF
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier, AdaBoostClassifier
from sklearn.neural_network import MLPClassifier
from sklearn.naive_bayes import GaussianNB
from sklearn.discriminant_analysis import QuadraticDiscriminantAnalysis
from sklearn.datasets import make_classification
from sklearn.preprocessing import StandardScaler
from matplotlib.colors import ListedColormap
In [ ]:
In [46]:
#classifiers
In [47]:
#x_train
In [48]:
#sns.pairplot(x_train)
In [49]:
x_train_scaled = StandardScaler().fit_transform(x_train)
x_test_scaled = StandardScaler().fit_transform(x_test)
In [50]:
x_train_scaled[0]
Out[50]:
In [51]:
len(x_train_scaled[0])
Out[51]:
In [52]:
df_x_train_scaled = pd.DataFrame(columns=x_train.columns, data=x_train_scaled)
In [53]:
df_x_train_scaled.head()
Out[53]:
In [54]:
#sns.pairplot(df_x_train_scaled)
Ref: http://machinelearningmastery.com/compare-machine-learning-algorithms-python-scikit-learn/
How do you choose the best model for your problem?
When you work on a machine learning project, you often end up with multiple good models to choose from. Each model will have different performance characteristics.
Using resampling methods like cross validation, you can get an estimate for how accurate each model may be on unseen data. You need to be able to use these estimates to choose one or two best models from the suite of models that you have created.
When you have a new dataset, it is a good idea to visualize the data using different techniques in order to look at the data from different perspectives.
The same idea applies to model selection. You should use a number of different ways of looking at the estimated accuracy of your machine learning algorithms in order to choose the one or two to finalize.
A way to do this is to use different visualization methods to show the average accuracy, variance and other properties of the distribution of model accuracies.
The key to a fair comparison of machine learning algorithms is ensuring that each algorithm is evaluated in the same way on the same data.
You can achieve this by forcing each algorithm to be evaluated on a consistent test harness.
In the example below 6 different algorithms are compared:
Logistic Regression
Linear Discriminant Analysis
K-Nearest Neighbors
Classification and Regression Trees
Naive Bayes
Support Vector Machines
The problem is a standard binary classification dataset from the UCI machine learning repository called the Pima Indians onset of diabetes problem. The problem has two classes and eight numeric input variables of varying scales.
The 10-fold cross validation procedure is used to evaluate each algorithm, importantly configured with the same random seed to ensure that the same splits to the training data are performed and that each algorithms is evaluated in precisely the same way.
Each algorithm is given a short name, useful for summarizing results afterward.
In [55]:
from sklearn.linear_model import LogisticRegression
from sklearn.discriminant_analysis import LinearDiscriminantAnalysis
from sklearn.neighbors import KNeighborsClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.naive_bayes import GaussianNB
from sklearn.svm import SVC
from sklearn import model_selection
In [56]:
# prepare configuration for cross validation test harness
seed = 7
# prepare models
models = []
models.append(('LR', LogisticRegression()))
models.append(('LDA', LinearDiscriminantAnalysis()))
models.append(('KNN', KNeighborsClassifier()))
models.append(('CART', DecisionTreeClassifier()))
models.append(('NB', GaussianNB()))
models.append(('SVM', SVC()))
# evaluate each model in turn
results = []
names = []
scoring = 'accuracy'
for name, model in models:
kfold = model_selection.KFold(n_splits=10, random_state=seed)
cv_results = model_selection.cross_val_score(model, x_train, y_train, cv=kfold, scoring=scoring)
results.append(cv_results)
names.append(name)
msg = "%s: %f (+-%f)" % (name, cv_results.mean(), cv_results.std())
print(msg)
# boxplot algorithm comparison
fig = plt.figure()
fig.suptitle('Algorithm Comparison')
ax = fig.add_subplot(111)
plt.boxplot(results)
ax.set_xticklabels(names)
plt.show()
In [ ]:
In [ ]:
In [57]:
names = ["Nearest Neighbors", "Linear SVM", "RBF SVM",
"Decision Tree", "Random Forest", "Neural Net", "AdaBoost",
"Naive Bayes", "QDA", "Gaussian Process"]
classifiers = [
KNeighborsClassifier(3),
SVC(kernel="linear", C=0.025),
SVC(gamma=2, C=1),
DecisionTreeClassifier(max_depth=5),
RandomForestClassifier(max_depth=5, n_estimators=10, max_features=1),
MLPClassifier(alpha=1),
AdaBoostClassifier(),
GaussianNB(),
QuadraticDiscriminantAnalysis()
#, GaussianProcessClassifier(1.0 * RBF(1.0), warm_start=True), # Take too long...
]
# iterate over classifiers
for name, model in zip(names, classifiers):
# fit a model to the data
model.fit(x_train_scaled, y_train)
# make predictions - not used
# summarize the fit of the model
acc_score = model.score(x_test_scaled, y_test)
print(name, " - accuracy score: ", acc_score)
#end for
In [58]:
names_classifiers = ["Nearest Neighbors", "Linear SVM", "RBF SVM",
"Decision Tree", "Random Forest", "Neural Net", "AdaBoost",
"Naive Bayes", "QDA", "Gaussian Process"]
classifiers = [
KNeighborsClassifier(3),
SVC(kernel="linear", C=0.025),
SVC(gamma=2, C=1),
DecisionTreeClassifier(max_depth=5),
RandomForestClassifier(max_depth=5, n_estimators=10, max_features=1),
MLPClassifier(alpha=1),
AdaBoostClassifier(),
GaussianNB(),
QuadraticDiscriminantAnalysis()
#, GaussianProcessClassifier(1.0 * RBF(1.0), warm_start=True), # Take too long...
]
# prepare configuration for cross validation test harness
seed = 7
models = zip(names_classifiers, classifiers)
# evaluate each model in turn
results = []
names = []
scoring = 'accuracy'
for name, model in models:
kfold = model_selection.KFold(n_splits=10, random_state=seed)
cv_results = model_selection.cross_val_score(model, x_train_scaled, y_train, cv=kfold, scoring=scoring)
results.append(cv_results)
names.append(name)
msg = "%s: %f (+-%f)" % (name, cv_results.mean(), cv_results.std())
print(msg)
# boxplot algorithm comparison
fig = plt.figure(figsize=(16, 6))
fig.suptitle('Algorithm Comparison')
ax = fig.add_subplot(111)
plt.boxplot(results)
ax.set_xticklabels(names)
plt.show()
In [ ]:
In [ ]:
In [ ]:
In [59]:
from xgboost import XGBClassifier
from sklearn.metrics import accuracy_score
# fit model no training data
model = XGBClassifier()
model.fit(x_train, y_train)
# make predictions for test data
y_pred = model.predict(x_test)
predictions = [round(value) for value in y_pred]
# evaluate predictions
accuracy = accuracy_score(y_test, predictions)
print("Accuracy: %.2f%%" % (accuracy * 100.0))
In [60]:
# Save Model Using pickle
import pickle
# fit model no training data
model = XGBClassifier()
model.fit(x_train, y_train)
# save the model to disk
filename = 'output/XGBClassifier_model-pickle.sav'
pickle.dump(model, open(filename, 'wb'))
# some time later...
# load the model from disk
loaded_model = pickle.load(open(filename, 'rb'))
# make predictions for test data
y_pred = loaded_model.predict(x_test)
predictions = [round(value) for value in y_pred]
# evaluate predictions
accuracy = accuracy_score(y_test, predictions)
print("Accuracy: %.2f%%" % (accuracy * 100.0))
In [61]:
# Save Model Using joblib
from sklearn.externals import joblib
# fit model no training data
model = XGBClassifier()
model.fit(x_train, y_train)
# save the model to disk
filename = 'output/XGBClassifier_model-joblib.sav'
joblib.dump(model, filename)
# some time later...
# load the model from disk
loaded_model = joblib.load(filename)
# make predictions for test data
y_pred = loaded_model.predict(x_test)
predictions = [round(value) for value in y_pred]
# evaluate predictions
accuracy = accuracy_score(y_test, predictions)
print("Accuracy: %.2f%%" % (accuracy * 100.0))
In [ ]:
In [62]:
# Using data after scaling ...
# fit model no training data
model = XGBClassifier()
model.fit(x_train_scaled, y_train)
# make predictions for test data
y_pred = model.predict(x_test_scaled)
predictions = [round(value) for value in y_pred]
# evaluate predictions
accuracy = accuracy_score(y_test, predictions)
print("Accuracy: %.2f%%" % (accuracy * 100.0))
In [ ]:
In [63]:
from sklearn.model_selection import KFold, cross_val_score
seed = 7
kfold = KFold(n_splits=10, random_state=seed)
model = XGBClassifier()
results = cross_val_score(model, x_train, y_train, cv=kfold)
print(results)
print("max: ", results.max())
print("min: ", results.min())
print("mean: ", results.mean())
print("Accuracy: %0.2f (+/- %0.2f)" % (results.mean(), results.std() * 2))
In [64]:
#for train_indices, test_indices in kfold.split(x_train):
# print('Train: %s | test: %s' % (train_indices, test_indices))
In [ ]:
Ref: http://machinelearningmastery.com/ensemble-machine-learning-algorithms-python-scikit-learn/
The three most popular methods for combining the predictions from different models are:
Each ensemble algorithm is demonstrated using 10 fold cross validation, a standard technique used to estimate the performance of any machine learning algorithm on unseen data.
In this part, we discovered ensemble machine learning algorithms for improving the performance of models on our problems.
Bootstrap Aggregation or bagging involves taking multiple samples from your training dataset (with replacement) and training a model for each sample.
The final output prediction is averaged across the predictions of all of the sub-models.
The three bagging models covered in this section are as follows:
Bagging performs best with algorithms that have high variance. A popular example are decision trees, often constructed without pruning.
In the example below see an example of using the BaggingClassifier with the Classification and Regression Trees algorithm (DecisionTreeClassifier). A total of 100 trees are created.
In [65]:
from sklearn.ensemble import BaggingClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn import model_selection
seed = 7
kfold = model_selection.KFold(n_splits=10, random_state=seed)
num_trees = 100
clf = DecisionTreeClassifier()
model = BaggingClassifier(base_estimator=clf, n_estimators=num_trees, random_state=seed)
results = model_selection.cross_val_score(model, x_train, y_train, cv=kfold)
print(results)
print("max: ", results.max())
print("min: ", results.min())
print("mean: ", results.mean())
# We get a robust estimate of model accuracy.
print("Accuracy: %0.2f (+/- %0.2f)" % (results.mean(), results.std() * 2))
Random forest is an extension of bagged decision trees.
Samples of the training dataset are taken with replacement, but the trees are constructed in a way that reduces the correlation between individual classifiers. Specifically, rather than greedily choosing the best split point in the construction of the tree, only a random subset of features are considered for each split.
We can construct a Random Forest model for classification using the RandomForestClassifier (http://scikit-learn.org/stable/modules/generated/sklearn.ensemble.RandomForestClassifier.html) class. A random forest is a meta estimator that fits a number of decision tree classifiers on various sub-samples of the dataset and use averaging to improve the predictive accuracy and control over-fitting. The sub-sample size is always the same as the original input sample size but the samples are drawn with replacement if bootstrap=True (default).
The example below provides an example of Random Forest for classification with 100 trees and split points chosen from a random selection of 3 features.
In [66]:
# Random Forest Classification
from sklearn import model_selection
from sklearn.ensemble import RandomForestClassifier
seed = 7
num_trees = 100
max_features = 3
kfold = model_selection.KFold(n_splits=10, random_state=seed)
model = RandomForestClassifier(n_estimators=num_trees, max_features=max_features)
results = model_selection.cross_val_score(model, x_train, y_train, cv=kfold)
print(results)
print("max: ", results.max())
print("min: ", results.min())
print("mean: ", results.mean())
# We get a mean estimate of classification accuracy.
print("Accuracy: %0.2f (+/- %0.2f)" % (results.mean(), results.std() * 2))
Extra Trees are another modification of bagging where random trees are constructed from samples of the training dataset.
You can construct an Extra Trees model for classification using the ExtraTreesClassifier (http://scikit-learn.org/stable/modules/generated/sklearn.ensemble.ExtraTreesClassifier.html) class. This class implements a meta estimator that fits a number of randomized decision trees (a.k.a. extra-trees) on various sub-samples of the dataset and use averaging to improve the predictive accuracy and control over-fitting.
The example below provides a demonstration of extra trees with the number of trees set to 100 and splits chosen from 7 random features.
In [67]:
# Extra Trees Classification
import pandas
from sklearn import model_selection
from sklearn.ensemble import ExtraTreesClassifier
seed = 7
num_trees = 100
max_features = 7
kfold = model_selection.KFold(n_splits=10, random_state=seed)
model = ExtraTreesClassifier(n_estimators=num_trees, max_features=max_features)
results = model_selection.cross_val_score(model, x_train, y_train, cv=kfold)
print(results)
print("max: ", results.max())
print("min: ", results.min())
print("mean: ", results.mean())
# We get a mean estimate of classification accuracy.
print("Accuracy: %0.2f (+/- %0.2f)" % (results.mean(), results.std() * 2))
Boosting ensemble algorithms creates a sequence of models that attempt to correct the mistakes of the models before them in the sequence.
Once created, the models make predictions which may be weighted by their demonstrated accuracy and the results are combined to create a final output prediction.
The two most common boosting ensemble machine learning algorithms are:
AdaBoost was perhaps the first successful boosting ensemble algorithm. It generally works by weighting instances in the dataset by how easy or difficult they are to classify, allowing the algorithm to pay or or less attention to them in the construction of subsequent models.
You can construct an AdaBoost model for classification using the AdaBoostClassifier (http://scikit-learn.org/stable/modules/generated/sklearn.ensemble.AdaBoostClassifier.html) class. An AdaBoost [1] classifier is a meta-estimator that begins by fitting a classifier on the original dataset and then fits additional copies of the classifier on the same dataset but where the weights of incorrectly classified instances are adjusted such that subsequent classifiers focus more on difficult cases. This class implements the algorithm known as AdaBoost-SAMME [2].
The example below demonstrates the construction of 30 decision trees in sequence using the AdaBoost algorithm.
In [68]:
# AdaBoost Classification
from sklearn import model_selection
from sklearn.ensemble import AdaBoostClassifier
seed = 7
num_trees = 30
kfold = model_selection.KFold(n_splits=10, random_state=seed)
model = AdaBoostClassifier(n_estimators=num_trees, random_state=seed)
results = model_selection.cross_val_score(model, x_train, y_train, cv=kfold)
print(results)
print("max: ", results.max())
print("min: ", results.min())
print("mean: ", results.mean())
# We get a mean estimate of classification accuracy.
print("Accuracy: %0.2f (+/- %0.2f)" % (results.mean(), results.std() * 2))
Stochastic Gradient Boosting (also called Gradient Boosting Machines) are one of the most sophisticated ensemble techniques. It is also a technique that is proving to be perhaps of the the best techniques available for improving performance via ensembles.
You can construct a Gradient Boosting model for classification using the GradientBoostingClassifier (http://scikit-learn.org/stable/modules/generated/sklearn.ensemble.GradientBoostingClassifier.html) class. Gradient Boosting for classification - GB builds an additive model in a forward stage-wise fashion; it allows for the optimization of arbitrary differentiable loss functions. In each stage n_classes_ regression trees are fit on the negative gradient of the binomial or multinomial deviance loss function. Binary classification is a special case where only a single regression tree is induced.
The example below demonstrates Stochastic Gradient Boosting for classification with 100 trees.
In [69]:
# Stochastic Gradient Boosting Classification
from sklearn import model_selection
from sklearn.ensemble import GradientBoostingClassifier
seed = 7
num_trees = 100
kfold = model_selection.KFold(n_splits=10, random_state=seed)
model = GradientBoostingClassifier(n_estimators=num_trees, random_state=seed)
results = model_selection.cross_val_score(model, x_train, y_train, cv=kfold)
print(results)
print("max: ", results.max())
print("min: ", results.min())
print("mean: ", results.mean())
# We get a mean estimate of classification accuracy.
print("Accuracy: %0.2f (+/- %0.2f)" % (results.mean(), results.std() * 2))
Voting is one of the simplest ways of combining the predictions from multiple machine learning algorithms.
It works by first creating two or more standalone models from your training dataset. A Voting Classifier can then be used to wrap your models and average the predictions of the sub-models when asked to make predictions for new data.
The predictions of the sub-models can be weighted, but specifying the weights for classifiers manually or even heuristically is difficult. More advanced methods can learn how to best weight the predictions from submodels, but this is called stacking (stacked aggregation) and is currently not provided in scikit-learn.
You can create a voting ensemble model for classification using the VotingClassifier (http://scikit-learn.org/stable/modules/generated/sklearn.ensemble.VotingClassifier.html) class. Soft Voting/Majority Rule classifier for unfitted estimators. New in version 0.17.
The code below provides an example of combining the predictions of logistic regression, classification and regression trees and support vector machines together for a classification problem.
In [70]:
# Voting Ensemble for Classification
from sklearn import model_selection
from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier
from sklearn.svm import SVC
from sklearn.ensemble import VotingClassifier
seed = 7
kfold = model_selection.KFold(n_splits=10, random_state=seed)
# create the sub models
estimators = []
model1 = LogisticRegression()
estimators.append(('logistic', model1))
model2 = DecisionTreeClassifier()
estimators.append(('cart', model2))
model3 = SVC()
estimators.append(('svm', model3))
# create the ensemble model
ensemble = VotingClassifier(estimators)
results = model_selection.cross_val_score(ensemble, x_train, y_train, cv=kfold)
print(results)
print("max: ", results.max())
print("min: ", results.min())
print("mean: ", results.mean())
# We get a mean estimate of classification accuracy.
print("Accuracy: %0.2f (+/- %0.2f)" % (results.mean(), results.std() * 2))
In [ ]:
In [ ]:
In [ ]:
In [ ]:
In [ ]:
In [ ]: