Note that this excerpt contains only the raw code - the book is rich with additional explanations and illustrations. If you find this content useful, please consider supporting the work by buying the book!
When the trees in the forest are trees of depth 1 (also known as decision stumps) and we perform boosting instead of bagging, the resulting algorithm is called AdaBoost.
AdaBoost adjusts the dataset at each iteration by performing the following actions:
This iterative weight adjustment causes each new classifier in the ensemble to prioritize training the incorrectly labeled cases. As a result, the model adjusts by targeting highlyweighted data points.
Eventually, the stumps are combined to form a final classifier.
In [1]:
import cv2
img_bgr = cv2.imread('data/lena.jpg', cv2.IMREAD_COLOR)
img_gray = cv2.cvtColor(img_bgr, cv2.COLOR_BGR2GRAY)
After loading the image in both color and grayscale, we load a pretrained Haar cascade:
In [2]:
filename = 'data/haarcascade_frontalface_default.xml'
face_cascade = cv2.CascadeClassifier(filename)
The classifier will then detect faces present in the image using the following function call:
In [3]:
faces = face_cascade.detectMultiScale(img_gray, 1.1, 5)
Note that the algorithm operates only on grayscale images. That's why we saved two
pictures of Lena, one to which we can apply the classifier (img_gray
), and one on which we
can draw the resulting bounding box (img_bgr
):
In [4]:
color = (255, 0, 0)
thickness = 2
for (x, y, w, h) in faces:
cv2.rectangle(img_bgr, (x, y), (x + w, y + h),
color, thickness)
Then we can plot the image using the following code:
In [5]:
import matplotlib.pyplot as plt
%matplotlib inline
plt.figure(figsize=(10, 6))
plt.imshow(cv2.cvtColor(img_bgr, cv2.COLOR_BGR2RGB));
Obviously, this picture contains only a single face. However, the preceding code will work even on images where multiple faces could be detected. Try it out!
In [6]:
from sklearn.ensemble import AdaBoostClassifier
ada = AdaBoostClassifier(n_estimators=100,
random_state=456)
We can load the breast cancer set once more and split it 75-25:
In [7]:
from sklearn.datasets import load_breast_cancer
cancer = load_breast_cancer()
X = cancer.data
y = cancer.target
In [8]:
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(
X, y, random_state=456
)
Then fit and score AdaBoost using the familiar procedure:
In [9]:
ada.fit(X_train, y_train)
ada.score(X_test, y_test)
Out[9]:
The result is remarkable, 97.9% accuracy!
We might want to compare this result to a random forest. However, to be fair, we should make the trees in the forest all decision stumps. Then we will know the difference between bagging and boosting:
In [10]:
from sklearn.ensemble import RandomForestClassifier
forest = RandomForestClassifier(n_estimators=100,
max_depth=1,
random_state=456)
forest.fit(X_train, y_train)
forest.score(X_test, y_test)
Out[10]:
Of course, if we let the trees be as deep as needed, we might get a better score:
In [11]:
forest = RandomForestClassifier(n_estimators=100,
random_state=456)
forest.fit(X_train, y_train)
forest.score(X_test, y_test)
Out[11]:
As a last step in this chapter, let's talk about how to combine different types of models into an ensemble.