Note that this excerpt contains only the raw code - the book is rich with additional explanations and illustrations. If you find this content useful, please consider supporting the work by buying the book!
In the last chapter, we built our very first supervised learning models and applied them to
some classic datasets, such as the Iris and the Boston datasets. However, in the real world,
data rarely comes in a neat <n_samples x n_features>
feature matrix that is part of a
pre-packaged database. Instead, it is our own responsibility to find a way to represent the
data in a meaningful way. The process of finding the best way to represent our data is
known as feature engineering, and it is one of the main tasks of data scientists and machine
learning practitioners trying to solve real-world problems.
I know you would rather jump right to the end and build the deepest neural network mankind has ever seen. But, trust me, this stuff is important! Representing our data in the right way can have a much greater influence on the performance of our supervised model than the exact parameters we choose. And we get to invent our own features, too.
In this chapter, we will therefore go over some common feature engineering tasks. Specifically, we want to answer the following questions:
The book features a detailed treatment of feature engineering, data preprocessing, and data transformation. Below is a short summary of these topics. For more information, please refer to the book.
Feature engineering comes in two stages:
The preprocessing pipeline:
Once the data has been preprocessed, you are ready for the actual feature engineering: to transform the preprocessed data to fit your specific machine learning algorithm. This step usually involves one or more of three possible procedures: