This notebook contains an excerpt from the book Machine Learning for OpenCV by Michael Beyeler. The code is released under the MIT license, and is available on GitHub.

Note that this excerpt contains only the raw code - the book is rich with additional explanations and illustrations. If you find this content useful, please consider supporting the work by buying the book!

< Classifying Iris Species Using Logistic Regression | Contents | Preprocessing Data >

Representing Data and Engineering Features

In the last chapter, we built our very first supervised learning models and applied them to some classic datasets, such as the Iris and the Boston datasets. However, in the real world, data rarely comes in a neat <n_samples x n_features> feature matrix that is part of a pre-packaged database. Instead, it is our own responsibility to find a way to represent the data in a meaningful way. The process of finding the best way to represent our data is known as feature engineering, and it is one of the main tasks of data scientists and machine learning practitioners trying to solve real-world problems.

I know you would rather jump right to the end and build the deepest neural network mankind has ever seen. But, trust me, this stuff is important! Representing our data in the right way can have a much greater influence on the performance of our supervised model than the exact parameters we choose. And we get to invent our own features, too.

In this chapter, we will therefore go over some common feature engineering tasks. Specifically, we want to answer the following questions:

What are some common preprocessing techniques that everyone uses but nobody talks about?
How do we represent categorical variables, such as the names of products, of colors, or of fruits?
How would we even go about representing text?
What is the best way to encode images, and what do SIFT and SURF stand for?

Outline

The book features a detailed treatment of feature engineering, data preprocessing, and data transformation. Below is a short summary of these topics. For more information, please refer to the book.

Feature engineering

Feature engineering comes in two stages:

Feature selection: This is the process of identifying important attributes (or features) in the data. Possible features of an image might be the location of edges, corners, or ridges. In this chapter, we will look at some of the more advanced feature descriptors that OpenCV provides, such as the Scale-Invariant Feature Transform (SIFT) and Speeded Up Robust Features (SURF).
Feature extraction: This is the actual process of transforming the raw data into the desired feature space used to feed a machine learning algorithm as illustrated in the figure above. An example would be the Harris operator, which allows you to extract corners (that is, a selected feature) in an image.

Preprocessing data

The preprocessing pipeline:

Data formatting: The data may not be in a format that is suitable for you to work with. For example, the data might be provided in a proprietary file format, which your favorite machine learning algorithm does not understand.
Data cleaning: The data may contain invalid or missing entries, which need to be cleaned up or removed.
Data sampling: The data may be far too numerous for your specific purpose, forcing you to sample the data in a smart way.

Transforming data

Once the data has been preprocessed, you are ready for the actual feature engineering: to transform the preprocessed data to fit your specific machine learning algorithm. This step usually involves one or more of three possible procedures:

Scaling: Machine learning algorithms often require the data to be within a common range, such as to have zero mean and unit variance. Scaling is the process of bringing all features (which might have different physical units) into a common range of values.
Decomposition: Datasets often have many more features than you could possibly process. Feature decomposition is the process of compressing data into a smaller number of highly informative data components.
Aggregation: There may be features that can be aggregated into a single feature that would be more meaningful to the problem you are trying to solve. For example, a database might contain the date and time each user logged into a webbased system. Depending on the task, this data might be better represented by simply counting the number of logins per user.

< Classifying Iris Species Using Logistic Regression | Contents | Preprocessing Data >