This notebook contains an excerpt from the book Machine Learning for OpenCV by Michael Beyeler. The code is released under the MIT license, and is available on GitHub.

Note that this excerpt contains only the raw code - the book is rich with additional explanations and illustrations. If you find this content useful, please consider supporting the work by buying the book!

Working with Data in OpenCV

Now that we have whetted our appetite for machine learning, it is time to delve a little deeper into the different parts that make up a typical machine learning system.

Machine learning is all about building mathematical models in order to understand data. The learning aspect enters this process when we give a machine learning model the capability to adjust its internal parameters; we can tweak these parameters so that the model explains the data better. In a sense, this can be understood as the model learning from the data. Once the model has learned enough—whatever that means—we can ask it to explain newly observed data.

Hence machine learning problems are always split into (at least) two distinct phases:

  • A training phase, during which we aim to train a machine learning model on a set of data that we call the training dataset.
  • A test phase, during which we evaluate the learned (or finalized) machine learning model on a new set of never-before-seen data that we call the test dataset.

The importance of splitting our data into a training set and test set cannot be understated. We always evaluate our models on an independent test set because we are interested in knowing how well our models generalize to new data. In the end, isn't this what learning is all about—be it machine learning or human learning?

Machine learning is also all about the data. Data can be anything from images and movies to text documents and audio files. Therefore, in its raw form, data might be made of pixels, letters, words, or even worse: pure bits. It is easy to see that data in such a raw form might not be very convenient to work with. Instead, we have to find ways to preprocess the data in order to bring it into a form that is easy to parse.

In this chapter, we want to learn how data fits in with machine learning, and how to work with data using the tools of our choice: OpenCV and Python.

In specific, we want to address the following questions:

  • What does a typical machine learning workflow look like?
  • What are training data, validation data, and test data - and what are they good for?
  • How do I load, store, and work with such data in OpenCV using Python?

Outline

Starting a new IPython or Jupyter session

Before we can get started, we need to open an IPython shell or start a Jupyter Notebook:

  1. Open a terminal like we did in the previous chapter, and navigate to the opencv-machine-learning directory:

     $ cd Desktop/opencv-machine-learning
  2. Activate the conda environment we created in the previous chapter:

     $ source activate Python3 # Mac OS X / Linux
    $ activate Python3 # Windows
  3. Start a new IPython or Jupyter session:

     $ ipython # for an IPython session
    $ jupyter notebook # for a Jupyter session

If you chose to start an IPython session, the program should have greeted you with a welcome message such as the following:

$ ipython
Python 3.5.2 |Continuum Analytics, Inc.| (default, Jul 5 2016, 11:41:13)
[MSC v.1900 64 bit (AMD64)]
Type "copyright", "credits" or "license" for more information.
IPython 3.5.0 -- An enhanced Interactive Python.
? -> Introduction and overview of IPython's features.
%quickref -> Quick reference.
help -> Python's own help system.
object? -> Details about 'object', use 'object??' for extra details.

In [1]:

The line starting with In [1] is where you type in your regular Python commands. In addition, you can also use the Tab key while typing the names of variables and functions in order to have IPython automatically complete them.

If you chose to start a Jupyter session, a new window should have opened in your web browser that is pointing to http://localhost:8888. You want to create a new notebook by clicking on New in the top-right corner and selecting Notebooks (Python3).

This will open a new window that contains an empty page with the same command line as in an IPython session:

In [ ]:

In [ ]: