This notebook contains an excerpt from the book Machine Learning for OpenCV by Michael Beyeler. The code is released under the MIT license, and is available on GitHub.

Note that this excerpt contains only the raw code - the book is rich with additional explanations and illustrations. If you find this content useful, please consider supporting the work by buying the book!

Using Decision Trees to Make a Medical Diagnosis

Now that we know how to handle data in all shapes and forms, be it numerical, categorical, text, or image data, it is time to put our newly gained knowledge to good use.

In this chapter, we will learn how to build a machine learning system that can make a medical diagnosis. We aren't all doctors, but we've probably all been to one at some point in our lives. Typically, a doctor would gain as much information as possible about a patient's history and symptoms in order to make an informed diagnosis. We will mimic a doctor's decision-making process with the help of what is known as decision trees.

A decision tree is a simple yet powerful supervised learning algorithm that resembles a flow chart; we will talk more about this in just a minute. Other than in medicine, decision trees are commonly used in fields such as astronomy (for example, for filtering noise from Hubble Space Telescope images or to classify star-galaxy clusters), manufacturing and production (for example, by Boeing to discover flaws in the manufacturing process), and object recognition (for example, for recognizing 3D objects).

Specifically, we want to address the following questions in this chapter:

  • How do we build simple decision trees from data, and use them for either classification or regression?
  • How do we decide which decision to make next?
  • How do we prune a tree, and what is that good for?

Outline

But first, let's talk about what decision trees actually are.

Understanding Decision Trees

Decision trees are simple yet powerful model for supervised learning problems. Like the name suggests, you can think of them as a tree in which information flows along different branches in the tree - starting at the trunk, and going all the way to the individual leaves.

Here the book offers a detailed treatment of the inner workings of decision trees, along with illustrations and simple examples. For more information on decision trees, please refer to the book.

Let's say we have a dataset consisting of a single e-mail:


In [1]:
data = [
    'I am Mohammed Abacha, the son of the late Nigerian Head of '
    'State who died on the 8th of June 1998. Since i have been '
    'unsuccessful in locating the relatives for over 2 years now '
    'I seek your consent to present you as the next of kin so '
    'that the proceeds of this account valued at US$15.5 Million '
    'Dollars can be paid to you. If you are capable and willing '
    'to assist, contact me at once via email with following '
    'details: 1. Your full name, address, and telephone number. '
    '2. Your Bank Name, Address. 3.Your Bank Account Number and '
    'Beneficiary Name - You must be the signatory.'
]

This data can be vectorized using Scikit-Learn's CountVectorizer, which turns the e-mail into its individual words and corresponding word counts:


In [2]:
from sklearn.feature_extraction.text import CountVectorizer
vec = CountVectorizer()
X = vec.fit_transform(data)

Let's look at the first five words in the dictionaries, and their word counts:


In [3]:
vec.get_feature_names()[:5]


Out[3]:
['15', '1998', '8th', 'abacha', 'account']

In [4]:
X.toarray()[0, :5]


Out[4]:
array([1, 1, 1, 1, 2], dtype=int64)

So, how would you have checked if the e-mail is from a Nigerian prince?

One way to do this is to look if the e-mail contained both the words 'nigerian' and 'prince':


In [5]:
'nigerian' in vec.get_feature_names()


Out[5]:
True

In [6]:
'prince' in vec.get_feature_names()


Out[6]:
False

And what do we find to our surprise? The word 'prince' does not occur in the e-mail.

Does this mean the message is legit?

No, of course not. Instead of 'prince', the e-mail went with the words 'head of state' instead - effectively circumventing our all-too-simple spam detector.

Luckily, the theoretical framework behind decision trees provides us with help with both finding the right decision rules as well as which decisions to tackle next.