Introduction:

Machine Learning is a vast area of Computer Science that is concerned with designing algorithms which form good models of the world around us (the data coming from the world around us).

Within Machine Learning many tasks are - or can be reformulated as - classification tasks.

In classification tasks we are trying to produce a model which can give the correlation between the input data $X$ and the class $C$ each input belongs to. This model is formed with the feature-values of the input-data. For example, the dataset contains datapoints belonging to the classes Apples, Pears and Oranges and based on the features of the datapoints (weight, color, size etc) we are trying to predict the class.

We need some amount of training data to train the Classifier, i.e. form a correct model of the data. We can then use the trained Classifier to classify new data. If the training dataset chosen correctly, the Classifier should predict the class probabilities of the new data with a similar accuracy (as it does for the training examples).

After construction, such a Classifier could for example tell us that document containing the words "Bose-Einstein condensate" should be categorized as a Physics article, while documents containing the words "Arbitrage" and "Hedging" should be categorized as a Finance article.

Another Classifier (whose dataset is illustrated below) could tell whether or not a person makes more than 50K, based on features such as Age, Education, Marital Status, Occupation etc.

As we can see, there is a input dataset $ X $ which corresponds to a 'output' $Y$. The dataset $X$ contains $m$ input examples $x^{(1)}, x^{(2)}, .. , x^{(m)}$, and each input example has $n$ feature values $x_1, x_2, ..., x_n$ (here $n\ =\ 7$).

There are three popular Classifiers within Machine Learning, which use three different mathematical approaches to classify data;

  • Naive Bayes, which uses a statistical (Bayesian) approach,
  • Logistic Regression, which uses a functional approach and
  • Support Vector Machines, which uses a geometrical approach.

Previously we have already looked at Logistic Regression. Here we will see the theory behind the Naive Bayes Classifier together with its implementation in Python.

2. Naive Bayes Classification:

Naive Bayes classifiers are trying to classify data from a Statistical point of view.

The starting point is that the probability (datapoint $x^{i}$ belongs to a) class $C\ =\ c_i$ is given by the posterior probability $P(C\ |\ x^{i})$. Here $x^{i}$ refers to an entry in the test set, consisting of n features; $x_1, x_2, ..., x_n$.

Using Bayes' rule, this posterior probability can be rewritten as:

$ P(C=c_i\ |\ x^{i}) = \frac{P(x^{i}\ |\ C=c_j) \cdot P(C=c_j)}{P(x^{i})} $


Since the marginal probability $P(x^{i})$ does not depends on the classes, it can be disregarded and the equation becomes:

$ P(C=c_j\ |\ x^{i}) = P(x^{i}\ |\ C=c_j) \cdot P(C=c_j) $


The training example $x^{(i)} $ belongs to the class $c_j$ which maximizes this probability, so:

$ C_{NB} = argmax\ P(x^{(i)}|C=c_j) \cdot P(C=c_j) $

$ C_{NB} = argmax\ P(x_1, x_2, .., x_n | C=c_j) \cdot P(C=c_j) $


Assuming conditional independence of the features $ x_k$, this equation simplifies to:

$ C_{NB} = argmax\ P(x_1|C) \cdot P(x_2|C) \cdot \cdot\ \cdot P(x_n|C) \cdot P(C) $

$ C_{NB} = argmax\ P(C) \cdot \prod_i P(x_i|C) $


Here $P(x_i | C)$ is the conditional probability that feature i belongs to class $C$.

This probability can simply be calculated by calculating the relative values of feature $i$ per class. This is should become more clear, if we look at our '50K income' example of above:


First, we select all of the entries belonging to one class:



Then we calculate the relative frequency of the values of each feature (per class):

New entries can be classified by multiplying the probabilities of each feature per class:
if a new entry has for the three features illustrated above, the following values:

  • native-country: United-states,
  • hours-per-week: 40,
  • occupation: Exec-managerial.

Then based on these features, the class probabilties will be:
$P( C = C_{>50K})\ =\ (1/3) \cdot (2/3) \cdot (1/3) = (2/27) $
$P( C = C_{<=50K})\ =\ (2/3) \cdot (2/3) \cdot (2/3) = (8/27) $

The predicted class for this new entry therefore would be '<=50K'.

In practice we of course have much more features, and thousands/millions of training examples, but the way Naive Bayes classification works remains the same.

So we need to make a Hash table, containing the feature probabilities.

Once such a Hash table is made, new entries can be classified by multiplying the probabilities of each feature value, per class.

The code to train a Naive Bayes Classifier looks as follows.


In [1]:
from collections import Counter, defaultdict
import numpy as np

class NaiveBaseClass:
    def calculate_relative_occurences(self, list1):
        no_examples = len(list1)
        ro_dict = dict(Counter(list1))
        for key in ro_dict.keys():
            ro_dict[key] = ro_dict[key] / float(no_examples)
        return ro_dict

    def get_max_value_key(self, d1):
        values = d1.values()
        keys = d1.keys()
        max_value_index = values.index(max(values))
        max_key = keys[max_value_index]
        return max_key
       
    def initialize_nb_dict(self):
        self.nb_dict = {}
        for label in self.labels:
            self.nb_dict[label] = defaultdict(list)


class NaiveBayes(NaiveBaseClass):
    """
    Naive Bayes Classifier method:
    It is trained with a 2D-array X (dimensions m,n) and a 1D array Y (dimension 1,n).
    X should have one column per feature (total n) and one row per training example (total m).
    After training a hash table is filled with the class probabilities per feature.
    We start with an empty hash table nb_dict, which has the form:

    nb_dict = {
        'class1': {
            'feature1': [],
            'feature2': [],
            (...)
            'featuren': []
        }
        'class2': {
            'feature1': [],
            'feature2': [],
            (...)
            'featuren': []
        }
    }
    """
    
    def train(self, X, Y):
        self.labels = np.unique(Y)
        no_rows, no_cols = np.shape(X)
        self.initialize_nb_dict(labels)
        self.class_probabilities = self.calculate_relative_occurences(Y)
        #iterate over all classes
        for label in self.labels:
            #first we get a list of indices per class, so we can take a subset X_ of the matrix X, containing data of only that class.
            row_indices = np.where(Y == label)[0]
            X_ = X[row_indices, :]

            #in this subset, we iterate over all the columns/features, and add all values of each feature to the hash table nb_dict
            no_rows_, no_cols_ = np.shape(X_)
            for jj in range(0,no_cols_):
                nb_dict[label][jj] += list(X_[:,jj])

        #Now we have a Hash table containing all occurences of feature values, per feature, per class
        #We need to transform this Hash table to a Hash table with relative feature value occurences per class
        for label in self.labels:
            for jj in range(0,no_cols):
                self.nb_dict[label][jj] = self.calculate_relative_occurences(nb_dict[label][jj])

Once the Naive Bayes Classifier has been trained with the train() method, we can use it to classify new elements:


In [2]:
def classify_single_elem(self, X_elem):
        Y_dict = {}
        #First we determine the class-probability of each class, and then we determine the class with the highest probability
        for label in self.labels:
            class_probability = self.class_probabilities[label]
            for ii in range(0,len(X_elem)):
              relative_feature_values = self.nb_dict[label][ii]
              if X_elem[ii] in relative_feature_values.keys():
                class_probability *= relative_feature_values[X_elem[ii]]
              else:
                class_probability *= 0
            Y_dict[label] = class_probability
        return self.get_max_value_key(Y_dict)

For the rest of the Python code, including batch classification, and the Naive Bayes classification of text-data and and worked out examples of both Classifiers, please have a look at the GitHub repository.


In [ ]: