Visual Categorization with Bags of Keypoints in Shogun

By Abhijeet Kislay (GitHub ID: kislayabhi) as a GSoC'14 project under Kevin Hughes(GitHub ID: pickle27)

This notebook is about performing Object Categorization using SIFT descriptors of keypoints as features, and SVMs to predict the category of the object present in the image. Shogun's K-Means clustering is employed for generating the bag of keypoints and its k-nearest neighbours module is extensively used to construct the feature vectors.

Background

This notebook presents a bag of keypoints approach to visual categorization. A bag of keypoints corresponds to a histogram of the number of occurences of particular image patterns in a given image.The main advantages of the method are its simplicity, its computational efficiency and its invariance to affine transformations, as well as occlusion, lighting and intra-class variations.

Strategy

1. Compute (SIFT) descriptors at keypoints in all the template images and pool all of them together

SIFT

SIFT extracts keypoints and computes its descriptors. It requires the following steps to be done:

  • Scale-space Extrema Detection: Difference of Gaussian (DOG) are used to search for local extrema over scale and space.
  • Keypoint Localization: Once potential keypoints are found, we refine them by eliminating low-contrast keypoints and edge keypoints.
  • Orientation Assignment: Now an orientation is assigned to each keypoint to achieve invariance to image rotation.
  • Keypoint Descriptor: Now a keypoint descriptor is created. A total of 128 elements are available for each keypoint.

To get more details about SIFT in OpenCV, do read OpenCV python documentation here.

OpenCV has a nice API for using SIFT. Let's see what we are looking at:


In [ ]:
#import Opencv library
import os
SHOGUN_DATA_DIR=os.getenv('SHOGUN_DATA_DIR', '../../../data')
try:
    import cv2
except ImportError:
    print "You must have OpenCV installed"
    exit(1)

#check the OpenCV version
try:
    v=cv2.__version__
    assert (tuple(map(int,v.split(".")))>(2,4,2))
except (AssertionError, ValueError):
    print "Install newer version of OpenCV than 2.4.2, i.e from 2.4.3"
    exit(1)
    
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline
from shogun import *

# get the list of all jpg images from the path provided
import os
def get_imlist(path):
    return [[os.path.join(path,f) for f in os.listdir(path) if (f.endswith('.jpg') or f.endswith('.png'))]]

#Use the following function when reading an image through OpenCV and displaying through plt.
def showfig(image, ucmap):
    #There is a difference in pixel ordering in OpenCV and Matplotlib.
    #OpenCV follows BGR order, while matplotlib follows RGB order.
    if len(image.shape)==3 :
        b,g,r = cv2.split(image)       # get b,g,r
        image = cv2.merge([r,g,b])     # switch it to rgb
    imgplot=plt.imshow(image, ucmap)
    imgplot.axes.get_xaxis().set_visible(False)
    imgplot.axes.get_yaxis().set_visible(False)

We try to construct the vocabulary from a set of template images. It is a set of three general images belonging to the category of car, plane and train.

OpenCV also provides cv2.drawKeyPoints() function which draws the small circles on the locations of keypoints. If you pass a flag, cv2.DRAW_MATCHES_FLAGS_DRAW_RICH_KEYPOINTS to it, it will draw a circle with size of keypoint and it will even show its orientation. See below example.


In [ ]:
plt.rcParams['figure.figsize'] = 17, 4
filenames=get_imlist(os.path.join(SHOGUN_DATA_DIR, 'SIFT/template/'))
filenames=np.array(filenames)

# for keeping all the descriptors from the template images
descriptor_mat=[]

# initialise OpenCV's SIFT
sift=cv2.SIFT()
fig = plt.figure()
plt.title('SIFT detected Keypoints')
plt.xticks(())
plt.yticks(())
for image_no in xrange(3):
    img=cv2.imread(filenames[0][image_no])
    img=cv2.resize(img, (500, 300), interpolation=cv2.INTER_AREA)
    gray=cv2.cvtColor(img, cv2.COLOR_BGR2GRAY)
    gray=cv2.equalizeHist(gray)
    
    #detect the SIFT keypoints and the descriptors.
    kp, des=sift.detectAndCompute(gray,None)
    # store the descriptors.
    descriptor_mat.append(des)
    # here we draw the keypoints
    img=cv2.drawKeypoints(img, kp, flags=cv2.DRAW_MATCHES_FLAGS_DRAW_RICH_KEYPOINTS)
    fig.add_subplot(1, 3, image_no+1)
    showfig(img, None)

2. Group similar descriptors into an arbitrary number of clusters.

We take all the descriptors that we got from the three images above and find similarity in between them. Here, similarity is decided by Euclidean distance between the 128-element SIFT descriptors. Similar descriptors are clustered into k number of groups. This can be done using Shogun's **KMeans class**. These clusters are called bags of keypoints or visual words and they collectively represent the vocabulary of the program. Each cluster has a cluster center, which can be thought of as the representative descriptor of all the descriptors belonging to that cluster. These cluster centers can be found using the **get_cluster_centers()** method.

To perform clustering into k groups, we define the get_similar_descriptors() function below.


In [ ]:
def get_similar_descriptors(k, descriptor_mat):

    descriptor_mat=np.double(np.vstack(descriptor_mat))
    descriptor_mat=descriptor_mat.T

    #initialize KMeans in Shogun 
    sg_descriptor_mat_features=RealFeatures(descriptor_mat)

    #EuclideanDistance is used for the distance measurement.
    distance=EuclideanDistance(sg_descriptor_mat_features, sg_descriptor_mat_features)

    #group the descriptors into k clusters.
    kmeans=KMeans(k, distance)
    kmeans.train()

    #get the cluster centers.
    cluster_centers=(kmeans.get_cluster_centers())
    
    return cluster_centers

In [ ]:
cluster_centers=get_similar_descriptors(100, descriptor_mat)

3. Now, compute training data for the SVM classifiers. .

Since we have already constructed the vocabulary, our next step is to generate viable feature vectors which can be used to represent each training image so that we can use them for multiclass classification later in the code.

  • We begin by computing SIFT descriptors for each training image.
  • For each training image, associate each of its descriptors with one of the clusters in the vocabulary. The simplest way to do this is by using k-Nearest Neighbour approach. This can be done using Shogun's KNN class. Euclidean distance measure is used here for finding out the neighbours.
  • Making a histogram from this association. This histogram has as many bins as there are clusters in the vocabulary. Each bin counts how many descriptors in the training image are associated with the cluster corresponding to that bin. Intuitively, this histogram describes the image in the visual words of the vocabulary, and is called the bag of visual words descriptor of the image.

In short, we approximated each training image into a k element vector. This can be utilized to train any multiclass classifier.

First, let us see a few training images


In [ ]:
# name of all the folders together
folders=['cars','planes','trains']
training_sample=[]
for folder in folders:
    #get all the training images from a particular class 
    filenames=get_imlist(os.path.join(SHOGUN_DATA_DIR, 'SIFT/%s'%folder))
    for i in xrange(10):
        temp=cv2.imread(filenames[0][i])
        training_sample.append(temp)

plt.rcParams['figure.figsize']=21,16
fig=plt.figure()
plt.xticks(())
plt.yticks(())
plt.title('10 training images for each class')
for image_no in xrange(30):
    fig.add_subplot(6,5, image_no+1)
    showfig(training_sample[image_no], None)

We here define get_sift_training() function to get all the SIFT descriptors present in all the training images.


In [ ]:
def get_sift_training():
    
    # name of all the folders together
    folders=['cars','planes','trains']
    
    folder_number=-1
    des_training=[]
      
    for folder in folders:
        folder_number+=1

        #get all the training images from a particular class 
        filenames=get_imlist(os.path.join(SHOGUN_DATA_DIR, 'SIFT/%s'%folder))
        filenames=np.array(filenames)
        
        des_per_folder=[]
        for image_name in filenames[0]:
            img=cv2.imread(image_name)

            # carry out normal preprocessing routines
            gray= cv2.cvtColor(img,cv2.COLOR_BGR2GRAY)
            gray=cv2.resize(gray, (500, 300), interpolation=cv2.INTER_AREA)
            gray=cv2.equalizeHist(gray)

            #get all the SIFT descriptors for an image
            _, des=sift.detectAndCompute(gray, None)
            des_per_folder.append(des)
    
        des_training.append(des_per_folder)
    return des_training

In [ ]:
descriptor_training=get_sift_training()

We define the compute_training_data() function which returns the training data required for multiclass classification in the later stages.

Inputs are:

  • k=number of clusters

  • cluster_centers=descriptors that are approximate form of all descriptors belonging to a particular cluster

  • descriptors=SIFT descriptors of the training images that are obtained from the above function.


In [ ]:
def compute_training_data(k, cluster_centers, descriptors):
    
    # a list to hold histograms of all the training images
    all_histograms=[]
    # labels for all of the test images
    final_labels=[]
    # to hold the cluster number a descriptor belong to
    cluster_labels=[]

    #initialize a KNN in Shogun
    dist=EuclideanDistance()
    labels=MulticlassLabels(np.double(range(k)))
    knn=KNN(1, dist, labels)

    #Target descriptors are the cluster_centers that we got earlier. 
    #All the descriptors of an image are matched against these for 
    #calculating the histogram.
    sg_cluster_centers=RealFeatures(cluster_centers)
    knn.train(sg_cluster_centers)

    # name of all the folders together
    folders=['cars','planes','trains']
    folder_number=-1

    for folder in folders:
        folder_number+=1

        #get all the training images from a particular class 
        filenames=get_imlist(os.path.join(SHOGUN_DATA_DIR, 'SIFT/%s'%folder))

        for image_name in xrange(len(filenames[0])):
            
            des=descriptors[folder_number][image_name]
            
            #Shogun works in a way in which columns are samples and rows are features.
            #Hence we need to transpose the observation matrix
            des=(np.double(des)).T
            sg_des=RealFeatures(np.array(des))

            #find all the labels of cluster_centers that are nearest to the descriptors present in the current image. 
            cluster_labels=(knn.apply_multiclass(sg_des)).get_labels()

            histogram_per_image=[]
            for i in xrange(k):
                #find the histogram for the current image
                histogram_per_image.append(sum(cluster_labels==i))

            all_histograms.append(np.array(histogram_per_image))
            final_labels.append(folder_number)

    # we now have the training features(all_histograms) and labels(final_labels) 
    all_histograms=np.array(all_histograms)
    final_labels=np.array(final_labels)
    return all_histograms, final_labels, knn

In [ ]:
all_histograms, final_labels, knn=compute_training_data(100, cluster_centers, descriptor_training)

We have to solve a multiclass classification problem here. In Shogun these are implemented in: MulticlassMachine

4. We train a one-vs-all SVM for each category of object using the training data:

The following function returns a trained GMNPSVM, which is a true Multiclass SVM in Shogun employing one vs rest approach. Inputs are:

  • all_histograms=Can be thought as the feature vector for all images for which the SVM has to be trained.
  • final_labels=The labels respective of the above mentioned feature vectors

In [ ]:
def train_svm(all_histograms, final_labels):
    
    # we will use GMNPSVM class of Shogun for one vs rest multiclass classification
    obs_matrix=np.double(all_histograms.T)
    sg_features=RealFeatures(obs_matrix)
    sg_labels=MulticlassLabels(np.double(final_labels))
    kernel=LinearKernel(sg_features, sg_features)
    C=1
    gsvm=GMNPSVM(C, kernel, sg_labels)
    _=gsvm.train(sg_features)
    return gsvm

In [ ]:
gsvm=train_svm(all_histograms, final_labels)

5. Now, classify by using the trained SVM:

First let us see all the test images


In [ ]:
# Lets see the testing images
testing_sample=[]
#get all the testing images  
filenames=get_imlist(os.path.join(SHOGUN_DATA_DIR, 'SIFT/test_image/'))
for i in xrange(len(filenames[0])):
    temp=cv2.imread(filenames[0][i])
    testing_sample.append(temp)

plt.rcParams['figure.figsize']=20,8
fig=plt.figure()
plt.xticks(())
plt.yticks(())
plt.title('Test Images')
for image_no in xrange(len(filenames[0])):
    fig.add_subplot(3,8, image_no+1)
    showfig(testing_sample[image_no], None)

We define the function get_sift_testing() which returns all the descriptors present in the testing images.


In [ ]:
def get_sift_testing():
    filenames=get_imlist(os.path.join(SHOGUN_DATA_DIR, 'SIFT/test_image/'))
    filenames=np.array(filenames)
    des_testing=[]
    for image_name in filenames[0]:
        result=[]
        #read the test image
        img=cv2.imread(image_name)

        #follow the normal preprocessing routines 
        gray= cv2.cvtColor(img,cv2.COLOR_BGR2GRAY)
        gray=cv2.resize(gray, (500, 300), interpolation=cv2.INTER_AREA)
        gray=cv2.equalizeHist(gray)

        #compute all the descriptors of the test images
        _, des=sift.detectAndCompute(gray, None)
        des_testing.append(des)
    return des_testing

In [ ]:
descriptor_testing=get_sift_testing()

In the following classify_svm() function, we use the trained GMNPSVM for classifying the test images. It returns the predictions from our trained SVM.


In [ ]:
def classify_svm(k, knn, des_testing):
    
    # a list to hold histograms of all the test images
    all_histograms=[]
    filenames=get_imlist(os.path.join(SHOGUN_DATA_DIR, 'SIFT/test_image/'))
    
    for image_name in xrange(len(filenames[0])):
        
        result=[]
        des=des_testing[image_name]
        
        #Shogun works in a way in which columns are samples and rows are features.
        #Hence we need to transpose the observation matrix
        des=(np.double(des)).T
        sg_des=RealFeatures(np.array(des))

        #cluster all the above found descriptors into the vocabulary
        cluster_labels=(knn.apply_multiclass(sg_des)).get_labels()

        #get the histogram for the current test image
        histogram=[]
        for i in xrange(k):
            histogram.append(sum(cluster_labels==i))
        
        all_histograms.append(np.array(histogram))

    all_histograms=np.double(np.array(all_histograms))
    all_histograms=all_histograms.T
    sg_testfeatures=RealFeatures(all_histograms)
    return gsvm.apply(sg_testfeatures).get_labels()

In [ ]:
predicted=classify_svm(100, knn, descriptor_testing)
print "the predicted labels for k=100 are as follows: "
print predicted

6. Selecting the classifier that gives the best overall classification accuracy with respect to number of clusters (k) :

We define the function create_conf_matrix() which creates the confusion matrix.

Inputs are:

  • expected=The actual labels which the test images belong to
  • predicted=The output of our SVM
  • n_classes=number of classes (here 3 i.e cars, trains and planes)

In [ ]:
def create_conf_matrix(expected, predicted, n_classes):
    m = [[0] * n_classes for i in range(n_classes)]
    for pred, exp in zip(predicted, expected):
        m[exp][int(pred)] += 1
    return np.array(m)

Form the expected list.

  • 0 represents cars
  • 1 represents planes
  • 2 represents trains

In [ ]:
import re
filenames=get_imlist(os.path.join(SHOGUN_DATA_DIR, 'SIFT/test_image/'))
# get the formation of the files, later to be used for calculating the confusion matrix
formation=([int(''.join(x for x in filename if x.isdigit())) for filename in filenames[0]])
    
# associate them with the correct labels by making a dictionary
keys=range(len(filenames[0]))

values=[0,1,0,2,1,0,1,0,0,0,1,2,2,2,2,1,1,1,1,1]
label_dict=dict(zip(keys, values))

# the following list holds the actual labels
expected=[]
for i in formation:
    expected.append(label_dict[i-1])

We extend all the steps that we did for k=100 to few other values of k and check their accuracies with respect to the expected labels. Alongside, we also draw their respective confusion matrix.


In [ ]:
best_k=1
max_accuracy=0

for k in xrange(1,5):
    k=100*k
    
    # step 2
    cluster_centers=get_similar_descriptors(k, descriptor_mat)
    
    # step 3
    all_histograms, final_labels, knn=compute_training_data(k, cluster_centers, descriptor_training)
    
    # step 4
    gsvm=train_svm(all_histograms, final_labels)
    
    # step 5
    predicted=classify_svm(k, knn, descriptor_testing)
    accuracy=sum(predicted==expected)*100/float(len(expected))
    print "for a k=%d, accuracy is %d%%"%(k, accuracy)
    
    #step 6
    m=create_conf_matrix(expected, predicted, 3)

    if accuracy>max_accuracy:
        best_k=k
        max_accuracy=accuracy
        best_prediction=predicted
    
    print "confusion matrix for k=%d"%k
    print m

From all the above k's we choose the one which has the best accuracy. Number of k's can be extended further to enhance the overall accuracy.

Test images along with their predicted labels are shown below for the most optimum value of k:


In [ ]:
plt.rcParams['figure.figsize']=20,8
fig=plt.figure()
for image_no in xrange(len(filenames[0])):
    fig.add_subplot(3,8, image_no+1)
    plt.title('pred. class: '+folders[int(best_prediction[image_no])])
    showfig(testing_sample[image_no], None)

Conclusion

Here we have presented a simple but novel approach to generic visual categorization using feature vectors constructed from clustered descriptors of image patches.

References:

  • Visual Categorization with Bags of Keypoints by Gabriella Csurka, Christopher R. Dance, Lixin Fan, Jutta Willamowski, Cédric Bray

  • Distinctive Image Features from Scale-Invariant Keypoints by David G. Lowe

  • Practical OpenCV by Samarth Brahmbhatt, University of Pennsylvania