Above video consists of several topics with multiple choices, all presented in a rectangle. Each question has a keyword included in its body; different questions have different keywords. I define a task to retrieve all frames display a wanted question based on a keyword I input. After that, I will select a concrete frame where it can exhibit my wanted question and associated answers to my query. For example, my expectation if I use Wine as the keyword is bellow.

Output video:

Output image:


In [1]:
%load_ext autotime

In [2]:
import os
import cv2
from matplotlib import pyplot as plt
import numpy as np
from tqdm.notebook import tqdm


time: 2.53 s

In [3]:
def generate_image_path(index, folder):
    return f"{folder}/frame_{index}.jpg"


time: 542 µs

I assume that readers installed OpenCV, Tesseract, Pytorch, other libraries, and cloned CRAFT-pytorch to current working folder


In [36]:
# notebook parameters

args = {
    'canvas_size': 1280, 
    'cuda': False, 
    'link_threshold': 0.4, 
    'low_text': 0.4, 
    'mag_ratio': 1.5, 
    'poly': False, 
    'refine': False,
    'show_time': False, 
    "test_folder": './figures', 
    'text_threshold': 0.7, 
    'trained_model': './weights/craft_mlt_25k.pth',
    'refine': False,
    'video': './video.mp4',
    'sampling_rate': 17,
    'keyword': 'wine',
    'word_similarity': 0.7
}


time: 985 µs

analysis

text detector & recognization

It is natural to use text detection approaches to extract texts, then using keyword matching to get the appropriate targeted question. In literature, text detection can achieve good performance with CNN-net-based methods that stack consecutive convolution blocks and a classifier on the top layer to predict text area. EAST has been a popular method since 2017. Recently, a new technique, namely CRAFT, have been invented with a significant improvement, especially images influenced in high variant environmental condition.

EAST: https://arxiv.org/abs/1704.03155

CRAFT: https://arxiv.org/abs/1904.01941

I have a quick valuation to determine which method is more appropriate in my task. Below is my comparison of the output of each method which shows that CRAFT outperformed

When text areas are detected, I need to feed bounding boxes through text recognition. Text recognization can be done via CNN+LSTM (e.g https://arxiv.org/pdf/1507.05717.pdf). In particular, I use Tesseract's implementation

Tesseract's Architecture: https://tesseract-ocr.github.io/docs/das_tutorial2016/2ArchitectureAndDataStructures.pdf

histogram differentiation

There are some observations can help

  1. a segment can be divided into 2 phases, one of them is when no answer is given. There are short periods where the box contains question doesn't change much in terms of size, colors to give time for responses.
  2. detect rectangle and its size can track phases, potentially support to finalize a frame
  3. the question box has a transparent color, strongly influenced by colors from the background hence it is hard for image processing techniques such as Canny edge detector, Hough transform.
  4. from point #1, we can leverage the difference between frames

sampling_video extracts frames from videos while sampling frame with a fixed frequency (default is 17)


In [5]:
def sampling_video(video_path: str, extracted_frames_path: str, freq:int = 17) -> dict:
    vidcap = cv2.VideoCapture(video_path)
    success,image = vidcap.read()
    count = 0
    frames_dict = {}
    while success:
        file_name = f"{extracted_frames_path}/frame_{count}.jpg"
        if count % freq == 0:
            frames_dict[count] = file_name
        cv2.imwrite(file_name, image)    # save frame as JPEG file      

        success,image = vidcap.read()
        count += 1

    return frames_dict, count

image_paths, total_frames = sampling_video(args['video'], args['test_folder'], args['sampling_rate'])


time: 17.5 s

explore observation #4


In [6]:
seed_list = [f"{args['test_folder']}/frame_{i}.jpg" for i in range(50,100, 10)]

n = len(seed_list)
color = ('b','g','r')

fig, axs = plt.subplots(2, n, figsize=(30, 20))

for i in range(1, n+1):
    m = cv2.imread(seed_list[i-1])
    hsv = cv2.cvtColor(m,cv2.COLOR_BGR2HSV)

    for j,col in enumerate(color):
        histr = cv2.calcHist([hsv],[j],None,[256],[0,256])
        axs[0][i-1].plot(histr,color = col)
        axs[0][i-1].set_title(seed_list[i-1])
        axs[1][i-1].imshow(m)
plt.show()


time: 2.05 s

use histogram to calculate differentiation, examine frames whose indexes are from 48 to 103


In [7]:
diffs = []

start = 48
end = 103


def hist_diff(img1_path, img2_path):
    # Load the images
    img1 = cv2.imread(img1_path)
    img2 = cv2.imread(img2_path)

    # Convert it to HSV
    img1_hsv = cv2.cvtColor(img1, cv2.COLOR_BGR2HSV)
    img2_hsv = cv2.cvtColor(img2, cv2.COLOR_BGR2HSV)

    # Calculate the histogram and normalize it
    hist_img1 = cv2.calcHist([img1_hsv], [0], None,[256],[0,256])
    cv2.normalize(hist_img1, hist_img1, alpha=0, beta=1, norm_type=cv2.NORM_MINMAX);
    hist_img2 = cv2.calcHist([img2_hsv], [0], None,[256],[0,256])
    cv2.normalize(hist_img2, hist_img2, alpha=0, beta=1, norm_type=cv2.NORM_MINMAX);

    # find the metric value
    metric_val = cv2.compareHist(hist_img1, hist_img2, cv2.HISTCMP_HELLINGER)
    
    return metric_val

for i in tqdm(range(start, end)):
    img1 = generate_image_path(i, args['test_folder'])
    img2 = generate_image_path(i+3, args['test_folder'])
    diff = hist_diff(img1, img2)
    diffs.append(diff)
    
plt.plot(range(start, end), diffs)



Out[7]:
[<matplotlib.lines.Line2D at 0x123cf0eb8>]
time: 2.26 s

Let me make histogram smoother


In [8]:
def moving_average(a, n=5) :
    ret = np.cumsum(a, dtype=float)
    ret[n:] = ret[n:] - ret[:-n]
    return ret[n - 1:] / n

avg_diffs = moving_average(diffs, 5)

plt.plot(avg_diffs)


Out[8]:
[<matplotlib.lines.Line2D at 0x1238ff2e8>]
time: 110 ms

Histogram diff small reflect short periods when consecutive frame keeps the same, otherwise, an event occurs to change image details dramatically (increase red pixels as an answer is checked or reduce while pixel as rectangle shrinks)

Recall observation #1, my target frame locates in the second flat phase of histogram difference. Local minima can leverage to spot out interested frame

Double-check frames in the second flat phase of histogram difference to make sure NO SIGNIFICANT DIFFERENCE, that has been said that I can use an arbitrary frame in this area


In [9]:
seed_list = [f"{args['test_folder']}/frame_{i}.jpg" for i in range(95,100)]
n = len(seed_list)

fig, axs = plt.subplots(1, n, figsize=(25, 25))

for i in range(1, n+1):
    m = cv2.imread(seed_list[i-1])
    
    for j,col in enumerate(color):
        axs[i-1].set_title(seed_list[i-1])
        axs[i-1].imshow(m)
plt.show()