Above video consists of several topics with multiple choices, all presented in a rectangle. Each question has a keyword included in its body; different questions have different keywords. I define a task to retrieve all frames display a wanted question based on a keyword I input. After that, I will select a concrete frame where it can exhibit my wanted question and associated answers to my query. For example, my expectation if I use Wine as the keyword is bellow.
Output video:
Output image:
In [1]:
%load_ext autotime
In [2]:
import os
import cv2
from matplotlib import pyplot as plt
import numpy as np
from tqdm.notebook import tqdm
In [3]:
def generate_image_path(index, folder):
return f"{folder}/frame_{index}.jpg"
I assume that readers installed OpenCV, Tesseract, Pytorch, other libraries, and cloned CRAFT-pytorch to current working folder
In [36]:
# notebook parameters
args = {
'canvas_size': 1280,
'cuda': False,
'link_threshold': 0.4,
'low_text': 0.4,
'mag_ratio': 1.5,
'poly': False,
'refine': False,
'show_time': False,
"test_folder": './figures',
'text_threshold': 0.7,
'trained_model': './weights/craft_mlt_25k.pth',
'refine': False,
'video': './video.mp4',
'sampling_rate': 17,
'keyword': 'wine',
'word_similarity': 0.7
}
It is natural to use text detection approaches to extract texts, then using keyword matching to get the appropriate targeted question. In literature, text detection can achieve good performance with CNN-net-based methods that stack consecutive convolution blocks and a classifier on the top layer to predict text area. EAST has been a popular method since 2017. Recently, a new technique, namely CRAFT, have been invented with a significant improvement, especially images influenced in high variant environmental condition.
I have a quick valuation to determine which method is more appropriate in my task. Below is my comparison of the output of each method which shows that CRAFT outperformed
When text areas are detected, I need to feed bounding boxes through text recognition. Text recognization can be done via CNN+LSTM (e.g https://arxiv.org/pdf/1507.05717.pdf). In particular, I use Tesseract's implementation
Tesseract's Architecture: https://tesseract-ocr.github.io/docs/das_tutorial2016/2ArchitectureAndDataStructures.pdf
There are some observations can help
sampling_video extracts frames from videos while sampling frame with a fixed frequency (default is 17)
In [5]:
def sampling_video(video_path: str, extracted_frames_path: str, freq:int = 17) -> dict:
vidcap = cv2.VideoCapture(video_path)
success,image = vidcap.read()
count = 0
frames_dict = {}
while success:
file_name = f"{extracted_frames_path}/frame_{count}.jpg"
if count % freq == 0:
frames_dict[count] = file_name
cv2.imwrite(file_name, image) # save frame as JPEG file
success,image = vidcap.read()
count += 1
return frames_dict, count
image_paths, total_frames = sampling_video(args['video'], args['test_folder'], args['sampling_rate'])
In [6]:
seed_list = [f"{args['test_folder']}/frame_{i}.jpg" for i in range(50,100, 10)]
n = len(seed_list)
color = ('b','g','r')
fig, axs = plt.subplots(2, n, figsize=(30, 20))
for i in range(1, n+1):
m = cv2.imread(seed_list[i-1])
hsv = cv2.cvtColor(m,cv2.COLOR_BGR2HSV)
for j,col in enumerate(color):
histr = cv2.calcHist([hsv],[j],None,[256],[0,256])
axs[0][i-1].plot(histr,color = col)
axs[0][i-1].set_title(seed_list[i-1])
axs[1][i-1].imshow(m)
plt.show()
use histogram to calculate differentiation, examine frames whose indexes are from 48 to 103
In [7]:
diffs = []
start = 48
end = 103
def hist_diff(img1_path, img2_path):
# Load the images
img1 = cv2.imread(img1_path)
img2 = cv2.imread(img2_path)
# Convert it to HSV
img1_hsv = cv2.cvtColor(img1, cv2.COLOR_BGR2HSV)
img2_hsv = cv2.cvtColor(img2, cv2.COLOR_BGR2HSV)
# Calculate the histogram and normalize it
hist_img1 = cv2.calcHist([img1_hsv], [0], None,[256],[0,256])
cv2.normalize(hist_img1, hist_img1, alpha=0, beta=1, norm_type=cv2.NORM_MINMAX);
hist_img2 = cv2.calcHist([img2_hsv], [0], None,[256],[0,256])
cv2.normalize(hist_img2, hist_img2, alpha=0, beta=1, norm_type=cv2.NORM_MINMAX);
# find the metric value
metric_val = cv2.compareHist(hist_img1, hist_img2, cv2.HISTCMP_HELLINGER)
return metric_val
for i in tqdm(range(start, end)):
img1 = generate_image_path(i, args['test_folder'])
img2 = generate_image_path(i+3, args['test_folder'])
diff = hist_diff(img1, img2)
diffs.append(diff)
plt.plot(range(start, end), diffs)
Out[7]:
Let me make histogram smoother
In [8]:
def moving_average(a, n=5) :
ret = np.cumsum(a, dtype=float)
ret[n:] = ret[n:] - ret[:-n]
return ret[n - 1:] / n
avg_diffs = moving_average(diffs, 5)
plt.plot(avg_diffs)
Out[8]:
Histogram diff small reflect short periods when consecutive frame keeps the same, otherwise, an event occurs to change image details dramatically (increase red pixels as an answer is checked or reduce while pixel as rectangle shrinks)
Recall observation #1, my target frame locates in the second flat phase of histogram difference. Local minima can leverage to spot out interested frame
Double-check frames in the second flat phase of histogram difference to make sure NO SIGNIFICANT DIFFERENCE, that has been said that I can use an arbitrary frame in this area
In [9]:
seed_list = [f"{args['test_folder']}/frame_{i}.jpg" for i in range(95,100)]
n = len(seed_list)
fig, axs = plt.subplots(1, n, figsize=(25, 25))
for i in range(1, n+1):
m = cv2.imread(seed_list[i-1])
for j,col in enumerate(color):
axs[i-1].set_title(seed_list[i-1])
axs[i-1].imshow(m)
plt.show()