Scanner walkthrough

To explore how Scanner fits in to a bigger pipeline, we're going to walk through a simple video analysis application. If you want to analyze a film, a common unit of analysis is the shot, short segments of video often delineated by the camera cutting to a different angle or location. In this walkthrough, we're going to use Scanner to implement shot segmentation, or breaking up a video into shots. To start, we need to get a video. We'll use a scene from Baby Driver:


In [ ]:
%%html
<video width="560" height="315" src="https://storage.googleapis.com/scanner-data/public/sample-clip.mp4?ignore_cache=1" controls />

We've set up some scripts to help you download the video in the snippet below.


In [ ]:
import util
path = util.download_video()
print(path)

# Read all the frames
%matplotlib inline
import matplotlib.pyplot as plt
import cv2
from timeit import default_timer as now

print('Reading frames from video...')
start = now()
video = cv2.VideoCapture(path)
frames = []
while True:
    ret, frame = video.read()
    if not ret: break
    frames.append(frame)
print(len(frames))
video.release()
read_frame_time = now() - start
print('Time to read frames: {:.3f}s'.format(read_frame_time))

# Display the tenth frame    
plt.imshow(cv2.cvtColor(frames[10], cv2.COLOR_RGB2BGR))
_ = plt.axis('off')

Take another look at the video and see if you can identify when shots change. Our shot segmentation algorithm uses the following intuition: in a video, most frames are similar to the one following it. Because most shot changes happen with cuts (as opposed to dissolves or fades), there's an immediate visual break from one frame to the next. We want to identify when the change in visual content between two adjacent frames is substantially larger than normal. One way to estimate change in visual content is by computing a histogram of colors for each frame, i.e. count the number of dark pixels and light pixels in each color channel (red/green/blue), and then compute the magnitude of difference between adjacent frames' histograms. Let's visualize this for the above video:


In [ ]:
import numpy as np
from scipy.spatial import distance
from tqdm import tqdm

histograms = []
N = len(frames)

# Compute 3 color histograms (one for each channel) for each video frame
print('Computing color histograms...')
start = now()
for frame in tqdm(frames):
    hists = [cv2.calcHist([frame], [channel], None, [16], [0, 256]) 
             for channel in range(3)]
    histograms.append(hists)
compute_hist_time = now() - start
print('Time to compute histograms: {:.3f}s'.format(compute_hist_time))

# Compute differences between adjacent pairs of histograms
def compute_histogram_diffs(histograms):    
    diffs = []        
    for i in range(1, N):
        frame_diffs = [distance.chebyshev(histograms[i-1][channel], histograms[i][channel]) 
                       for channel in range(3)]
        avg_diff = np.mean(frame_diffs)
        diffs.append(avg_diff)
    return diffs
        
diffs = compute_histogram_diffs(histograms)

# Plot the differences
plt.rcParams["figure.figsize"] = [16, 9]
plt.xlabel("Frame number")
plt.ylabel("Difference from previous frame")
_ = plt.plot(range(1, N), diffs)

This plot shows, for each frame, the difference between its color histograms and the previous frame's color histograms. Try playing around with the number of histogram bins as well as the distance metric. As you can see, there are a number of sharp peaks interspersed throughout the video that likely correspond to shot boundaries. We can run a sliding window over the above graph to find the peaks:


In [ ]:
import math

WINDOW_SIZE = 500  # The size of our sliding window (how many data points to include)
OUTLIER_STDDEV = 3 # Outliers are N standard deviations away from the mean of the sliding window

def find_shot_boundaries(diffs):
    boundaries = []
    for i in range(1, N):
        window = diffs[max(i-WINDOW_SIZE,0):min(i+WINDOW_SIZE,N)]
        if diffs[i-1] - np.mean(window) > OUTLIER_STDDEV * np.std(window):
            boundaries.append(i)
    return boundaries

boundaries = find_shot_boundaries(diffs)        

print('Shot boundaries are:')
print(boundaries)


def tile(imgs, rows=None, cols=None):
    # If neither rows/cols is specified, make a square
    if rows is None and cols is None:
        rows = int(math.sqrt(len(imgs)))

    if rows is None:
        rows = (len(imgs) + cols - 1) // cols
    else:
        cols = (len(imgs) + rows - 1) // rows

    # Pad missing frames with black
    diff = rows * cols - len(imgs)
    if diff != 0:
        imgs.extend([np.zeros(imgs[0].shape, dtype=imgs[0].dtype) for _ in range(diff)])

    return np.vstack([np.hstack(imgs[i * cols:(i + 1) * cols]) for i in range(rows)])


montage = tile([frames[i] for i in boundaries])
plt.imshow(cv2.cvtColor(montage, cv2.COLOR_RGB2BGR))
_ = plt.axis('off')

And we've done it! The video is now segmented in shots. At this point, you're probably wondering: "...but I thought this was a Scanner tutorial!" Well, consider now: what if you wanted to run this pipeline over a second trailer? A movie? A thousand movies? The simple Python code we wrote above is great for experimenting, but doesn't scale. To accelerate this analysis, we need to speed up the core computation, computing the color histogram. Here are some ways we can make that faster:

  • Use a faster histogram implementation, e.g. using the GPU.
  • Use a faster video decoder, e.g. the hardware decoder.
  • Parallelize the histogram pipeline on multiple CPUs or GPUs.
  • Parallelize the histogram pipeline across a cluster of machines.

All of that is fairly difficult to do with Python, but easy with Scanner.

Now I'm going to walk you through running the histogram computation in Scanner. First, we start by setting up our inputs.


In [ ]:
from scannerpy import Client, DeviceType, PerfParams, CacheMode
from scannerpy.storage import NamedVideoStream, NamedStream
import scannertools.imgproc

sc = Client()
stream = NamedVideoStream(sc, 'example', path)

In Scanner, all data is organized into streams, or lazy lists of elements. Videos are streams where each element is a frame. We can create a stream from a video by defining a NamedVideoStream pointing to the video path. The name allows Scanner to store some metadata about the video in a local database that we use to optimize video decode at runtime.


In [ ]:
frame = sc.io.Input([stream])
histogram = sc.ops.Histogram(
    frame = frame,
    device = DeviceType.CPU) # Change this to DeviceType.GPU if you have a GPU
output = NamedStream(sc, 'example_hist')
output_op = sc.io.Output(sc.streams.Range(histogram, [(0, 2000)]), [output])

start = now()
sc.run(output_op, PerfParams.estimate(), cache_mode=CacheMode.Overwrite)
scanner_time = now() - start
print('Time to decode + compute histograms: {:.3f}'.format(scanner_time))
print('Scanner was {:.2f}x faster'.format((read_frame_time + compute_hist_time) / scanner_time))

Computations in Scanner are defined in a data-parallel manner--that is, you write a computation that takes in one (or a few) frames at a time, and then the Scanner runtime runs your computation in parallel across your video. Here, we define a computation that computes a color histogram for each frame in the video. This is done by defining a series of "ops" (operators, similar to TensorFlow):

  1. The Input source represents a stream of frames, the input to our computation. This will be drawn from a video.
  2. Histogram is an op that computes a color histogram over the input frame. We specify that it should run on the CPU.
  3. Output represents the final output of our computation, the data that will get written back to disk, in this case a stream containing the histogram for each frame of the input stream.

We use sc.run(...) with the computation graph (given by the output node) to execute the computation. Next, we want to load the results of our computation into Python for further processing:


In [ ]:
from pprint import pprint
histograms = list(output.load())

# Run the same shot detection pipeline as before
diffs = compute_histogram_diffs(histograms)
boundaries = find_shot_boundaries(diffs)
montage = tile([frames[i] for i in boundaries])
plt.imshow(cv2.cvtColor(montage, cv2.COLOR_RGB2BGR))
_ = plt.axis('off')

Loading output is as simple as output.load(), a generator that reads elements of the stored stream from disk (or wherever it was written).

Let's reflect for a moment on the script we just made. Is it any faster than before? Going back to our four bullet points:

  • Scanner will run your computation on the GPU (device=DeviceType.GPU).
  • Scanner will use accelerated hardware video decode behind the scenes.
  • Scanner will automatically run on all of your CPU cores and on multiple GPUs.
  • Scanner will automatically distribute the work across a cluster.

That's what you get for free using Scanner for your video analyses. All of the code for organizing, distributing, and decoding your videos is taken care of by the Scanner runtime. As an exercise, download a long video like a movie and try running both our Python histogram pipeline and the Scanner pipeline. You'll likely notice a substantial difference!

So, where should you go from here? I would check out:

  • Extended tutorial: covers more Scanner features like sampling patterns and building custom ops.
  • Example applications: other applications like face detection and reverse image search implemented with Scanner.