Segmentation into scenes using audio

This tutorial addresses the following points:

  • audio feature extraction
  • temporal segmentation using "sliding window" approach
  • evaluation of the segmentation result

In [1]:
%pylab inline


Populating the interactive namespace from numpy and matplotlib

In [2]:
# pyannote.core package provides core pyannote data structures.
# (available at http://github.com/pyannote)
from pyannote.core import Segment, Timeline

uri stands for uniform resource identifier.


In [3]:
uri = 'GameOfThrones.Season01.Episode01'

Let's start by loading the reference (i.e. manual) segmentation into scenes.
It is stored in data/GameOfThrones.Season01.Episode01/scenes.txt.


In [4]:
with open('data/GameOfThrones.Season01.Episode01/scenes.txt', 'r') as f:
    lines = [line.split() for line in f.readlines()]

Timeline objects are used to store set of Segment (one per scene).
A Segment corresponds to a time range (with a start time and an end time, in seconds).


In [5]:
reference = Timeline(uri=uri)
for start_time, end_time, _ in lines:
    segment = Segment(start=float(start_time), end=float(end_time))
    reference.add(segment)

Now, we will initialize an extractor of MFCC features (including energy and 12 first coefficients).


In [6]:
# pyannote.features package provides feature extraction tools.
# (available at http://github.com/pyannote)
from pyannote.features.audio.yaafe import YaafeMFCC
mfcc_extractor = YaafeMFCC(e=True, coefs=12)

Once initialized, it can be used to extract the actual features.
Beware, it may take a while (a few seconds for a one hour episode).


In [7]:
features = mfcc_extractor.extract('data/GameOfThrones.Season01.Episode01/english.wav')

Features instances have several handy methods.
crop is one of them and allows to get all features for a given pyannote.Segment in a numpy.array.


In [8]:
x = features.crop(Segment(0., 10.))
print x.shape


(625, 13)

Let's plot audio signal energy for the first 30 seconds.


In [9]:
plt.plot(features.crop(Segment(40, 60))[:,0])


Out[9]:
[<matplotlib.lines.Line2D at 0x11a434a90>]

Now, we are going to segment the episode using Gaussian divergence.
Two sliding windows (left and right) of 20 seconds each are used, with a step of 1 second.


In [10]:
# pyannote.algorithms provides algorithms for multimedia document processing.
# (available at http://github.com/pyannote)
from pyannote.algorithms.segmentation.sliding_window import SegmentationGaussianDivergence
segmenter = SegmentationGaussianDivergence(duration=20, step=1)

One can use segmenter to compute the Gaussian divergence d between left and right windows for each position t of the sliding windows...


In [11]:
T, D = zip(*[(t, d) for (t, d) in segmenter.iterdiff(features)]);


/Volumes/home/Development/virtualenv/pyannote.algorithms/lib/python2.7/site-packages/numpy/core/_methods.py:55: RuntimeWarning: Mean of empty slice.
  warnings.warn("Mean of empty slice.", RuntimeWarning)

... and consequently plot $d = f(t)$ alongside the actual position of scene boundaries.


In [12]:
for segment in reference:
    plt.plot([segment.start, segment.start], [0, 20], 'r')
plt.plot(T, D)
plt.xlim(0, 2000)
plt.ylim(0,20);


It looks like setting a detection threshold $\theta = 7$ might do (some of) the trick.


In [13]:
segmenter = SegmentationGaussianDivergence(duration=20, step=1, threshold=7)
hypothesis = segmenter.apply(features)

Let's evaluate the results visually (reference in green, hypothesis in red).


In [14]:
for segment in hypothesis:
    plt.plot([segment.start, segment.start], [-10, -0.5], 'r')
for segment in reference:
    plt.plot([segment.start, segment.start], [0.5, 10], 'g')
plt.ylim(-11, 11);
plt.xlim(0, segment.end);
plt.xlabel('Time (seconds)');


One can also use evaluation metrics :


In [15]:
# pyannote.metrics provides evaluation metrics for various tasks.
# (available at http://github.com/pyannote)
from pyannote.metrics.segmentation import SegmentationPurity, SegmentationCoverage
from pyannote.metrics import f_measure
purity = SegmentationPurity()
coverage = SegmentationCoverage()

In [16]:
p = purity(reference, hypothesis)
c = coverage(reference, hypothesis)
f = f_measure(p, c)
print "Purity {p:.1f}% / Coverage {c:.1f}% / F-Measure {f:.1f}%".format(p=100*p,c=100*c,f=100*f)


Purity 77.6% / Coverage 85.9% / F-Measure 81.5%