pyannote.audio is an open-source toolkit written in Python for speaker diarization.

Based on PyTorch machine learning framework, it provides a set of trainable end-to-end neural building blocks that can be combined and jointly optimized to build speaker diarization pipelines.

pyannote.audio also comes with pre-trained models covering a wide range of domains for voice activity detection, speaker change detection, overlapped speech detection, speaker embedding reaching state-of-the-art performance for most of them.

This notebook will teach you how to apply those pretrained pipelines on your own data.

Make sure you run it using a GPU (or it might otherwise be slow...)

Installation

Until a proper version is released on PyPI, pyannote.audio should be installed from the develop branch of the Github repository, like this:


In [0]:
!pip install -q https://github.com/pyannote/pyannote-audio/tarball/develop

Visualization with pyannote.core

For the purpose of this notebook, we will download and use an audio file coming from the AMI corpus, which contains a conversation between 4 people in a meeting room.


In [0]:
!wget -q http://groups.inf.ed.ac.uk/ami/AMICorpusMirror/amicorpus/ES2004a/audio/ES2004a.Mix-Headset.wav
DEMO_FILE = {'uri': 'ES2004a.Mix-Headset', 'audio': 'ES2004a.Mix-Headset.wav'}

Because AMI is a benchmarking dataset, it comes with manual annotations (a.k.a groundtruth).
Let us load and visualize the expected output of the speaker diarization pipeline.


In [0]:
!wget -q https://raw.githubusercontent.com/pyannote/pyannote-audio/develop/tutorials/data_preparation/AMI/MixHeadset.test.rttm

In [4]:
# load groundtruth
from pyannote.database.util import load_rttm
groundtruth = load_rttm('MixHeadset.test.rttm')[DEMO_FILE['uri']]

# visualize groundtruth
groundtruth


Out[4]:

For the rest of this notebook, we will only listen to and visualize a one-minute long excerpt of the file (but will process the whole file anyway).


In [5]:
from pyannote.core import Segment, notebook
# make notebook visualization zoom on 600s < t < 660s time range
EXCERPT = Segment(600, 660)
notebook.crop = EXCERPT

# visualize excerpt groundtruth
groundtruth


Out[5]:

This nice visualization is brought to you by pyannote.core and basically indicates when each speaker speaks.


In [6]:
from pyannote.audio.features import RawAudio
from IPython.display import Audio

# load audio waveform, crop excerpt, and play it
waveform = RawAudio(sample_rate=16000).crop(DEMO_FILE, EXCERPT)
Audio(data=waveform.squeeze(), rate=16000, autoplay=True)


Out[6]:

Processing your own audio file (optional)

In case you just want to go ahead with the demo file, skip this section entirely.

In case you want to try processing your own audio file, proceed with running this section. It will offer you to upload an audio file (preferably a wav file but all formats supported by SoundFile should work just fine).

Upload audio file


In [0]:
import google.colab
own_file, _ = google.colab.files.upload().popitem()
OWN_FILE = {'audio': own_file}
notebook.reset()

# load audio waveform and play it
waveform = RawAudio(sample_rate=16000)(OWN_FILE).data
Audio(data=waveform.squeeze(), rate=16000, autoplay=True)


Upload widget is only available when the cell has been executed in the current browser session. Please rerun this cell to enable.
Saving TheBigBangTheory.wav to TheBigBangTheory.wav

Simply replace DEMO_FILE by OWN_FILE in the rest of the notebook.

Note, however, that unless you provide a groundtruth annotation in the next cell, you will (obviously) not be able to visualize groundtruth annotation nor evaluate the performance of the diarization pipeline quantitatively

Upload groundtruth (optional)

The groundtruth file is expected to use the RTTM format, with one line per speech turn with the following convention:

SPEAKER {file_name} 1 {start_time} {duration} <NA> <NA> {speaker_name} <NA> <NA>

In [0]:
groundtruth_rttm, _ = google.colab.fils.upload().popitem()
groundtruths = load_rttm(groundtruth_rttm)
if OWN_FILE['audio'] in groundtruths:
  groundtruth = groundtruths[OWN_FILE['audio']]
else:
  _, groundtruth = groundtruths.popitem()
groundtruth

Speaker diarization with pyannote.pipeline

We are about to run a full speaker diarization pipeline, that includes speech activity detection, speaker change detection, speaker embedding, and a final clustering step. Brace yourself!


In [0]:
import torch
pipeline = torch.hub.load('pyannote/pyannote-audio', 'dia')
diarization = pipeline(DEMO_FILE)

That's it? Yes, that's it :-)


In [8]:
diarization


Out[8]:

Evaluation with pyannote.metrics

Because groundtruth is available, we can evaluate the quality of the diarization pipeline by computing the diarization error rate.


In [0]:
from pyannote.metrics.diarization import DiarizationErrorRate
metric = DiarizationErrorRate()
der = metric(groundtruth, diarization)

In [16]:
print(f'diarization error rate = {100 * der:.1f}%')


diarization error rate = 42.9%

This implementation of diarization error rate is brought to you by pyannote.metrics.

It can also be used to improve visualization by find the optimal one-to-one mapping between groundtruth and hypothesized speakers.


In [10]:
mapping = metric.optimal_mapping(groundtruth, diarization)
diarization.rename_labels(mapping=mapping)


Out[10]:

In [11]:
groundtruth


Out[11]:

Going further

We have only scratched the surface in this introduction.

More details about pyannote.audio can be found in the paper, while tutorials (for training or fine-tuning models on your own data) are available on the pyannote.audio Github repository.

Teaser: overlap detection

It can even do overlapped speech detection (which would definitely come very handy for this messy meeting conversation)


In [13]:
groundtruth


Out[13]:

In [12]:
overlap_detection = torch.hub.load('pyannote/pyannote-audio', 'ovl_ami', pipeline=True)
overlap_detection(DEMO_FILE).get_timeline()


Using cache found in /root/.cache/torch/hub/pyannote_pyannote-audio_master
Downloading pretrained model "ovl_ami" to "/root/.pyannote/hub/models/ovl_ami.zip".

Out[12]: