pyannote.audio
is an open-source toolkit written in Python for speaker diarization.
Based on PyTorch
machine learning framework, it provides a set of trainable end-to-end neural building blocks that can be combined and jointly optimized to build speaker diarization pipelines.
pyannote.audio
also comes with pre-trained models covering a wide range of domains for voice activity detection, speaker change detection, overlapped speech detection, speaker embedding reaching state-of-the-art performance for most of them.
This notebook will teach you how to apply those pretrained pipelines on your own data.
Make sure you run it using a GPU (or it might otherwise be slow...)
In [0]:
!pip install -q https://github.com/pyannote/pyannote-audio/tarball/develop
pyannote.core
For the purpose of this notebook, we will download and use an audio file coming from the AMI corpus, which contains a conversation between 4 people in a meeting room.
In [0]:
!wget -q http://groups.inf.ed.ac.uk/ami/AMICorpusMirror/amicorpus/ES2004a/audio/ES2004a.Mix-Headset.wav
DEMO_FILE = {'uri': 'ES2004a.Mix-Headset', 'audio': 'ES2004a.Mix-Headset.wav'}
Because AMI is a benchmarking dataset, it comes with manual annotations (a.k.a groundtruth).
Let us load and visualize the expected output of the speaker diarization pipeline.
In [0]:
!wget -q https://raw.githubusercontent.com/pyannote/pyannote-audio/develop/tutorials/data_preparation/AMI/MixHeadset.test.rttm
In [4]:
# load groundtruth
from pyannote.database.util import load_rttm
groundtruth = load_rttm('MixHeadset.test.rttm')[DEMO_FILE['uri']]
# visualize groundtruth
groundtruth
Out[4]:
For the rest of this notebook, we will only listen to and visualize a one-minute long excerpt of the file (but will process the whole file anyway).
In [5]:
from pyannote.core import Segment, notebook
# make notebook visualization zoom on 600s < t < 660s time range
EXCERPT = Segment(600, 660)
notebook.crop = EXCERPT
# visualize excerpt groundtruth
groundtruth
Out[5]:
This nice visualization is brought to you by pyannote.core
and basically indicates when each speaker speaks.
In [6]:
from pyannote.audio.features import RawAudio
from IPython.display import Audio
# load audio waveform, crop excerpt, and play it
waveform = RawAudio(sample_rate=16000).crop(DEMO_FILE, EXCERPT)
Audio(data=waveform.squeeze(), rate=16000, autoplay=True)
Out[6]:
In case you just want to go ahead with the demo file, skip this section entirely.
In case you want to try processing your own audio file, proceed with running this section. It will offer you to upload an audio file (preferably a wav
file but all formats supported by SoundFile
should work just fine).
In [0]:
import google.colab
own_file, _ = google.colab.files.upload().popitem()
OWN_FILE = {'audio': own_file}
notebook.reset()
# load audio waveform and play it
waveform = RawAudio(sample_rate=16000)(OWN_FILE).data
Audio(data=waveform.squeeze(), rate=16000, autoplay=True)
Simply replace DEMO_FILE
by OWN_FILE
in the rest of the notebook.
Note, however, that unless you provide a groundtruth annotation in the next cell, you will (obviously) not be able to visualize groundtruth annotation nor evaluate the performance of the diarization pipeline quantitatively
In [0]:
groundtruth_rttm, _ = google.colab.fils.upload().popitem()
groundtruths = load_rttm(groundtruth_rttm)
if OWN_FILE['audio'] in groundtruths:
groundtruth = groundtruths[OWN_FILE['audio']]
else:
_, groundtruth = groundtruths.popitem()
groundtruth
In [0]:
import torch
pipeline = torch.hub.load('pyannote/pyannote-audio', 'dia')
diarization = pipeline(DEMO_FILE)
That's it? Yes, that's it :-)
In [8]:
diarization
Out[8]:
pyannote.metrics
Because groundtruth is available, we can evaluate the quality of the diarization pipeline by computing the diarization error rate.
In [0]:
from pyannote.metrics.diarization import DiarizationErrorRate
metric = DiarizationErrorRate()
der = metric(groundtruth, diarization)
In [16]:
print(f'diarization error rate = {100 * der:.1f}%')
This implementation of diarization error rate is brought to you by pyannote.metrics
.
It can also be used to improve visualization by find the optimal one-to-one mapping between groundtruth and hypothesized speakers.
In [10]:
mapping = metric.optimal_mapping(groundtruth, diarization)
diarization.rename_labels(mapping=mapping)
Out[10]:
In [11]:
groundtruth
Out[11]:
We have only scratched the surface in this introduction.
More details about pyannote.audio
can be found in the paper, while tutorials (for training or fine-tuning models on your own data) are available on the pyannote.audio
Github repository.
In [13]:
groundtruth
Out[13]:
In [12]:
overlap_detection = torch.hub.load('pyannote/pyannote-audio', 'ovl_ami', pipeline=True)
overlap_detection(DEMO_FILE).get_timeline()
Out[12]: