In [ ]:
#@title Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# https://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
One of the biggest challanges in Automatic Speech Recognition is the preparation and augmentation of audio data. Audio data analysis could be in time or frequency domain, which adds additional complex compared with other data sources such as images.
As a part of the TensorFlow ecosystem, tensorflow-io
package provides quite a few useful audio-related APIs that helps easing the preparation and augmentation of audio data.
In [ ]:
!pip install tensorflow-io
In [ ]:
import tensorflow as tf
import tensorflow_io as tfio
audio = tfio.audio.AudioIOTensor('gs://cloud-samples-tests/speech/brooklyn.flac')
print(audio)
In the above example, the Flac file brooklyn.flac
is from a publicly accessible audio clip in google cloud.
The GCS address gs://cloud-samples-tests/speech/brooklyn.flac
are used directly because GCS is a supported file system in TensorFlow. In addition to Flac
format, WAV
, Ogg
, MP3
, and MP4A
are also supported by AudioIOTensor
with automatic file format detection.
AudioIOTensor
is lazy-loaded so only shape, dtype, and sample rate are shown initially. The shape of the AudioIOTensor
is represented as [samples, channels]
, which means the audio clip we loaded is mono channel with 28979
samples in int16
.
The content of the audio clip will only be read as needed, either by converting AudioIOTensor
to Tensor
through to_tensor()
, or though slicing. Slicing is especially useful when only a small portion of a large audio clip is needed:
In [ ]:
audio_slice = audio[100:]
# remove last dimension
audio_tensor = tf.squeeze(audio_slice, axis=[-1])
print(audio_tensor)
The audio can be played through:
In [ ]:
from IPython.display import Audio
Audio(audio_tensor.numpy(), rate=audio.rate.numpy())
It is more convinient to convert tensor into float numbers and show the audio clip in graph:
In [ ]:
import matplotlib.pyplot as plt
tensor = tf.cast(audio_tensor, tf.float32) / 32768.0
plt.figure()
plt.plot(tensor.numpy())
In [ ]:
position = tfio.experimental.audio.trim(tensor, axis=0, epsilon=0.1)
print(position)
start = position[0]
stop = position[1]
print(start, stop)
processed = tensor[start:stop]
plt.figure()
plt.plot(processed.numpy())
In [ ]:
fade = tfio.experimental.audio.fade(
processed, fade_in=1000, fade_out=2000, mode="logarithmic")
plt.figure()
plt.plot(fade.numpy())
In [ ]:
# Convert to spectrogram
spectrogram = tfio.experimental.audio.spectrogram(
fade, nfft=512, window=512, stride=256)
plt.figure()
plt.imshow(tf.math.log(spectrogram).numpy())
Additional transformation to different scales are also possible:
In [ ]:
# Convert to mel-spectrogram
mel_spectrogram = tfio.experimental.audio.melscale(
spectrogram, rate=16000, mels=128, fmin=0, fmax=8000)
plt.figure()
plt.imshow(tf.math.log(mel_spectrogram).numpy())
# Convert to db scale mel-spectrogram
dbscale_mel_spectrogram = tfio.experimental.audio.dbscale(
mel_spectrogram, top_db=80)
plt.figure()
plt.imshow(dbscale_mel_spectrogram.numpy())
In addition to the above mentioned data preparation and augmentation APIs, tensorflow-io
package also provides advanced spectrogram augmentations, most notably Frequency and Time Masking discussed in SpecAugment: A Simple Data Augmentation Method for Automatic Speech Recognition (Park et al., 2019).
In [ ]:
# Freq masking
freq_mask = tfio.experimental.audio.freq_mask(dbscale_mel_spectrogram, param=10)
plt.figure()
plt.imshow(freq_mask.numpy())
In [ ]:
# Time masking
time_mask = tfio.experimental.audio.time_mask(dbscale_mel_spectrogram, param=10)
plt.figure()
plt.imshow(time_mask.numpy())