Prepare audio

This notebook tells you how to prepare your audio when you use Kapre.



In [29]:

    
import librosa
"""
# You might consider soundfile unless loading mp3 is your concern.
import soundfile as sf
# mac, window: pip install soundfile, 
# linux: pip install soundfile & sudo apt-get install libsndfile1
"""
import keras
import kapre
from keras.models import Sequential
from kapre.time_frequency import Spectrogram
import numpy as np

from datetime import datetime
now = datetime.now()

def print_version_info():
    print('%s/%s/%s' % (now.year, now.month, now.day))
    print('Keras version: {}'.format(keras.__version__))
    if keras.backend._BACKEND == 'tensorflow':
        import tensorflow
        print('Keras backend: {}: {}'.format(keras.backend._backend, tensorflow.__version__))
    else:
        import theano
        print('Keras backend: {}: {}'.format(keras.backend._backend, theano.__version__))
    print('Keras image dim ordering: {}'.format(keras.backend.image_dim_ordering()))
    print('Kapre version: {}'.format(kapre.__version__))


print_version_info()









    



2018/10/25
Keras version: 2.2.0
Keras backend: tensorflow: 1.9.0
Keras image dim ordering: tf
Kapre version: 0.1.3.1

Loading an mp3 file



In [30]:

    
src, sr = librosa.load('bensound-cute.mp3', sr=None, mono=True)
print(src.shape)
print(sr)

Trim it make it a 2d.

If your file is mono, librosa.load returns a 1D array. Kapre always expects 2d array, so make it 2d.



In [31]:

    
len_second = 1.0 # 1 second
src = src[:int(sr*len_second)]
src = src[np.newaxis, :]
input_shape = src.shape
print(input_shape)

Let's assume we have 16 of this

to make it more like a proper dataset. You should have many files indeed.



In [32]:

    
x = np.array([src] * 16)
print(x.shape)









    



(16, 1, 44100)

Now get a keras model using kapre

A simple model with 10-class and single-label classification.



In [33]:

    
model = Sequential()
model.add(Spectrogram(n_dft=512, n_hop=256, input_shape=input_shape, 
          return_decibel_spectrogram=True, power_spectrogram=2.0, 
          trainable_kernel=False, name='static_stft'))
model.add(keras.layers.Convolution2D(32, (3, 3), name='conv1', activation='relu'))
model.add(keras.layers.MaxPooling2D((25, 17)))
model.add(keras.layers.Convolution2D(32, (10, 10), name='conv2', activation='relu'))
model.add(keras.layers.Flatten())
model.add(keras.layers.Dense(10, activation='softmax'))
model.summary(line_length=80, positions=[.33, .65, .8, 1.])









    



________________________________________________________________________________
Layer (type)              Output Shape              Param #     
================================================================================
static_stft (Spectrogram) (None, 257, 173, 1)       263168      
________________________________________________________________________________
conv1 (Conv2D)            (None, 255, 171, 32)      320         
________________________________________________________________________________
max_pooling2d_5 (MaxPooli (None, 10, 10, 32)        0           
________________________________________________________________________________
conv2 (Conv2D)            (None, 1, 1, 32)          102432      
________________________________________________________________________________
flatten_5 (Flatten)       (None, 32)                0           
________________________________________________________________________________
dense_5 (Dense)           (None, 10)                330         
================================================================================
Total params: 366,250
Trainable params: 103,082
Non-trainable params: 263,168
________________________________________________________________________________

Training

With real labels you'll train the model. I don't do it here.



In [34]:

    
# model.fit()

Prediction

In this notebook, it's not really trained to predict



In [35]:

    
y = model.predict(x)
print(np.argmax(y,axis=1))









    



[5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5]



In [ ]: