Linear regression - audio

Use linear regression to recover or 'fill out' a completely deleted portion of an audio file! This will be using The FSDD, Free-Spoken-Digits-Dataset, an audio dataset put together by Zohar Jackson: cleaned up audio (no dead-space, roughly same length, same bitrate, same samples-per-second rate, same speaker, etc) samples ready for machine learning.

get the data



In [47]:

    
import os
import scipy.io.wavfile as wavfile


zero = []
directory = "../datasets/free-spoken-digit-dataset-master/recordings/"
for fname in os.listdir(directory):
    if fname.startswith("0_jackson"):
        fullname = os.path.join(directory, fname)
        sample_rate, data = wavfile.read(fullname)
        zero.append( data )

There are 500 recordings, 50 of each digit.
Each .wav file is actually just a bunch of numeric samples, "sampled" from the analog signal. Sampling is a type of discretization. When we mention 'samples', we mean observations. When we mention 'audio samples', we mean the actually "features" of the audio file.

The goal of this notebook is to use multi-target, linear regression to generate by extrapolation, the missing portion of the test audio file.

Each one audio_sample features will be the output of an equation, which is a function of the provided portion of the audio_samples:

missing_samples = f(provided_samples)

prepare the data

Convert zero into a DataFrame and set the dtype to np.int16, since the input audio files are 16 bits per sample. This is important otherwise the produced audio samples will be encoded as 64 bits per sample and will be too short.



In [48]:

    
import numpy as np
import pandas as pd

zeroDF = pd.DataFrame(zero, dtype=np.int16)



In [49]:

    
zeroDF.info()









    



<class 'pandas.core.frame.DataFrame'>
RangeIndex: 50 entries, 0 to 49
Columns: 6273 entries, 0 to 6272
dtypes: float64(2186), int16(4087)
memory usage: 1.2 MB

Since these audio clips are unfortunately not length-normalized, we're going to have to just hard chop them to all be the same length. Since Pandas would have inserted NANs at any spot to make zero a perfectly rectangular [n_observed_samples, n_audio_samples] array, do a dropna on the Y axis here. Then, convert one back into an NDArray using .values



In [50]:

    
if zeroDF.isnull().values.any() == True:
  print("Preprocessing data: dropping all NaN")
  zeroDF.dropna(axis=1, inplace=True)
else:
  print("Preprocessing data: No NaN found!")

zero = zeroDF.values # this is a list









    



Preprocessing data: dropping all NaN



In [51]:

    
n_audio_samples = zero.shape[1]



In [52]:

    
n_audio_samples









    Out[52]:





4087

split the data into training and testing sets

There are 50 takes of each clip. You want to pull out just one of them, randomly, and that one will NOT be used in the training of the model. In other words, the one file we'll be testing / scoring on will be an unseen sample, independent to the rest of the training set.



In [53]:

    
from sklearn.utils.validation import check_random_state

rng   = check_random_state(7) 
random_idx = rng.randint(zero.shape[0])

test  = zero[random_idx] # the test sample
train = np.delete(zero, [random_idx], axis=0)



In [54]:

    
print(train.shape)
print(test.shape)









    



(49, 4087)
(4087,)

Save the original 'test' clip, the one you're about to delete half of, so that you can compare it to the 'patched' clip once you've generated it. This assume the sample rate is always the same for all samples



In [55]:

    
wavfile.write('../outputs/OriginalTestClip.wav', sample_rate, test)

Embedding the audio file.
Note that this is not working directly in GitHub (I think all JavaScript is stripped out), fork it or download it to play the audio



In [73]:

    
from IPython.display import Audio
Audio("../outputs/OriginalTestClip.wav")









    Out[73]:

carve out the labels Y

The data will have two parts: X and y (the true labels).
X is going to be the first portion of the audio file, which we will be providing the computer as input (the "chopped" audio).
y, the "label", is going to be the remaining portion of the audio file. In this way the computer will use linear regression to derive the missing portion of the sound file based off of the training data it has received!

ProvidedPortion is how much of the audio file will be provided, in percent. The remaining percent of the file will be generated via linear extrapolation.



In [57]:

    
Provided_Portion = 0.5 # let's delete half of the audio

test_samples = int(Provided_Portion * n_audio_samples)
X_test = test[0:test_samples] # first ones



In [58]:

    
IPython.display.Audio(data=X_test, rate= sample_rate)









    Out[58]:

Can you hear it? Now it's only the first syllable, "ze" ...
But we can even delete more and leave only the first quarter!



In [59]:

    
Provided_Portion = 0.25 # let's delete three quarters of the audio!

test_samples = int(Provided_Portion * n_audio_samples)
X_test = test[0:test_samples] # first ones



In [60]:

    
wavfile.write('../outputs/ChoppedTestClip.wav', sample_rate, X_test)
IPython.display.Audio("../outputs/ChoppedTestClip.wav")









    Out[60]:

Almost unrecognisable.
Will the linear regression model be able to reconstruct the audio?



In [61]:

    
y_test = test[test_samples:] # remaining audio part is the label

Duplicate the same process for X_train, y_train.



In [62]:

    
X_train = train[:, 0:test_samples] # first ones: data
y_train = train[:, test_samples:]  # remaining ones: label

SciKit-Learn gets mad if you don't supply your training data in the form of a 2D arrays: [n_samples, n_features].

So if you only have one SAMPLE, such as is our case with X_test, and y_test, then by calling .reshape(1, -1), you can turn [n_features] into [1, n_features].



In [63]:

    
X_test = X_test.reshape(1,-1)
y_test = y_test.reshape(1,-1)

Create and train the linear regression model



In [64]:

    
from sklearn import linear_model

model = linear_model.LinearRegression()



In [65]:

    
model.fit(X_train, y_train)









    Out[65]:





LinearRegression(copy_X=True, fit_intercept=True, n_jobs=1, normalize=False)

Use the model to predict the 'label' of X_test.
SciKit-Learn will use float64 to generate the predictions so let's take those values back to int16



In [66]:

    
y_test_prediction = model.predict(X_test)



In [67]:

    
y_test_prediction = y_test_prediction.astype(dtype=np.int16)

Evaluate the result



In [68]:

    
score = model.score(X_test, y_test) # test samples X and true values for X
print ("Extrapolation R^2 Score: ", score)









    



Extrapolation R^2 Score:  0.0

Obviously, if you look only at Rsquared it seems that it was a totally useless result.
But let's listen to the generated audio.

First, take the first Provided_Portion portion of the test clip, the part you fed into your linear regression model. Then, stitch that together with the abomination the predictor model generated for you, and then save the completed audio clip:



In [69]:

    
completed_clip = np.hstack((X_test, y_test_prediction))
wavfile.write('../outputs/ExtrapolatedClip.wav', sample_rate, completed_clip[0])



In [70]:

    
IPython.display.Audio("../outputs/ExtrapolatedClip.wav")









    Out[70]:

Well, not bad!