Essentia.standard Python tutorial

This tutorial will show how to use Essentia in standard mode.

We will have a look at some basic functionality:

  • how to load an audio
  • how to perform some numerical operations, such as FFT et al.
  • how to plot results
  • how to output results to a file

Exploring the python module


In [1]:
# first, we need to import our essentia module. It is aptly named 'essentia'!
  import essentia

  # as there are 2 operating modes in essentia which have the same algorithms,
  # these latter are dispatched into 2 submodules:
  import essentia.standard
  import essentia.streaming

  # let's have a look at what is in there
  print dir(essentia.standard)

  # you can also do it by using autocompletion in IPython, typing "essentia.standard." and pressing Tab


['AfterMaxToBeforeMaxEnergyRatio', 'AllPass', 'AudioLoader', 'AudioOnsetsMarker', 'AudioWriter', 'AutoCorrelation', 'BPF', 'BandPass', 'BandReject', 'BarkBands', 'BeatTrackerDegara', 'BeatTrackerMultiFeature', 'Beatogram', 'BeatsLoudness', 'BinaryOperator', 'BinaryOperatorStream', 'BpmHistogramDescriptors', 'BpmRubato', 'CartesianToPolar', 'CentralMoments', 'Centroid', 'ChordsDescriptors', 'ChordsDetection', 'ChordsDetectionBeats', 'Clipper', 'Crest', 'CrossCorrelation', 'CubicSpline', 'DCRemoval', 'DCT', 'Danceability', 'Decrease', 'Derivative', 'DerivativeSFX', 'Dissonance', 'DistributionShape', 'Duration', 'DynamicComplexity', 'ERBBands', 'EasyLoader', 'EffectiveDuration', 'Energy', 'EnergyBand', 'EnergyBandRatio', 'Entropy', 'Envelope', 'EqloudLoader', 'EqualLoudness', 'Extractor', 'FFT', 'FadeDetection', 'Flatness', 'FlatnessDB', 'FlatnessSFX', 'Flux', 'FrameCutter', 'FrameGenerator', 'FrameToReal', 'FrequencyBands', 'GFCC', 'GeometricMean', 'HFC', 'HPCP', 'HarmonicBpm', 'HarmonicMask', 'HarmonicPeaks', 'HighPass', 'HighResolutionFeatures', 'IFFT', 'IIR', 'Inharmonicity', 'InstantPower', 'Intensity', 'Key', 'KeyExtractor', 'LPC', 'Larm', 'Leq', 'LevelExtractor', 'LogAttackTime', 'Loudness', 'LoudnessVickers', 'LowLevelSpectralEqloudExtractor', 'LowLevelSpectralExtractor', 'LowPass', 'MFCC', 'Magnitude', 'MaxFilter', 'MaxMagFreq', 'MaxToTotal', 'Mean', 'Median', 'MelBands', 'MetadataReader', 'Meter', 'MinToTotal', 'MonoLoader', 'MonoMixer', 'MonoWriter', 'MovingAverage', 'MultiPitchMelodia', 'Multiplexer', 'NoiseAdder', 'NoveltyCurve', 'NoveltyCurveFixedBpmEstimator', 'OddToEvenHarmonicEnergyRatio', 'OnsetDetection', 'OnsetDetectionGlobal', 'OnsetRate', 'Onsets', 'OverlapAdd', 'PCA', 'Panning', 'PeakDetection', 'PitchContourSegmentation', 'PitchContours', 'PitchContoursMelody', 'PitchContoursMonoMelody', 'PitchContoursMultiMelody', 'PitchFilterMakam', 'PitchMelodia', 'PitchSalience', 'PitchSalienceFunction', 'PitchSalienceFunctionPeaks', 'PitchYin', 'PitchYinFFT', 'PolarToCartesian', 'PoolAggregator', 'PowerMean', 'PowerSpectrum', 'PredominantPitchMelodia', 'RMS', 'RawMoments', 'ReplayGain', 'Resample', 'RhythmDescriptors', 'RhythmExtractor', 'RhythmExtractor2013', 'RhythmTransform', 'RollOff', 'SBic', 'Scale', 'SilenceRate', 'SineModel', 'SingleBeatLoudness', 'SingleGaussian', 'Slicer', 'SpectralComplexity', 'SpectralContrast', 'SpectralPeaks', 'SpectralWhitening', 'Spectrum', 'Spline', 'StartStopSilence', 'StereoDemuxer', 'StrongDecay', 'StrongPeak', 'SuperFluxExtractor', 'SuperFluxNovelty', 'SuperFluxPeaks', 'TCToTotal', 'TempoScaleBands', 'TempoTap', 'TempoTapDegara', 'TempoTapMaxAgreement', 'TempoTapTicks', 'TonalExtractor', 'TonicIndianArtMusic', 'TriangularBands', 'Trimmer', 'Tristimulus', 'TuningFrequency', 'TuningFrequencyExtractor', 'UnaryOperator', 'UnaryOperatorStream', 'Variance', 'Vibrato', 'WarpedAutoCorrelation', 'Windowing', 'YamlInput', 'YamlOutput', 'ZeroCrossingRate', '__builtins__', '__doc__', '__file__', '__name__', '__package__', '_c', '_create_essentia_class', '_create_python_algorithms', '_essentia', '_reloadAlgorithms', '_sys', 'algorithmInfo', 'algorithmNames', 'essentia']

Instantiating our first algorithm, loading some audio

Let's start doing some useful things now!

Before you can use algorithms in Essentia, you first need to instantiate (create) them. When doing so, you can give them parameters which they may need to work properly, such as the filename of the audio file in the case of an audio loader.

Once you have instantiated an algorithm, nothing has happened yet, but your algorithm is ready to be used and works like a function, that is, you have to call it to make stuff happen (technically, it is a function object).

Essentia has a selection of audio loaders:

  • AudioLoader: the most generic one, returns the audio samples, sampling rate and number of channels, and some other related information
  • MonoLoader: which returns audio, down-mixed and resampled to a given sampling rate
  • EasyLoader: a MonoLoader which can optionally trim start/end slices and rescale according to a ReplayGain value
  • EqloudLoader: an EasyLoader that applies an equal-loudness filtering on the audio

In [3]:
# we start by instantiating the audio loader:
loader = essentia.standard.MonoLoader(filename='../../../test/audio/recorded/musicbox.wav')

# and then we actually perform the loading:
audio = loader()

By default, the MonoLoader will output audio with 44100Hz samplerate. To make sure that this actually worked, let's plot a 1-second slice of audio, from t = 1sec to t = 2sec:


In [4]:
# pylab contains the plot() function, as well as figure, etc... (same names as Matlab)
from pylab import plot, show, figure, imshow

plot(audio[1*44100:2*44100])
show() # unnecessary if you started "ipython --pylab"


Note that if you have started IPython with the --pylab option, the call to show() is not necessary, and you don't have to close the plot to regain control of your terminal.

Setting the stage for our future computations

So let's say that we want to compute the MFCCs for the frames in our audio.

We will need the following algorithms: Windowing, Spectrum, MFCC:


In [5]:
from essentia.standard import *
  w = Windowing(type = 'hann')
  spectrum = Spectrum()  # FFT() would return the complex FFT, here we just want the magnitude spectrum
  mfcc = MFCC()

Let's have a look at the inline help using help command (you can also see it by typing "MFCC?" in IPython):


In [5]:
help(MFCC)


Help on class Algo in module essentia.standard:

class Algo(Algorithm)
 |  MFCC
 |  
 |  
 |  Inputs:
 |  
 |    [vector_real] spectrum - the audio spectrum
 |  
 |  
 |  Outputs:
 |  
 |    [vector_real] bands - the energies in mel bands
 |    [vector_real] mfcc - the mel frequency cepstrum coefficients
 |  
 |  
 |  Parameters:
 |  
 |    highFrequencyBound:
 |      real ∈ (0,inf) (default = 11000)
 |      the upper bound of the frequency range [Hz]
 |  
 |    inputSize:
 |      integer ∈ (1,inf) (default = 1025)
 |      the size of input spectrum
 |  
 |    lowFrequencyBound:
 |      real ∈ [0,inf) (default = 0)
 |      the lower bound of the frequency range [Hz]
 |  
 |    numberBands:
 |      integer ∈ [1,inf) (default = 40)
 |      the number of mel-bands in the filter
 |  
 |    numberCoefficients:
 |      integer ∈ [1,inf) (default = 13)
 |      the number of output mel coefficients
 |  
 |    sampleRate:
 |      real ∈ (0,inf) (default = 44100)
 |      the sampling rate of the audio signal [Hz]
 |  
 |  
 |  Description:
 |  
 |    This algorithm computes the mel-frequency cepstrum coefficients.
 |    As there is no standard implementation, the MFCC-FB40 is used by default:
 |      - filterbank of 40 bands from 0 to 11000Hz
 |      - take the log value of the spectrum energy in each mel band
 |      - DCT of the 40 bands down to 13 mel coefficients
 |    There is a paper describing various MFCC implementations [1].
 |    
 |    This algorithm depends on the algorithms MelBands and DCT and therefore
 |    inherits their parameter restrictions. An exception is thrown if any of these
 |    restrictions are not met. The input "spectrum" is passed to the MelBands
 |    algorithm and thus imposes MelBands' input requirements. Exceptions are
 |    inherited by MelBands as well as by DCT.
 |    
 |    References:
 |      [1] T. Ganchev, N. Fakotakis, and G. Kokkinakis, "Comparative evaluation
 |      of various MFCC implementations on the speaker verification task," in
 |      International Conference on Speach and Computer (SPECOM’05), 2005, vol.
 |    1,
 |      pp. 191–194.
 |    
 |      [2] Mel-frequency cepstrum - Wikipedia, the free encyclopedia,
 |      http://en.wikipedia.org/wiki/Mel_frequency_cepstral_coefficient
 |  
 |  Method resolution order:
 |      Algo
 |      Algorithm
 |      __builtin__.object
 |  
 |  Methods defined here:
 |  
 |  __call__(self, *args)
 |  
 |  __init__(self, **kwargs)
 |  
 |  __str__(self)
 |  
 |  compute(self, *args)
 |  
 |  configure(self, **kwargs)
 |  
 |  ----------------------------------------------------------------------
 |  Data descriptors defined here:
 |  
 |  __dict__
 |      dictionary for instance variables (if defined)
 |  
 |  __weakref__
 |      list of weak references to the object (if defined)
 |  
 |  ----------------------------------------------------------------------
 |  Data and other attributes defined here:
 |  
 |  __struct__ = {'description': 'This algorithm computes the mel-frequenc...
 |  
 |  ----------------------------------------------------------------------
 |  Methods inherited from Algorithm:
 |  
 |  __compute__(...)
 |      compute the algorithm
 |  
 |  __configure__(...)
 |      Configure the algorithm
 |  
 |  getDoc(...)
 |      Returns the doc string for the algorithm
 |  
 |  getStruct(...)
 |      Returns the doc struct for the algorithm
 |  
 |  inputNames(...)
 |      Returns the names of the inputs of the algorithm.
 |  
 |  inputType(...)
 |      Returns the type of the input given by its name
 |  
 |  name(...)
 |      Returns the name of the algorithm.
 |  
 |  outputNames(...)
 |      Returns the names of the outputs of the algorithm.
 |  
 |  paramType(...)
 |      Returns the type of the parameter given by its name
 |  
 |  paramValue(...)
 |      Returns the value of the parameter or None if not yet configured
 |  
 |  parameterNames(...)
 |      Returns the names of the parameters for this algorithm.
 |  
 |  reset(...)
 |      Reset the algorithm to its initial state (if any).
 |  
 |  ----------------------------------------------------------------------
 |  Data and other attributes inherited from Algorithm:
 |  
 |  __new__ = <built-in method __new__ of type object>
 |      T.__new__(S, ...) -> a new object with type S, a subtype of T

Once algorithms have been instantiated, they work like normal functions:


In [6]:
frame = audio[5*44100 : 5*44100 + 1024]
spec = spectrum(w(frame))

plot(spec)
show() # unnecessary if you started "ipython --pylab"


Computing MFCCs the Matlab way

Now let's compute the MFCCs the way we would do it in Matlab, slicing the frames manually:


In [7]:
mfccs = []
  frameSize = 1024
  hopSize = 512

  for fstart in range(0, len(audio)-frameSize, hopSize):
      frame = audio[fstart:fstart+frameSize]
      mfcc_bands, mfcc_coeffs = mfcc(spectrum(w(frame)))
      mfccs.append(mfcc_coeffs)

  # and plot them...
  # as this is a 2D array, we need to use imshow() instead of plot()
  imshow(mfccs, aspect = 'auto')
  show() # unnecessary if you started "ipython --pylab"


See also that the MFCC algorithm returns 2 values: the band energies and the coefficients, and that you get (unpack) them the same way as in Matlab.

Let's see if we can write this in a nicer way, though.

Computing MFCCs the Essentia way

Essentia has been designed to do audio processing, and as such it has lots of readily available related algorithms; you don't have to chase around lots of toolboxes to be able to achieve what you want. For more details, it is recommended to have a look either at the algorithms_overview or at the complete reference.

In particular, we will use the FrameGenerator here:


In [8]:
mfccs = []

for frame in FrameGenerator(audio, frameSize = 1024, hopSize = 512):
    mfcc_bands, mfcc_coeffs = mfcc(spectrum(w(frame)))
    mfccs.append(mfcc_coeffs)

# transpose to have it in a better shape
# we need to convert the list to an essentia.array first (== numpy.array of floats)
mfccs = essentia.array(mfccs).T

# and plot
imshow(mfccs[1:,:], aspect = 'auto') 
show() # unnecessary if you started "ipython --pylab"


We ignored the first MFCC coefficient to disregard the power of the signal and only plot its spectral shape.

Introducing the Pool - a versatile data container

A Pool is a container similar to a C++ map or Python dict which can contain any type of values (easy in Python, not as much in C++...). Values are stored in there using a name which represent the full path to these values; dot ('.') characters are used as separators. You can think of it as a directory tree, or as namespace(s) + local name.

Examples of valid names are: "bpm", "lowlevel.mfcc", "highlevel.genre.rock.probability", etc...

So let's redo the previous computations using a pool:


In [9]:
pool = essentia.Pool()

for frame in FrameGenerator(audio, frameSize = 1024, hopSize = 512):
    mfcc_bands, mfcc_coeffs = mfcc(spectrum(w(frame)))
    pool.add('lowlevel.mfcc', mfcc_coeffs)
    pool.add('lowlevel.mfcc_bands', mfcc_bands)

imshow(pool['lowlevel.mfcc'].T[1:,:], aspect = 'auto')
show() # unnecessary if you started "ipython --pylab"
figure()
imshow(pool['lowlevel.mfcc_bands'].T, aspect = 'auto', interpolation = 'nearest')


Out[9]:
<matplotlib.image.AxesImage at 0xea9aa50>

The pool also has the nice advantage that the data you get out of it is already in an essentia.array format (which is equal to numpy.array of floats), so you can call transpose (.T) directly on it.

Aggregation and file output

Let's finish this tutorial by writing our results to a file. As we are using such a nice language as Python, we could use its facilities for writing data to a file, but for the sake of this tutorial let's do it using the YamlOutput algorithm, which writes a pool in a file using the YAML or JSON format.


In [10]:
output = YamlOutput(filename = 'mfcc.sig') # use "format = 'json'" for JSON output
output(pool)

# or as a one-liner:
YamlOutput(filename = 'mfcc.sig')(pool)

This should take a while as we actually write the MFCCs for all the frames, which can be quite heavy depending on the duration of your audio file.

Now let's assume we do not want all the frames but only the mean and variance of those frames. We can do this using the PoolAggregator algorithm and use it on the pool to get a new pool with the aggregated descriptors:


In [12]:
# compute mean and variance of the frames
aggrPool = PoolAggregator(defaultStats = [ 'mean', 'var' ])(pool)

print 'Original pool descriptor names:'
print pool.descriptorNames()
print
print 'Aggregated pool descriptor names:'
print aggrPool.descriptorNames()

# and ouput those results in a file
YamlOutput(filename = 'mfccaggr.sig')(aggrPool)


Original pool descriptor names:
['lowlevel.mfcc', 'lowlevel.mfcc_bands']

Aggregated pool descriptor names:
['lowlevel.mfcc.mean', 'lowlevel.mfcc.var', 'lowlevel.mfcc_bands.mean', 'lowlevel.mfcc_bands.var']

In [13]:
!cat mfccaggr.sig


metadata:
    version:
        essentia: "2.1-dev"

lowlevel:
    mfcc:
        mean: [-770.771728516, 246.557647705, 53.5677185059, 1.70909059048, -35.5930786133, -27.0709495544, -12.4148387909, -19.2304668427, -33.986038208, -23.4126434326, -15.8186225891, -5.1132478714, -2.86430335045]
        var: [9531.9296875, 2612.62597656, 1268.72875977, 442.906768799, 258.520568848, 229.063858032, 168.463638306, 126.90486145, 172.914840698, 142.858963013, 209.542709351, 237.36315918, 588.467102051]
    mfcc_bands:
        mean: [3.09789697894e-06, 0.0018204189837, 0.00687531381845, 0.00559488125145, 0.00746234040707, 0.00762519706041, 0.00263760378584, 0.00176807912067, 0.00187411252409, 0.0010101441294, 0.000384628627216, 8.97606587387e-05, 0.000103173675598, 0.000462994532427, 0.000481149676489, 0.000150407780893, 4.3479638407e-05, 9.8532436823e-06, 3.4149172734e-06, 4.67248537461e-06, 3.91657658838e-06, 1.8994775246e-06, 2.5756589821e-06, 1.52094037276e-06, 1.15387149435e-06, 3.3445369354e-06, 1.65835001553e-06, 2.04684874916e-06, 1.96311066247e-06, 1.5418397652e-06, 1.18413072414e-06, 1.06164293356e-06, 5.61618151096e-07, 5.55542726488e-07, 1.16678609174e-06, 9.67434175436e-07, 5.79169636694e-07, 4.31736594919e-07, 3.07267100652e-07, 1.74535870201e-07]
        var: [5.49797274374e-09, 2.71185967904e-06, 3.54826916009e-05, 1.4284183635e-05, 5.45716975466e-05, 5.73034012632e-05, 1.09859265649e-05, 5.82664006288e-06, 6.33308718534e-06, 2.67984637503e-06, 9.67446794675e-07, 1.09977271734e-07, 4.5999644982e-08, 1.01382534012e-06, 2.36311120716e-06, 1.19640631624e-07, 2.97720443854e-08, 1.96152050158e-09, 4.33603347672e-10, 2.12821246737e-10, 1.17798229504e-10, 3.65933568169e-11, 6.12296602309e-11, 2.86074445383e-11, 1.31610087412e-11, 1.15442384818e-10, 6.51916090555e-11, 8.07163641481e-11, 7.98162577698e-11, 4.8446673756e-11, 2.83182435834e-11, 3.88169184296e-11, 9.29356158003e-12, 1.18757833428e-11, 4.44560638302e-11, 3.39185728115e-11, 8.12059273991e-12, 6.04115542313e-12, 3.49459740659e-12, 1.1808879612e-12]

And this closes the tutorial!

There is not much more to know about Essentia for using it in python environment, the basics are:

  • instantiate and configure algorithms
  • use them to compute some results
  • and that's pretty much it!

The big strength of Essentia is that it provides a considerably large collection of algorithms, from low-level to high-level descriptors, which have been thoroughly optimized and tested and which you can rely on to build your own signal analysis.