Y. Aytar, C. Vondrick, and A. Torralba. SoundNet: Learning sound representations from unlabeled video. In NIPS, 2016

A student-teacher training procedure

Look, Listen and Learn (2017)

Audio-Visual Correspondence (AVC) learning task

  • A large amount of unlabelled videos

    • e.g. Car in the video frame-Engine Noise in the sound.
  • 3 Motivations

    • Self-supervision(Free supervision) Task
    • Supervision of Infants
    • Good vision and audio representations.
      • The new state-of-the-art on two sound classification bench- marks
      • On par with the state-of-the-art self- supervised approaches on ImageNet classification.
  • Histories

    • Unsupervised visual representation
    • Audio-Visual
      • Train a visual network to generate sounds
      • Train an audio network to correlate with visual outputs
  • Train both visual and audio networks

    • Performance improves substantially over previous researches.
  • As an added benefit

    • Localize the source of the audio event in the video frame
    • Localize the corresponding regions of the sound source using activation visualization.
  • c.f. Cotraining

    • This method is entirely unsupervised
    • The more common semi-supervised scenario in co-training.

Implement “look, listen and learn” network (L3-Net)

Log-spectrogram computation.

  • 199 time-windows with 257 frequency bands.
    • Time: a spectrogram is computed with window length of 0.01 seconds and a half-window overlap
    • Freq: The 1 second audio is resampled to 48 kHz
      48 kHz
      The standard audio sampling rate used by professional digital video equipment such as tape recorders, video servers, vision mixers and so on.

Discussion

1. Performance of the network on the audio-visual correspondence task

  • Comparision with supervised baselines.

2. the quality of the learnt visual and audio features is tested in a transfer learning setting, on visual and audio classification tasks.

  • Audio
    • Beat the previous state-of-the-art, SoundNet, by 5.1%.
  • Video
    • On par with the state-of-the-art self-supervised approaches on ImageNet classification
      • c.f. VGG, Inception, Resnet...

3. Qualitative analysis


In [ ]: