Y. Aytar, C. Vondrick, and A. Torralba. SoundNet: Learning sound representations from unlabeled video. In NIPS, 2016

A student-teacher training procedure

Visual (Teacher)
- Use state-of-the-art recognition network
Sound (Students)
Transfers discriminative visual knowledge from well established visual recognition models into the sound modality using unlabeled video as a bridge.
Official Reource
- Webpage: https://projects.csail.mit.edu/soundnet/
- Source Code: https://github.com/cvondrick/soundnet (torch)

A large amount of unlabelled videos
- e.g. Car in the video frame-Engine Noise in the sound.
3 Motivations
- Self-supervision(Free supervision) Task
- Supervision of Infants
- Good vision and audio representations.
  - The new state-of-the-art on two sound classification bench- marks
  - On par with the state-of-the-art self- supervised approaches on ImageNet classification.
Histories
- Unsupervised visual representation
- Audio-Visual
  - Train a visual network to generate sounds
  - Train an audio network to correlate with visual outputs
Train both visual and audio networks
- Performance improves substantially over previous researches.
As an added benefit
- Localize the source of the audio event in the video frame
- Localize the corresponding regions of the sound source using activation visualization.
c.f. Cotraining
- This method is entirely unsupervised
- The more common semi-supervised scenario in co-training.

Audio
- Beat the previous state-of-the-art, SoundNet, by 5.1%.
Video
- On par with the state-of-the-art self-supervised approaches on ImageNet classification
  - c.f. VGG, Inception, Resnet...



In [ ]: