Transfers discriminative visual knowledge from well established visual recognition models into the sound modality using unlabeled video as a bridge.
Official Reource
A large amount of unlabelled videos
3 Motivations
Histories
Train both visual and audio networks
As an added benefit
c.f. Cotraining
48 kHz
The standard audio sampling rate used by professional digital video equipment such as tape recorders, video servers, vision mixers and so on.
In [ ]: