L7126 - DEEP LEARNING FOR IMAGE AND VIDEO CAPTIONING

Presented by NVIDIA's Deep Learning Institute

Created by Allison Gray and Myrieme Demouth

This lab has been prepared for NVIDIA's 2017 GPU Technology conference and is comprised of four main tasks.

Deep learning is being used today in a variety of different ways. In this lab, attendees will learn about how to use convolutional neural networks to perform classification and recurrent neural networks for character and sentence generation. Then we will combine these two techniques to generate captions for images and videos.

Task 1 Convolutional Neural Networks

This exercise explores how to get started with TensorFlow and convolutional neural networks (CNNs). A dataset of paintings by Van Gogh and others are used to create a binary classifier.

Task 2 Recurrent Neural Networks

In this part of the lab, we will demonstrate the power of RNNs with character and code generation examples. Participants will explore and experiment RNNs, learn how to configure, and use them to generate sentences. The captions from MSCOCO dataset will be used to train an RNN with different network parameters.

Task 3 Image Captioning

Image captioning can be performed by combining a CNN with an RNN. Attendees will get hands-on experience with combining data from CNNs with RNNs. The MSCOCO images and captions will be used to train and finetune a network to generate captions.

Task 4 Video Captioning

In this last part, we will combine all the things we learned about in the previous three tasks to generate captions about video clips.