Papers and articles :

-  MS coco, lin et al 2015, rusell et al 2008, deng et al 2009, everingham et al 2010, xiao et al 2010
-  Discriminative clustering - Bach, Harchaoui et al 2007
-  multi class cosegmentation joulin et al.
-  Scripts as a source of supervision - laptev et al, sivic et al
-  duchenne, laptev, sivic, bach, ponce et al 2009
-  A latent SVM model for temporal localization - ramanan 2008 et al : IMP
-  Temporal action localization Bojanowski et al ECCV 14  : IMP
-  How much supervision do we really need? Cho et al CVPR 15
-  Girshick et al 2014
-  Consenses for curating training data AAAI 18 , NCETIS, IITB
-  Incorporating domain patterns throigh in seqtoseqy models ICDAR 17, Interspeech 18
-  Martingale boosting - Long et al 2005
-  Optimizing Multivariate performances measures for multi instance... NAACL 2015, AAAI 2017
-  Monocular depth estimation Liu et al 2015
-  Boundary detection Xie, tu et al 2015
-  deeplab chen et al 
-  MIL- based Pathak et al ICLR 15
-  Pinheiro & collobert et al CVPR 15
-  Tokmakov et al 2016 MCNN
-  OpenOCRCorrect Rohitsaluja github
-  Autoquestion generation PAKDD 2018
-  Neural sentence matcher decomposable attention model Parikh 2016 
-  SNORKEL stanford
-  data programming stanford
- 

Datasets :

- Youtube Objects dataset
- 10 categories from PASCAL VOC
- 155 raw videos from youtube
- IMNET - Imagenet video dataset
- QG xinya 2017 

Lecture 1

  • The outcomes of imagenet challenges are, features from networks trained on image net can be used elsewhere.
  • Any visual task can be solved by first constructing a large scale labelled dataset, then specifying the nn arch. and the traning loss, finally training and deploying.
  • Alternatives to strong supervision are weaker forms of supervision, semi supervised methods and totally unsupervised(self-supervised) methods.

Using Weaker Supervision

  • Eg: Imae level labels, Meta data
  • Cosegmentation : a set of photos have the same class, the "where" is not known so you need to find a way to make the network learn to segment that class. This comes in diffrac, i.e discriminative clustering. Multi class cosegmentation is possible as well. You specify the max no of classes
  • Are semantics important for doing a CV task?
  • Scripts as a source of supervision {movie} or/and subtitles.
  • Temporal localization as classfication

A type of Unsupervised learning : Self Supervision

  • Why ?
    1. Expensive to Produce a new dataset for each new task .
    2. Untapped potential
    3. How infants learn?
  • What ?
    1. A form of unsupervised learning where the data provides the supervision
  • Semantics from non semantic data, taking patches of data a grid of patches and learning the location of where this patches lie in the image. A bug was found that all of the patches were being taken from a corner because of chromatic aberrations. The solution is to send one channel out of RGB and send it to the network.

Lecture 2

  • Where does the human assistance come in?
  • Problem : Multi label classification problem . The motivation is that there are multiple labelers, multiple labels and little or no data. the problems are inter-label correlation, labeler reliability, collective prediction, Active learning sampling.
  • Consensus model , single label and multiple labels.
  • Evaluation function vs Loss function :
    1. Frequently used losses: They are usually decomposable measures and easy to optimize {convenient to use] eg: Hamming distance, precision, recall, cross entropy.
    2. Evaluation Measures : Often Non decomposable measures that cannot be decomposed into a sum of terms . eg : F-measure, Diversity Measures, BLEU score, Confusion Matrix.
  • Two general ways to optimize a non decomposable objective function : direct optimization and indirect optimization.
  • Martingale Boosting : expanding scope for bargaining.
  • How to extract latent labels given the knowledge base and sentences ? It's a multi instance labeling problem, multi label per instance bag.
  • Automatic question and answer generation , The problem is challenging because question must be relevant to the text, Answer must be Unambiguous .

Lecture 3

Deep Learning with Weak Supervision/ Self Supervision :

  • Learning motion patterns and their use for semantic segmentation :
    1. Goal : Pixels in , pixels out.
    2. You want to keep the input and output dimensions same. Fully convolution nets can be used for the same. They have excellent results but they need pixel wise ground truth annotations.
    3. Fully Conv Net : Weakly supervised approaches - MIL Based, Image Level Aggregation[Image level classes provided], Constraint Based[say how many pixels should be a part of the new class, not very good result]
    4. Using Motion as a supervisory signal, Not perfect but need to handle imperfect segmentations
    5. M-CNN : Given a frame from a video, through a FCNN . Use the Motion Segmentations through a Gaussian Mixture Model (GMM) and through the FCNN you get a Category appearance. Ultimately using a graph based inference to combine the category appearance and the foreground appearance obtained through the GMM. The labels are still at the image level there are no pixel level annotations.

Lecture 4

Illustration OCR Correction

  • Problem Analysis : Indian languages have short primary vowels, long primary vowels, secondary vowels as well as consonants. Indic langauges are rich in inflections. Inflections{Sandhi} lead to out of vocabulary problems.
  • Motivation for using LSTM :
    1. The basic building block is called the rnn. The input and the output can have multiple cells. LSTMs are RNNs with the capability of deciding how much you want to remember, i.e how much you want to maintain the previous cell state. They also have a forget and update gate. LSTMs learn when forget and update gate should be in action.
    2. Directly using LSTMs didnt result in convergence, first hardwiring delay at both ends which doesnt really increase the accuracy[This delay is for the network to warm up] but some handwriting delay at output increases the accuracy. The optimal delay value is the edit distance for that language.
  • The model is giving you an error prediction model. The model mostly corrects the incorrect ocr words.
  • Simple Ensemble of multiple ocr systems also helps instead of using complex LSTMs. If tuned on the right performance measure the binary classifier gives as much accuracy as a complex lstm on ocr dataset.

Illustration for Auto question Generation

  • Problem : Generating NL questions for a given para this is for checking the comprehension of a reader. Questions must be relevant to the text.
  • Seq to seq model with attention and augmented with rich set of linguistic features and encoding.
  • The LSTM is based on words instead of characters.
  • the model : Answer selection -> Answers and features encoding -> Sentence encoder -> Question Decoder
  • the encoder and decoder are both LSTMs.
  • The answer selection could be a Named entity selection or a pointer network.
  • The named entity recognizer's output state was whether the token was a location, organisation etc whereas the pointer network's output sequence was based on span/ token sequence.
  • The features are POS tags, Named Entity tags, Dependency labels.
  • The sentence encoder was a biLSTM, The attention is coming on the hidden state of distribution. The key idea of attention is to have what part of the input should be represented as a part of the output? not necessary to be the same could be synonyms as well.
  • Limitations :
    1. These models optimize cross entropy loss while training.
    2. But the BLEU metric is used for evaluation, therefore need for a better loss.
    3. Need to incorporate copy and coverage mechanism.
  • There is an RL angle here as well.