In [ ]:
__author__ = "Christopher Potts"
__version__ = "CS224u, Stanford, Spring 2020"
This homework and associated bake-off are oriented toward building an effective system for generating color descriptions that are pragmatic in the sense that they would help a reader/listener figure out which color was being referred to in a shared context consisting of a target color (whose identity is known only to the describer/speaker) and a set of distractors.
The notebook colors_overview.ipynb should be studied before work on this homework begins. That notebook provides backgroud on the task, the dataset, and the modeling code that you will be using and adapting.
The homework questions are more open-ended than previous ones have been. Rather than asking you to implement pre-defined functionality, they ask you to try to improve baseline components of the full system in ways that you find to be effective. As usual, this culiminates in a prompt asking you to develop a novel system for entry into the bake-off. In this case, though, the work you do for the homework will likely be directly incorporated into that system.
See colors_overview.ipynb for set-up in instructions and other background details.
In [ ]:
from colors import ColorsCorpusReader
import os
from sklearn.model_selection import train_test_split
from torch_color_describer import (
ContextualColorDescriber, create_example_dataset)
import utils
from utils import START_SYMBOL, END_SYMBOL, UNK_SYMBOL
In [ ]:
utils.fix_random_seeds()
In [ ]:
COLORS_SRC_FILENAME = os.path.join(
"data", "colors", "filteredCorpus.csv")
In [ ]:
dev_corpus = ColorsCorpusReader(
COLORS_SRC_FILENAME,
word_count=2,
normalize_colors=True)
In [ ]:
dev_examples = list(dev_corpus.read())
This subset has about one-third the examples of the full corpus:
In [ ]:
len(dev_examples)
We should worry that it's not a fully representative sample. Most of the descriptions in the full corpus are shorter, and a large proportion are longer. So this dataset is mainly for debugging, development, and general hill-climbing. All findings should be validated on the full dataset at some point.
In [ ]:
dev_rawcols, dev_texts = zip(*[[ex.colors, ex.contents] for ex in dev_examples])
The raw color representations are suitable inputs to a model, but the texts are just strings, so they can't really be processed as-is. Question 1 asks you to do some tokenizing!
In [ ]:
dev_rawcols_train, dev_rawcols_test, dev_texts_train, dev_texts_test = \
train_test_split(dev_rawcols, dev_texts)
In [ ]:
def tokenize_example(s):
# Improve me!
return [START_SYMBOL] + s.split() + [END_SYMBOL]
In [ ]:
tokenize_example(dev_texts_train[376])
Your task: Modify tokenize_example
so that it does something more sophisticated with the input text.
Notes:
UNK_SYMBOL
.Important: don't forget to add the start and end symbols, else the resulting models will definitely be terrible!
Once the tokenizer is working, run the following cell to tokenize your inputs:
In [ ]:
dev_seqs_train = [tokenize_example(s) for s in dev_texts_train]
dev_seqs_test = [tokenize_example(s) for s in dev_texts_test]
We use only the train set to derive a vocabulary for the model:
In [ ]:
dev_vocab = sorted({w for toks in dev_seqs_train for w in toks}) + [UNK_SYMBOL]
It's important that the UNK_SYMBOL
is included somewhere in this list. Test examples with word not seen in training will be mapped to UNK_SYMBOL
. If you model's vocab is the same as your train vocab, then UNK_SYMBOL
will never be encountered during training, so it will be a random vector at test time.
In [ ]:
len(dev_vocab)
In [ ]:
def represent_color_context(colors):
# Improve me!
return [represent_color(color) for color in colors]
def represent_color(color):
# Improve me!
return color
In [ ]:
represent_color_context(dev_rawcols_train[0])
Your task: Modify represent_color_context
and/or represent_color
to represent colors in a new way.
Notes:
represent_color
. This might be unnatural if you want to perform an operation on each color trio all at once.The following cell just runs your represent_color_context
on the train and test sets:
In [ ]:
dev_cols_train = [represent_color_context(colors) for colors in dev_rawcols_train]
dev_cols_test = [represent_color_context(colors) for colors in dev_rawcols_test]
At this point, our preprocessing steps are complete, and we can fit a first model.
In [ ]:
dev_mod = ContextualColorDescriber(
dev_vocab,
embed_dim=10,
hidden_dim=10,
max_iter=5,
batch_size=128)
In [ ]:
_ = dev_mod.fit(dev_cols_train, dev_seqs_train)
As discussed in colors_overview.ipynb, our primary metric is listener_accuracy
:
In [ ]:
dev_mod.listener_accuracy(dev_cols_test, dev_seqs_test)
We can also see the model's predicted sequences given color context inputs:
In [ ]:
dev_mod.predict(dev_cols_test[:1])
In [ ]:
dev_seqs_test[:1]
The above model uses a random initial embedding, as configured by the decoder used by ContextualColorDescriber
. This homework question asks you to consider using GloVe inputs.
Your task: Complete create_glove_embedding
so that it creates a GloVe embedding based on your model vocabulary. This isn't mean to be analytically challenging, but rather just to create a basis for you to try out other kinds of rich initialization.
In [ ]:
GLOVE_HOME = os.path.join('data', 'glove.6B')
In [ ]:
def create_glove_embedding(vocab, glove_base_filename='glove.6B.50d.txt'):
# Use `utils.glove2dict` to read in the GloVe file:
##### YOUR CODE HERE
# Use `utils.create_pretrained_embedding` to create the embedding.
# This function will, by default, ensure that START_TOKEN,
# END_TOKEN, and UNK_TOKEN are included in the embedding.
##### YOUR CODE HERE
# Be sure to return the embedding you create as well as the
# vocabulary returned by `utils.create_pretrained_embedding`,
# which is likely to have been modified from the input `vocab`.
##### YOUR CODE HERE
Let's see if GloVe helped for our development data:
In [ ]:
dev_glove_embedding, dev_glove_vocab = create_glove_embedding(dev_vocab)
The above might dramatically change your vocabulary, depending on how many items from your vocab are in the Glove space:
In [ ]:
len(dev_vocab)
In [ ]:
len(dev_glove_vocab)
In [ ]:
dev_mod_glove = ContextualColorDescriber(
dev_glove_vocab,
embedding=dev_glove_embedding,
hidden_dim=10,
max_iter=5,
batch_size=128)
In [ ]:
_ = dev_mod_glove.fit(dev_cols_train, dev_seqs_train)
In [ ]:
dev_mod_glove.listener_accuracy(dev_cols_test, dev_seqs_test)
You probably saw a small boost, assuming your tokeization scheme leads to good overlap with the GloVe vocabulary. The input representations are larger than in our previous model (at least as I configured things), so we would need to do more runs with higher max_iter
values to see whether this is worthwhile overall.
The final required homework question is the most challenging, but it should set you up to think in much more flexible ways about the underlying model we're using.
The question asks you to modify various model components in torch_color_describer.py
. The section called Modifying the core model from the core unit notebook provides a number of examples illustrating the basic techniques, so you might review that material if you get stuck here.
Your task: Monroe et al. 2017 append the target color (the final one in the context) to each input token that gets processed by the decoder. The question asks you to subclass the Decoder
and EncoderDecoder
from torch_color_describer.py
so that you can build models that do this.
Step 1: Modify the Decoder
so that the input vector to the model at each timestep is not just a token representation x
but the concatenation of x
with the representation of the target color.
Notes:
You might notice at this point that the original Decoder.forward
method has an optional keyword argument target_colors
that is passed to Decoder.get_embeddings
. Because this is already in place, all you have to do is modify the get_embeddings
method to use this argument.
The change affects the configuration of self.rnn
, so you need to subclass the __init__
method as well, so that its input_size
argument accomodates the embedding as well as the color representations.
You can do the relevant operations efficiently in pure PyTorch using repeat_interleave
and cat
, but the important thing is to get a working implementation – you can always optimize the code later if the ideas prove useful to you.
Here's skeleton code for you to flesh out:
In [ ]:
import torch
import torch.nn as nn
from torch_color_describer import Decoder
class ColorContextDecoder(Decoder):
def __init__(self, color_dim, *args, **kwargs):
self.color_dim = color_dim
super().__init__(*args, **kwargs)
# Fix the `self.rnn` attribute:
##### YOUR CODE HERE
def get_embeddings(self, word_seqs, target_colors=None):
"""You can assume that `target_colors` is a tensor of shape
(m, n), where m is the length of the batch (same as
`word_seqs.shape[0]`) and n is the dimensionality of the
color representations the model is using. The goal is
to attached each color vector i to each of the tokens in
the ith sequence of (the embedded version of) `word_seqs`.
"""
##### YOUR CODE HERE
Step 2: Modify the EncoderDecoder
. For this, you just need to make a small change to the forward
method: extract the target colors from color_seqs
and feed them to the decoder.
In [ ]:
from torch_color_describer import EncoderDecoder
class ColorizedEncoderDecoder(EncoderDecoder):
def forward(self,
color_seqs,
word_seqs,
seq_lengths=None,
hidden=None,
targets=None):
if hidden is None:
hidden = self.encoder(color_seqs)
# Extract the target colors from `color_seqs` and
# feed them to the decoder, which already has a
# `target_colors` keyword.
##### YOUR CODE HERE
return output, hidden, targets
Step 3: Finally, as in the examples in Modifying the core model, you need to modify the build_graph
method of ContextualColorDescriber
so that it uses your new ColorContextDecoder
and ColorizedEncoderDecoder
. Here's starter code:
In [ ]:
from torch_color_describer import Encoder
class ColorizedInputDescriber(ContextualColorDescriber):
def build_graph(self):
# We didn't modify the encoder, so this is
# just copied over from the original:
encoder = Encoder(
color_dim=self.color_dim,
hidden_dim=self.hidden_dim)
# Use your `ColorContextDecoder`, making sure
# to pass in all the keyword arguments coming
# from `ColorizedInputDescriber`:
##### YOUR CODE HERE
# Return a `ColorizedEncoderDecoder` that uses
# your encoder and decoder:
##### YOUR CODE HERE
That's it! Since these modifications are pretty intricate, you might want to use a toy dataset to debug it:
In [ ]:
toy_color_seqs, toy_word_seqs, toy_vocab = create_example_dataset(
group_size=50, vec_dim=2)
In [ ]:
toy_color_seqs_train, toy_color_seqs_test, toy_word_seqs_train, toy_word_seqs_test = \
train_test_split(toy_color_seqs, toy_word_seqs)
In [ ]:
toy_mod = ColorizedInputDescriber(
toy_vocab,
embed_dim=10,
hidden_dim=10,
max_iter=100,
batch_size=128)
In [ ]:
_ = toy_mod.fit(toy_color_seqs_train, toy_word_seqs_train)
In [ ]:
toy_mod.listener_accuracy(toy_color_seqs_test, toy_word_seqs_test)
If that worked, then you can now try this model on SCC problems!
There are many options for your original system, which consists of the full pipeline – all preprocessing and modeling steps. You are free to use any model you like, as long as you subclass ContextualColorDescriber
in a way that allows its listener_accuracy
method to behave in the expected way.
So that we can evaluate models in a uniform way for the bake-off, we ask that you modify the function my_original_system
below so that it accepts a trained instance of your model and does any preprocessing steps required by your model.
If we seek to reproduce your results, we will rerun this entire notebook. Thus, it is fine if your my_original_system
makes use of functions you wrote or modified above this cell.
In [ ]:
def my_original_system(trained_model, color_seqs_test, texts_test):
"""Feel free to modify this code to accommodate the needs of
your system. Just keep in mind that it will get raw corpus
examples as inputs for the bake-off.
"""
# `word_seqs_test` is a list of strings, so tokenize each of
# its elements:
tok_seqs = [tokenize_example(s) for s in texts_test]
col_seqs = [represent_color_context(colors)
for colors in color_seqs_test]
# Return the `listener_accuracy` for your model:
return trained_model.listener_accuracy(col_seqs, tok_seqs)
If my_original_system
works on test sets you create from the corpus distribution, then it will works for the bake-off, so consider checking that. For example, this would check that dev_mod
above passes muster:
In [ ]:
my_original_system(dev_mod, dev_rawcols_test, dev_texts_test)
In the cell below, please provide a brief technical description of your original system, so that the teaching team can gain an understanding of what it does. This will help us to understand your code and analyze all the submissions to identify patterns and strategies.
In [ ]:
# Enter your system description in this cell.
# Please do not remove this comment.
For the bake-off, we will release a test set. The announcement will go out on the discussion forum. You will evaluate your custom model from the previous question on these new datasets using your my_original_system
function. Rules:
The cells below this one constitute your bake-off entry.
People who enter will receive the additional homework point, and people whose systems achieve the top score will receive an additional 0.5 points. We will test the top-performing systems ourselves, and only systems for which we can reproduce the reported results will win the extra 0.5 points.
Late entries will be accepted, but they cannot earn the extra 0.5 points. Similarly, you cannot win the bake-off unless your homework is submitted on time.
The announcement will include the details on where to submit your entry.
In [ ]:
# Enter your bake-off assessment code in this cell.
# Please do not remove this comment.
In [ ]:
# On an otherwise blank line in this cell, please enter
# your listener_accuracy score as reported by the code
# above. Please enter only a number between 0 and 1 inclusive.
# Please do not remove this comment.